[00:02:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T402925)', diff saved to https://phabricator.wikimedia.org/P81815 and previous config saved to /var/cache/conftool/dbconfig/20250827-000203-ladsgroup.json [00:02:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:02:20] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [00:02:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T402925)', diff saved to https://phabricator.wikimedia.org/P81816 and previous config saved to /var/cache/conftool/dbconfig/20250827-000227-ladsgroup.json [00:04:36] (03PS1) 10Zabe: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182244 [00:04:46] (03PS1) 10Zabe: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182245 [00:05:11] (03CR) 10Zabe: [C:03+2] BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182244 (owner: 10Zabe) [00:05:13] (03CR) 10Zabe: [C:03+2] BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182245 (owner: 10Zabe) [00:08:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182247 [00:08:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182247 (owner: 10TrainBranchBot) [00:10:58] (03CR) 10Papaul: "recheck" [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [00:16:20] (03PS8) 10Papaul: Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) [00:17:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T402925)', diff saved to https://phabricator.wikimedia.org/P81817 and previous config saved to /var/cache/conftool/dbconfig/20250827-001701-ladsgroup.json [00:17:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [00:22:47] (03Merged) 10jenkins-bot: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182244 (owner: 10Zabe) [00:22:51] (03Merged) 10jenkins-bot: BacklinkCache: Use LinksMigration for categorylinks [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182245 (owner: 10Zabe) [00:23:40] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1182244|BacklinkCache: Use LinksMigration for categorylinks]], [[gerrit:1182245|BacklinkCache: Use LinksMigration for categorylinks]] [00:29:41] !log zabe@deploy1003 zabe: Backport for [[gerrit:1182244|BacklinkCache: Use LinksMigration for categorylinks]], [[gerrit:1182245|BacklinkCache: Use LinksMigration for categorylinks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:30:10] !log zabe@deploy1003 zabe: Continuing with sync [00:32:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81818 and previous config saved to /var/cache/conftool/dbconfig/20250827-003208-ladsgroup.json [00:35:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1182247 (owner: 10TrainBranchBot) [00:35:29] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182244|BacklinkCache: Use LinksMigration for categorylinks]], [[gerrit:1182245|BacklinkCache: Use LinksMigration for categorylinks]] (duration: 11m 49s) [00:47:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81819 and previous config saved to /var/cache/conftool/dbconfig/20250827-004716-ladsgroup.json [00:50:23] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2039 [00:50:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2039 [00:51:33] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11121998 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [01:00:06] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:01:06] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:02:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T402925)', diff saved to https://phabricator.wikimedia.org/P81821 and previous config saved to /var/cache/conftool/dbconfig/20250827-010223-ladsgroup.json [01:02:29] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:02:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [01:02:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T402925)', diff saved to https://phabricator.wikimedia.org/P81822 and previous config saved to /var/cache/conftool/dbconfig/20250827-010246-ladsgroup.json [01:12:32] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 25s) [01:15:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T402925)', diff saved to https://phabricator.wikimedia.org/P81823 and previous config saved to /var/cache/conftool/dbconfig/20250827-011501-ladsgroup.json [01:15:07] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [01:19:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:24:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:26:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81824 and previous config saved to /var/cache/conftool/dbconfig/20250827-013008-ladsgroup.json [01:36:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11122052 (10phaultfinder) [01:45:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81826 and previous config saved to /var/cache/conftool/dbconfig/20250827-014516-ladsgroup.json [01:48:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11122065 (10phaultfinder) [02:00:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T402925)', diff saved to https://phabricator.wikimedia.org/P81827 and previous config saved to /var/cache/conftool/dbconfig/20250827-020023-ladsgroup.json [02:00:29] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:00:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [02:00:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2192 (T402925)', diff saved to https://phabricator.wikimedia.org/P81828 and previous config saved to /var/cache/conftool/dbconfig/20250827-020046-ladsgroup.json [02:10:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T402925)', diff saved to https://phabricator.wikimedia.org/P81829 and previous config saved to /var/cache/conftool/dbconfig/20250827-021006-ladsgroup.json [02:10:12] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:25:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P81833 and previous config saved to /var/cache/conftool/dbconfig/20250827-022513-ladsgroup.json [02:40:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P81834 and previous config saved to /var/cache/conftool/dbconfig/20250827-024021-ladsgroup.json [02:53:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:55:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T402925)', diff saved to https://phabricator.wikimedia.org/P81835 and previous config saved to /var/cache/conftool/dbconfig/20250827-025529-ladsgroup.json [02:55:35] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [02:55:44] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [03:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:07:06] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [03:07:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2211 (T402925)', diff saved to https://phabricator.wikimedia.org/P81836 and previous config saved to /var/cache/conftool/dbconfig/20250827-030713-ladsgroup.json [03:07:18] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:10:25] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122138 (10Zache) > However, I remain concerned that a determined attacker or a widely used non-compliant script could create the same load again. This risk hi... [03:20:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T402925)', diff saved to https://phabricator.wikimedia.org/P81837 and previous config saved to /var/cache/conftool/dbconfig/20250827-032019-ladsgroup.json [03:20:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [03:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:35:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P81838 and previous config saved to /var/cache/conftool/dbconfig/20250827-033527-ladsgroup.json [03:50:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P81839 and previous config saved to /var/cache/conftool/dbconfig/20250827-035035-ladsgroup.json [03:59:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:03:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:03:05] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:03:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:04:01] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:04:05] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:04:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:05:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T402925)', diff saved to https://phabricator.wikimedia.org/P81840 and previous config saved to /var/cache/conftool/dbconfig/20250827-040542-ladsgroup.json [04:05:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:05:58] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2223.codfw.wmnet with reason: Maintenance [04:06:06] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2223 (T402925)', diff saved to https://phabricator.wikimedia.org/P81841 and previous config saved to /var/cache/conftool/dbconfig/20250827-040605-ladsgroup.json [04:08:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:09:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:19:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T402925)', diff saved to https://phabricator.wikimedia.org/P81842 and previous config saved to /var/cache/conftool/dbconfig/20250827-041900-ladsgroup.json [04:19:06] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [04:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:27:11] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (install6003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:34:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P81843 and previous config saved to /var/cache/conftool/dbconfig/20250827-043407-ladsgroup.json [04:40:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:46:40] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11122213 (10Abbe98) Affected SPARQL backends appear to at least include Fuseki and Virtuoso. [04:49:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P81844 and previous config saved to /var/cache/conftool/dbconfig/20250827-044915-ladsgroup.json [05:02:22] (03PS1) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182258 [05:04:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T402925)', diff saved to https://phabricator.wikimedia.org/P81845 and previous config saved to /var/cache/conftool/dbconfig/20250827-050423-ladsgroup.json [05:04:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:04:39] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2228.codfw.wmnet with reason: Maintenance [05:04:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2228 (T402925)', diff saved to https://phabricator.wikimedia.org/P81846 and previous config saved to /var/cache/conftool/dbconfig/20250827-050446-ladsgroup.json [05:04:57] (03CR) 10CI reject: [V:04-1] Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182258 (owner: 10Arnaudb) [05:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T402925)', diff saved to https://phabricator.wikimedia.org/P81848 and previous config saved to /var/cache/conftool/dbconfig/20250827-051540-ladsgroup.json [05:15:46] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [05:17:24] (03CR) 10Ayounsi: [C:03+1] "nice, lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [05:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:29:35] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:48] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P81849 and previous config saved to /var/cache/conftool/dbconfig/20250827-053047-ladsgroup.json [05:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:38:32] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11122268 (10ayounsi) 05Resolved→03Open Good job all!! @Ladsgroup from {T378715} do you need to upgrade any listed db* hosts to 10G? [05:40:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:45:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P81850 and previous config saved to /var/cache/conftool/dbconfig/20250827-054555-ladsgroup.json [05:48:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11122281 (10phaultfinder) [05:53:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11122283 (10phaultfinder) [05:56:53] (03PS8) 10Ayounsi: Nokia: Add initial Python files for nokia switch system config [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [05:56:53] (03PS11) 10Ayounsi: Nokia: module for interface configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180925 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:56:53] (03PS4) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:56:53] (03PS3) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [05:58:24] (03CR) 10CI reject: [V:04-1] Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [05:58:25] (03CR) 10CI reject: [V:04-1] Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T0600) [06:01:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T402925)', diff saved to https://phabricator.wikimedia.org/P81851 and previous config saved to /var/cache/conftool/dbconfig/20250827-060103-ladsgroup.json [06:01:09] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [06:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:25:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:30:48] (03PS4) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182258 (https://phabricator.wikimedia.org/T402611) [06:36:34] (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182440 [06:39:08] (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182440 (owner: 10Arnaudb) [06:52:06] (03CR) 10Muehlenhoff: [C:03+2] Assign installserver role to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182167 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [06:56:16] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122371 (10Josve05a) (I meant to edit my comment but deleted it… ugh) [06:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:04] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:16] (03CR) 10Brouberol: [C:03+1] opensearch-k8s: allow setting vm.max_map_count [puppet] - 10https://gerrit.wikimedia.org/r/1182218 (https://phabricator.wikimedia.org/T402926) (owner: 10Ryan Kemper) [07:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:02:28] (03CR) 10Muehlenhoff: [C:03+2] Update DHCP server in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1182168 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:02:44] (03CR) 10Muehlenhoff: [C:03+2] homer: Update DHCP server in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1182165 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:11:18] (03CR) 10Filippo Giunchedi: [C:03+1] Update the proxies used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:14:36] (03CR) 10MVernon: [C:03+2] swift: re-add 3 codfw hosts, drain the next 3 [puppet] - 10https://gerrit.wikimedia.org/r/1182174 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [07:14:54] (03CR) 10MVernon: [C:03+2] thanos - put thanos-be2005 back into rings [puppet] - 10https://gerrit.wikimedia.org/r/1182182 (https://phabricator.wikimedia.org/T400876) (owner: 10MVernon) [07:22:30] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11122388 (10MatthewVernon) [07:23:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11122389 (10MatthewVernon) [07:28:56] (03PS7) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [07:29:20] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:37:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:57:16] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182189 (owner: 10Andrew Bogott) [07:57:35] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182190 (owner: 10Andrew Bogott) [07:59:47] (03CR) 10David Caro: "LGTM, does tofu wait/expect the domain to be active?" [puppet] - 10https://gerrit.wikimedia.org/r/1182188 (https://phabricator.wikimedia.org/T398712) (owner: 10Andrew Bogott) [08:00:05] andre and jnuche: gettimeofday() says it's time for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T0800) [08:00:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:00:09] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182492 (https://phabricator.wikimedia.org/T396377) [08:00:11] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy1003" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182492 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [08:01:09] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182492 (https://phabricator.wikimedia.org/T396377) (owner: 10TrainBranchBot) [08:02:33] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in codfw to install2005 [dns] - 10https://gerrit.wikimedia.org/r/1182166 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:02:39] !log jmm@dns1004 START - running authdns-update [08:03:50] !log jmm@dns1004 END - running authdns-update [08:07:45] (03CR) 10Fabfur: [C:03+1] varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [08:09:29] (03CR) 10Filippo Giunchedi: "Good questions; I have not looked into how exactly prometheus-openstack-exporter gather metrics, though I'm assuming a nova API call indee" [alerts] - 10https://gerrit.wikimedia.org/r/1182034 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [08:13:35] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.16 refs T396377 [08:13:40] T396377: 1.45.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T396377 [08:20:07] (03CR) 10Muehlenhoff: [C:03+2] Update the proxies used by cloudcumin to install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182170 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:26:36] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122458 (10Aklapper) @Josve05a: I could re-post your comment from my bugmail copy, if you want me to? [08:30:45] PROBLEM - Squid on install2004 is CRITICAL: connect to address 208.80.153.105 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [08:31:09] PROBLEM - HTTP on install2004 is CRITICAL: connect to address 208.80.153.105 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [08:31:29] PROBLEM - TFTP service on install2004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [08:33:26] is install2004 a new host being setup? [08:33:39] FIRING: [2x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:34:11] no, it's an old host being taken down, install2005 is the new one, I'll silence it some more [08:34:19] no worries [08:34:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:35:03] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on install2004.wikimedia.org with reason: being replaced by install2005 [08:35:43] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [08:36:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS trixie [08:56:55] (03PS1) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1182450 (https://phabricator.wikimedia.org/T402611) [08:56:55] (03CR) 10Arnaudb: [C:03+2] "the previous iteration was installing the wrong version of mod-qos on bookworm" [puppet] - 10https://gerrit.wikimedia.org/r/1182450 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [08:57:40] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182496 [09:02:03] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182496 (owner: 10Arnaudb) [09:05:05] (03PS8) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [09:08:41] (03PS2) 10Tiziano Fogli: mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) [09:09:07] (03CR) 10CI reject: [V:04-1] mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:09:58] (03PS8) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:10:37] (03PS3) 10Tiziano Fogli: mirrormaker: add alerts directly in Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) [09:15:37] (03CR) 10Tiziano Fogli: "I decided to split the checks into two different profiles because I wasn’t happy about grooming the parameters with regexps, as it didn’t " [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [09:17:15] (03PS5) 10Ayounsi: Nokia: module for network-instance configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:17:15] (03PS4) 10Ayounsi: Nokia: module for OSPF configuration [homer/public] - 10https://gerrit.wikimedia.org/r/1181132 (owner: 10Cathal Mooney) [09:17:29] (03CR) 10Ayounsi: Nokia: module for network-instance configuration (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180979 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:19:18] (03PS1) 10Muehlenhoff: Remove my old Neo-based key [puppet] - 10https://gerrit.wikimedia.org/r/1182498 [09:24:47] (03PS9) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [09:32:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2157.codfw.wmnet with reason: Maintenance [09:32:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81854 and previous config saved to /var/cache/conftool/dbconfig/20250827-093239-fceratto.json [09:32:44] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [09:33:02] (03CR) 10Tiziano Fogli: [C:03+2] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182148 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:34:22] (03PS1) 10Cathal Mooney: Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 [09:34:55] (03CR) 10Ayounsi: [C:03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 (owner: 10Cathal Mooney) [09:35:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81855 and previous config saved to /var/cache/conftool/dbconfig/20250827-093507-fceratto.json [09:35:29] (03PS2) 10Cathal Mooney: Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 [09:37:01] (03CR) 10Cathal Mooney: [C:03+2] Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 (owner: 10Cathal Mooney) [09:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:20] (03CR) 10Vgutierrez: P:puppetserver::volatile generate datacenter database (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:38:23] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11122675 (10Bugreporter) >>! In T402749#11122138, @Zache wrote: >> However, I remain concerned that a determined attacker or a widely used non-compliant script... [09:39:39] 06SRE, 06Traffic, 10Wikidata, 10Wikidata-Query-Service: Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement - https://phabricator.wikimedia.org/T402959#11122678 (10Lucas_Werkmeister_WMDE) [09:39:54] (03Merged) 10jenkins-bot: Fix error where codfw kube-dse ASN listed under eqiad customers [homer/public] - 10https://gerrit.wikimedia.org/r/1182500 (owner: 10Cathal Mooney) [09:42:47] (03PS1) 10Sergio Gimeno: Revert "changeprop beta: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182501 [09:43:04] (03PS1) 10Sergio Gimeno: Revert "changeprop: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182502 [09:43:48] (03CR) 10Vgutierrez: P:puppetserver::volatile generate datacenter database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [09:43:53] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2005.codfw.wmnet with OS trixie [09:49:39] jmm@cumin2002 reimage (PID 474023) is awaiting input [09:49:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [09:50:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P81856 and previous config saved to /var/cache/conftool/dbconfig/20250827-095014-fceratto.json [09:53:05] (03PS1) 10Peter Fischer: SUP: upgrade to flink 1.20.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182503 (https://phabricator.wikimedia.org/T398159) [09:53:25] (03PS5) 10Vgutierrez: varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 [09:53:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11122817 (10phaultfinder) [09:54:14] (03CR) 10Vgutierrez: [C:03+2] varnish: Remove unused header X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [09:56:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS trixie [09:56:44] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [09:58:51] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11122825 (10phaultfinder) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1000) [10:00:45] (03CR) 10Vgutierrez: "we moved this to leverage `X-Provenance` signaled from HAProxy to Varnish: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppe" [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [10:01:57] !log installing libxslt security updates [10:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:05:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P81857 and previous config saved to /var/cache/conftool/dbconfig/20250827-100521-fceratto.json [10:07:17] (03PS1) 10Kevin Bazira: ml-services: update revscoring staging image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182506 (https://phabricator.wikimedia.org/T400350) [10:10:06] (03PS1) 10Federico Ceratto: Prepare new es2* nodes to replace old ones [puppet] - 10https://gerrit.wikimedia.org/r/1182507 (https://phabricator.wikimedia.org/T402859) [10:10:06] (03CR) 10Federico Ceratto: "Deploying new es2* nodes (as discussed on IRC)" [puppet] - 10https://gerrit.wikimedia.org/r/1182507 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:13:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:17:00] (03CR) 10Majavah: [C:03+2] openstack: puppet: Set user-agent for ENC client script [puppet] - 10https://gerrit.wikimedia.org/r/1179121 (owner: 10Majavah) [10:19:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:20] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:20:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T401906)', diff saved to https://phabricator.wikimedia.org/P81859 and previous config saved to /var/cache/conftool/dbconfig/20250827-102029-fceratto.json [10:20:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:20:35] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [10:20:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T401906)', diff saved to https://phabricator.wikimedia.org/P81860 and previous config saved to /var/cache/conftool/dbconfig/20250827-102041-fceratto.json [10:21:15] (03PS1) 10Filippo Giunchedi: wmcs: add JobUnavailable alert [alerts] - 10https://gerrit.wikimedia.org/r/1182508 (https://phabricator.wikimedia.org/T402778) [10:23:40] (03PS9) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:24:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T401906)', diff saved to https://phabricator.wikimedia.org/P81861 and previous config saved to /var/cache/conftool/dbconfig/20250827-102414-fceratto.json [10:29:55] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS bookworm [10:31:39] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:33:00] (03CR) 10FNegri: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1182508 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [10:38:27] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [10:38:34] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T402925)', diff saved to https://phabricator.wikimedia.org/P81862 and previous config saved to /var/cache/conftool/dbconfig/20250827-103834-ladsgroup.json [10:38:40] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:39:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81863 and previous config saved to /var/cache/conftool/dbconfig/20250827-103921-fceratto.json [10:41:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS trixie [10:41:31] (03CR) 10Urbanecm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182502 (owner: 10Sergio Gimeno) [10:41:43] (03CR) 10Urbanecm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182501 (owner: 10Sergio Gimeno) [10:42:16] (03CR) 10Máté Szabó: "pcc fail seems to be from Ic32d387689d6faabd233c2f357d7a34c7c083949" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:42:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11122997 (10Ladsgroup) I looked at them and they seems to be random replicas in random sections. I think they probably need rebalanacing to reduce their load... [10:44:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS trixie [10:47:33] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2043.codfw.wmnet with OS bookworm [10:49:32] !log idm2001.wikimedia.org - Update EnvoyProxy to version 1.26.8 - https://phabricator.wikimedia.org/T402584 [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:13] (03PS6) 10STran: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [10:51:01] (03CR) 10CI reject: [V:04-1] Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [10:51:19] (03PS4) 10FNegri: maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [10:51:21] (03CR) 10Ladsgroup: [C:03+2] maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [10:51:23] (03CR) 10Ladsgroup: [V:03+2 C:03+2] maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [10:51:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T402925)', diff saved to https://phabricator.wikimedia.org/P81865 and previous config saved to /var/cache/conftool/dbconfig/20250827-105151-ladsgroup.json [10:51:57] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [10:54:03] !log idm1001.wikimedia.org - Update EnvoyProxy to version 1.26.8 - https://phabricator.wikimedia.org/T402584 [10:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:15] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [10:54:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P81866 and previous config saved to /var/cache/conftool/dbconfig/20250827-105428-fceratto.json [10:54:35] (03CR) 10Vgutierrez: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [10:55:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS bookworm [10:55:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031 (10cmooney) 03NEW p:05Triage→03Medium [10:56:01] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181123 (owner: 10PipelineBot) [10:56:56] (03PS7) 10STran: Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [10:57:36] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [10:57:42] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181123 (owner: 10PipelineBot) [10:58:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2009.codfw.wmnet with OS bookworm [10:59:16] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:00:05] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1100). Please do the needful. [11:00:24] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:00:53] !log ladsgroup@cumin1003 START - Cookbook sre.wikireplicas.update-views [11:01:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [11:03:03] (03PS1) 10Tiziano Fogli: nrpewrapper: fix max parameters [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) [11:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:04:40] FIRING: KubernetesRsyslogDown: rsyslog on dse-k8s-worker2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=dse-k8s-worker2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:05:09] (03PS10) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) [11:05:23] (03CR) 10Kosta Harlan: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [11:05:52] (03CR) 10Vgutierrez: [C:03+1] hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [11:06:59] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P81867 and previous config saved to /var/cache/conftool/dbconfig/20250827-110659-ladsgroup.json [11:07:54] !log ladsgroup@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:08:47] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:08:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035 (10cmooney) 03NEW p:05Triage→03Medium [11:09:07] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:09:20] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:09:36] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6771/console" [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [11:09:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T401906)', diff saved to https://phabricator.wikimedia.org/P81868 and previous config saved to /var/cache/conftool/dbconfig/20250827-110936-fceratto.json [11:09:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11123210 (10cmooney) [11:09:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:09:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2178.codfw.wmnet with reason: Maintenance [11:09:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11123211 (10cmooney) [11:09:42] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:09:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T401906)', diff saved to https://phabricator.wikimedia.org/P81869 and previous config saved to /var/cache/conftool/dbconfig/20250827-110948-fceratto.json [11:10:54] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6772/console" [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [11:11:04] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:11:43] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:12:20] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:52] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:12:56] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:13:03] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [11:13:03] (03Abandoned) 10Tiziano Fogli: nrpewrapper: fix max parameters [puppet] - 10https://gerrit.wikimedia.org/r/1182511 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [11:13:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T401906)', diff saved to https://phabricator.wikimedia.org/P81870 and previous config saved to /var/cache/conftool/dbconfig/20250827-111320-fceratto.json [11:13:23] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'sync'. [11:13:43] (03PS1) 10Tiziano Fogli: Revert "nrpewrapper: correlate Prometheus "for:" duration with Icinga timing" [puppet] - 10https://gerrit.wikimedia.org/r/1182512 [11:13:52] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:14:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:15:35] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:16:17] (03CR) 10Tiziano Fogli: [C:03+2] Revert "nrpewrapper: correlate Prometheus "for:" duration with Icinga timing" [puppet] - 10https://gerrit.wikimedia.org/r/1182512 (owner: 10Tiziano Fogli) [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:44] (03PS11) 10Máté Szabó: hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) [11:19:04] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [11:19:40] RESOLVED: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:20:01] jmm@cumin2002 reimage (PID 519131) is awaiting input [11:22:07] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P81871 and previous config saved to /var/cache/conftool/dbconfig/20250827-112206-ladsgroup.json [11:28:15] (03CR) 10Dreamy Jazz: [C:03+1] Enable temporary accounts on remaining small-sized projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180532 (https://phabricator.wikimedia.org/T402181) (owner: 10Tchanders) [11:28:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81872 and previous config saved to /var/cache/conftool/dbconfig/20250827-112827-fceratto.json [11:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:29:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm [11:33:37] jmm@cumin2002 reimage (PID 533938) is awaiting input [11:34:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2005.codfw.wmnet with OS bookworm [11:37:14] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T402925)', diff saved to https://phabricator.wikimedia.org/P81873 and previous config saved to /var/cache/conftool/dbconfig/20250827-113714-ladsgroup.json [11:37:19] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:37:30] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:37:47] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:37:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T402925)', diff saved to https://phabricator.wikimedia.org/P81874 and previous config saved to /var/cache/conftool/dbconfig/20250827-113754-ladsgroup.json [11:39:21] (03CR) 10Mvolz: [C:03+2] Whitelist api-user-agent header for logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180511 (https://phabricator.wikimedia.org/T345627) (owner: 10Mvolz) [11:40:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:41:01] (03Merged) 10jenkins-bot: Whitelist api-user-agent header for logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180511 (https://phabricator.wikimedia.org/T345627) (owner: 10Mvolz) [11:42:31] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:43:02] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:43:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P81875 and previous config saved to /var/cache/conftool/dbconfig/20250827-114335-fceratto.json [11:45:20] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11123289 (10Zache) >>! In T402749#11122675, @Bugreporter wrote: > I believe we should revert {T194864} so autopatroller, patroller and image reviewer will no lo... [11:46:44] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:47:17] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:47:48] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:47:53] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [11:48:07] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:51:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11123297 (10elukey) >>! In T392851#11068425, @elukey wrote: > 3) Last but not least, `late_command.sh` fails during the Bookworm debian install, and the reason... [11:51:23] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T402925)', diff saved to https://phabricator.wikimedia.org/P81876 and previous config saved to /var/cache/conftool/dbconfig/20250827-115122-ladsgroup.json [11:51:28] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [11:53:01] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11123303 (10Ladsgroup) I suggest something even more radical: Move CAL (and HotCat) to core. These two gadgets are one of the most widely used and installed gad... [11:53:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2005.codfw.wmnet with reason: host reimage [11:58:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T401906)', diff saved to https://phabricator.wikimedia.org/P81878 and previous config saved to /var/cache/conftool/dbconfig/20250827-115843-fceratto.json [11:58:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2192.codfw.wmnet with reason: Maintenance [11:58:48] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [11:58:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T401906)', diff saved to https://phabricator.wikimedia.org/P81879 and previous config saved to /var/cache/conftool/dbconfig/20250827-115854-fceratto.json [12:01:15] (03CR) 10Cathal Mooney: [C:03+2] Nokia: Add examples for Nokia password hashes commonly used [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:01:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T401906)', diff saved to https://phabricator.wikimedia.org/P81880 and previous config saved to /var/cache/conftool/dbconfig/20250827-120128-fceratto.json [12:01:46] (03Merged) 10jenkins-bot: Nokia: Add examples for Nokia password hashes commonly used [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:04:23] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1012.eqiad.wmnet with OS bookworm [12:06:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P81881 and previous config saved to /var/cache/conftool/dbconfig/20250827-120630-ladsgroup.json [12:09:39] (03CR) 10Cathal Mooney: [C:03+2] wmf-plugin: New function to expose generic interface data to modules [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1180553 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:09:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2005.codfw.wmnet with OS bookworm [12:13:34] !log stevemunene@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'sync'. [12:14:32] (03CR) 10David Caro: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1182508 (https://phabricator.wikimedia.org/T402778) (owner: 10Filippo Giunchedi) [12:14:55] (03PS1) 10Brouberol: flink-operator: update operator to 1.12 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182515 (https://phabricator.wikimedia.org/T398162) [12:14:56] (03PS1) 10Brouberol: flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) [12:14:58] (03PS1) 10Brouberol: flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) [12:15:00] (03PS1) 10Brouberol: flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) [12:15:02] (03PS1) 10Brouberol: flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) [12:15:04] (03PS1) 10Brouberol: flink-operator: update operator to 1.12 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182520 (https://phabricator.wikimedia.org/T398162) [12:16:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P81882 and previous config saved to /var/cache/conftool/dbconfig/20250827-121635-fceratto.json [12:19:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:20:35] (03PS1) 10Muehlenhoff: Failover idp.w.o to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1182521 (https://phabricator.wikimedia.org/T402584) [12:21:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P81883 and previous config saved to /var/cache/conftool/dbconfig/20250827-122138-ladsgroup.json [12:21:39] (03CR) 10Slyngshede: [C:03+1] Failover idp.w.o to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1182521 (https://phabricator.wikimedia.org/T402584) (owner: 10Muehlenhoff) [12:22:32] (03CR) 10Slyngshede: [C:03+1] "I'll just assume that your other keys work." [puppet] - 10https://gerrit.wikimedia.org/r/1182498 (owner: 10Muehlenhoff) [12:22:40] (03CR) 10Ayounsi: Nokia: Add initial Python files for nokia switch system config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [12:22:56] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178181 (owner: 10PipelineBot) [12:23:01] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181122 (owner: 10PipelineBot) [12:24:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:25:06] (03CR) 10DCausse: [C:03+1] flink-operator: update operator to 1.12 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182515 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [12:25:12] (03CR) 10Brouberol: [C:03+2] flink-operator: update operator to 1.12 in staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182515 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [12:26:48] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:27:34] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [12:28:14] (03CR) 10David Caro: "I think we can drop this for now, if it happens again then we can rethink on raising it, but it's quite unlikely that it was related to th" [puppet] - 10https://gerrit.wikimedia.org/r/1182175 (https://phabricator.wikimedia.org/T402480) (owner: 10Andrew Bogott) [12:28:18] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:31:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P81884 and previous config saved to /var/cache/conftool/dbconfig/20250827-123143-fceratto.json [12:33:05] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1012.eqiad.wmnet with reason: host reimage [12:34:48] (03PS1) 10Tiziano Fogli: nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182524 [12:36:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T402925)', diff saved to https://phabricator.wikimedia.org/P81885 and previous config saved to /var/cache/conftool/dbconfig/20250827-123645-ladsgroup.json [12:36:51] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:37:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [12:37:09] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T402925)', diff saved to https://phabricator.wikimedia.org/P81886 and previous config saved to /var/cache/conftool/dbconfig/20250827-123708-ladsgroup.json [12:37:09] (03CR) 10CI reject: [V:04-1] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182524 (owner: 10Tiziano Fogli) [12:37:21] (03PS3) 10Anzx: tlwiktionary: set sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182523 (https://phabricator.wikimedia.org/T402725) [12:37:57] (03PS2) 10Tiziano Fogli: nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182524 (https://phabricator.wikimedia.org/T395446) [12:38:39] FIRING: [2x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:39:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:41:15] (03PS1) 10Mhorsey: Release CampaignEvents extension to all active wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182526 (https://phabricator.wikimedia.org/T402329) [12:42:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182526 (https://phabricator.wikimedia.org/T402329) (owner: 10Mhorsey) [12:43:08] (03PS1) 10Anzx: gotwiki: update wordmark and add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182530 (https://phabricator.wikimedia.org/T402706) [12:44:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182523 (https://phabricator.wikimedia.org/T402725) (owner: 10Anzx) [12:44:07] (03PS2) 10Brouberol: flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) [12:44:07] (03PS2) 10Brouberol: flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) [12:44:07] (03PS2) 10Brouberol: flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) [12:44:08] (03PS2) 10Brouberol: flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) [12:44:09] (03PS2) 10Brouberol: flink-operator: update operator to 1.12 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182520 (https://phabricator.wikimedia.org/T398162) [12:44:13] (03PS1) 10Brouberol: flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182532 (https://phabricator.wikimedia.org/T398162) [12:44:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182530 (https://phabricator.wikimedia.org/T402706) (owner: 10Anzx) [12:44:58] (03CR) 10Vgutierrez: [C:03+2] hcaptcha: Remap upstream Set-Cookie headers to use the proxy domain [puppet] - 10https://gerrit.wikimedia.org/r/1182106 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [12:46:31] (03PS1) 10Jforrester: wikifunctions: Enable forthcoming wikidataImport feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182533 (https://phabricator.wikimedia.org/T402357) [12:46:32] (03PS1) 10Klausman: team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) [12:46:33] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-08-20-203801 to 2025-08-26-213211 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182534 (https://phabricator.wikimedia.org/T395475) [12:46:40] (03PS2) 10Brouberol: flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182532 (https://phabricator.wikimedia.org/T398162) [12:46:44] (03PS3) 10Brouberol: flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) [12:46:48] (03PS3) 10Brouberol: flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) [12:46:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T401906)', diff saved to https://phabricator.wikimedia.org/P81887 and previous config saved to /var/cache/conftool/dbconfig/20250827-124650-fceratto.json [12:46:53] (03PS3) 10Brouberol: flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) [12:46:55] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2201.codfw.wmnet with reason: Maintenance [12:46:57] (03PS3) 10Brouberol: flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) [12:46:57] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:47:01] (03PS1) 10Brouberol: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182536 (https://phabricator.wikimedia.org/T398162) [12:47:05] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-08-20-210742 to 2025-08-25-145906 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182535 (https://phabricator.wikimedia.org/T395475) [12:48:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2211.codfw.wmnet with reason: Maintenance [12:48:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1012.eqiad.wmnet with OS bookworm [12:48:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T401906)', diff saved to https://phabricator.wikimedia.org/P81888 and previous config saved to /var/cache/conftool/dbconfig/20250827-124832-fceratto.json [12:48:36] (03PS7) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1182497 (https://phabricator.wikimedia.org/T402611) [12:48:36] (03CR) 10Arnaudb: [C:03+2] "mod-qos is installed the right way on pcc: https://puppet-compiler.wmflabs.org/output/1182497/4819/gerrit2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1182497 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [12:49:09] (03PS1) 10Arnaudb: Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182537 [12:49:45] (03PS3) 10Brouberol: flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182532 (https://phabricator.wikimedia.org/T398162) [12:49:45] (03PS4) 10Brouberol: flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) [12:49:46] (03PS4) 10Brouberol: flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) [12:49:50] (03PS4) 10Brouberol: flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) [12:49:54] (03PS4) 10Brouberol: flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) [12:49:58] (03PS3) 10Brouberol: flink-operator: update operator to 1.12 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182520 (https://phabricator.wikimedia.org/T398162) [12:50:24] (03PS1) 10Jforrester: Wikifunctions: Enable Wikidata input types in embedded calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182538 (https://phabricator.wikimedia.org/T397403) [12:50:29] (03PS10) 10Slyngshede: P:puppetserver::volatile generate datacenter database [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) [12:50:33] (03Abandoned) 10Brouberol: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182536 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [12:50:38] (03CR) 10DCausse: [C:03+1] flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182532 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [12:52:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T401906)', diff saved to https://phabricator.wikimedia.org/P81889 and previous config saved to /var/cache/conftool/dbconfig/20250827-125207-fceratto.json [12:52:09] (03PS1) 10Vgutierrez: hcaptcha: Create /etc/nginx/lua [puppet] - 10https://gerrit.wikimedia.org/r/1182539 (https://phabricator.wikimedia.org/T402713) [12:52:13] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [12:52:50] (03PS2) 10Vgutierrez: hcaptcha: Create /etc/nginx/lua [puppet] - 10https://gerrit.wikimedia.org/r/1182539 (https://phabricator.wikimedia.org/T402713) [12:52:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182539 (https://phabricator.wikimedia.org/T402713) (owner: 10Vgutierrez) [12:53:17] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1182521 (https://phabricator.wikimedia.org/T402584) (owner: 10Muehlenhoff) [12:53:20] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T402925)', diff saved to https://phabricator.wikimedia.org/P81890 and previous config saved to /var/cache/conftool/dbconfig/20250827-125319-ladsgroup.json [12:53:25] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [12:53:28] !log jmm@dns1004 START - running authdns-update [12:54:39] !log jmm@dns1004 END - running authdns-update [12:55:34] (03PS1) 10Anzx: hawiki: revert temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182540 (https://phabricator.wikimedia.org/T376049) [12:57:22] (03CR) 10Brouberol: [C:03+2] flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182532 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [12:57:59] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182542 [12:58:02] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11123725 (10FCeratto-WMF) [12:58:49] !log upgrading envoy on testreduce T402584 [12:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [12:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:59:04] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:59:22] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1300). [13:00:05] Mvolz, houseofm, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182537 (owner: 10Arnaudb) [13:00:15] o/ [13:00:21] I can deploy for a bit but have a meeting in half an hour [13:00:39] o/ [13:00:49] ty Lucas_WMDE [13:00:50] (03CR) 10Kosta Harlan: [C:03+1] hcaptcha: Create /etc/nginx/lua [puppet] - 10https://gerrit.wikimedia.org/r/1182539 (https://phabricator.wikimedia.org/T402713) (owner: 10Vgutierrez) [13:00:55] (03CR) 10Vgutierrez: [C:03+2] hcaptcha: Create /etc/nginx/lua [puppet] - 10https://gerrit.wikimedia.org/r/1182539 (https://phabricator.wikimedia.org/T402713) (owner: 10Vgutierrez) [13:00:58] o/ [13:01:14] let’s start with HouseOfM [13:01:16] o/ [13:02:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182526 (https://phabricator.wikimedia.org/T402329) (owner: 10Mhorsey) [13:02:28] (03PS1) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182545 [13:03:14] (03Merged) 10jenkins-bot: Release CampaignEvents extension to all active wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182526 (https://phabricator.wikimedia.org/T402329) (owner: 10Mhorsey) [13:03:24] (03CR) 10Papaul: [C:03+2] Add BGP on mr1-ulsfo and temporary remove replace ospf [homer/public] - 10https://gerrit.wikimedia.org/r/1182185 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [13:03:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11123748 (10Jclark-ctr) Based on the cabling for spine D1 (T401238) last week, I will need an 8m fiber. @VRiley-WMF, will you be able to update the lengt... [13:03:49] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1182526|Release CampaignEvents extension to all active wikisources (T402329)]] [13:03:54] T402329: Release CampaignEvents extension to all Wikisources - week of AUGUST 25 - https://phabricator.wikimedia.org/T402329 [13:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:04:29] (03PS2) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182545 [13:06:21] (03PS2) 10Anzx: hawiki: revert temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182540 (https://phabricator.wikimedia.org/T376049) [13:06:46] (03PS3) 10Anzx: hawiki: revert temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182540 (https://phabricator.wikimedia.org/T376049) [13:07:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P81891 and previous config saved to /var/cache/conftool/dbconfig/20250827-130714-fceratto.json [13:08:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P81892 and previous config saved to /var/cache/conftool/dbconfig/20250827-130827-ladsgroup.json [13:08:53] 06SRE, 06Traffic, 13Patch-For-Review, 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): hCaptcha: Ensure GeoIP and WMF-Uniq cookies are removed in proxied requests - https://phabricator.wikimedia.org/T402713#11123784 (10kostajh) [13:09:19] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] tlwiktionary: set sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182523 (https://phabricator.wikimedia.org/T402725) (owner: 10Anzx) [13:09:23] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] gotwiki: update wordmark and add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182530 (https://phabricator.wikimedia.org/T402706) (owner: 10Anzx) [13:10:16] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, mhorsey: Backport for [[gerrit:1182526|Release CampaignEvents extension to all active wikisources (T402329)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:10:21] T402329: Release CampaignEvents extension to all Wikisources - week of AUGUST 25 - https://phabricator.wikimedia.org/T402329 [13:10:50] HouseOfM: please test on WikimediaDebug :) [13:10:51] LGTM, ty [13:10:54] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, mhorsey: Continuing with sync [13:10:55] ok! [13:11:24] (03PS1) 10Federico Ceratto: data.yaml Add mszwarc (Marcin Szwarc) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1182543 [13:11:24] (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (owner: 10Federico Ceratto) [13:11:57] (03PS2) 10Anzx: hawiki: remove temporary logo files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182541 (https://phabricator.wikimedia.org/T376049) [13:12:45] (03PS2) 10Federico Ceratto: data.yaml Add mszwarc (Marcin Szwarc) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (https://phabricator.wikimedia.org/T402779) [13:13:17] 06SRE, 06Traffic, 13Patch-For-Review, 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): hCaptcha: Ensure GeoIP and WMF-Uniq cookies are removed in proxied requests - https://phabricator.wikimedia.org/T402713#11123816 (10kostajh) 05Open→03Resolved a:03kostajh [13:13:28] 06SRE, 06Traffic, 13Patch-For-Review, 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): hCaptcha: Ensure GeoIP and WMF-Uniq cookies are removed in proxied requests - https://phabricator.wikimedia.org/T402713#11123818 (10kostajh) a:05kostajh→03mszabo [13:13:30] (03CR) 10CI reject: [V:04-1] data.yaml Add mszwarc (Marcin Szwarc) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (https://phabricator.wikimedia.org/T402779) (owner: 10Federico Ceratto) [13:16:01] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182526|Release CampaignEvents extension to all active wikisources (T402329)]] (duration: 12m 12s) [13:16:06] T402329: Release CampaignEvents extension to all Wikisources - week of AUGUST 25 - https://phabricator.wikimedia.org/T402329 [13:16:56] Tysm Lucas_WMDE [13:16:57] (03PS5) 10Brouberol: flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) [13:16:57] (03PS5) 10Brouberol: flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) [13:16:57] (03PS5) 10Brouberol: flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) [13:16:58] (03PS5) 10Brouberol: flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) [13:16:59] (03PS4) 10Brouberol: flink-operator: update operator to 1.12 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182520 (https://phabricator.wikimedia.org/T398162) [13:16:59] np :) [13:17:00] (03PS1) 10Brouberol: flink-operator: grant the required permissions to manage it own CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182549 (https://phabricator.wikimedia.org/T398162) [13:17:20] I don’t think there’s enough time for me to start another deployment, sorry [13:17:31] Mvolz should be able to self-service… can anyone deploy for anzx? [13:17:56] (03CR) 10DCausse: [C:03+1] flink-operator: grant the required permissions to manage it own CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182549 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [13:19:11] Lucas_WMDE: I actually don't have spiderpig access :/ [13:19:17] apparently [13:19:24] huh, weird [13:19:29] But it's not critical I deploy today. [13:19:40] but you have deployment access according to puppet, right? [13:19:44] Yeah. [13:19:54] I thought all deployers had spiderpig access now o_O [13:20:09] I guess I should file a ticket! [13:20:21] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11123850 (10Papaul) Diff on mr1-ulsfo ` + bgp { + group Production { + type external; + import BGP_Default; + expo... [13:21:06] I seem to recall there was a change to 2FA that was stopping folks with access from using it and giving a message that made it appear like access wasn't allowed [13:21:36] jouncebot: now [13:21:36] For the next 0 hour(s) and 38 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1300) [13:22:05] 06SRE, 10SRE-swift-storage: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918#11123872 (10MatthewVernon) 05Stalled→03Open We do now have all the same UID across the fleet: ` mvernon@cumin2002:~$ sudo cumin O:swift::storage 'id swift' [...] ===== NODE G... [13:22:09] (But I also can't +2 the config repo) [13:22:19] Mvolz: you don't need to! [13:22:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P81893 and previous config saved to /var/cache/conftool/dbconfig/20250827-132222-fceratto.json [13:22:26] just `scap backport changeid` and it'll +2 for you [13:22:39] Lucas_WMDE: i think they gave it to all deployers who recently deployed something [13:23:11] I am quickly restarting Gerrit [13:23:35] urbanecm: ah, that rings a bell… [13:23:35] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P81894 and previous config saved to /var/cache/conftool/dbconfig/20250827-132334-ladsgroup.json [13:23:44] for access to spiderpig, I have no idea how it is granted / what is required [13:23:52] (also +1 to the scap backport comment ^^) [13:23:59] spiderpig-access LDAP group [13:25:38] (03PS2) 10Klausman: team-ml: Add alert for outdated admin_ng config [alerts] - 10https://gerrit.wikimedia.org/r/1182531 (https://phabricator.wikimedia.org/T403047) [13:25:40] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182549 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [13:25:56] !log restarted Gerrit [13:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:09] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [13:29:15] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=tegola-vector-tiles,name=codfw [13:30:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and mr1-ulsfo (198.35.26.199) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:33:00] (03CR) 10DCausse: [C:03+1] flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [13:33:21] (03CR) 10Brouberol: [C:03+2] flink-operator: grant the required permissions to manage it own CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182549 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [13:34:15] (03PS5) 10Arnaudb: Revert^2 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182545 (https://phabricator.wikimedia.org/T402611) [13:34:15] (03CR) 10Arnaudb: [C:03+2] "previous iteration was properly installing mod-qos but was not properly configuring it with httpd::conf" [puppet] - 10https://gerrit.wikimedia.org/r/1182545 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [13:34:34] (03CR) 10Brouberol: [C:03+2] flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [13:34:40] (03CR) 10Brouberol: [V:03+2 C:03+2] flink-operator: update operator to 1.12 in staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182516 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [13:34:47] (03PS1) 10Arnaudb: Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182570 [13:35:27] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11123959 (10FCeratto-WMF) Opened https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182543 [13:35:54] (03PS1) 10Tiziano Fogli: icinga/audit: add script to dump defined checks [software] - 10https://gerrit.wikimedia.org/r/1182571 (https://phabricator.wikimedia.org/T395443) [13:36:23] (03CR) 10CI reject: [V:04-1] icinga/audit: add script to dump defined checks [software] - 10https://gerrit.wikimedia.org/r/1182571 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [13:37:14] (03PS2) 10Tiziano Fogli: icinga/audit: add script to dump defined checks [software] - 10https://gerrit.wikimedia.org/r/1182571 (https://phabricator.wikimedia.org/T395443) [13:37:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T401906)', diff saved to https://phabricator.wikimedia.org/P81895 and previous config saved to /var/cache/conftool/dbconfig/20250827-133729-fceratto.json [13:37:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2223.codfw.wmnet with reason: Maintenance [13:37:35] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [13:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2223 (T401906)', diff saved to https://phabricator.wikimedia.org/P81896 and previous config saved to /var/cache/conftool/dbconfig/20250827-133741-fceratto.json [13:38:43] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T402925)', diff saved to https://phabricator.wikimedia.org/P81897 and previous config saved to /var/cache/conftool/dbconfig/20250827-133842-ladsgroup.json [13:38:45] (03PS1) 10Ottomata: eventgate-analytics, eventgate-logging-ext: upgrade to 1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182572 (https://phabricator.wikimedia.org/T376026) [13:38:47] (03CR) 10Cathal Mooney: Nokia: Add initial Python files for nokia switch system config (033 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:38:48] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:38:49] (03CR) 10Arnaudb: [C:03+2] Revert^3 "gerrit: mod qos configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1182570 (owner: 10Arnaudb) [13:38:55] !log brouberol@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:38:57] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:39:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T402925)', diff saved to https://phabricator.wikimedia.org/P81898 and previous config saved to /var/cache/conftool/dbconfig/20250827-133904-ladsgroup.json [13:39:16] !log brouberol@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:40:29] urbanecm: Lucas_WMDE is the backport window finished? I would like to do an eventgate deploy, but I'll wait til the window is done. [13:40:43] I’m not deploying, at least. not sure about anyone else [13:40:43] not sure, i was not deploying something [13:40:45] * Lucas_WMDE in a meeting [13:40:50] but i'd like to do a changeprop deployment [13:41:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T401906)', diff saved to https://phabricator.wikimedia.org/P81899 and previous config saved to /var/cache/conftool/dbconfig/20250827-134116-fceratto.json [13:41:24] okay, I guess it is done then! :) proceeding! [13:41:30] (03CR) 10Ottomata: [C:03+2] eventgate-analytics, eventgate-logging-ext: upgrade to 1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182572 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [13:41:36] ottomata: ping me when done, please! :) [13:42:19] (03PS1) 10MVernon: swift: use admin to manage swift uid/gid, remove old bodges [puppet] - 10https://gerrit.wikimedia.org/r/1182573 (https://phabricator.wikimedia.org/T123918) [13:43:03] !log brouberol@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:43:10] (03Merged) 10jenkins-bot: eventgate-analytics, eventgate-logging-ext: upgrade to 1.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182572 (https://phabricator.wikimedia.org/T376026) (owner: 10Ottomata) [13:43:39] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:44:27] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [13:44:33] !log brouberol@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:44:49] !log deploying eventgate-analytics and eventgate-logging-external to pick up meta.dt change - T376026 [13:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:54] T376026: Update event-producing tools to overwrite `meta.dt` - https://phabricator.wikimedia.org/T376026 [13:44:58] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [13:45:23] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [13:45:58] 10SRE-Access-Requests: Request access to spiderpig for Mvolz - https://phabricator.wikimedia.org/T403061 (10Mvolz) 03NEW [13:46:09] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [13:46:39] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [13:47:08] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [13:47:32] 10SRE-Access-Requests: Request access to spiderpig for Mvolz - https://phabricator.wikimedia.org/T403061#11124038 (10MoritzMuehlenhoff) You can request is via Wikimedia IDM; please see here: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access The access is then reviewed/granted by Tyler. [13:47:35] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [13:48:17] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [13:48:33] (03PS10) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) [13:48:37] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [13:49:26] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [13:49:55] 10SRE-Access-Requests: Request access to spiderpig for Mvolz - https://phabricator.wikimedia.org/T403061#11124044 (10Mvolz) Awesome, thanks! [13:50:00] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [13:50:12] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [13:50:17] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T402925)', diff saved to https://phabricator.wikimedia.org/P81900 and previous config saved to /var/cache/conftool/dbconfig/20250827-135017-ladsgroup.json [13:50:23] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [13:50:26] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11124046 (10Jgiannelos) Here is the latest benchmark between codfw/eqiad when it comes to latency: {F65921136} Here are some more generic stats: {F65921140} There is still a sligh... [13:50:56] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [13:50:59] 10SRE-Access-Requests: Request access to spiderpig for Mvolz - https://phabricator.wikimedia.org/T403061#11124050 (10Mvolz) 05Open→03Invalid [13:51:01] !log stop puppet/swift/rsync to vacuum large DBs on ms-be1066 T377827 [13:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:05] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [13:51:38] (03PS2) 10Ottomata: eventgate-analytics remove canary release from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119206 (https://phabricator.wikimedia.org/T383814) [13:51:47] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum [13:52:18] (03CR) 10Clément Goubert: [C:03+1] eventgate-analytics remove canary release from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119206 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [13:52:54] urbanecm: i'm done thank you! i will be messing with a staging canary helm release here in a sec, but it won't do anything prod so please proceed [13:53:19] (03PS1) 10Jcrespo: backup: Reenable notifications for doc1004, disable for install2005 [puppet] - 10https://gerrit.wikimedia.org/r/1182576 (https://phabricator.wikimedia.org/T392130) [13:53:20] sounds good, ty [13:54:55] (03CR) 10Urbanecm: [C:03+2] Revert "changeprop: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182502 (owner: 10Sergio Gimeno) [13:55:38] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640#11124076 (10cmooney) >>! In T402640#11121128, @ayounsi wrote: > If I understand correctly we currently get some "per rack" load balancing, where `E3` might... [13:55:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and mr1-ulsfo (198.35.26.199) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:55:48] (03CR) 10Jcrespo: "Let me know what you think." [puppet] - 10https://gerrit.wikimedia.org/r/1182576 (https://phabricator.wikimedia.org/T392130) (owner: 10Jcrespo) [13:56:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P81901 and previous config saved to /var/cache/conftool/dbconfig/20250827-135623-fceratto.json [13:56:28] (03Merged) 10jenkins-bot: Revert "changeprop: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182502 (owner: 10Sergio Gimeno) [13:57:16] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:57:48] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:57:51] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [13:58:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11124104 (10phaultfinder) [13:58:51] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:59:04] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1400) [14:00:15] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:00:24] !log urbanecm@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:00:28] (03PS1) 10Elukey: profile::base: add an option to install linux 6.12 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) [14:00:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr3-ulsfo and mr1-ulsfo (198.35.26.199) - group Management - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Management&var-bgp_neighbor=mr1-ulsfo - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:01:05] (03PS2) 10Elukey: profile::base: add an option to install linux 6.12 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) [14:01:53] (03CR) 10Muehlenhoff: "install2005 is now fully in service, I don't think we need to exempt it anymore." [puppet] - 10https://gerrit.wikimedia.org/r/1182576 (https://phabricator.wikimedia.org/T392130) (owner: 10Jcrespo) [14:01:55] !log urbanecm@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:01:58] (03CR) 10Urbanecm: [C:03+2] Revert "changeprop beta: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182501 (owner: 10Sergio Gimeno) [14:02:01] (03PS1) 10Michael Große: GrowthExperiments: remove unused wgGENewcomerTasksTopicType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182578 [14:02:15] !log deleted eventgate-analytics staging canary release [14:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:21] (03PS11) 10Scott French: hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) [14:02:25] (03CR) 10Ottomata: [C:03+2] eventgate-analytics remove canary release from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119206 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [14:02:46] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:03:19] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11124123 (10MoritzMuehlenhoff) The new maps servers are arriving this week, as such there'll be a slight change of plans; we don't reimage the old nodes any more, but rather install t... [14:03:58] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11124128 (10phaultfinder) [14:04:06] (03CR) 10Jforrester: [C:03+2] wikifunctions: Enable forthcoming wikidataImport feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182533 (https://phabricator.wikimedia.org/T402357) (owner: 10Jforrester) [14:04:24] (03Merged) 10jenkins-bot: Revert "changeprop beta: Decrease reenqueue_delay for Getting Started notif job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182501 (owner: 10Sergio Gimeno) [14:04:24] (03CR) 10DCausse: [C:03+1] flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:04:25] (03Merged) 10jenkins-bot: eventgate-analytics remove canary release from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119206 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [14:04:34] (03CR) 10DCausse: [C:03+1] flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:04:36] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11124130 (10MoritzMuehlenhoff) p:05Triage→03High [14:04:53] (03CR) 10DCausse: [C:03+1] flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:05:07] (03CR) 10Eevans: [C:03+2] Revert "data-gateway: enable debug logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182220 (owner: 10Eevans) [14:05:10] (03CR) 10DCausse: [C:03+1] flink-operator: update operator to 1.12 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182520 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:05:12] (03PS1) 10Elukey: Deploy linux 6.12 from bookworm-backports on ml-serve101[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/1182582 (https://phabricator.wikimedia.org/T393948) [14:05:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P81902 and previous config saved to /var/cache/conftool/dbconfig/20250827-140524-ladsgroup.json [14:05:31] (03CR) 10Brouberol: [C:03+2] flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:05:33] (03CR) 10Brouberol: [V:03+2 C:03+2] flink-operator: update operator to 1.12 in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182517 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:06:01] (03Merged) 10jenkins-bot: wikifunctions: Enable forthcoming wikidataImport feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182533 (https://phabricator.wikimedia.org/T402357) (owner: 10Jforrester) [14:06:14] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6773/console" [puppet] - 10https://gerrit.wikimedia.org/r/1182582 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:06:22] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:07:20] (03Merged) 10jenkins-bot: Revert "data-gateway: enable debug logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182220 (owner: 10Eevans) [14:08:16] (03PS2) 10Jcrespo: backup: Reenable notifications for doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1182576 (https://phabricator.wikimedia.org/T392130) [14:08:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11124157 (10elukey) Note for the future - we decided to use bookworm for these nodes, forcing the install of a backported kernel to get more... [14:08:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10MinT: Q4:rack/setup/install ml-serve101[45] - https://phabricator.wikimedia.org/T400626#11124158 (10elukey) [14:08:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182538 (https://phabricator.wikimedia.org/T397403) (owner: 10Jforrester) [14:09:00] (03CR) 10Jcrespo: "Thanks for the info, amended." [puppet] - 10https://gerrit.wikimedia.org/r/1182576 (https://phabricator.wikimedia.org/T392130) (owner: 10Jcrespo) [14:09:10] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11124162 (10FCeratto-WMF) 05Stalled→03In progress [14:09:25] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [14:09:39] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [14:10:05] (03Merged) 10jenkins-bot: Wikifunctions: Enable Wikidata input types in embedded calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182538 (https://phabricator.wikimedia.org/T397403) (owner: 10Jforrester) [14:10:29] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1182538|Wikifunctions: Enable Wikidata input types in embedded calls (T397403)]] [14:10:35] T397403: Add support for Wikidata items and Wikidata lexemes as function inputs - https://phabricator.wikimedia.org/T397403 [14:10:50] !log urbanecm@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:11:01] !log urbanecm@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:11:12] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:11:18] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:11:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223', diff saved to https://phabricator.wikimedia.org/P81903 and previous config saved to /var/cache/conftool/dbconfig/20250827-141131-fceratto.json [14:11:46] (03CR) 10Brouberol: [C:03+1] dumps: remove dead links. [puppet] - 10https://gerrit.wikimedia.org/r/1182225 (https://phabricator.wikimedia.org/T402976) (owner: 10Xcollazo) [14:11:48] (03CR) 10Brouberol: [C:03+2] dumps: remove dead links. [puppet] - 10https://gerrit.wikimedia.org/r/1182225 (https://phabricator.wikimedia.org/T402976) (owner: 10Xcollazo) [14:12:26] !log brouberol@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:13:56] !log eevans@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [14:14:14] !log brouberol@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:15:03] !log eevans@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [14:16:28] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [14:16:55] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1182538|Wikifunctions: Enable Wikidata input types in embedded calls (T397403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:16:58] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [14:17:00] T397403: Add support for Wikidata items and Wikidata lexemes as function inputs - https://phabricator.wikimedia.org/T397403 [14:17:34] (03PS1) 10Sergio Gimeno: changeprop: add rule for notificationReEngageJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182583 (https://phabricator.wikimedia.org/T400118) [14:18:24] !log jforrester@deploy1003 jforrester: Continuing with sync [14:18:30] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [14:18:51] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [14:19:07] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:19:09] FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:19:50] (03CR) 10Brouberol: [C:03+2] flink-operator: update operator to 1.12 in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182518 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:20:33] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P81905 and previous config saved to /var/cache/conftool/dbconfig/20250827-142032-ladsgroup.json [14:20:37] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182573 (https://phabricator.wikimedia.org/T123918) (owner: 10MVernon) [14:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:14] (03PS2) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-08-20-203801 to 2025-08-26-213211 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182534 (https://phabricator.wikimedia.org/T395475) [14:21:15] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-08-20-210742 to 2025-08-25-145906 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182535 (https://phabricator.wikimedia.org/T395475) [14:21:15] (03PS1) 10Jforrester: wikifunctions: JSON spec, I hate you [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182585 [14:21:32] (03CR) 10Jforrester: [C:03+2] wikifunctions: JSON spec, I hate you [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182585 (owner: 10Jforrester) [14:21:41] !log brouberol@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:22:56] !log brouberol@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:23:18] (03Merged) 10jenkins-bot: wikifunctions: JSON spec, I hate you [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182585 (owner: 10Jforrester) [14:24:06] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182538|Wikifunctions: Enable Wikidata input types in embedded calls (T397403)]] (duration: 13m 37s) [14:24:09] RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:24:10] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:24:13] T397403: Add support for Wikidata items and Wikidata lexemes as function inputs - https://phabricator.wikimedia.org/T397403 [14:24:16] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum1002.eqiad.wmnet with OS trixie [14:24:23] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:24:34] (03CR) 10MVernon: "I would appreciate a puppet-expert review of this change, please! It _shouldn't_ have any impact on the running systems..." [puppet] - 10https://gerrit.wikimedia.org/r/1182573 (https://phabricator.wikimedia.org/T123918) (owner: 10MVernon) [14:25:03] (03CR) 10Brouberol: profile::base: add an option to install linux 6.12 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:25:08] (03PS3) 10Jforrester: Enable Wikifunctions client mode on Wiktionaries, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172046 (https://phabricator.wikimedia.org/T397401) [14:25:26] (03CR) 10Brouberol: [C:03+2] flink-operator: update operator to 1.12 in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182519 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:26:05] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:26:38] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:26:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2223 (T401906)', diff saved to https://phabricator.wikimedia.org/P81906 and previous config saved to /var/cache/conftool/dbconfig/20250827-142638-fceratto.json [14:26:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2228.codfw.wmnet with reason: Maintenance [14:26:46] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [14:26:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2228 (T401906)', diff saved to https://phabricator.wikimedia.org/P81907 and previous config saved to /var/cache/conftool/dbconfig/20250827-142650-fceratto.json [14:26:52] jouncebot: nowandnext [14:26:52] For the next 0 hour(s) and 33 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1400) [14:26:52] In 0 hour(s) and 3 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1430) [14:26:56] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:27:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:27:30] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:27:52] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-08-20-203801 to 2025-08-26-213211 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182534 (https://phabricator.wikimedia.org/T395475) (owner: 10Jforrester) [14:28:30] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:29:19] (03CR) 10Elukey: profile::base: add an option to install linux 6.12 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:29:32] (03CR) 10Brouberol: [C:03+2] flink-operator: update operator to 1.12 by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182520 (https://phabricator.wikimedia.org/T398162) (owner: 10Brouberol) [14:29:42] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-08-20-203801 to 2025-08-26-213211 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182534 (https://phabricator.wikimedia.org/T395475) (owner: 10Jforrester) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1430) [14:30:18] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:30:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T401906)', diff saved to https://phabricator.wikimedia.org/P81908 and previous config saved to /var/cache/conftool/dbconfig/20250827-143024-fceratto.json [14:30:26] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11124308 (10MatthewVernon) 05Resolved→03Open ms-be1066 alerted again today for disk space, I deployed [[ https://phabricator.wikime... [14:31:10] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:31:19] (03CR) 10DCausse: [C:03+1] "Balthazar just deployed the new version of the operator everywhere, this should be good to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182503 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [14:31:25] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:31:42] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS bookworm [14:31:48] (03CR) 10Muehlenhoff: "One suggestion in line, which is a little cleaner as it's explicit and not just a side effect of the current apt config." [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:32:11] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:32:17] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:32:19] (03CR) 10Urbanecm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182583 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno) [14:33:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:33:50] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11124335 (10Ladsgroup) FWIW, I think we have a new issue that adds a lot of data to random backends. See https://grafana.wikimedia.org/... [14:33:56] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:34:06] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-08-20-210742 to 2025-08-25-145906 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182535 (https://phabricator.wikimedia.org/T395475) (owner: 10Jforrester) [14:34:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172046 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [14:34:34] (03PS1) 10Dreamy Jazz: Temp accounts: Disable logged out editing on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182587 (https://phabricator.wikimedia.org/T403067) [14:34:44] jouncebot: nowandnext [14:34:45] For the next 0 hour(s) and 25 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1400) [14:34:45] For the next 0 hour(s) and 25 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1430) [14:34:45] In 2 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1700) [14:35:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T402925)', diff saved to https://phabricator.wikimedia.org/P81909 and previous config saved to /var/cache/conftool/dbconfig/20250827-143539-ladsgroup.json [14:35:45] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:35:55] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [14:36:01] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-08-20-210742 to 2025-08-25-145906 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182535 (https://phabricator.wikimedia.org/T395475) (owner: 10Jforrester) [14:36:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T402925)', diff saved to https://phabricator.wikimedia.org/P81910 and previous config saved to /var/cache/conftool/dbconfig/20250827-143602-ladsgroup.json [14:36:17] (03Merged) 10jenkins-bot: Enable Wikifunctions client mode on Wiktionaries, Part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172046 (https://phabricator.wikimedia.org/T397401) (owner: 10Jforrester) [14:36:43] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1172046|Enable Wikifunctions client mode on Wiktionaries, Part I (T397401)]] [14:36:48] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [14:37:58] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:38:20] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:38:43] !log starting etcd cfssl-PKI migration in eqiad - T352245 [14:38:43] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:47] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [14:39:18] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:39:22] moritzm: I'll be starting with conf1009. as yesterday, I'll give you a heads-up when the nginx restart can proceed. [14:39:25] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:39:53] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:40:17] * swfrench-wmf checks `etcd tlsproxy SSL` downtimes have started [14:40:40] (03CR) 10Scott French: [C:03+2] hieradata: use cfssl/pki for nginx on all configcluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1090586 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:42:46] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1172046|Enable Wikifunctions client mode on Wiktionaries, Part I (T397401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:42:50] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11124378 (10elukey) I repooled codfw today, so we are now running with this environment: * kartotherian and tegola in eqiad are using the maps1* hosts, running Buster. * kartotherian... [14:42:52] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [14:43:10] !log jforrester@deploy1003 jforrester: Continuing with sync [14:43:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11124384 (10FCeratto-WMF) [14:43:41] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T402938#11124391 (10Jclark-ctr) @wiki_willy The error Returned in Dmesg. The best option might be to purchase a 25G Broadcom NIC to avoid future problems wi... [14:44:04] (03PS1) 10Federico Ceratto: data.yaml: Add dsaez to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1182588 (https://phabricator.wikimedia.org/T400344) [14:45:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P81911 and previous config saved to /var/cache/conftool/dbconfig/20250827-144532-fceratto.json [14:46:16] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T402925)', diff saved to https://phabricator.wikimedia.org/P81912 and previous config saved to /var/cache/conftool/dbconfig/20250827-144615-ladsgroup.json [14:46:21] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [14:46:29] moritzm: you should be good to go to restart on conf1009. vgutierrez FYI [14:47:28] on it [14:48:19] all worker threads on 1009 are running the new binary [14:48:23] (03PS3) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1182574 (https://phabricator.wikimedia.org/T402611) [14:48:23] (03CR) 10Arnaudb: [C:03+1] "this iteration works as intended" [puppet] - 10https://gerrit.wikimedia.org/r/1182574 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [14:48:26] awesome, thank you! [14:48:28] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1172046|Enable Wikifunctions client mode on Wiktionaries, Part I (T397401)]] (duration: 11m 45s) [14:48:33] T397401: If we follow Parsoid’s rollout and integrate Wikifunctions on most Wiktionaries and some low-traffic Wikipedias, we will get the testing we need to confidently roll out to larger wikis. - https://phabricator.wikimedia.org/T397401 [14:48:42] !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [14:49:21] vgutierrez: if there's anything you'd like to verify around liberica cp daemons after the nginx restart on conf1009, now is a good time [14:49:33] * vgutierrez looking [14:50:01] swfrench-wmf: yeah.. we got the expected dance, but looking good :D [14:50:14] vgutierrez: awesome, thank you! [14:51:05] moritzm: the next will be conf1007, which is also the pybal host. I'll give you a heads-up when we're ready. vgutierrez FYI [14:51:58] ack [14:52:16] ack [14:53:46] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum1002.eqiad.wmnet with reason: host reimage [14:54:39] moritzm: all yours on conf1007 [14:54:55] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [14:55:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11124480 (10VRiley-WMF) Verified to try to avoid any racks that currently have es hosts. [14:55:22] * swfrench-wmf is watching pybal checks [14:57:41] (03PS3) 10Elukey: profile::base: add an option to install linux 6.12 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) [14:57:41] (03PS2) 10Elukey: Deploy linux 6.12 from bookworm-backports on ml-serve101[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/1182582 (https://phabricator.wikimedia.org/T393948) [14:58:07] (03CR) 10CI reject: [V:04-1] profile::base: add an option to install linux 6.12 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:58:12] (03CR) 10Elukey: profile::base: add an option to install linux 6.12 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:58:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [14:58:18] swfrench-wmf: pybal reconnecting :D [14:58:20] swfrench-wmf: all nginx worker threads are refreshed [14:58:40] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6774/console" [puppet] - 10https://gerrit.wikimedia.org/r/1182582 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [14:58:44] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T402938#11124506 (10wiki_willy) ++ @RobH - can you work with John on getting a 25g Broadcom NIC for this one? >>! In T402938#11124390, @Jclark-ctr wrote: > @wi... [14:58:52] moritzm: vgutierrez: ack, watching - thank you! [15:00:11] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 11 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [15:00:21] (03PS4) 10Elukey: profile::base: add an option to install linux 6.12 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) [15:00:21] (03PS3) 10Elukey: Deploy linux 6.12 from bookworm-backports on ml-serve101[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/1182582 (https://phabricator.wikimedia.org/T393948) [15:00:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228', diff saved to https://phabricator.wikimedia.org/P81913 and previous config saved to /var/cache/conftool/dbconfig/20250827-150039-fceratto.json [15:00:41] ^ PROBLEM - PyBal connections to etcd is partially expected, but monitoring [15:01:01] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 17 connections established with conf1007.eqiad.wmnet:4001 (min=18) https://wikitech.wikimedia.org/wiki/PyBal [15:01:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P81914 and previous config saved to /var/cache/conftool/dbconfig/20250827-150123-ladsgroup.json [15:01:37] swfrench-wmf: it looks like both lvs1018 and lvs1019 would require a restart [15:01:39] vgutierrez: we need restarts [15:01:47] yeah, bad index [15:01:49] on it [15:01:56] at least liberica recovers from that :) [15:03:06] (03CR) 10Cyndywikime: [C:03+1] GrowthExperiments: remove unused wgGENewcomerTasksTopicType [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182578 (owner: 10Michael Große) [15:03:12] (03PS1) 10Federico Ceratto: preseed.yaml: Remove es2049-es2057 [puppet] - 10https://gerrit.wikimedia.org/r/1182592 (https://phabricator.wikimedia.org/T402859) [15:03:22] (03PS1) 10Federico Ceratto: es2049.yaml, site.pp: Prepare es2049 to replace es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1182593 (https://phabricator.wikimedia.org/T402859) [15:03:38] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1013.eqiad.wmnet with OS bookworm [15:04:22] (03CR) 10Brouberol: [C:03+1] profile::base: add an option to install linux 6.12 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:04:25] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245) [15:04:30] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:37] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 18 connections established with conf1007.eqiad.wmnet:4001 (min=18) https://wikitech.wikimedia.org/wiki/PyBal [15:04:47] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-eqiad (T352245) [15:05:19] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245) [15:06:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:06:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182582 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:08:35] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 83 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [15:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:47] \o/ [15:09:00] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T402938#11124630 (10RobH) >>! In T402938#11124505, @wiki_willy wrote: > ++ @RobH - can you work with John on getting a 25g Broadcom NIC for this one? > >>>! In... [15:09:15] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad (T352245) [15:10:02] moritzm: alright, not that that's cleaned up, the next and final host will be conf1008. I'll ping you when ready. [15:10:59] ack [15:12:07] (03CR) 10Sergio Gimeno: [C:03+2] changeprop: add rule for notificationReEngageJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182583 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno) [15:12:13] nice [15:12:23] moritzm: good to go on conf1008. vgutierrez FYI [15:12:49] heh, just realized I meant to say _now_ that that's cleaned up [15:12:51] lol [15:13:24] (03CR) 10Muehlenhoff: [C:03+1] profile::base: add an option to install linux 6.12 on Bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182579 (https://phabricator.wikimedia.org/T393948) (owner: 10Elukey) [15:13:58] swfrench-wmf: all nginx worker threads are refreshed on 1008 [15:14:09] (03Merged) 10jenkins-bot: changeprop: add rule for notificationReEngageJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182583 (https://phabricator.wikimedia.org/T400118) (owner: 10Sergio Gimeno) [15:14:09] amazing. thank you! [15:15:29] I'm planning to do changeprop-jobqueue deploy, unless there are any objections [15:15:41] alright, that should wrap things up - I'll get started on the confd "restart 60% of the world" [15:15:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2228 (T401906)', diff saved to https://phabricator.wikimedia.org/P81915 and previous config saved to /var/cache/conftool/dbconfig/20250827-151546-fceratto.json [15:15:52] T401906: Add default value for afl_ip and remove default value for afl_ip_hex in abuse_filter_log table - https://phabricator.wikimedia.org/T401906 [15:16:16] !log sgimeno@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:16:31] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P81916 and previous config saved to /var/cache/conftool/dbconfig/20250827-151630-ladsgroup.json [15:20:12] (03PS1) 10Zabe: Use cl_timestamp_id instead of cl_timestamp [extensions/UploadWizard] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182595 (https://phabricator.wikimedia.org/T403069) [15:20:27] (03PS1) 10Zabe: Use cl_timestamp_id instead of cl_timestamp [extensions/UploadWizard] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182596 (https://phabricator.wikimedia.org/T403069) [15:20:57] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11124698 (10MatthewVernon) It would have to be adding a lot of objects (not necessarily data) to be filling up `sda3` (which has contai... [15:21:09] !log sgimeno@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:22:26] !log sgimeno@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:23:44] 06SRE, 06Traffic, 10API Platform (RESTBase Deprecation Roadmap): Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423#11124709 (10Tgr) Obsoleted by {T400119}, probably? [15:24:00] !log sgimeno@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:24:09] James_F: there are bursts of wikifunctionsclient_usage doesn't exist errors, are they expected? [15:24:34] !log sgimeno@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:25:26] (created T403079) [15:25:27] T403079: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'tpiwiktionary.wikifunctionsclient_usage' doesn't existFunction: MediaWiki\Extension\WikiLambda\WikifunctionsClientStore::fetchWikifunctionsUsageQuery: SELECT wfcu_targetPage A - https://phabricator.wikimedia.org/T403079 [15:25:27] (03PS1) 10Ssingh: Release 0.9.8-1+wmf13u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1182599 (https://phabricator.wikimedia.org/T401832) [15:25:39] !log sgimeno@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:27:25] jouncebot: nowandnext [15:27:25] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [15:27:25] In 1 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1700) [15:28:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:29:57] (03CR) 10Zabe: [C:03+2] Use cl_timestamp_id instead of cl_timestamp [extensions/UploadWizard] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182595 (https://phabricator.wikimedia.org/T403069) (owner: 10Zabe) [15:30:00] (03CR) 10Zabe: [C:03+2] Use cl_timestamp_id instead of cl_timestamp [extensions/UploadWizard] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182596 (https://phabricator.wikimedia.org/T403069) (owner: 10Zabe) [15:31:24] (03Merged) 10jenkins-bot: Use cl_timestamp_id instead of cl_timestamp [extensions/UploadWizard] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182595 (https://phabricator.wikimedia.org/T403069) (owner: 10Zabe) [15:31:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T402925)', diff saved to https://phabricator.wikimedia.org/P81917 and previous config saved to /var/cache/conftool/dbconfig/20250827-153138-ladsgroup.json [15:31:44] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:31:48] (03Merged) 10jenkins-bot: Use cl_timestamp_id instead of cl_timestamp [extensions/UploadWizard] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182596 (https://phabricator.wikimedia.org/T403069) (owner: 10Zabe) [15:31:54] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [15:32:22] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1182595|Use cl_timestamp_id instead of cl_timestamp (T403069)]], [[gerrit:1182596|Use cl_timestamp_id instead of cl_timestamp (T403069)]] [15:32:27] T403069: Expectation (readQueryTime <= 5) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T403069 [15:33:08] (03CR) 10Vgutierrez: P:puppetserver::volatile generate datacenter database (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181090 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [15:35:00] (03PS2) 10Ssingh: Release 0.9.8-1+wmf13u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1182599 (https://phabricator.wikimedia.org/T401832) [15:35:43] (03CR) 10STran: [C:03+1] Temp accounts: Disable logged out editing on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182587 (https://phabricator.wikimedia.org/T403067) (owner: 10Dreamy Jazz) [15:36:06] jouncebot: nowandnext [15:36:06] No deployments scheduled for the next 1 hour(s) and 23 minute(s) [15:36:06] In 1 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1700) [15:36:25] (03PS1) 10Muehlenhoff: profile::etcd::tlsproxy: Remove use of non-PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1182602 (https://phabricator.wikimedia.org/T352245) [15:36:33] zabe: Can you ping me when you are done? I'd like to deploy a config change [15:36:39] Sure [15:36:50] Thanks, and also thanks for your work on rc_source changes [15:36:54] !log zabe@deploy1003 zabe: Backport for [[gerrit:1182595|Use cl_timestamp_id instead of cl_timestamp (T403069)]], [[gerrit:1182596|Use cl_timestamp_id instead of cl_timestamp (T403069)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:37:19] yw, I still have to figure out if I just want to drop the recentchanges stuff from flow [15:37:26] it would make all our lives easier [15:37:43] (03PS1) 10Urbanecm: urbanecm's dotfiles: Remove proxy vars by defauilt [puppet] - 10https://gerrit.wikimedia.org/r/1182603 [15:37:46] and flow is not creating rc entries anymore anyway (unless I missunderstood the undeployment tasks) [15:37:49] !log zabe@deploy1003 zabe: Continuing with sync [15:37:56] That's a good point. [15:39:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1182599 (https://phabricator.wikimedia.org/T401832) (owner: 10Ssingh) [15:40:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182602 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [15:40:17] (03CR) 10Ssingh: [C:03+2] Release 0.9.8-1+wmf13u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/1182599 (https://phabricator.wikimedia.org/T401832) (owner: 10Ssingh) [15:43:14] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182595|Use cl_timestamp_id instead of cl_timestamp (T403069)]], [[gerrit:1182596|Use cl_timestamp_id instead of cl_timestamp (T403069)]] (duration: 10m 52s) [15:43:20] T403069: Expectation (readQueryTime <= 5) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: {actualSeconds}) in trx #{trxId}:{query} - https://phabricator.wikimedia.org/T403069 [15:43:20] Dreamy_Jazz: done [15:43:26] Thanks [15:43:42] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [15:43:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182587 (https://phabricator.wikimedia.org/T403067) (owner: 10Dreamy Jazz) [15:43:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T402925)', diff saved to https://phabricator.wikimedia.org/P81918 and previous config saved to /var/cache/conftool/dbconfig/20250827-154350-ladsgroup.json [15:43:55] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:44:26] (03PS6) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [15:44:45] (03Merged) 10jenkins-bot: Temp accounts: Disable logged out editing on wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182587 (https://phabricator.wikimedia.org/T403067) (owner: 10Dreamy Jazz) [15:45:11] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1182587|Temp accounts: Disable logged out editing on wikimaniawiki (T403067)]] [15:45:16] T403067: Temporary accounts: $wmgDisableAccountCreation causes temporary account autocreation to fail - https://phabricator.wikimedia.org/T403067 [15:45:52] !log finished etcd cfssl-PKI migration in eqiad - T352245 [15:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:57] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:50:29] (03PS1) 10Papaul: Update peer-as on mr1-ulsfo to 14907 [homer/public] - 10https://gerrit.wikimedia.org/r/1182608 (https://phabricator.wikimedia.org/T294845) [15:51:02] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1182587|Temp accounts: Disable logged out editing on wikimaniawiki (T403067)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:51:07] T403067: Temporary accounts: $wmgDisableAccountCreation causes temporary account autocreation to fail - https://phabricator.wikimedia.org/T403067 [15:51:21] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [15:52:42] (03CR) 10Papaul: [C:03+2] Update peer-as on mr1-ulsfo to 14907 [homer/public] - 10https://gerrit.wikimedia.org/r/1182608 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [15:53:29] (03PS3) 10Dzahn: data.yaml Add mszwarc (Marcin Szwarc) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (https://phabricator.wikimedia.org/T402779) (owner: 10Federico Ceratto) [15:54:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T402925)', diff saved to https://phabricator.wikimedia.org/P81920 and previous config saved to /var/cache/conftool/dbconfig/20250827-155405-ladsgroup.json [15:54:08] !log reprepro -C main include trixie-wikimedia anycast-healthchecker_0.9.8-1+wmf13u1_amd64.changes: T401832 [15:54:11] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [15:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:16] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [15:54:44] (03CR) 10Dzahn: [C:03+1] "lgtm (took the liberty to remove blank line from commit message to fix CI vote)" [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (https://phabricator.wikimedia.org/T402779) (owner: 10Federico Ceratto) [15:55:47] (03CR) 10Dzahn: [C:03+1] data.yaml: Add dsaez to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1182588 (https://phabricator.wikimedia.org/T400344) (owner: 10Federico Ceratto) [15:56:37] zabe: Yeah, oops, sorry about that. Forgot the DB creation step. [15:57:17] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [15:58:15] !log jforrester@deploy1003:~$ foreachwikiindblist wikifunctionsclient sql /srv/mediawiki-staging/php-1.45.0-wmf.16/extensions/WikiLambda/sql/mysql/table-usage.sql # T403079 [15:58:19] (03CR) 10Dzahn: [C:03+1] backup: Reenable notifications for doc1004 [puppet] - 10https://gerrit.wikimedia.org/r/1182576 (https://phabricator.wikimedia.org/T392130) (owner: 10Jcrespo) [15:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:20] T403079: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'tpiwiktionary.wikifunctionsclient_usage' doesn't existFunction: MediaWiki\Extension\WikiLambda\WikifunctionsClientStore::fetchWikifunctionsUsageQuery: SELECT wfcu_targetPage A - https://phabricator.wikimedia.org/T403079 [16:01:50] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum1002.eqiad.wmnet with OS trixie [16:02:40] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182587|Temp accounts: Disable logged out editing on wikimaniawiki (T403067)]] (duration: 17m 29s) [16:02:45] T403067: Temporary accounts: $wmgDisableAccountCreation causes temporary account autocreation to fail - https://phabricator.wikimedia.org/T403067 [16:03:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:06:53] (03PS1) 10Ssingh: dumps: remove reference to DVD page [puppet] - 10https://gerrit.wikimedia.org/r/1182611 (https://phabricator.wikimedia.org/T402976) [16:07:38] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6776/co" [puppet] - 10https://gerrit.wikimedia.org/r/1182611 (https://phabricator.wikimedia.org/T402976) (owner: 10Ssingh) [16:08:53] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11125111 (10hashar) [16:08:56] (03CR) 10FNegri: [C:03+1] dumps: remove reference to DVD page [puppet] - 10https://gerrit.wikimedia.org/r/1182611 (https://phabricator.wikimedia.org/T402976) (owner: 10Ssingh) [16:08:59] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11125114 (10hashar) This prevent us from updating the CI Jenkins. That is done over the API https://integration.wikimedia.org/ci/ using authentication. Ca... [16:09:12] (03CR) 10Ssingh: [V:03+1 C:03+2] dumps: remove reference to DVD page [puppet] - 10https://gerrit.wikimedia.org/r/1182611 (https://phabricator.wikimedia.org/T402976) (owner: 10Ssingh) [16:09:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P81922 and previous config saved to /var/cache/conftool/dbconfig/20250827-160912-ladsgroup.json [16:11:12] (03PS1) 10Papaul: Allow bgp for security zone production on interface facing the cr3/cr4 [homer/public] - 10https://gerrit.wikimedia.org/r/1182612 (https://phabricator.wikimedia.org/T294845) [16:13:04] (03CR) 10Papaul: [C:03+2] Allow bgp for security zone production on interface facing the cr3/cr4 [homer/public] - 10https://gerrit.wikimedia.org/r/1182612 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [16:14:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_atftpd.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:24:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P81923 and previous config saved to /var/cache/conftool/dbconfig/20250827-162420-ladsgroup.json [16:24:43] (03PS2) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) [16:24:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:27:21] (03CR) 10STran: "Does this mean we're not currently running `purge_temporary_accounts`?" [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:29:05] (03CR) 10Dreamy Jazz: "We are running it, because AFAICS `temporary_accounts.pp` is listed in those files: https://codesearch.wmcloud.org/search/?q=temporary_acc" [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:29:23] (03CR) 10Dreamy Jazz: "(Now that it's not in a different file, this can be resolved)." [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:29:39] (03CR) 10Xcollazo: "My bad, thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1182611 (https://phabricator.wikimedia.org/T402976) (owner: 10Ssingh) [16:30:00] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:31:56] (03CR) 10Dreamy Jazz: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:32:21] (03CR) 10BCornwall: [V:03+1] "0 tests failed, 0 tests skipped, 40 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle) [16:33:04] (03PS1) 10CDanis: tunnelencabulator: add integration.wm.o [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1182613 [16:33:30] (03PS2) 10CDanis: tunnelencabulator: add integration.wm.o [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1182613 (https://phabricator.wikimedia.org/T403089) [16:34:02] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:35:12] (03PS3) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) [16:35:40] (03CR) 10CI reject: [V:04-1] mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:35:47] (03CR) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:36:16] (03PS4) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) [16:37:26] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [16:38:09] (03CR) 10Ssingh: [C:03+1] "I think it can't harm to add this for later as well." [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1182613 (https://phabricator.wikimedia.org/T403089) (owner: 10CDanis) [16:39:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T402925)', diff saved to https://phabricator.wikimedia.org/P81924 and previous config saved to /var/cache/conftool/dbconfig/20250827-163928-ladsgroup.json [16:39:35] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [16:39:35] FIRING: [2x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:39:41] (03CR) 10CDanis: [V:03+2 C:03+2] tunnelencabulator: add integration.wm.o [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1182613 (https://phabricator.wikimedia.org/T403089) (owner: 10CDanis) [16:39:45] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:42:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:49:51] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [16:57:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:58:46] !log upgrading envoyproxy on people* hosts T402584 [16:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:52] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T1700) [17:05:12] !log upgrading envoyproxy on aphlict* and zuul* hosts T402584 [17:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:18] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:06:43] (03PS1) 10KCVelaga: Disable User Agent collection for MinT for Readers streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) [17:07:41] (03PS2) 10KCVelaga: Disable User Agent collection for MinT for Readers streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) [17:10:57] 06SRE, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11125437 (10Krinkle) [17:12:58] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11125446 (10Krinkle) I believe this used to work in Turnilo via the `webrequest_sampled_128` dataset. Phabricator history seems to confir... [17:14:36] RECOVERY - TFTP service on install2004 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [17:16:10] RECOVERY - Squid on install2004 is OK: TCP OK - 0.031 second response time on 208.80.153.105 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [17:18:26] (03CR) 10Scott French: [C:03+1] "Thank you, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1182602 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [17:18:39] RESOLVED: [2x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:19:01] (03CR) 10Scott French: [C:03+2] profile::etcd::tlsproxy: Remove use of non-PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/1182602 (https://phabricator.wikimedia.org/T352245) (owner: 10Muehlenhoff) [17:19:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_atftpd.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:55] FIRING: [2x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:22] 06SRE, 06Data-Engineering, 06Traffic: Add pageview information to turnilo's webrequest_sampled_live (is_pageview is always "-") - https://phabricator.wikimedia.org/T402612#11125465 (10CDanis) I'm not sure if there's an easy way to re-use the logic already in Refine given the current setup of webrequest_sampl... [17:21:02] !log upgrading envoyproxy on releases* and planet* hosts T402584 [17:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:07] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:22:10] RECOVERY - HTTP on install2004 is OK: HTTP OK: HTTP/1.1 200 OK - 244 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Install_servers [17:23:10] (03PS1) 10Majavah: aptly: Remove use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1182622 [17:23:35] 06SRE, 10Wikimedia-Mailing-lists: Create new mailing lists: knowledgepark@lists.wikimedia.org - https://phabricator.wikimedia.org/T402280#11125471 (10FCeratto-WMF) Hello, the mailing list has been created with owner `zi.jony93@gmail.com` - please configure the secondary email address. You can update any other... [17:23:48] (03CR) 10Majavah: [C:03+2] aptly: Remove use of legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1182622 (owner: 10Majavah) [17:25:31] 06SRE, 10Wikimedia-Mailing-lists: Create new mailing lists: wikilovesramadan@lists.wikimedia.org - https://phabricator.wikimedia.org/T402279#11125473 (10FCeratto-WMF) Hello, the mailing list has been created with owner zi.jony93@gmail.com - please configure the three secondary email addresses. You can update a... [17:27:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182540 (https://phabricator.wikimedia.org/T376049) (owner: 10Anzx) [17:28:11] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11125489 (10Jhancock.wm) a:03Jhancock.wm [17:29:55] RESOLVED: [2x] SystemdUnitFailed: squid-logrotate.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:31:31] (03PS1) 10Majavah: puppetmaster: git-sync-upstream: Set custom user-agent [puppet] - 10https://gerrit.wikimedia.org/r/1182624 [17:32:13] (03CR) 10CI reject: [V:04-1] puppetmaster: git-sync-upstream: Set custom user-agent [puppet] - 10https://gerrit.wikimedia.org/r/1182624 (owner: 10Majavah) [17:32:43] (03PS2) 10Majavah: puppetmaster: git-sync-upstream: Set custom user-agent [puppet] - 10https://gerrit.wikimedia.org/r/1182624 [17:34:34] (03CR) 10CDanis: [C:03+1] puppetmaster: git-sync-upstream: Set custom user-agent [puppet] - 10https://gerrit.wikimedia.org/r/1182624 (owner: 10Majavah) [17:35:00] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [17:35:01] (03CR) 10Majavah: [C:03+2] puppetmaster: git-sync-upstream: Set custom user-agent [puppet] - 10https://gerrit.wikimedia.org/r/1182624 (owner: 10Majavah) [17:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:39] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding maps2011-2014 to codfw - jhancock@cumin1003" [17:38:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding maps2011-2014 to codfw - jhancock@cumin1003" [17:38:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:38:59] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps2011 [17:38:59] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps2012 [17:39:00] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps2013 [17:39:01] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host maps2014 [17:39:08] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps2011 [17:39:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps2012 [17:39:11] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps2013 [17:39:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host maps2014 [17:39:42] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:40:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:40:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:40:38] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:42:12] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#11125570 (10taavi) [17:43:20] (03CR) 10Majavah: "This came up when provisioning a new Trixie VM since DSA keys are no longer supported there. So we should get this moving again, sorry we'" [puppet] - 10https://gerrit.wikimedia.org/r/989993 (https://phabricator.wikimedia.org/T177371) (owner: 10Muehlenhoff) [17:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:44:36] 06SRE, 10Wikimedia-Mailing-lists: Create new mailing lists: knowledgepark@lists.wikimedia.org - https://phabricator.wikimedia.org/T402280#11125583 (10FCeratto-WMF) 05Open→03Resolved a:03FCeratto-WMF Closing the task, if there's any issue please comment and reopen it. [17:44:50] 06SRE, 10Wikimedia-Mailing-lists: Create new mailing lists: wikilovesramadan@lists.wikimedia.org - https://phabricator.wikimedia.org/T402279#11125586 (10FCeratto-WMF) 05Open→03Resolved a:03FCeratto-WMF Closing the task, if there's any issue please comment and reopen it. [17:45:55] 06SRE, 10envoy, 06serviceops, 06Traffic: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101 (10RLazarus) 03NEW [17:46:15] !log upgrading envoyproxy on doc* and etherpad* hosts T402584 [17:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:20] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [17:51:40] (03PS1) 10Majavah: aptrepo: Refresh Grafana signing key [puppet] - 10https://gerrit.wikimedia.org/r/1182625 [17:52:35] (03PS2) 10Majavah: aptrepo: Refresh Grafana signing key [puppet] - 10https://gerrit.wikimedia.org/r/1182625 [17:53:08] jhancock@cumin1003 provision (PID 3977180) is awaiting input [17:53:35] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1182625 (owner: 10Majavah) [17:54:43] jhancock@cumin1003 provision (PID 3977123) is awaiting input [17:55:13] (03CR) 10Majavah: [C:03+2] aptrepo: Refresh Grafana signing key [puppet] - 10https://gerrit.wikimedia.org/r/1182625 (owner: 10Majavah) [17:55:23] jhancock@cumin1003 provision (PID 3977147) is awaiting input [17:55:39] (03CR) 10Krinkle: "From the logs we already know there weren't any request with these user agents, but as gut/baseline check, here's the 24 hours after the c" [puppet] - 10https://gerrit.wikimedia.org/r/1180228 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [17:55:39] (03PS2) 10Bartosz Dziewoński: Set $wgPHPSessionHandling to 'disable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [17:55:47] jhancock@cumin1003 provision (PID 3977206) is awaiting input [17:57:16] (03CR) 10Bartosz Dziewoński: [C:03+1] "All dependencies are resolved now. Is it okay to just deploy this, or do we want some kind of phased schedule?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [18:03:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11125742 (10phaultfinder) [18:04:29] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:04:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:04:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:05:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:05:23] !log upgrading envoyproxy on phab2002, lists2001, contint2002 T402584 [18:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:28] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [18:07:55] !log reprepro: copy helmfile and helm-diff to trixie-wikimedia [18:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11125759 (10phaultfinder) [18:13:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T402938#11125792 (10RobH) [18:19:34] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:19:49] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:02] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:29] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host maps2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2011.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:45] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2012.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:20:59] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:59] !log Upgrade envoyproxy on vrts2002 T402584 [18:21:01] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['maps2011'] [18:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:07] T402584: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584 [18:21:14] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['maps2011'] [18:21:25] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host maps2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:24:46] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host maps2011.codfw.wmnet with OS bookworm [18:25:00] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host maps2012.codfw.wmnet with OS bookworm [18:25:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11125886 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host maps2011.codfw.wmnet with OS bookworm [18:25:06] (03CR) 10Ssingh: "I think after this last comment we should merge it and then fix whatever new thing that comes up later!" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:25:13] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host maps2013.codfw.wmnet with OS bookworm [18:25:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11125888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host maps2012.codfw.wmnet with OS bookworm [18:25:22] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11125889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host maps2013.codfw.wmnet with OS bookworm [18:25:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host maps2014.codfw.wmnet with OS bookworm [18:25:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11125890 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host maps2014.codfw.wmnet with OS bookworm [18:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:29:48] (03CR) 10Ottomata: [C:03+1] Disable User Agent collection for MinT for Readers streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182621 (https://phabricator.wikimedia.org/T398057) (owner: 10KCVelaga) [18:34:32] (03CR) 10Gergő Tisza: "I'd go group by group, yeah." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144497 (https://phabricator.wikimedia.org/T362324) (owner: 10Gergő Tisza) [18:35:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) (owner: 10Ebernhardson) [18:35:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) (owner: 10Ebernhardson) [18:40:34] 06SRE, 10envoy, 06serviceops: Envoy config updates from v1.26 - https://phabricator.wikimedia.org/T403101#11125946 (10ssingh) [18:41:03] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11125948 (10ssingh) [18:41:15] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11125960 (10ssingh) [18:41:57] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#11125964 (10ssingh) For awareness: I checked with @RLazarus and removing the Traffic tag. We can add back later as required. [18:42:27] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host maps2011.codfw.wmnet with OS bookworm [18:42:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11125967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host maps2011.codfw.wmnet with OS bookworm executed with errors: - maps2011 (**FA... [18:43:30] (03CR) 10Federico Ceratto: [C:03+2] data.yaml: Add dsaez to analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1182588 (https://phabricator.wikimedia.org/T400344) (owner: 10Federico Ceratto) [18:45:05] (03CR) 10Federico Ceratto: [C:03+1] data.yaml Add mszwarc (Marcin Szwarc) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (https://phabricator.wikimedia.org/T402779) (owner: 10Federico Ceratto) [18:45:25] (03CR) 10Federico Ceratto: [C:03+2] data.yaml Add mszwarc (Marcin Szwarc) to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1182543 (https://phabricator.wikimedia.org/T402779) (owner: 10Federico Ceratto) [18:46:17] (03PS1) 10AOkoth: aptrepo: bump gitlab version [puppet] - 10https://gerrit.wikimedia.org/r/1182628 (https://phabricator.wikimedia.org/T403115) [18:47:18] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11125988 (10FCeratto-WMF) 05In progress→03Resolved The request has been processed, closing task. [18:48:30] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11125995 (10FCeratto-WMF) 05In progress→03Resolved a:03FCeratto-WMF The request has been processed, closing task. [18:51:30] 06SRE, 06Traffic: Add unique error IDs to 4xx responses - https://phabricator.wikimedia.org/T330973#11126015 (10ssingh) 05Open→03Resolved a:03ssingh This is now done in https://gitlab.wikimedia.org/repos/sre/hiddenparma/-/commit/9f1bd99010be4d84e643bc6ef038c28efb98a092 automatically so I am taking th... [18:55:09] (03CR) 10Dzahn: [C:03+1] aptrepo: bump gitlab version [puppet] - 10https://gerrit.wikimedia.org/r/1182628 (https://phabricator.wikimedia.org/T403115) (owner: 10AOkoth) [18:55:39] (03PS2) 10AOkoth: aptrepo: bump gitlab version [puppet] - 10https://gerrit.wikimedia.org/r/1182628 (https://phabricator.wikimedia.org/T403115) [18:56:26] (03CR) 10Dzahn: [C:03+1] aptrepo: bump gitlab version [puppet] - 10https://gerrit.wikimedia.org/r/1182628 (https://phabricator.wikimedia.org/T403115) (owner: 10AOkoth) [18:58:05] (03CR) 10AOkoth: [C:03+2] aptrepo: bump gitlab version [puppet] - 10https://gerrit.wikimedia.org/r/1182628 (https://phabricator.wikimedia.org/T403115) (owner: 10AOkoth) [18:58:49] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166#11126075 (10ssingh) Aug 2025 update: With Liberica rolled out to everywhere except `eqiad` and `codfw` (pending T352956 for the core sites), `cookbook sre.loadbalance... [19:00:07] (03CR) 10CDobbins: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1181819 (owner: 10Ncmonitor) [19:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:08:52] (03PS1) 10Ahmon Dancy: scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) [19:09:18] (03CR) 10CI reject: [V:04-1] scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [19:10:22] (03PS2) 10Ahmon Dancy: scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) [19:10:48] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:10:48] (03CR) 10CI reject: [V:04-1] scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [19:10:50] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:10:54] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11126132 (10ssingh) @BCornwall: I think this is done but I will wait for your confirmation. And it seems like we are not moving `wikimedia.pt` (since it points to the s... [19:10:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:10:56] oh hi [19:11:10] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:11:19] ChrisDobbins901_: please run authdns-update from dns1004! [19:11:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:11:26] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:11:36] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:11:48] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:12:16] (03PS3) 10Ahmon Dancy: scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) [19:12:54] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:12:56] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:12:56] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:13:02] !log sukhe@dns1004 START - running authdns-update [19:13:14] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is f74e6a107ddd75b406e24553f111cb4d22fe133d, dns.git is 5ba0f32d018cdbfdc27d0d4f7fda7ac5b9cd7986) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:13:25] sukhe: ack [19:13:32] ChrisDobbins901_: on it [19:13:50] but yeah, since we merged the DNS patch, this is basically a reminder for us to do that [19:13:53] I merged it for now [19:13:56] !log cdobbins@dns1004 START - running authdns-update [19:14:13] ChrisDobbins901_: it won't work now since I am running it [19:14:13] !log sukhe@dns1004 END - running authdns-update [19:15:48] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:15:48] (03CR) 10Dzahn: "adding another reviewer since I will be out until Tuesday.if it has time until then I will get it deployed or it can be done before. I see" [puppet] - 10https://gerrit.wikimedia.org/r/1181198 (https://phabricator.wikimedia.org/T390119) (owner: 10Dduvall) [19:15:49] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [19:15:50] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:15:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:16:03] (03PS4) 10Ahmon Dancy: scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) [19:16:10] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:16:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:16:26] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:16:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:16:37] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [19:16:48] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:17:54] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:17:56] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:17:56] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:18:14] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [19:21:50] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1182629/7347/" [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [19:22:30] (03CR) 10CDobbins: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181821 (owner: 10Ncmonitor) [19:23:22] (03CR) 10CDobbins: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1181820 (owner: 10Ncmonitor) [19:24:48] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11126211 (10BCornwall) `wikimedia.pt` redirects to the local chapter but `wikipedia.pt` redirects to to `pt.wikipedia.org`. I've tried emailing the representatives of t... [19:26:14] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11126215 (10ssingh) >>! In T401438#11126211, @BCornwall wrote: > `wikimedia.pt` redirects to the local chapter but `wikipedia.pt` redirects to to `pt.wikipedia.org`. I'... [19:29:19] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11126237 (10Mike_Peel) Maybe @Alchimista can help with the .pt domain? [19:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:31:38] (03CR) 10BCornwall: "Acknowledged the history check, this was merged a little prematurely." [puppet] - 10https://gerrit.wikimedia.org/r/1181820 (owner: 10Ncmonitor) [19:36:46] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:37:14] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:38:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:38:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:38:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:42:37] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [19:42:48] (03CR) 10BCornwall: "Thanks for the information. Redirection to the diff post is what I believe to be best for now. We'll follow up with another CR" [puppet] - 10https://gerrit.wikimedia.org/r/1181820 (owner: 10Ncmonitor) [19:47:25] (03PS1) 10Bartosz Dziewoński: FixRenamedUserGlobalEditCount: Add --since and --until parameters [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182636 (https://phabricator.wikimedia.org/T313900) [19:47:40] (03PS1) 10Bartosz Dziewoński: FixRenamedUserGlobalEditCount: Improve script output [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182637 (https://phabricator.wikimedia.org/T313900) [19:47:50] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Old username may not be valid [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182638 (https://phabricator.wikimedia.org/T398177) [19:48:02] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve finding local log entries [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182639 (https://phabricator.wikimedia.org/T398177) [19:48:24] (03PS1) 10Bartosz Dziewoński: FixRenamedUserGlobalEditCount: Add --since and --until parameters [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182640 (https://phabricator.wikimedia.org/T313900) [19:48:33] (03PS1) 10Bartosz Dziewoński: FixRenamedUserGlobalEditCount: Improve script output [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182642 (https://phabricator.wikimedia.org/T313900) [19:48:41] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Old username may not be valid [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182643 (https://phabricator.wikimedia.org/T398177) [19:48:48] (03PS1) 10Bartosz Dziewoński: FixRenameUserLocalLogs: Improve finding local log entries [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182644 (https://phabricator.wikimedia.org/T398177) [19:49:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182636 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [19:49:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182637 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [19:49:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182640 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [19:49:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182638 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [19:49:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182642 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [19:49:52] (03PS1) 10CDobbins: ncredir: funnel wikimint.org to https://wikimint.org/index.php/Prashant_Ohol [puppet] - 10https://gerrit.wikimedia.org/r/1182645 [19:49:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182639 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [19:50:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182643 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [19:50:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182644 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [19:52:13] jouncebot: next [19:52:14] In 0 hour(s) and 7 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T2000) [19:52:22] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [19:53:01] hi, i just completely spammed the window, but these patches can all be deployed together and don't need any testing. they are fixes for maintenance scripts that i plan to run tomorrow. i hope they can be shipped out, but i can reschedule. [19:53:26] (03CR) 10BCornwall: [C:04-2] ncredir: funnel wikimint.org to https://wikimint.org/index.php/Prashant_Ohol (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [19:56:27] (03PS5) 10Ahmon Dancy: scap::master: Add /srv/patches git pre-commit hook for permissions [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) [19:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:56:55] (03PS2) 10CDobbins: ncredir: funnel wikimint.org to https://wikimediafoundation.org/news/2018/08/22/dont-pay-for-wikipedia-articles/ [puppet] - 10https://gerrit.wikimedia.org/r/1182645 [19:57:20] !log aokoth@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Update [19:57:24] (03CR) 10CI reject: [V:04-1] ncredir: funnel wikimint.org to https://wikimediafoundation.org/news/2018/08/22/dont-pay-for-wikipedia-articles/ [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [19:59:25] (03PS3) 10CDobbins: ncredir: funnel wikimint.org to https://wikimediafoundation.org/news/2018/08/22/dont-pay-for-wikipedia-articles/ [puppet] - 10https://gerrit.wikimedia.org/r/1182645 [19:59:51] (03CR) 10Dzahn: "CI should be happy once you insert a newline after the first line." [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [19:59:58] (03CR) 10CI reject: [V:04-1] ncredir: funnel wikimint.org to https://wikimediafoundation.org/news/2018/08/22/dont-pay-for-wikipedia-articles/ [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T2000). [20:00:05] anzx, ebernhardson, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] hi, i just completely spammed the window, but these patches can all be deployed together and don't need any testing. they are fixes for maintenance scripts that i plan to run tomorrow. i hope they can be shipped out, but i can reschedule. [20:00:54] (03CR) 10Thcipriani: [C:03+1] "I like it! Tagging in @sbassett@wikimedia.org — what do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/1182629 (https://phabricator.wikimedia.org/T401672) (owner: 10Ahmon Dancy) [20:01:02] (03PS4) 10CDobbins: ncredir: funnel wikimint.org [puppet] - 10https://gerrit.wikimedia.org/r/1182645 [20:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:02:01] (03CR) 10BCornwall: ncredir: funnel wikimint.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [20:03:03] MatmaRex: Sounds reasonable [20:07:12] !log aokoth@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Security Update [20:13:27] (03CR) 10Pppery: [C:03+1] "(Signalling I approve of the idea, not caring about the puppet-style and commit message issues)" [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [20:14:19] hmm, are there any deployers around? [20:15:45] (03CR) 10BCornwall: [C:03+1] ncredir: funnel wikimint.org [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [20:16:26] (03CR) 10Andrew Bogott: "I do not think anything checks the status of those domains. You can even add records to them while they're in 'pending' state, just not re" [puppet] - 10https://gerrit.wikimedia.org/r/1182188 (https://phabricator.wikimedia.org/T398712) (owner: 10Andrew Bogott) [20:17:40] (03PS2) 10Andrew Bogott: Keystone hooks: speed up domain creation [puppet] - 10https://gerrit.wikimedia.org/r/1182188 (https://phabricator.wikimedia.org/T398712) [20:17:41] (03PS2) 10Andrew Bogott: wmfkeystonehooks: format with Black [puppet] - 10https://gerrit.wikimedia.org/r/1182189 [20:17:41] (03PS2) 10Andrew Bogott: designatemakedomain.py: format with Black [puppet] - 10https://gerrit.wikimedia.org/r/1182190 [20:17:41] (03PS1) 10Andrew Bogott: Openstack: add wmcs-projectcleanup.py [puppet] - 10https://gerrit.wikimedia.org/r/1182648 (https://phabricator.wikimedia.org/T397648) [20:17:42] (03PS1) 10Andrew Bogott: Openstack wmfkeystonehooks: don't clean up after project delete [puppet] - 10https://gerrit.wikimedia.org/r/1182649 (https://phabricator.wikimedia.org/T397648) [20:19:47] MatmaRex: it looks like none of the usual suspects showed up. You say all 8 of yours can be bundled? [20:20:11] yeah [20:20:38] (if that works when they're on two branches) [20:21:12] it should. we haven't gone into the fancy realm of single version images for the live wikis yet. [20:21:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:23:48] ugh. managed to clear my form after putting all the patches in :/ [20:23:57] (03PS1) 10Papaul: Revert "Allow bgp for security zone production on interface facing the cr3/cr4" [homer/public] - 10https://gerrit.wikimedia.org/r/1182651 [20:24:55] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:25:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182636 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:25:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182637 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:25:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182638 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:25:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182639 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:25:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182640 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:25:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182642 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:25:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182643 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:25:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182644 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:25:31] MatmaRex: let's see how it goes :) [20:25:41] (03CR) 10Papaul: [C:03+2] Revert "Allow bgp for security zone production on interface facing the cr3/cr4" [homer/public] - 10https://gerrit.wikimedia.org/r/1182651 (owner: 10Papaul) [20:27:00] (03Merged) 10jenkins-bot: FixRenamedUserGlobalEditCount: Add --since and --until parameters [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182636 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:27:02] (03Merged) 10jenkins-bot: FixRenamedUserGlobalEditCount: Improve script output [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182637 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:27:03] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Old username may not be valid [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182638 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:27:07] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Improve finding local log entries [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1182639 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:27:08] (03Merged) 10jenkins-bot: FixRenamedUserGlobalEditCount: Add --since and --until parameters [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182640 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:27:18] (03Merged) 10jenkins-bot: FixRenamedUserGlobalEditCount: Improve script output [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182642 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [20:27:20] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Old username may not be valid [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182643 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:28:03] (03PS1) 10Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T402206) [20:30:25] (03PS1) 10Papaul: Add bgp to security zone production for mr [homer/public] - 10https://gerrit.wikimedia.org/r/1182653 (https://phabricator.wikimedia.org/T294845) [20:30:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11126374 (10VRiley-WMF) es1049 - rack A5, U06 es1050 - rack B6, U30 es1051 - rack D1, U11 es1052 - rack D3, U08 es1053 - rack D6, U09 these have been racked. @Ladsgroup at this... [20:31:31] (03CR) 10CDobbins: ncredir: funnel wikimint.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [20:31:34] (03CR) 10CDobbins: [C:03+2] ncredir: funnel wikimint.org [puppet] - 10https://gerrit.wikimedia.org/r/1182645 (owner: 10CDobbins) [20:31:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps101[1-4] - https://phabricator.wikimedia.org/T400638#11126375 (10VRiley-WMF) a:03VRiley-WMF [20:35:36] MatmaRex: we apparently lost the jenkins lottery. the last patch is still running quibble-vendor-mysql-php81. It's hard to tell anymore how much longer with the parallel phpunit tests not outputting anything until each runner finishes. :/ [20:36:29] (03Merged) 10jenkins-bot: FixRenameUserLocalLogs: Improve finding local log entries [extensions/CentralAuth] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182644 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [20:36:30] yay. whining worked [20:36:36] heh [20:37:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11126381 (10Jhancock.wm) [20:37:04] !log bd808@deploy1003 Started scap sync-world: Backport for [[gerrit:1182636|FixRenamedUserGlobalEditCount: Add --since and --until parameters (T313900)]], [[gerrit:1182637|FixRenamedUserGlobalEditCount: Improve script output (T313900)]], [[gerrit:1182638|FixRenameUserLocalLogs: Old username may not be valid (T398177)]], [[gerrit:1182639|FixRenameUserLocalLogs: Improve finding local log entries (T398177)]], [[gerrit:11826 [20:37:04] 40|FixRenamedUserGlobalEditCount: Add --since and --until parameters (T313900)]], [[gerrit:1182642|FixRenamedUserGlobalEditCount: Improve script output (T313900)]], [[gerrit:1182643|FixRenameUserLocalLogs: Old username may not be valid (T398177)]], [[gerrit:1182644|FixRenameUserLocalLogs: Improve finding local log entries (T398177)]] [20:37:10] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [20:37:11] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [20:37:23] that's not an ideal log message :) [20:37:34] E_TOO_MANY_PATCHES [20:37:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install maps201[1-4] - https://phabricator.wikimedia.org/T400637#11126384 (10Jhancock.wm) ran into an issue where the installer thinks that there is a raid when the config does not call for one. need to go back and compare and check for issues. [20:38:15] bd808: i tried to set it up so that it would use the "success cache" from the test jobs, weird that it worked for 7 of the 8 patches [20:39:46] computers are fickle and mean. [20:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:42:39] (03PS3) 10JHathaway: provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 [20:43:11] !log bd808@deploy1003 matmarex, bd808: Backport for [[gerrit:1182636|FixRenamedUserGlobalEditCount: Add --since and --until parameters (T313900)]], [[gerrit:1182637|FixRenamedUserGlobalEditCount: Improve script output (T313900)]], [[gerrit:1182638|FixRenameUserLocalLogs: Old username may not be valid (T398177)]], [[gerrit:1182639|FixRenameUserLocalLogs: Improve finding local log entries (T398177)]], [[gerrit:1182640|FixRe [20:43:11] namedUserGlobalEditCount: Add --since and --until parameters (T313900)]], [[gerrit:1182642|FixRenamedUserGlobalEditCount: Improve script output (T313900)]], [[gerrit:1182643|FixRenameUserLocalLogs: Old username may not be valid (T398177)]], [[gerrit:1182644|FixRenameUserLocalLogs: Improve finding local log entries (T398177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be v [20:43:11] erified there. [20:43:18] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [20:43:19] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [20:43:34] MatmaRex: changes are on the test servers. please verify [20:43:44] nothing to test with these. they're all maintenance scripts [20:44:05] (i'm planning to run them tomorrow) [20:44:17] that's easy enough. pushing the go button [20:44:26] !log bd808@deploy1003 matmarex, bd808: Continuing with sync [20:46:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:41] (03PS4) 10JHathaway: provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 [20:50:03] !log bd808@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182636|FixRenamedUserGlobalEditCount: Add --since and --until parameters (T313900)]], [[gerrit:1182637|FixRenamedUserGlobalEditCount: Improve script output (T313900)]], [[gerrit:1182638|FixRenameUserLocalLogs: Old username may not be valid (T398177)]], [[gerrit:1182639|FixRenameUserLocalLogs: Improve finding local log entries (T398177)]], [[gerrit:1182 [20:50:03] 640|FixRenamedUserGlobalEditCount: Add --since and --until parameters (T313900)]], [[gerrit:1182642|FixRenamedUserGlobalEditCount: Improve script output (T313900)]], [[gerrit:1182643|FixRenameUserLocalLogs: Old username may not be valid (T398177)]], [[gerrit:1182644|FixRenameUserLocalLogs: Improve finding local log entries (T398177)]] (duration: 12m 59s) [20:50:10] T313900: Renaming a user doubles their edit count according to CentralAuthUser::getGlobalEditCount() / global_edit_count.gec_count field - https://phabricator.wikimedia.org/T313900 [20:50:10] T398177: 'renameuser' logs for a global rename use actor ID from metawiki instead of the local one when created by the fixStuckGlobalRename.php script - https://phabricator.wikimedia.org/T398177 [20:50:46] all done MatmaRex [20:51:18] thanks bd808 [20:51:30] anz seems not to be here for the 3 community config patches [20:51:34] yw MatmaRex [20:51:46] (03CR) 10JHathaway: provision: poll for reboot via Redfish (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [20:51:53] ebernhardson: should I ship your config change too? [20:55:08] ok, I declare this late starting and partially implemented backport window {{Done}} [20:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [20:56:45] (03CR) 10BryanDavis: "This was not deployed as the requestor did not show up on IRC during the window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182192 (https://phabricator.wikimedia.org/T402627) (owner: 10Ebernhardson) [20:57:09] (03CR) 10BryanDavis: "This was not deployed as the requestor did not show up on IRC during the window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182530 (https://phabricator.wikimedia.org/T402706) (owner: 10Anzx) [20:57:28] (03CR) 10BryanDavis: "This was not deployed as the requestor did not show up on IRC during the window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182540 (https://phabricator.wikimedia.org/T376049) (owner: 10Anzx) [20:58:32] (03CR) 10BryanDavis: "This was not deployed as the requestor did not show up on IRC during the window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182523 (https://phabricator.wikimedia.org/T402725) (owner: 10Anzx) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T2100) [21:03:03] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on all the remaining Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182655 (https://phabricator.wikimedia.org/T402329) [21:04:03] 06SRE, 06Traffic-Icebox: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891#11126478 (10Pppery) [21:04:16] 06SRE, 06Traffic-Icebox, 07User-notice-archive: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891#11126479 (10Pppery) [21:04:18] (03CR) 10Cwhite: [C:03+2] cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [21:05:15] (03PS2) 10Cwhite: monitoring: ensure disk space check is absent [puppet] - 10https://gerrit.wikimedia.org/r/1180642 (https://phabricator.wikimedia.org/T332764) [21:06:11] (03Merged) 10jenkins-bot: cirrussearch: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179178 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [21:07:01] (03PS2) 10Daimona Eaytoy: Enable the CampaignEvents extension on all the remaining Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182655 (https://phabricator.wikimedia.org/T402329) [21:17:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182655 (https://phabricator.wikimedia.org/T402329) (owner: 10Daimona Eaytoy) [21:17:18] (03PS45) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [21:25:43] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6781/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [21:27:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:28:58] !log jhathaway@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2009.codfw.wmnet with OS bookworm [21:30:04] ^ that spike of edit conflicts is interesting [21:31:06] !log jhathaway@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2009.codfw.wmnet with OS bookworm [21:32:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [21:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:35] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:45:48] (03PS7) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [21:48:43] (03PS1) 10Cwhite: sre: add unit filter to systemd status dashboard link [alerts] - 10https://gerrit.wikimedia.org/r/1182657 (https://phabricator.wikimedia.org/T332764) [21:51:25] (03PS1) 10RLazarus: envoy: Update to v1.26.8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182659 (https://phabricator.wikimedia.org/T402584) [21:52:15] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [21:52:54] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182659 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250827T2200) [22:00:05] (03CR) 10Scott French: [C:03+1] envoy: Update to v1.26.8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182659 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [22:02:25] (03CR) 10RLazarus: [V:03+2 C:03+2] envoy: Update to v1.26.8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1182659 (https://phabricator.wikimedia.org/T402584) (owner: 10RLazarus) [22:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:07:59] (Loading spider pig) [22:08:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11126703 (10phaultfinder) [22:09:50] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:10:14] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:11:28] (03CR) 10Cwhite: [C:03+1] "Nit inline but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1182092 (https://phabricator.wikimedia.org/T370153) (owner: 10Tiziano Fogli) [22:12:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:13:05] (03CR) 10Cwhite: [C:03+1] nrpewrapper: correlate Prometheus "for:" duration with Icinga timing [puppet] - 10https://gerrit.wikimedia.org/r/1182524 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [22:13:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:13:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:13:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:13:52] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11126722 (10phaultfinder) [22:16:41] (03PS8) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [22:16:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182215 (https://phabricator.wikimedia.org/T397084) (owner: 10Jdlrobson) [22:18:07] (03CR) 10Cwhite: "Is this script meant to be manually invoked locally, following the instructions in the comments for raw data gathering?" [software] - 10https://gerrit.wikimedia.org/r/1182571 (https://phabricator.wikimedia.org/T395443) (owner: 10Tiziano Fogli) [22:23:11] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [22:24:44] (03PS9) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [22:26:12] can I do a quick spiderpig backport? I seem to have caused a mess on wiktionary that I need to backport to clean up: https://phabricator.wikimedia.org/T403113 [22:27:51] i see Jdlrobson is currently running a spiderpig [22:28:26] (03Merged) 10jenkins-bot: Consolidate search config to match Minerva [skins/Vector] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182215 (https://phabricator.wikimedia.org/T397084) (owner: 10Jdlrobson) [22:28:54] (03PS1) 10C. Scott Ananian: Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182664 (https://phabricator.wikimedia.org/T403113) [22:28:54] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1182215|Consolidate search config to match Minerva (T397084)]] [22:29:04] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [22:29:08] (03CR) 10C. Scott Ananian: [C:03+1] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182664 (https://phabricator.wikimedia.org/T403113) (owner: 10C. Scott Ananian) [22:31:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:33] (03PS10) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) [22:33:09] (03CR) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [22:34:21] @jdlr [22:34:48] @jdlrobson im available to test! [22:34:54] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1182215|Consolidate search config to match Minerva (T397084)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:34:59] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [22:35:20] bwang: you can test on mwdebug now! [22:40:29] (03CR) 10Papaul: [C:03+2] Add bgp to security zone production for mr [homer/public] - 10https://gerrit.wikimedia.org/r/1182653 (https://phabricator.wikimedia.org/T294845) (owner: 10Papaul) [22:41:56] bwang: all good? [22:41:57] Jdlrobson, cscott: I'll have something once y'all are done, let me know but no hurry :) [22:41:59] (03CR) 10Arlolra: [C:03+1] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182664 (https://phabricator.wikimedia.org/T403113) (owner: 10C. Scott Ananian) [22:44:30] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [22:46:03] rzl: i need to leave for dinner soon, so feel free to jump in after Jdlrobson is done. I'll backport my patch a little later tonight, after you. [22:46:18] sure, thanks [22:46:48] cscott: rzl sorry for delay. [22:47:07] it's your window still, I'm just freeloading :) [22:48:37] (03PS1) 10RLazarus: mw-*: Upgrade to Envoy 1.26.8 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182666 (https://phabricator.wikimedia.org/T408254) [22:48:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11126828 (10Papaul) I follow up again with Juniper today for an update since i upload the log files requested but still no update. [22:49:34] Hi everyone, we are wondering about the process to make schema changes in beta. Do we need to run it by DBAs? [22:49:43] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182215|Consolidate search config to match Minerva (T397084)]] (duration: 20m 48s) [22:49:48] T397084: Clean up and consolidate typeahead search config across Minerva and Vector - https://phabricator.wikimedia.org/T397084 [22:51:07] Jdlrobson: please let me know when you're done so I can backport 1182664 [22:51:18] ok done with window arlolra go for it [22:51:27] * Jdlrobson releases the conch [22:51:33] thank you [22:52:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182664 (https://phabricator.wikimedia.org/T403113) (owner: 10C. Scott Ananian) [22:55:16] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#11126833 (10Papaul) BGP is up and running between mr1-ulsfo and both cr3 and cr4 ` Peer AS InPkt OutPkt OutQ Flaps Las... [22:55:23] (03PS2) 10RLazarus: mw-*: Upgrade to Envoy 1.26.8 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182666 (https://phabricator.wikimedia.org/T408254) [22:55:38] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/1182524 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [22:56:30] (03Merged) 10jenkins-bot: Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" [core] (wmf/1.45.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1182664 (https://phabricator.wikimedia.org/T403113) (owner: 10C. Scott Ananian) [22:56:56] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] [22:56:57] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1182657 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [22:57:02] T403113: Scribunto - mw.ustring.lower and mw.ustring.upper now automatically convert text to NFC, causing Module:grc-translit to fail on English Wiktionary - https://phabricator.wikimedia.org/T403113 [22:57:03] T400057: MW's Title uppercase first letter code can lead to non-NFC titles - https://phabricator.wikimedia.org/T400057 [22:57:52] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1180642 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [23:01:19] !log arlolra@deploy1003 arlolra, cscott: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:03:22] !log arlolra@deploy1003 arlolra, cscott: Continuing with sync [23:04:35] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:08:36] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] (duration: 11m 40s) [23:08:37] (03CR) 10Scott French: [C:03+1] mw-*: Upgrade to Envoy 1.26.8 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182666 (https://phabricator.wikimedia.org/T408254) (owner: 10RLazarus) [23:08:42] T403113: Scribunto - mw.ustring.lower and mw.ustring.upper now automatically convert text to NFC, causing Module:grc-translit to fail on English Wiktionary - https://phabricator.wikimedia.org/T403113 [23:08:43] T400057: MW's Title uppercase first letter code can lead to non-NFC titles - https://phabricator.wikimedia.org/T400057 [23:08:59] (03PS1) 10RLazarus: mesh: Copy configuration_1.14.0 to 1.14.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182669 (https://phabricator.wikimedia.org/T403101) [23:09:01] (03PS1) 10RLazarus: mesh: Remove deprecated field config.core.v3.HeaderValueOption.append [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182670 (https://phabricator.wikimedia.org/T403101) [23:09:40] arlolra: let me know when you're finished, I'll start mine :) [23:10:07] rzl: I am done, go for it [23:10:11] thanks! [23:12:25] (03CR) 10RLazarus: [C:03+2] mw-*: Upgrade to Envoy 1.26.8 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182666 (https://phabricator.wikimedia.org/T408254) (owner: 10RLazarus) [23:14:25] (03Merged) 10jenkins-bot: mw-*: Upgrade to Envoy 1.26.8 in the MW canary releases and mw-debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182666 (https://phabricator.wikimedia.org/T408254) (owner: 10RLazarus) [23:16:06] !log rzl@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [23:16:33] !log rzl@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [23:17:04] !log rzl@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply [23:17:26] !log rzl@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply [23:18:28] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [23:22:58] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [23:23:25] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [23:25:48] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [23:26:59] mw-debug still happy, proceeding with the canaries [23:28:09] !log rzl@deploy1003 Started scap sync-world: https://gerrit.wikimedia.org/r/1182666 [23:29:35] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:30:35] !log rzl@deploy1003 Finished scap sync-world: https://gerrit.wikimedia.org/r/1182666 (duration: 03m 03s) [23:31:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:32:06] done for now, letting that bake overnight [23:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:37:16] (03CR) 10RLazarus: "For the mesh config, this is the only deprecation warning in the 1.23 -> 1.26 bump. The replacement field was added in 1.20 so we can swit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182670 (https://phabricator.wikimedia.org/T403101) (owner: 10RLazarus) [23:38:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1182673 [23:38:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1182673 (owner: 10TrainBranchBot) [23:38:25] (03PS1) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182674 (https://phabricator.wikimedia.org/T403101) [23:38:27] (03PS1) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182675 (https://phabricator.wikimedia.org/T403101) [23:38:29] (03PS1) 10RLazarus: api-gateway: Remove deprecated Envoy config fields [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182676 (https://phabricator.wikimedia.org/T403101) [23:41:12] (03CR) 10Anzx: "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182523 (https://phabricator.wikimedia.org/T402725) (owner: 10Anzx) [23:41:41] (03CR) 10Anzx: "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182530 (https://phabricator.wikimedia.org/T402706) (owner: 10Anzx) [23:42:01] (03CR) 10Anzx: "Acknowledged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1182540 (https://phabricator.wikimedia.org/T376049) (owner: 10Anzx) [23:49:44] (03PS2) 10RLazarus: mesh: Remove deprecated field config.core.v3.HeaderValueOption.append [deployment-charts] - 10https://gerrit.wikimedia.org/r/1182670 (https://phabricator.wikimedia.org/T403101) [23:53:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1182673 (owner: 10TrainBranchBot) [23:55:16] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [23:55:34] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:55:41] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T402925)', diff saved to https://phabricator.wikimedia.org/P81926 and previous config saved to /var/cache/conftool/dbconfig/20250827-235540-ladsgroup.json [23:55:46] T402925: Drop cl_to and cl_collation from categorylinks in wmf production - https://phabricator.wikimedia.org/T402925 [23:59:51] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T402925)', diff saved to https://phabricator.wikimedia.org/P81927 and previous config saved to /var/cache/conftool/dbconfig/20250827-235950-ladsgroup.json