[00:32:46] FIRING: Traffic bill over quota: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:40:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235605 [00:40:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235605 (owner: 10TrainBranchBot) [00:51:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235605 (owner: 10TrainBranchBot) [00:52:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233651 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [00:52:46] RESOLVED: Traffic bill over quota: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota Has worsened - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:10:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235606 [01:10:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235606 (owner: 10TrainBranchBot) [01:33:58] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235606 (owner: 10TrainBranchBot) [02:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:09:15] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:14:02] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 22s) [02:45:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:50:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [03:10:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:19:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:25:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:54:55] 10ops-eqiad, 06DC-Ops: Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T416106 (10phaultfinder) 03NEW [05:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:51:23] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T416106#11572222 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [05:51:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1235613 (https://phabricator.wikimedia.org/T416107) [05:53:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1209 with weight 0 T416107', diff saved to https://phabricator.wikimedia.org/P88351 and previous config saved to /var/cache/conftool/dbconfig/20260202-055304-marostegui.json [05:53:11] T416107: Switchover s8 master (db1193 -> db1209) - https://phabricator.wikimedia.org/T416107 [05:53:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s8 T416107 [05:53:24] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1209 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1235613 (https://phabricator.wikimedia.org/T416107) (owner: 10Gerrit maintenance bot) [05:56:42] !log Starting s8 eqiad failover from db1193 to db1209 - T416107 [05:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1209 to s8 primary T416107', diff saved to https://phabricator.wikimedia.org/P88352 and previous config saved to /var/cache/conftool/dbconfig/20260202-055717-marostegui.json [05:57:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1193 T416107', diff saved to https://phabricator.wikimedia.org/P88353 and previous config saved to /var/cache/conftool/dbconfig/20260202-055755-marostegui.json [05:59:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1193.eqiad.wmnet with reason: long schema change [06:00:58] (03PS1) 10Marostegui: db1193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235614 (https://phabricator.wikimedia.org/T411164) [06:02:05] !log Deploy schema change on old s8 eqiad master db1193 T411164 T411163 [06:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:02:14] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:04:05] (03CR) 10Marostegui: [C:03+2] db1193: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235614 (https://phabricator.wikimedia.org/T411164) (owner: 10Marostegui) [06:04:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2165 with weight 0 T415748', diff saved to https://phabricator.wikimedia.org/P88354 and previous config saved to /var/cache/conftool/dbconfig/20260202-060437-marostegui.json [06:04:41] T415748: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T415748 [06:04:53] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2165 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1234158 (https://phabricator.wikimedia.org/T415748) (owner: 10Gerrit maintenance bot) [06:07:45] (03PS1) 10Marostegui: db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235615 (https://phabricator.wikimedia.org/T411164) [06:08:15] (03CR) 10CI reject: [V:04-1] db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235615 (https://phabricator.wikimedia.org/T411164) (owner: 10Marostegui) [06:09:03] (03PS2) 10Marostegui: db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235615 (https://phabricator.wikimedia.org/T411164) [06:09:15] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:09:34] (03CR) 10CI reject: [V:04-1] db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235615 (https://phabricator.wikimedia.org/T411164) (owner: 10Marostegui) [06:11:04] (03PS3) 10Marostegui: db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235615 (https://phabricator.wikimedia.org/T411164) [06:11:16] !log Starting s8 codfw failover from db2161 to db2165 - T415748 [06:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:21] T415748: Switchover s8 master (db2161 -> db2165) - https://phabricator.wikimedia.org/T415748 [06:11:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s8 codfw as read-only for maintenance - T415748', diff saved to https://phabricator.wikimedia.org/P88355 and previous config saved to /var/cache/conftool/dbconfig/20260202-061150-marostegui.json [06:12:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2165 to s8 primary and set section read-write T415748', diff saved to https://phabricator.wikimedia.org/P88356 and previous config saved to /var/cache/conftool/dbconfig/20260202-061217-marostegui.json [06:12:37] (03CR) 10Marostegui: [C:03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1234159 (https://phabricator.wikimedia.org/T415748) (owner: 10Gerrit maintenance bot) [06:12:43] !log marostegui@dns1006 START - running authdns-update [06:13:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2161 T415748', diff saved to https://phabricator.wikimedia.org/P88357 and previous config saved to /var/cache/conftool/dbconfig/20260202-061310-marostegui.json [06:13:40] (03CR) 10Marostegui: [C:03+2] db2161: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235615 (https://phabricator.wikimedia.org/T411164) (owner: 10Marostegui) [06:13:56] !log marostegui@dns1006 END - running authdns-update [06:14:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2161.codfw.wmnet with reason: long schema change [06:21:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T415983 [06:22:01] T415983: Switchover s2 master (db1222 -> db1162) - https://phabricator.wikimedia.org/T415983 [06:22:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1162 with weight 0 T415983', diff saved to https://phabricator.wikimedia.org/P88358 and previous config saved to /var/cache/conftool/dbconfig/20260202-062212-marostegui.json [06:22:38] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1235227 (https://phabricator.wikimedia.org/T415983) (owner: 10Gerrit maintenance bot) [06:23:18] !log Starting s2 eqiad failover from db1222 to db1162 - T415983 [06:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:13] (03PS1) 10Marostegui: db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235619 [06:24:45] (03CR) 10Marostegui: [C:03+2] db1222: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235619 (owner: 10Marostegui) [06:25:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1162 to s2 primary T415983', diff saved to https://phabricator.wikimedia.org/P88359 and previous config saved to /var/cache/conftool/dbconfig/20260202-062522-marostegui.json [06:25:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1222 T415983', diff saved to https://phabricator.wikimedia.org/P88360 and previous config saved to /var/cache/conftool/dbconfig/20260202-062554-marostegui.json [06:27:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1222.eqiad.wmnet with reason: Maintenance [06:32:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [06:33:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2149 (T415786)', diff saved to https://phabricator.wikimedia.org/P88361 and previous config saved to /var/cache/conftool/dbconfig/20260202-063304-marostegui.json [06:33:10] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:05:56] (03PS2) 10Jforrester: [wikifunctions] Grant sysops permission to edit function of attached implementation and tester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227748 (https://phabricator.wikimedia.org/T399934) (owner: 10Daphne Smit) [07:06:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1227748 (https://phabricator.wikimedia.org/T399934) (owner: 10Daphne Smit) [07:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:19:42] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:36] (03CR) 10Tiziano Fogli: [C:03+2] admin: remove cwhite ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1235601 (owner: 10Cwhite) [07:47:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235458 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [07:50:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:46] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:59:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:00:05] Amir1, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T0800). [08:00:05] samwilson and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:10] hi [08:00:41] hullo [08:01:15] samwilson: do you want me to deploy your patch? [08:01:44] kostajh: yes sure, that'd be great [08:01:46] thanks! [08:02:08] !log Restarting druid middle-managers to recover from OOM - T415799 [08:02:09] ok, I'll sync yours first, then I'll backport mine [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:12] T415799: Since upgrade Druid realtime-ingestion tasks replication fails - https://phabricator.wikimedia.org/T415799 [08:02:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233651 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [08:02:56] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:03:05] (03Merged) 10jenkins-bot: Enable watchlist labels everywhere (prod and beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1233651 (https://phabricator.wikimedia.org/T413967) (owner: 10Samwilson) [08:04:00] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1233651|Enable watchlist labels everywhere (prod and beta) (T413967)]] [08:04:03] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:05:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:07:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [08:07:44] FIRING: SLOMetricAbsent: charts-client-side-availability-v1 - https://slo.wikimedia.org/?search=charts-client-side-availability-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:09:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:09:24] FIRING: SLOMetricAbsent: edit-check-pre-save-checks-ratio - https://slo.wikimedia.org/?search=edit-check-pre-save-checks-ratio - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:09:50] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:47] !log installing openssl security updates [08:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:13:15] (03CR) 10Elukey: [C:03+2] profile::pyrra: add second SLO for Abstract Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/1230259 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [08:17:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:17:44] RESOLVED: SLOMetricAbsent: charts-client-side-availability-v1 - https://slo.wikimedia.org/?search=charts-client-side-availability-v1 - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:19:16] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:24] RESOLVED: SLOMetricAbsent: edit-check-pre-save-checks-ratio - https://slo.wikimedia.org/?search=edit-check-pre-save-checks-ratio - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:20:54] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:23:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:06] (03CR) 10Kosta Harlan: [C:03+2] BlockUtils: Log x-provenance and IP reputation fields [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235458 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [08:26:26] (03CR) 10Jforrester: [C:03+1] "Thanks! https://slo.wikimedia.org/objectives?expr={__name__=%22wikilambda-parsoid-combined-v1%22,%20revision=%221%22,%20service=%22parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/1230259 (https://phabricator.wikimedia.org/T415067) (owner: 10Elukey) [08:26:32] (03Merged) 10jenkins-bot: BlockUtils: Log x-provenance and IP reputation fields [extensions/WikimediaEvents] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235458 (https://phabricator.wikimedia.org/T415354) (owner: 10Kosta Harlan) [08:27:21] !log kharlan@deploy2002 kharlan, samwilson: Backport for [[gerrit:1233651|Enable watchlist labels everywhere (prod and beta) (T413967)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:27:24] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:27:40] (03Abandoned) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 (owner: 10Elukey) [08:27:44] (03Abandoned) 10Elukey: [DNM] provision: remove some idrac10 cpu settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1185057 (owner: 10Elukey) [08:27:53] samwilson: can you please test your patch? [08:28:32] (03CR) 10Elukey: "I think we are ready for a review, we can proceed this week :)" [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [08:30:09] (03CR) 10Elukey: [C:03+1] Remove support for Python 3.9 [software/cumin] - 10https://gerrit.wikimedia.org/r/1224029 (owner: 10Volans) [08:30:13] kostajh, testing now [08:30:53] (03CR) 10Elukey: [C:03+1] Update deprecated type hints [software/cumin] - 10https://gerrit.wikimedia.org/r/1224030 (owner: 10Volans) [08:31:37] kostajh: yep, all looks good [08:31:58] !log kharlan@deploy2002 kharlan, samwilson: Continuing with sync [08:31:59] cool [08:32:54] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:36:07] (03CR) 10Elukey: [C:03+1] transports: refactor State implementation [software/cumin] - 10https://gerrit.wikimedia.org/r/1224031 (owner: 10Volans) [08:37:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55711 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:37:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:40:58] (03PS1) 10Muehlenhoff: Add Cumin alias for ml-build [puppet] - 10https://gerrit.wikimedia.org/r/1235737 [08:41:21] (03PS1) 10Giuseppe Lavagetto: Revert "hiddenparma: use sqlite for now" [puppet] - 10https://gerrit.wikimedia.org/r/1235738 [08:44:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234457 (https://phabricator.wikimedia.org/T415638) (owner: 10Joal) [08:45:08] (03CR) 10Giuseppe Lavagetto: [C:03+2] Revert "hiddenparma: use sqlite for now" [puppet] - 10https://gerrit.wikimedia.org/r/1235738 (owner: 10Giuseppe Lavagetto) [08:45:46] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1233651|Enable watchlist labels everywhere (prod and beta) (T413967)]] (duration: 41m 47s) [08:45:49] T413967: Deploy watchlist labels - https://phabricator.wikimedia.org/T413967 [08:46:31] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1235458|BlockUtils: Log x-provenance and IP reputation fields (T415354)]] [08:46:34] T415354: Record CDN/Backend api and IP reputation values in editattemptsblocked schema - https://phabricator.wikimedia.org/T415354 [08:48:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T415786)', diff saved to https://phabricator.wikimedia.org/P88363 and previous config saved to /var/cache/conftool/dbconfig/20260202-084806-marostegui.json [08:48:13] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:48:27] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1235458|BlockUtils: Log x-provenance and IP reputation fields (T415354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:50:21] !log kharlan@deploy2002 kharlan: Continuing with sync [08:54:00] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:55:17] (03PS1) 10Brouberol: idp: flip oidc_id_token_claims to true [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) [08:56:10] (03PS2) 10Brouberol: idp: flip oidc_id_token_claims to true [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) [08:56:33] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7962/console" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [08:56:36] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1235458|BlockUtils: Log x-provenance and IP reputation fields (T415354)]] (duration: 10m 05s) [08:56:42] T415354: Record CDN/Backend api and IP reputation values in editattemptsblocked schema - https://phabricator.wikimedia.org/T415354 [08:57:28] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7963/console" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [08:57:52] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11572476 (10elukey) @dancy @Scott_French I think we are ready to move forward with https://gerrit.wikimedia... [08:57:55] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [09:02:53] (03CR) 10Dpogorzelski: [C:03+1] Add Cumin alias for ml-build [puppet] - 10https://gerrit.wikimedia.org/r/1235737 (owner: 10Muehlenhoff) [09:03:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2249 T415358', diff saved to https://phabricator.wikimedia.org/P88364 and previous config saved to /var/cache/conftool/dbconfig/20260202-090328-marostegui.json [09:03:32] T415358: Migrate 1P db* to Debian Trixie - https://phabricator.wikimedia.org/T415358 [09:03:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P88365 and previous config saved to /var/cache/conftool/dbconfig/20260202-090337-marostegui.json [09:04:05] (03PS1) 10Marostegui: db2249: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235741 (https://phabricator.wikimedia.org/T415358) [09:04:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2249.codfw.wmnet with reason: Reimage to debian trixie [09:04:59] (03CR) 10Marostegui: [C:03+2] db2249: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1235741 (https://phabricator.wikimedia.org/T415358) (owner: 10Marostegui) [09:07:33] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2249.codfw.wmnet with OS trixie [09:12:33] jouncebot: nowandnext [09:12:33] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [09:12:33] In 1 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1100) [09:13:38] hello, world - i'd like to deploy a new parsoid to fix a cite issue - if there's no counter-order i'll do that in 5 minutes or so [09:15:06] (that's what c.scott considered deploying on friday evening and that was decided not to [09:17:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2249.codfw.wmnet with reason: host reimage [09:18:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P88366 and previous config saved to /var/cache/conftool/dbconfig/20260202-091845-marostegui.json [09:20:23] (03PS1) 10Slyngshede: data.yaml: Offboarding auglnkv [puppet] - 10https://gerrit.wikimedia.org/r/1235746 [09:21:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235418 (https://phabricator.wikimedia.org/T416050) (owner: 10Reedy) [09:21:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235386 (https://phabricator.wikimedia.org/T415328) (owner: 10C. Scott Ananian) [09:21:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235384 (https://phabricator.wikimedia.org/T415888) (owner: 10C. Scott Ananian) [09:23:21] (03PS1) 10Brouberol: Remove mpic-related placeholder secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1235747 [09:23:28] (03CR) 10Brouberol: [C:03+2] Remove mpic-related placeholder secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1235747 (owner: 10Brouberol) [09:23:30] (03CR) 10Brouberol: [V:03+2 C:03+2] Remove mpic-related placeholder secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1235747 (owner: 10Brouberol) [09:23:45] (03CR) 10Brouberol: [V:03+1] "check experiment" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [09:24:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2249.codfw.wmnet with reason: host reimage [09:26:12] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [09:27:43] !log cleanup nginx-related packages and configs from urldownloader hosts to clean up alerts - T405631 [09:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:27] FIRING: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:15] this should clear soon --^ [09:33:23] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/ml-staging-codfw: maintenance [09:33:23] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/ml-staging-codfw: maintenance [09:33:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T415786)', diff saved to https://phabricator.wikimedia.org/P88367 and previous config saved to /var/cache/conftool/dbconfig/20260202-093354-marostegui.json [09:33:58] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:34:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:34:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2156 (T415786)', diff saved to https://phabricator.wikimedia.org/P88368 and previous config saved to /var/cache/conftool/dbconfig/20260202-093418-marostegui.json [09:34:27] RESOLVED: [8x] SystemdUnitFailed: nginx.service on urldownloader1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:47] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.pool-depool-cluster depool all services in codfw/ml-staging-codfw: maintenance [09:35:47] !log dpogorzelski@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) depool all services in codfw/ml-staging-codfw: maintenance [09:36:00] (03Merged) 10jenkins-bot: Upgrading psy/psysh (v0.12.10 => v0.12.19) [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235418 (https://phabricator.wikimedia.org/T416050) (owner: 10Reedy) [09:36:54] (03CR) 10Volans: [C:03+2] "Correct, but they will be upgraded at some point and it might take a while to make a final new release with all the new features to be tes" [software/cumin] - 10https://gerrit.wikimedia.org/r/1224029 (owner: 10Volans) [09:37:18] (03CR) 10Volans: [C:03+2] Update deprecated type hints [software/cumin] - 10https://gerrit.wikimedia.org/r/1224030 (owner: 10Volans) [09:37:51] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a13.1 [vendor] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235384 (https://phabricator.wikimedia.org/T415888) (owner: 10C. Scott Ananian) [09:37:57] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.23.0-a13.1 [core] (wmf/1.46.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1235386 (https://phabricator.wikimedia.org/T415328) (owner: 10C. Scott Ananian) [09:38:19] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1235418|Upgrading psy/psysh (v0.12.10 => v0.12.19) (T416050)]], [[gerrit:1235386|Bump wikimedia/parsoid to 0.23.0-a13.1 (T415328)]], [[gerrit:1235384|Bump wikimedia/parsoid to 0.23.0-a13.1 (T415888 T415328)]] [09:38:27] T416050: CVE-2026-25129: PsySH has Local Privilege Escalation via CWD .psysh.php auto-load - https://phabricator.wikimedia.org/T416050 [09:38:27] T415328: CTT tasks week of 2026-01-23 - https://phabricator.wikimedia.org/T415328 [09:38:27] T415888: PHP Warning: Undefined property: Wikimedia\Parsoid\NodeData\NodeData::$mw - https://phabricator.wikimedia.org/T415888 [09:39:15] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster ml-staging-codfw: Kubernetes upgrade [09:40:12] (03PS1) 10Brouberol: idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235749 [09:40:14] !log ihurbain@deploy2002 reedy, cscott, ihurbain: Backport for [[gerrit:1235418|Upgrading psy/psysh (v0.12.10 => v0.12.19) (T416050)]], [[gerrit:1235386|Bump wikimedia/parsoid to 0.23.0-a13.1 (T415328)]], [[gerrit:1235384|Bump wikimedia/parsoid to 0.23.0-a13.1 (T415888 T415328)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:40:22] (03CR) 10Brouberol: [C:03+2] idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235749 (owner: 10Brouberol) [09:40:23] (03CR) 10Brouberol: [V:03+2 C:03+2] idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235749 (owner: 10Brouberol) [09:40:33] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [09:40:51] !log ihurbain@deploy2002 reedy, cscott, ihurbain: Continuing with sync [09:42:05] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for trueg - https://phabricator.wikimedia.org/T415632#11572827 (10elukey) @thcipriani Hi! When you have a moment, please review this request :) @trueg Hi! One follow up question - I noticed that you already have an account and an ssh key registered... [09:42:13] (03PS1) 10Brouberol: idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235751 [09:42:20] (03CR) 10Brouberol: [C:03+2] idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235751 (owner: 10Brouberol) [09:42:21] (03CR) 10Brouberol: [V:03+2 C:03+2] idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235751 (owner: 10Brouberol) [09:42:58] dpogorzelski@cumin1003 wipe-cluster (PID 904355) is awaiting input [09:44:56] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1235418|Upgrading psy/psysh (v0.12.10 => v0.12.19) (T416050)]], [[gerrit:1235386|Bump wikimedia/parsoid to 0.23.0-a13.1 (T415328)]], [[gerrit:1235384|Bump wikimedia/parsoid to 0.23.0-a13.1 (T415888 T415328)]] (duration: 06m 36s) [09:45:14] T416050: CVE-2026-25129: PsySH has Local Privilege Escalation via CWD .psysh.php auto-load - https://phabricator.wikimedia.org/T416050 [09:45:15] T415328: CTT tasks week of 2026-01-23 - https://phabricator.wikimedia.org/T415328 [09:45:17] T415888: PHP Warning: Undefined property: Wikimedia\Parsoid\NodeData\NodeData::$mw - https://phabricator.wikimedia.org/T415888 [09:46:12] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2249.codfw.wmnet with OS trixie [09:46:29] (03PS1) 10Brouberol: idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235754 [09:46:39] (03CR) 10Brouberol: [C:03+2] idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235754 (owner: 10Brouberol) [09:46:41] (03CR) 10Brouberol: [V:03+2 C:03+2] idp_test: Add missing services [labs/private] - 10https://gerrit.wikimedia.org/r/1235754 (owner: 10Brouberol) [09:47:29] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7967/co" [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [09:48:14] (03Merged) 10jenkins-bot: Remove support for Python 3.9 [software/cumin] - 10https://gerrit.wikimedia.org/r/1224029 (owner: 10Volans) [09:48:50] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11572883 (10elukey) Hi @Sucheta-Salgaonkar-WMF! I'd need you to sign https://phabricator.wikimedia.org/L3 to kick off the process, in the meantime I'll seek Chris' appr... [09:48:59] (03Merged) 10jenkins-bot: Update deprecated type hints [software/cumin] - 10https://gerrit.wikimedia.org/r/1224030 (owner: 10Volans) [09:49:34] (03PS1) 10Dpogorzelski: ml-staging-codfw: k8s version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1235755 [09:50:28] (03PS1) 10Marostegui: Revert "db2249: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1235756 [09:50:33] !log dpogorzelski@cumin1003 END (FAIL) - Cookbook sre.k8s.wipe-cluster (exit_code=99) Wipe the K8s cluster ml-staging-codfw: Kubernetes upgrade [09:51:08] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2249: After reimage [09:51:17] (03CR) 10Marostegui: [C:03+2] Revert "db2249: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1235756 (owner: 10Marostegui) [09:51:35] brouberol: ok to merge all your changes? [09:53:01] oh, I forgot these private changes wind up in puppet-merge [09:53:01] tyes [09:53:03] *yes [09:53:05] sorry! [09:53:06] doing it! [09:53:12] brouberol: thanks, no worries! [09:53:13] thanks [09:54:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11572937 (10elukey) @lerickson Hi! Could you please add a little more context about the reason for the request? I am asking since we offer various access levels: https://wiki... [09:54:39] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1235746 (owner: 10Slyngshede) [09:55:37] (03PS1) 10Gehel: feat(airflow): add libssl1.1 to the airflow docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235757 (https://phabricator.wikimedia.org/T415667) [09:56:01] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11572947 (10elukey) @GGalofre-WMF Hi! Could you please add a little more context about the reason for the request? I am asking since we offer various access levels: https://wi... [09:56:13] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for ml-build [puppet] - 10https://gerrit.wikimedia.org/r/1235737 (owner: 10Muehlenhoff) [09:57:34] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11572963 (10elukey) 05Stalled→03Declined Declining for the moment since there seems to be no updates in a while. Please reopen if needeed! :) [09:58:42] (03CR) 10Brouberol: [C:03+1] feat(airflow): add libssl1.1 to the airflow docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235757 (https://phabricator.wikimedia.org/T415667) (owner: 10Gehel) [09:58:45] 06SRE, 06ServiceOps new, 06Traffic: Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200#11572980 (10MLechvien-WMF) [09:59:07] 06SRE, 06ServiceOps new: Separate mediawiki latency metrics by endpoint - https://phabricator.wikimedia.org/T263727#11572985 (10MLechvien-WMF) [09:59:16] 06SRE, 06ServiceOps new, 10Continuous-Integration-Config, 06Release-Engineering-Team (Seen): operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855#11572984 (10MLechvien-WMF) [09:59:18] 06SRE, 10DNS, 06ServiceOps new, 06Traffic-Icebox: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability - https://phabricator.wikimedia.org/T162818#11572987 (10MLechvien-WMF) [10:01:25] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11573013 (10elukey) >>! In T413634#11505670, @bd808 wrote: > @DannyS712 In addition to your MediaWiki +2 which I just revoked, do you want to g... [10:02:38] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11573035 (10elukey) Looping in @SLyngshede-WMF for a consult :) [10:09:16] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:09:39] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11573058 (10MoritzMuehlenhoff) Looks good to me, we don't have a formal approval process, but it seems fine to add him. I'll do that later and will also add Jesse since he recently the... [10:11:17] (03CR) 10Elukey: [C:03+2] "Checked via elukey@ldap-maint1001:~$ ldapsearch -x uid=j89, the change looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1234429 (https://phabricator.wikimedia.org/T414789) (owner: 10AOkoth) [10:11:22] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11573059 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [10:11:25] 06SRE, 10LDAP-Access-Requests: Grant Access to bitu-account-managers(?) for reedy - https://phabricator.wikimedia.org/T416062#11573062 (10LSobanski) Approved. [10:12:39] 06SRE, 06Data-Persistence, 06ServiceOps new, 07Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907#11573077 (10MLechvien-WMF) a:03Blake @Blake @Clement_Goubert @jasmine_ there are several child tasks open here, can we move them to a better home (... [10:12:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:14:57] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:15:17] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11573087 (10elukey) @FCeratto-WMF @Arnoldokoth - follow up note: https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Modify_LDAP_groups (cha... [10:16:16] (03CR) 10Elukey: [C:03+1] ml-staging-codfw: k8s version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1235755 (owner: 10Dpogorzelski) [10:16:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:16:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T415786)', diff saved to https://phabricator.wikimedia.org/P88371 and previous config saved to /var/cache/conftool/dbconfig/20260202-101658-marostegui.json [10:17:03] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:20:05] (03CR) 10Dpogorzelski: [C:03+2] ml-staging-codfw: k8s version upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1235755 (owner: 10Dpogorzelski) [10:20:15] !log dpogorzelski@cumin1003 START - Cookbook sre.k8s.wipe-cluster Wipe the K8s cluster ml-staging-codfw: Kubernetes upgrade [10:23:39] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled: ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-staging_31443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia [10:23:39] i/PyBal [10:23:39] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled: ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled: k8s-ingress-ml-staging_31443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia [10:23:39] i/PyBal [10:24:39] this is Dawid working on ml-staging --% [10:24:41] --^ [10:31:19] (03CR) 10Jelto: "beside the typo this looks good to me. Suggested edit in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [10:31:28] dpogorzelski@cumin1003 wipe-cluster (PID 909951) is awaiting input [10:36:42] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2249: After reimage [10:38:55] PROBLEM - Memcached on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [10:38:55] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:39:45] RECOVERY - Memcached on titan1002 is OK: TCP OK - 0.014 second response time on 10.64.48.167 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [10:39:45] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:11] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:41:05] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11573186 (10Johannnes89) Thanks! I just tested accessing Turnilo, everything works as expected :) [10:44:16] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:45:09] !log restarting Bitu on idm* [10:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:24] !incidents [10:47:24] 7378 (UNACKED) wmf - metamonitoring - prometheus - notified - vip is still DOWN [10:47:38] !ack [10:47:39] 7378 (ACKED) wmf - metamonitoring - prometheus - notified - vip is still DOWN [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1100) [11:05:25] OK, I am failing to even find out where I need to look for this ^ [11:06:33] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-codfw [11:07:33] (03PS1) 10Elukey: kubernetes: remove PodSecurityPolicy plugin from ml-staging's config [puppet] - 10https://gerrit.wikimedia.org/r/1235766 [11:07:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-codfw [11:08:14] (03CR) 10Elukey: [C:03+2] kubernetes: remove PodSecurityPolicy plugin from ml-staging's config [puppet] - 10https://gerrit.wikimedia.org/r/1235766 (owner: 10Elukey) [11:09:49] fyi, https://metamonitoring.wikimedia.org/prometheus/deadmanswitchnotified says ml-staging, so I am assuming this is related to dpogorzelski's ml-staging cluster wipe work. [11:10:26] (03PS2) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [11:10:31] anyway, alert ACKed, I think it should recover once that work is done [11:11:25] akosiaris: yeah he's upgrading the staging cluster to 1.31, thanks! [11:12:39] (03CR) 10CI reject: [V:04-1] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [11:14:57] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling restart_daemons on A:schema-eqiad [11:15:42] (03PS3) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [11:16:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling restart_daemons on A:schema-eqiad [11:17:51] (03CR) 10CI reject: [V:04-1] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [11:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:23:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:26:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:30:10] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding auglnkv [puppet] - 10https://gerrit.wikimedia.org/r/1235746 (owner: 10Slyngshede) [11:31:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T415786)', diff saved to https://phabricator.wikimedia.org/P88374 and previous config saved to /var/cache/conftool/dbconfig/20260202-113139-marostegui.json [11:31:46] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:35:50] (03PS1) 10Slyngshede: data.yaml: Offboarding sguebo [puppet] - 10https://gerrit.wikimedia.org/r/1235772 [11:38:12] akosiaris: Alerts from k8s-mlstaging have been suppressed (silence id: b0086861-4b7b-4e1a-907a-503eff4dc79b), so the Dead Man's Switch alert is not being received by the meta-monitoring tool (https://wikitech.wikimedia.org/wiki/Prometheus#Meta-Monitoring) [11:39:59] A cookbook to silence meta-monitoring during planned work is on the roadmap. [11:40:31] jouncebot: nowandnext [11:40:31] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1100) [11:40:32] In 2 hour(s) and 19 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1400) [11:40:59] I’d like to deploy some changes to PrivateSettings.php [11:41:20] I’ll wait to see if anyone uses the infrastructure window in 20 minutes [11:43:57] (03PS1) 10Jakob: Enable Wikibase GraphQL on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235773 (https://phabricator.wikimedia.org/T415516) [11:44:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235773 (https://phabricator.wikimedia.org/T415516) (owner: 10Jakob) [11:46:35] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AUgolnikova out of all services on: 2487 hosts [11:46:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P88375 and previous config saved to /var/cache/conftool/dbconfig/20260202-114648-marostegui.json [11:51:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T415786)', diff saved to https://phabricator.wikimedia.org/P88376 and previous config saved to /var/cache/conftool/dbconfig/20260202-115142-marostegui.json [11:52:03] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:55:56] (03CR) 10Btullis: [C:03+1] feat(airflow): add libssl1.1 to the airflow docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235757 (https://phabricator.wikimedia.org/T415667) (owner: 10Gehel) [11:57:13] (03PS1) 10Marostegui: Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1235774 [11:57:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1222: After schema change [11:58:03] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.newpool (exit_code=99) pool db1222: After schema change [11:58:08] (03CR) 10Marostegui: [C:03+2] Revert "db1222: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1235774 (owner: 10Marostegui) [11:59:42] (03PS1) 10Marostegui: Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1235775 [12:00:07] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1222: After schema change [12:00:27] (03CR) 10Marostegui: [C:03+2] Revert "db1193: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1235775 (owner: 10Marostegui) [12:00:28] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1193: After schema change [12:02:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P88379 and previous config saved to /var/cache/conftool/dbconfig/20260202-120157-marostegui.json [12:02:27] starting PS sync [12:03:57] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:40] (03CR) 10Slyngshede: [C:03+1] idp: flip oidc_id_token_claims to true [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [12:05:47] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:06:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1235772 (owner: 10Slyngshede) [12:06:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P88380 and previous config saved to /var/cache/conftool/dbconfig/20260202-120654-marostegui.json [12:07:32] (03CR) 10Brouberol: [V:03+1 C:03+2] idp: flip oidc_id_token_claims to true [puppet] - 10https://gerrit.wikimedia.org/r/1235739 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [12:08:20] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [12:11:48] jmm@cumin2002 roll-restart-reboot-docker-registry (PID 3574877) is awaiting input [12:11:59] done with the privatesettings sync [12:16:01] (03PS1) 10Slyngshede: idp: Failover to idp2005 [dns] - 10https://gerrit.wikimedia.org/r/1235779 [12:16:09] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding sguebo [puppet] - 10https://gerrit.wikimedia.org/r/1235772 (owner: 10Slyngshede) [12:16:10] jmm@cumin2002 roll-restart-reboot-docker-registry (PID 3574877) is awaiting input [12:17:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T415786)', diff saved to https://phabricator.wikimedia.org/P88383 and previous config saved to /var/cache/conftool/dbconfig/20260202-121707-marostegui.json [12:17:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [12:17:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [12:17:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T415786)', diff saved to https://phabricator.wikimedia.org/P88384 and previous config saved to /var/cache/conftool/dbconfig/20260202-121735-marostegui.json [12:19:16] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:19:59] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1235781 (owner: 10L10n-bot) [12:20:06] tappof: thanks! [12:21:40] FIRING: [5x] KubernetesRsyslogDown: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:22:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P88385 and previous config saved to /var/cache/conftool/dbconfig/20260202-122203-marostegui.json [12:25:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1235779 (owner: 10Slyngshede) [12:27:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [12:29:18] (03CR) 10Slyngshede: [C:03+2] idp: Failover to idp2005 [dns] - 10https://gerrit.wikimedia.org/r/1235779 (owner: 10Slyngshede) [12:29:33] !log slyngshede@dns1004 START - running authdns-update [12:30:19] (03PS1) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-docker-registry: Fix service names [cookbooks] - 10https://gerrit.wikimedia.org/r/1235790 [12:30:48] !log slyngshede@dns1004 END - running authdns-update [12:31:06] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest2006.codfw.wmnet [12:33:54] !log restarting nginx on puppetdb hosts [12:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:20] (03PS1) 10Brouberol: growthbook: re-integrate production release with idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235792 (https://phabricator.wikimedia.org/T411752) [12:36:36] (03PS2) 10Brouberol: growthbook: re-integrate production release with idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235792 (https://phabricator.wikimedia.org/T411752) [12:37:01] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Samuel (WMF) out of all services on: 2487 hosts [12:37:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T415786)', diff saved to https://phabricator.wikimedia.org/P88388 and previous config saved to /var/cache/conftool/dbconfig/20260202-123712-marostegui.json [12:37:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [12:37:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [12:37:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2177 (T415786)', diff saved to https://phabricator.wikimedia.org/P88389 and previous config saved to /var/cache/conftool/dbconfig/20260202-123726-marostegui.json [12:38:32] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db1222: After schema change [12:46:06] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.newpool (exit_code=99) pool db1193: After schema change [12:47:48] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11573523 (10Jclark-ctr) a:03Jclark-ctr [12:53:44] FIRING: [3x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:54:12] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1193: After schema change [12:54:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db1193: After schema change [12:54:29] (03PS1) 10Muehlenhoff: sre.cdn.roll-restart-reboot-ncredir: Fix construction of aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1235794 [12:55:46] !log restarting Postfix on the MXes to pick up OpenSSL security updates [12:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:44] FIRING: [8x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:59:21] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11573550 (10Jclark-ctr) @VRiley-WMF, you should leave this ticket open until @BTullis has had the opportunity to complete the final step of adding the disk to the RAID. [13:07:05] (03CR) 10Pmiazga: "can you solve merge conflicts in chart.mk and service.mk? then we can try to merge it later today as it seems to be no-op for prod" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 (owner: 10Daniel Kinzler) [13:07:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1175 - https://phabricator.wikimedia.org/T396703#11573571 (10Jclark-ctr) @VRiley-WMF, you should leave this ticket open until @BTullis has had the opportunity to complete the final step of adding the disk to the RAID. [13:12:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Follow-up: Degraded Disk Not Yet Added to RAID (an-worker1175, an-worker1199) - https://phabricator.wikimedia.org/T416166 (10Jclark-ctr) 03NEW [13:14:17] (03PS1) 10Giuseppe Lavagetto: Fix policy no reason [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1235796 [13:14:31] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Fix policy no reason [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1235796 (owner: 10Giuseppe Lavagetto) [13:15:23] (03CR) 10Btullis: [C:03+1] growthbook: re-integrate production release with idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235792 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:15:36] (03CR) 10Brouberol: [C:03+2] growthbook: re-integrate production release with idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235792 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [13:16:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11573613 (10Arnoldokoth) Thanks @elukey [13:16:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lerickson - https://phabricator.wikimedia.org/T415406#11573615 (10Arnoldokoth) a:05Arnoldokoth→03elukey [13:16:44] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix bugs with no reason policy and haproxy actions - oblivian@cumin1003" [13:16:46] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix bugs with no reason policy and haproxy actions - oblivian@cumin1003 [13:16:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [13:17:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11573619 (10Arnoldokoth) a:05Arnoldokoth→03elukey [13:17:35] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix bugs with no reason policy and haproxy actions - oblivian@cumin1003 [13:17:37] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix bugs with no reason policy and haproxy actions - oblivian@cumin1003" [13:17:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [13:22:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11573636 (10Jclark-ctr) Dell ticket You have successfully submitted request SR222095997. [13:23:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:27:10] !log installing Postgresql 15 security updates [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:32:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:33:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T415786)', diff saved to https://phabricator.wikimedia.org/P88392 and previous config saved to /var/cache/conftool/dbconfig/20260202-133319-marostegui.json [13:33:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:34:58] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:38:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11573680 (10Jclark-ctr) @elukey @RobH @Papaul @BTullis we have had addtional failures since the last comment an-worker1199 recently Not random disk... [13:42:01] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11573683 (10Jclark-ctr) @akosiaris This is the ticket previously discussed over irc. if these did not need to be as diverse and split between two racks in old cage. @ayounsi do you know if there is... [13:43:16] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11573686 (10Jclark-ctr) [13:45:56] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:47:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55711 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:47:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:48:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P88393 and previous config saved to /var/cache/conftool/dbconfig/20260202-134827-marostegui.json [13:49:16] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:51] (03PS1) 10Santiago Faci: Test Kitchen renaming: Updated references to old names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235798 (https://phabricator.wikimedia.org/T415843) [13:57:37] (03PS1) 10Joal: Update druid-analytics middlemanager JVM settings [puppet] - 10https://gerrit.wikimedia.org/r/1235800 (https://phabricator.wikimedia.org/T415799) [13:57:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:58:28] (03PS1) 10Dpogorzelski: ml-stading-codfw: Patch configs to adapt to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235801 [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1400). [14:00:05] kipfel, joal, and jakob_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:38] o/ [14:00:42] o/ [14:00:42] o/ [14:00:50] I can deploy in principle [14:00:59] though I looked at kipfel’s change earlier and I’m not yet convinced there’s community consensus [14:01:08] so I might have to take another look at that [14:01:43] joal: do you want to start with your change in the meantime? [14:01:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:02:13] (03CR) 10Gehel: [C:03+2] feat(airflow): add libssl1.1 to the airflow docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235757 (https://phabricator.wikimedia.org/T415667) (owner: 10Gehel) [14:02:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:02:25] Works for me Lucas_WMDE [14:03:03] Lucas_WMDE: the change is tiny, very low risk :) [14:03:18] do you want to deploy it yourself or should I? [14:03:35] Lucas_WMDE: I'm happy to try, but I'm completely unsure if I have permissions [14:03:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P88394 and previous config saved to /var/cache/conftool/dbconfig/20260202-140336-marostegui.json [14:03:48] I'm no SRE :) [14:03:55] you’re in the deployment group according to puppet.git, at least [14:03:58] * Lucas_WMDE isn’t an SRE either [14:04:02] but I don’t know if you have spiderpig access [14:04:11] I have not yet done that [14:04:25] ok, then I can deploy [14:04:51] Ack - the patch needs to be merged first - not yet done [14:05:06] yes, that happens as part of the deployment [14:05:13] great [14:05:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234457 (https://phabricator.wikimedia.org/T415638) (owner: 10Joal) [14:06:06] (03Merged) 10jenkins-bot: Update ext-EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234457 (https://phabricator.wikimedia.org/T415638) (owner: 10Joal) [14:06:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [14:06:26] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1234457|Update ext-EventStreamConfig (T415638)]] [14:06:31] T415638: Make canary-events for the `resource_change` stream - https://phabricator.wikimedia.org/T415638 [14:07:04] the proposal has been discussed for a long time, the objection has been answered by multiple users and believed is not valid enough, so I think it do has a consensus at the moment [14:07:20] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:20] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:25] but it's fine to have someone else to have a look Lucas_WMDE [14:07:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [14:08:09] (03CR) 10Elukey: "Left a couple of comments, lemme know!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235801 (owner: 10Dpogorzelski) [14:08:19] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, joal: Backport for [[gerrit:1234457|Update ext-EventStreamConfig (T415638)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:20] (03CR) 10Btullis: [C:03+2] Update druid-analytics middlemanager JVM settings [puppet] - 10https://gerrit.wikimedia.org/r/1235800 (https://phabricator.wikimedia.org/T415799) (owner: 10Joal) [14:08:30] (03CR) 10Btullis: [C:03+2] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1235800 (https://phabricator.wikimedia.org/T415799) (owner: 10Joal) [14:09:01] joal: can you test the change on mwdebug? [14:09:16] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:09:24] Yes Lucas_WMDE [14:12:59] Lucas_WMDE: I confirm it's all good for me [14:13:01] thanks! [14:13:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, joal: Continuing with sync [14:15:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11573802 (10BTullis) >>! In T415002#11550039, @Jclark-ctr wrote: > The system is currently set to Performance Per Watt (OS). Given the intermittent disk dr... [14:17:10] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1234457|Update ext-EventStreamConfig (T415638)]] (duration: 10m 45s) [14:17:15] T415638: Make canary-events for the `resource_change` stream - https://phabricator.wikimedia.org/T415638 [14:18:05] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] zhwiki: Remove extra autoconfirmed limit for Tor user (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [14:18:12] (03PS2) 10Dpogorzelski: ml-staging-codfw: Patch configs to adapt to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235801 [14:18:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [14:18:35] (03CR) 10Dpogorzelski: ml-staging-codfw: Patch configs to adapt to k8s 1.31 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235801 (owner: 10Dpogorzelski) [14:18:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T415786)', diff saved to https://phabricator.wikimedia.org/P88395 and previous config saved to /var/cache/conftool/dbconfig/20260202-141844-marostegui.json [14:18:50] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [14:18:54] Lucas_WMDE: I confirm the change is live for me - thank you so much for the deploy :) [14:19:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [14:19:11] (03Merged) 10jenkins-bot: zhwiki: Remove extra autoconfirmed limit for Tor user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230708 (https://phabricator.wikimedia.org/T415335) (owner: 10Stang) [14:19:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1175 (T415786)', diff saved to https://phabricator.wikimedia.org/P88396 and previous config saved to /var/cache/conftool/dbconfig/20260202-141910-marostegui.json [14:19:30] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1230708|zhwiki: Remove extra autoconfirmed limit for Tor user (T415335)]] [14:19:34] T415335: Remove extra autoconfirmed limit for Tor user on zhwiki - https://phabricator.wikimedia.org/T415335 [14:20:50] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:21:23] (03CR) 10Elukey: [C:03+1] ml-staging-codfw: Patch configs to adapt to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235801 (owner: 10Dpogorzelski) [14:21:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, stang: Backport for [[gerrit:1230708|zhwiki: Remove extra autoconfirmed limit for Tor user (T415335)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:21:35] kipfel: anything to test on mwdebug? [14:21:47] (03CR) 10Elukey: [C:03+1] sre.misc-clusters.roll-restart-reboot-docker-registry: Fix service names [cookbooks] - 10https://gerrit.wikimedia.org/r/1235790 (owner: 10Muehlenhoff) [14:22:14] (03CR) 10Elukey: [C:03+1] sre.cdn.roll-restart-reboot-ncredir: Fix construction of aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1235794 (owner: 10Muehlenhoff) [14:22:18] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 196.08 ms [14:23:10] Lucas_WMDE, to be honest i dont think this can be easily tested, and enwiki's patch did not test itself [14:23:15] (03PS4) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [14:23:20] yeah, that’s what I expected ^^ [14:23:25] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, stang: Continuing with sync [14:23:40] so direct deploy will be fine i think [14:25:23] (03CR) 10CI reject: [V:04-1] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:26:25] (03PS5) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [14:27:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55711 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:21] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1230708|zhwiki: Remove extra autoconfirmed limit for Tor user (T415335)]] (duration: 07m 51s) [14:27:29] T415335: Remove extra autoconfirmed limit for Tor user on zhwiki - https://phabricator.wikimedia.org/T415335 [14:27:43] jakob_WMDE: do you want to deploy your config change yourself? [14:28:36] (03CR) 10CI reject: [V:04-1] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [14:28:44] FIRING: [8x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:31:00] (03CR) 10Dpogorzelski: [C:03+2] ml-staging-codfw: Patch configs to adapt to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235801 (owner: 10Dpogorzelski) [14:31:52] 06SRE, 10Bitu, 06Infrastructure-Foundations: Bitu: In account blocking also allow to remove an email address - https://phabricator.wikimedia.org/T404430#11573898 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [14:32:41] jakob_WMDE: are you around? [14:33:38] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:33:43] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:35:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:35:30] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:36:23] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:36:33] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:38:44] RESOLVED: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:39:28] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:41:40] RESOLVED: [5x] KubernetesRsyslogDown: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:43:02] Lucas_WMDE: sorry, yes, I'm around! I'd rather have someone else deploy the change [14:43:37] ok, sure [14:44:08] sorry, I mixed up the time zones [14:44:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235773 (https://phabricator.wikimedia.org/T415516) (owner: 10Jakob) [14:44:50] !log restart vrts-daemon on vrts1003 [14:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:56] (03PS1) 10Muehlenhoff: Add Cumin alias for staging maps node(s) [puppet] - 10https://gerrit.wikimedia.org/r/1235810 [14:44:57] yeah I saw it in your calendar [14:44:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:12] (03Merged) 10jenkins-bot: Enable Wikibase GraphQL on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235773 (https://phabricator.wikimedia.org/T415516) (owner: 10Jakob) [14:45:13] was gonna ping you on Mattermost after a few more minutes but got distracted by https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/233 [14:45:16] ^^ [14:45:30] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1235773|Enable Wikibase GraphQL on beta wikidata (T415516)]] [14:45:38] T415516: Enable Wikibase GraphQL on Beta - https://phabricator.wikimedia.org/T415516 [14:47:23] !log lucaswerkmeister-wmde@deploy2002 jakob, lucaswerkmeister-wmde: Backport for [[gerrit:1235773|Enable Wikibase GraphQL on beta wikidata (T415516)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:47:36] (03PS6) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [14:47:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:58] (03PS2) 10Bking: apt: mirror opensearch 2 and 3 repos in trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1235075 (https://phabricator.wikimedia.org/T415699) [14:47:59] jakob_WMDE: can you test that it *isn’t* enabled on production with mwdebug? ^^ [14:48:04] just to be sure [14:48:13] (03CR) 10Bking: apt: mirror opensearch 2 and 3 repos in trixie-wikimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1235075 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [14:48:14] Lucas_WMDE: yup, sec [14:48:29] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:49:01] Lucas_WMDE: wait, should I be seeing it on beta already? [14:49:09] no, that would take up to 10 more minutes [14:49:16] ah ok [14:49:18] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11573968 (10Arnoldokoth) @elukey Ack. Thank you. [14:49:23] unless the beta cluster happened to update at just the right time I guess [14:49:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1235075 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [14:50:25] (03CR) 10Muehlenhoff: [C:03+2] sre.cdn.roll-restart-reboot-ncredir: Fix construction of aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1235794 (owner: 10Muehlenhoff) [14:51:08] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:51:38] Lucas_WMDE: I can confirm that I don't see it on prod with mwdebug :) [14:51:45] yay, thanks ^^ [14:51:57] nothing special in mwdebug logstash either [14:52:00] !log lucaswerkmeister-wmde@deploy2002 jakob, lucaswerkmeister-wmde: Continuing with sync [14:52:06] just a warning “Message blob for wikibase.vector.scopedTypeaheadSearch should have been preloaded” [14:52:24] (which is apparently T409033) [14:52:25] T409033: "Message blob for wikibase.vector.scopedTypeaheadSearch should have been preloaded" - https://phabricator.wikimedia.org/T409033 [14:53:24] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1223183 (owner: 10Muehlenhoff) [14:53:59] FIRING: [8x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:54:25] (03CR) 10Bking: [C:03+2] apt: mirror opensearch 2 and 3 repos in trixie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1235075 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [14:54:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T415786)', diff saved to https://phabricator.wikimedia.org/P88397 and previous config saved to /var/cache/conftool/dbconfig/20260202-145445-marostegui.json [14:54:54] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [14:56:00] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1235773|Enable Wikibase GraphQL on beta wikidata (T415516)]] (duration: 10m 30s) [14:56:07] T415516: Enable Wikibase GraphQL on Beta - https://phabricator.wikimedia.org/T415516 [14:56:43] (03PS1) 10Muehlenhoff: sre.cdn.roll-restart-reboot-ncredir: Fix one more syntax error [cookbooks] - 10https://gerrit.wikimedia.org/r/1235814 [14:57:17] (03CR) 10Muehlenhoff: [C:03+2] sre.misc-clusters.roll-restart-reboot-docker-registry: Fix service names [cookbooks] - 10https://gerrit.wikimedia.org/r/1235790 (owner: 10Muehlenhoff) [14:58:26] !log UTC afternoon backport+config window done [14:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:45] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:00:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:00:55] !log restarting mailman-web on lists1004 to pick up openssl security updates [15:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:06:49] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:07:02] !log restarting Exim on lists1004 to pick up openssl security updates [15:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on ml-staging-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:09:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P88398 and previous config saved to /var/cache/conftool/dbconfig/20260202-150955-marostegui.json [15:10:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:11:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:11:28] (03PS7) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [15:12:44] (03PS1) 10Muehlenhoff: mailman: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1235817 (https://phabricator.wikimedia.org/T135991) [15:12:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:19] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:13:59] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:14:01] Lucas_WMDE: everything works fine on beta now btw! thanks for deploying! [15:14:16] FIRING: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:40] RESOLVED: [5x] KubernetesRsyslogDown: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:14:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:15:11] RESOLVED: [3x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:33] (03CR) 10Elukey: "Personal opinion - with some extra documentation this new method will be simpler to review in some months time." [software/cumin] - 10https://gerrit.wikimedia.org/r/1224032 (owner: 10Volans) [15:19:05] !log restarting Mailman on lists1004 to pick up openssl security updates [15:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:19:39] (03CR) 10Elukey: [C:03+1] tests: fix integration tests error handling [software/cumin] - 10https://gerrit.wikimedia.org/r/1224034 (owner: 10Volans) [15:22:40] jakob_WMDE: \o/ [15:24:31] (03PS1) 10Bking: Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) [15:25:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P88399 and previous config saved to /var/cache/conftool/dbconfig/20260202-152503-marostegui.json [15:29:10] (03PS1) 10Muehlenhoff: mailman: Enable profile::auto_restarts::service for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1235824 (https://phabricator.wikimedia.org/T135991) [15:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1530) [15:33:08] (03CR) 10Brouberol: [C:03+1] Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [15:34:12] (03CR) 10Btullis: "I wonder whether we should update the openjdk-21-jdk image in the same patch, so that it builds from this one." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [15:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T415786)', diff saved to https://phabricator.wikimedia.org/P88400 and previous config saved to /var/cache/conftool/dbconfig/20260202-153447-marostegui.json [15:34:53] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:40:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T415786)', diff saved to https://phabricator.wikimedia.org/P88401 and previous config saved to /var/cache/conftool/dbconfig/20260202-154013-marostegui.json [15:40:19] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:40:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2190.codfw.wmnet with reason: Maintenance [15:40:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2190 (T415786)', diff saved to https://phabricator.wikimedia.org/P88402 and previous config saved to /var/cache/conftool/dbconfig/20260202-154038-marostegui.json [15:41:34] (03PS2) 10Bking: Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) [15:41:46] (03PS1) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 [15:45:46] (03PS1) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) [15:46:08] (03CR) 10Elukey: "The bump of the changelogs are missing :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [15:46:09] (03CR) 10Muehlenhoff: Add trixie-based openjdk-21-jre image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [15:49:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P88403 and previous config saved to /var/cache/conftool/dbconfig/20260202-154956-marostegui.json [15:51:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1235824 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:52:06] (03CR) 10Majavah: [C:03+2] hieradata: Use dedicated memcache user by default in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1230336 (https://phabricator.wikimedia.org/T273950) (owner: 10Majavah) [15:52:21] (03PS8) 10Elukey: dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) [15:53:48] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [15:53:56] (03CR) 10Elukey: "This is an attempt to add a k8s Secret to hold the ssh private key needed to rsync files to puppetservers, lemme know!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [15:55:29] (03PS2) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 [15:56:15] (03PS3) 10Bking: Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) [15:56:32] (03CR) 10Bking: Add trixie-based openjdk-21-jre image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [15:57:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1235824 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:59:30] (03PS4) 10Bking: Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) [16:00:41] (03CR) 10Muehlenhoff: "Two nits inline, otherwise LTGM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:00:54] 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11574372 (10Gehel) [16:02:15] (03CR) 10Elukey: "Sorry left some nits about the changelogs, docker-pkg throws some warnings etc.. if we diverge from the usual style. After it I think we a" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [16:03:17] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for trueg - https://phabricator.wikimedia.org/T415632#11574385 (10thcipriani) sorry for the delay, approved for the `deployment` group. [16:04:02] (03CR) 10Brouberol: "I had a look in the secret file, and I'm indeed seeing a private key, which is great. However, it seems that the key is `puppetservers-rsy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:05:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260202-160504-marostegui.json [16:05:40] (03PS5) 10Bking: Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) [16:06:41] (03CR) 10Bking: Add trixie-based openjdk-21-jre image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:07:10] (03CR) 10Elukey: "Fixed, didn't realized that I made a typo :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:07:37] (03PS3) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 [16:07:58] (03CR) 10Dpogorzelski: kserve: update image to 0.16 (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [16:08:08] (03CR) 10Alexandros Kosiaris: "Thanks for the review. PCC still looks good, merging this. Should be a noop across the board." [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [16:08:10] (03CR) 10Alexandros Kosiaris: [C:03+2] base::sysctl: Allow more finegrained rp_filter behavior [puppet] - 10https://gerrit.wikimedia.org/r/1228582 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [16:09:30] 10ops-eqsin: Unresponsive management for cp5022.mgmt:22 - https://phabricator.wikimedia.org/T416193 (10phaultfinder) 03NEW [16:09:38] (03PS3) 10Tiziano Fogli: thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1235829 (https://phabricator.wikimedia.org/T410152) [16:12:07] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for Johannnes89 - https://phabricator.wikimedia.org/T414789#11574461 (10elukey) 05In progress→03Resolved [16:13:25] (03CR) 10Bking: [C:03+2] Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:13:33] (03CR) 10Bking: [V:03+2 C:03+2] Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235823 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:13:36] (03PS1) 10Muehlenhoff: Kerberos: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1235834 (https://phabricator.wikimedia.org/T135991) [16:14:02] (03PS1) 10Dpogorzelski: kserve: update to version 0.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235835 [16:15:31] (03CR) 10CI reject: [V:04-1] kserve: update to version 0.16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235835 (owner: 10Dpogorzelski) [16:15:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1235834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:17:11] (03PS1) 10Brouberol: hadoop/yarn: allow analytics-sre to submit jobs to the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1235837 (https://phabricator.wikimedia.org/T402512) [16:18:38] (03CR) 10Brouberol: [C:03+1] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:19:59] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1235837 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:20:12] (03CR) 10Elukey: [C:03+1] hadoop/yarn: allow analytics-sre to submit jobs to the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1235837 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:20:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T415786)', diff saved to https://phabricator.wikimedia.org/P88405 and previous config saved to /var/cache/conftool/dbconfig/20260202-162017-marostegui.json [16:20:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:20:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:20:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1189 (T415786)', diff saved to https://phabricator.wikimedia.org/P88406 and previous config saved to /var/cache/conftool/dbconfig/20260202-162042-marostegui.json [16:21:52] (03CR) 10Joal: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1235837 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:21:58] (03CR) 10Brouberol: [C:03+2] hadoop/yarn: allow analytics-sre to submit jobs to the production queue [puppet] - 10https://gerrit.wikimedia.org/r/1235837 (https://phabricator.wikimedia.org/T402512) (owner: 10Brouberol) [16:22:00] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove scap_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [16:22:08] (03CR) 10Alexandros Kosiaris: [C:03+2] "Merging, thanks for the +1" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [16:23:27] (03CR) 10Volans: transports: add shortened method to Command class (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/1224032 (owner: 10Volans) [16:30:05] jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1630). [16:33:24] (03PS4) 10Tiziano Fogli: thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1235829 (https://phabricator.wikimedia.org/T410152) [16:33:30] (03CR) 10Elukey: [C:03+2] dse-k8s-services: add service-secrets to airflow-sre's helmfile config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230970 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [16:38:44] (03PS1) 10Bking: Correct formatting errors for java 21 image changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235839 (https://phabricator.wikimedia.org/T415699) [16:39:49] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11574722 (10dancy) >>! In T412951#11572476, @elukey wrote: > @dancy @Scott_French I think we are ready to m... [16:40:20] !log dancy@deploy2002 Installing scap version "4.241.0" for 2 host(s) [16:42:11] !log dancy@deploy2002 Installation of scap version "4.241.0" completed for 2 hosts [16:42:51] (03CR) 10Brouberol: Correct formatting errors for java 21 image changelogs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235839 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:43:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11574743 (10calbon) I approve! [16:44:00] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:44:37] (03PS2) 10Bking: Correct formatting errors for java 21 image changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235839 (https://phabricator.wikimedia.org/T415699) [16:44:55] (03CR) 10Bking: Correct formatting errors for java 21 image changelogs (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235839 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:45:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11574750 (10BTullis) I've checked with @JerryWang-WMF and he would like to include [[http... [16:45:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:47:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:48:25] !log elukey@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-sre: sync [16:49:12] !log elukey@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-sre: sync [16:50:41] (03CR) 10Brouberol: [C:03+1] Correct formatting errors for java 21 image changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235839 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:51:54] (03CR) 10SBassett: [C:03+1] WikimediaCustomizations: Set WMCBadEmailDomainsFile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230462 (https://phabricator.wikimedia.org/T397244) (owner: 10Gergő Tisza) [16:51:59] (03CR) 10Bking: [V:03+2 C:03+2] Correct formatting errors for java 21 image changelogs [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235839 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [16:55:11] (03CR) 10Elukey: "I'd suggest to build this locally to find little mistakes, I was about to add +1 and then I realized the changelog typos :(" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [17:07:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11574885 (10Sucheta-Salgaonkar-WMF) Thanks so much @elukey and Chris!! I attempted to sign https://phabricator.wikimedia.org/L3 but am asked for MFA credentials in the... [17:09:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11574902 (10elukey) @Sucheta-Salgaonkar-WMF strange, do you have MFA for Phabricator? It shouldn't ask you anything if you are logged in, in theory, but it has been a w... [17:12:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11574906 (10Sucheta-Salgaonkar-WMF) @elukey I thought I hadn't set it up, but I guess I must have done so at some point in my onboarding and totally forgot... sorry abo... [17:36:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T415786)', diff saved to https://phabricator.wikimedia.org/P88407 and previous config saved to /var/cache/conftool/dbconfig/20260202-173616-marostegui.json [17:36:20] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:46:58] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:47:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T415786)', diff saved to https://phabricator.wikimedia.org/P88408 and previous config saved to /var/cache/conftool/dbconfig/20260202-174721-marostegui.json [17:47:26] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:48:56] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 04 Apr 2026 07:22:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P88409 and previous config saved to /var/cache/conftool/dbconfig/20260202-175125-marostegui.json [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1800) [18:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T1800). [18:01:33] (03PS1) 10Bking: Rollback broken java image commits [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235849 (https://phabricator.wikimedia.org/T415699) [18:02:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P88410 and previous config saved to /var/cache/conftool/dbconfig/20260202-180230-marostegui.json [18:03:20] (03PS2) 10Bking: Rollback broken java image commits [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235849 (https://phabricator.wikimedia.org/T415699) [18:04:48] 07Puppet, 06ServiceOps new, 10ServiceOps-good-first-task, 13Patch-For-Review, 07Serviceops-easywins: network::constants::mw_appserver_networks is out of date (or named poorly?) - https://phabricator.wikimedia.org/T411508#11575180 (10MLechvien-WMF) [18:06:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P88411 and previous config saved to /var/cache/conftool/dbconfig/20260202-180633-marostegui.json [18:09:16] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:09:48] (03PS1) 10Bvibber: Revert "Update chart-renderer service for Parsoid template fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235853 (https://phabricator.wikimedia.org/T411319) [18:16:38] (03CR) 10Bking: [V:03+2 C:03+2] "self-merging as things are currently broken" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235849 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [18:17:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P88412 and previous config saved to /var/cache/conftool/dbconfig/20260202-181739-marostegui.json [18:19:38] (03PS3) 10Daniel Kinzler: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 [18:21:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T415786)', diff saved to https://phabricator.wikimedia.org/P88413 and previous config saved to /var/cache/conftool/dbconfig/20260202-182144-marostegui.json [18:21:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:22:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [18:22:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1198 (T415786)', diff saved to https://phabricator.wikimedia.org/P88414 and previous config saved to /var/cache/conftool/dbconfig/20260202-182210-marostegui.json [18:23:12] (03PS1) 10Bking: Add trixie-based openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235855 (https://phabricator.wikimedia.org/T415699) [18:27:42] (03CR) 10Bking: [V:03+2 C:03+2] "self-merging, as the same changes have already been approved in prior commits that had to be rolled back" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235855 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [18:29:21] (03CR) 10Bvibber: [C:03+2] "Self-merging revert to previous, known-good version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235853 (https://phabricator.wikimedia.org/T411319) (owner: 10Bvibber) [18:31:17] (03Merged) 10jenkins-bot: Revert "Update chart-renderer service for Parsoid template fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235853 (https://phabricator.wikimedia.org/T411319) (owner: 10Bvibber) [18:32:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T415786)', diff saved to https://phabricator.wikimedia.org/P88415 and previous config saved to /var/cache/conftool/dbconfig/20260202-183248-marostegui.json [18:32:51] fuck that was the wrong revert [18:32:53] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:33:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2194.codfw.wmnet with reason: Maintenance [18:33:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88416 and previous config saved to /var/cache/conftool/dbconfig/20260202-183312-marostegui.json [18:35:00] (03PS1) 10Bvibber: Reapply "Update chart-renderer service for Parsoid template fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235858 (https://phabricator.wikimedia.org/T411319) [18:35:37] (03PS1) 10Muehlenhoff: Record LDAP access for mpostoronca [puppet] - 10https://gerrit.wikimedia.org/r/1235859 [18:35:57] (03PS1) 10Bvibber: Revert "Update chart-renderer to 2026-01-29-153835-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235860 (https://phabricator.wikimedia.org/T411319) [18:36:32] (03CR) 10Bvibber: [C:03+2] "self-merge undoing revert of wrong commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235858 (https://phabricator.wikimedia.org/T411319) (owner: 10Bvibber) [18:37:00] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for mpostoronca [puppet] - 10https://gerrit.wikimedia.org/r/1235859 (owner: 10Muehlenhoff) [18:37:07] (03PS1) 10Bking: openjdk-21-jre: fix changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235861 (https://phabricator.wikimedia.org/T415699) [18:37:16] (03CR) 10Bvibber: [C:03+2] "self-merging revert to previous known-good version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235860 (https://phabricator.wikimedia.org/T411319) (owner: 10Bvibber) [18:37:41] (03CR) 10Bking: [V:03+2 C:03+2] openjdk-21-jre: fix changelog entry [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235861 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [18:38:25] (03Merged) 10jenkins-bot: Reapply "Update chart-renderer service for Parsoid template fix" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235858 (https://phabricator.wikimedia.org/T411319) (owner: 10Bvibber) [18:39:12] (03Merged) 10jenkins-bot: Revert "Update chart-renderer to 2026-01-29-153835-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235860 (https://phabricator.wikimedia.org/T411319) (owner: 10Bvibber) [18:40:30] !log bvibber@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [18:40:47] !log bvibber@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [18:41:28] !log bvibber@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [18:41:58] !log bvibber@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [18:42:09] !log bvibber@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [18:42:39] !log bvibber@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [18:48:59] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:53:59] FIRING: [6x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133212 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:08:59] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:09:22] (03CR) 10Herron: [C:03+1] thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1235829 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [19:16:47] (03CR) 10Scott French: [C:03+1] "Thanks, Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [19:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:24:54] 10ops-eqiad, 06DC-Ops: Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T416234 (10phaultfinder) 03NEW [19:30:18] (03CR) 10Gergő Tisza: WikimediaCustomizations: Set WMCBadEmailDomainsFile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230462 (https://phabricator.wikimedia.org/T397244) (owner: 10Gergő Tisza) [19:30:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230462 (https://phabricator.wikimedia.org/T397244) (owner: 10Gergő Tisza) [19:32:29] (03PS2) 10Daniel Kinzler: rest gateway: include service values.yaml when testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 [19:38:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T415786)', diff saved to https://phabricator.wikimedia.org/P88417 and previous config saved to /var/cache/conftool/dbconfig/20260202-193837-marostegui.json [19:38:42] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [19:40:37] (03PS1) 10Bking: Rename `openjdk-21-jre` bookworm image to `openjdk-21-jre-bookworm` [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235869 [19:41:42] (03PS2) 10Bking: Rename `openjdk-21-jre` bookworm image to `openjdk-21-jre-bookworm` [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235869 [19:44:22] (03PS3) 10Bking: Rename `openjdk-21-jre` bookworm image to `openjdk-21-jre-bookworm` [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235869 [19:47:12] (03CR) 10Bking: [V:03+2 C:03+2] Rename `openjdk-21-jre` bookworm image to `openjdk-21-jre-bookworm` [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235869 (owner: 10Bking) [19:53:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P88418 and previous config saved to /var/cache/conftool/dbconfig/20260202-195345-marostegui.json [19:58:29] (03PS1) 10Bking: openjdk-21-jdk: source image from new openjdk-21-jre image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235870 (https://phabricator.wikimedia.org/T415699) [20:06:57] (03PS16) 10Daniel Kinzler: rest gateway: add tests for chart rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 [20:08:30] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1235872 [20:08:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P88419 and previous config saved to /var/cache/conftool/dbconfig/20260202-200855-marostegui.json [20:11:33] (03PS1) 10Bking: openjdk-21-jre: fix changelog date formatting [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235873 (https://phabricator.wikimedia.org/T415699) [20:13:06] (03CR) 10Bking: [V:03+2 C:03+2] openjdk-21-jre: fix changelog date formatting [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235873 (https://phabricator.wikimedia.org/T415699) (owner: 10Bking) [20:13:07] (03PS6) 10Daniel Kinzler: rest gateway: implement per-policy shadow mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225699 (https://phabricator.wikimedia.org/T413183) [20:13:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235392 (https://phabricator.wikimedia.org/T411914) (owner: 10DLynch) [20:13:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235111 (https://phabricator.wikimedia.org/T415504) (owner: 10Esanders) [20:18:40] (03PS1) 10Xcollazo: Mark XML content dump jobs as deprecated. [dumps] - 10https://gerrit.wikimedia.org/r/1235874 (https://phabricator.wikimedia.org/T416180) [20:19:01] (03CR) 10CI reject: [V:04-1] Mark XML content dump jobs as deprecated. [dumps] - 10https://gerrit.wikimedia.org/r/1235874 (https://phabricator.wikimedia.org/T416180) (owner: 10Xcollazo) [20:19:17] (03PS5) 10Daniel Kinzler: rest route: support multiple rate limit policies at once [deployment-charts] - 10https://gerrit.wikimedia.org/r/1228218 (https://phabricator.wikimedia.org/T413186) [20:24:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T415786)', diff saved to https://phabricator.wikimedia.org/P88420 and previous config saved to /var/cache/conftool/dbconfig/20260202-202404-marostegui.json [20:24:08] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:24:21] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [20:24:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 6 hosts with reason: Maintenance [20:24:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88421 and previous config saved to /var/cache/conftool/dbconfig/20260202-202451-marostegui.json [20:25:29] (03PS4) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) [20:35:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 02 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235491 (https://phabricator.wikimedia.org/T328872) (owner: 10Func) [20:38:54] (03CR) 10Dzahn: [C:03+2] mailman: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1235817 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:41:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88422 and previous config saved to /var/cache/conftool/dbconfig/20260202-204113-marostegui.json [20:41:19] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:42:00] (03PS1) 10Dzahn: admin: add trueg to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1235879 (https://phabricator.wikimedia.org/T415632) [20:45:37] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for trueg - https://phabricator.wikimedia.org/T415632#11575944 (10Dzahn) a:05thcipriani→03None [20:49:08] (03CR) 10Dzahn: [C:03+1] "the comment says should be removed after T402611 which is closed - so that seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1234269 (owner: 10Arnaudb) [20:51:26] (03CR) 10Dzahn: [C:03+1] gitlab: set qos to low in rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [20:52:59] (03CR) 10Ottomata: "Nice!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [20:56:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P88423 and previous config saved to /var/cache/conftool/dbconfig/20260202-205621-marostegui.json [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T2100). [21:00:05] tgr, Kemayo, and Func: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] o/ with two config changes I can deploy for myself [21:01:09] o/ [21:01:15] o/ [21:01:55] (03CR) 10Ottomata: topic: Flink enrichment pipeline (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [21:02:21] Kemayo: would you mind adding in mine? it's a no-op (the variable is not used yet) [21:02:31] tgr_: Sure, I can do that. [21:02:47] thx [21:03:15] also a cleanup and nothing to test for me [21:03:27] I can throw that in as well, then. [21:03:49] thanks [21:03:56] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11576031 (10Dzahn) Hi @JerryWang-WMF you can start with getting access to the LDAP grou... [21:04:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235392 (https://phabricator.wikimedia.org/T411914) (owner: 10DLynch) [21:04:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235111 (https://phabricator.wikimedia.org/T415504) (owner: 10Esanders) [21:04:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230462 (https://phabricator.wikimedia.org/T397244) (owner: 10Gergő Tisza) [21:04:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235491 (https://phabricator.wikimedia.org/T328872) (owner: 10Func) [21:04:20] Okay, four config patches bundled together and starting their merging. [21:04:40] (03PS2) 10Xcollazo: Mark XML content dump jobs as deprecated. [dumps] - 10https://gerrit.wikimedia.org/r/1235874 (https://phabricator.wikimedia.org/T416180) [21:04:50] (03Merged) 10jenkins-bot: Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235392 (https://phabricator.wikimedia.org/T411914) (owner: 10DLynch) [21:04:54] (03Merged) 10jenkins-bot: Enable suggestions BetaFeature on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235111 (https://phabricator.wikimedia.org/T415504) (owner: 10Esanders) [21:04:58] (03Merged) 10jenkins-bot: WikimediaCustomizations: Set WMCBadEmailDomainsFile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1230462 (https://phabricator.wikimedia.org/T397244) (owner: 10Gergő Tisza) [21:05:01] (03Merged) 10jenkins-bot: filebackend: Clean up removed config params for multi-write backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235491 (https://phabricator.wikimedia.org/T328872) (owner: 10Func) [21:05:19] Ahh, it's so nice to not be doing backports with 20 minutes of tests. [21:05:24] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1235392|Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki (T411914)]], [[gerrit:1235111|Enable suggestions BetaFeature on beta wikis (T415504)]], [[gerrit:1230462|WikimediaCustomizations: Set WMCBadEmailDomainsFile (T397244)]], [[gerrit:1235491|filebackend: Clean up removed config params for multi-write backends (T328872)]] [21:05:40] T411914: [Config] Deploy config change to STOP the Tone Check A/B experiment - https://phabricator.wikimedia.org/T411914 [21:05:41] T415504: EditCheck: Create beta feature preference - https://phabricator.wikimedia.org/T415504 [21:05:41] T397244: Private mitigation blocks registration from certain email domains but gives misleading error about rate limits - https://phabricator.wikimedia.org/T397244 [21:05:42] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [21:07:21] !log kemayo@deploy2002 tgr, func, kemayo, esanders: Backport for [[gerrit:1235392|Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki (T411914)]], [[gerrit:1235111|Enable suggestions BetaFeature on beta wikis (T415504)]], [[gerrit:1230462|WikimediaCustomizations: Set WMCBadEmailDomainsFile (T397244)]], [[gerrit:1235491|filebackend: Clean up removed config params for multi-write backends (T328872)]] synced to [21:07:21] the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:07:29] Okay, got to test my own ones. [21:08:50] Hm. Wikimediadebug is broken. That's a fun time to discover that. [21:10:25] Ah, only for the beta cluster. Well, I can deal with that. [21:11:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P88424 and previous config saved to /var/cache/conftool/dbconfig/20260202-211129-marostegui.json [21:12:03] !log kemayo@deploy2002 tgr, func, kemayo, esanders: Continuing with sync [21:13:48] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install moss-fe200[5-8] - https://phabricator.wikimedia.org/T416243 (10Jhancock.wm) 03NEW [21:15:21] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install moss-fe200[5-8] - https://phabricator.wikimedia.org/T416243#11576134 (10Jhancock.wm) [21:15:54] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install moss-fe200[5-8] - https://phabricator.wikimedia.org/T416243#11576138 (10Jhancock.wm) [21:16:18] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1235392|Edit check: turn off the tone a/b test on frwiki, jawiki, ptwiki (T411914)]], [[gerrit:1235111|Enable suggestions BetaFeature on beta wikis (T415504)]], [[gerrit:1230462|WikimediaCustomizations: Set WMCBadEmailDomainsFile (T397244)]], [[gerrit:1235491|filebackend: Clean up removed config params for multi-write backends (T328872)]] (duration: 10 [21:16:18] m 54s) [21:16:25] T411914: [Config] Deploy config change to STOP the Tone Check A/B experiment - https://phabricator.wikimedia.org/T411914 [21:16:26] T415504: EditCheck: Create beta feature preference - https://phabricator.wikimedia.org/T415504 [21:16:26] T397244: Private mitigation blocks registration from certain email domains but gives misleading error about rate limits - https://phabricator.wikimedia.org/T397244 [21:16:27] T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 [21:17:55] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:18:02] Okay, all done if anyone else needs to make use of the window. [21:19:44] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245 (10Jhancock.wm) 03NEW [21:22:34] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T416234#11576212 (10Jclark-ctr) a:03Jclark-ctr [21:24:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11576218 (10Jclark-ctr) a:03VRiley-WMF [21:26:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88425 and previous config saved to /var/cache/conftool/dbconfig/20260202-212638-marostegui.json [21:26:42] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [21:26:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2209.codfw.wmnet with reason: Maintenance [21:27:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T415786)', diff saved to https://phabricator.wikimedia.org/P88426 and previous config saved to /var/cache/conftool/dbconfig/20260202-212703-marostegui.json [21:27:58] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249 (10Jhancock.wm) 03NEW [21:30:13] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe202[1-4] - https://phabricator.wikimedia.org/T416243#11576286 (10Jhancock.wm) [21:33:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88427 and previous config saved to /var/cache/conftool/dbconfig/20260202-213347-marostegui.json [21:33:52] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [21:36:40] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install frqueue2004 - https://phabricator.wikimedia.org/T416251 (10Jhancock.wm) 03NEW [21:46:48] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install payments10[09-11] - https://phabricator.wikimedia.org/T416252 (10Jhancock.wm) 03NEW [21:48:52] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:48:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P88428 and previous config saved to /var/cache/conftool/dbconfig/20260202-214855-marostegui.json [21:48:58] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:49:15] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11576378 (10Jhancock.wm) [21:49:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (195.200.68.151) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:50:10] FIRING: [4x] BFDdown: BFD session down between cr1-magru and 94.142.103.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:50:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Transit6&var-bgp_neighbor=Telxius - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:50:50] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install payments1009 - https://phabricator.wikimedia.org/T416253 (10Jhancock.wm) 03NEW [21:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:55:10] FIRING: [4x] BFDdown: BFD session down between cr1-magru and 94.142.103.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:55:31] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254 (10Jhancock.wm) 03NEW [21:55:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:58:52] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T416234#11576446 (10Jclark-ctr) Rebalanced pdu pulling migrated devices off L1,L2 leg [22:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260202T2200). Please do the needful. [22:03:52] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:04:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P88429 and previous config saved to /var/cache/conftool/dbconfig/20260202-220404-marostegui.json [22:05:08] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:10] FIRING: [4x] BFDdown: BFD session down between cr1-magru and 94.142.103.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:05:17] 10ops-codfw, 06SRE, 06DC-Ops, 06ServiceOps new: wikikube-worker2346 DOA - https://phabricator.wikimedia.org/T414708#11576467 (10Jhancock.wm) refreshed the email with supermicro support AGAIN [22:09:16] FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:09:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:10:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-magru and 94.142.103.58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:10:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:19:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T415786)', diff saved to https://phabricator.wikimedia.org/P88430 and previous config saved to /var/cache/conftool/dbconfig/20260202-221912-marostegui.json [22:19:16] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [22:19:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:19:46] (03PS1) 10Bking: WIP: openjdk-21-jre: Yet another changelog formatting patch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235888 (https://phabricator.wikimedia.org/T415699) [22:21:39] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11576546 (10herron) [22:22:19] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11576562 (10herron) 05Open→03Resolved a:03herron SLO WG has decided together to proceed with a production roll out of sloth, which will be tracked in the parent task! [22:23:27] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11576566 (10herron) 05Open→03Resolved a:03herron Closing this as pilot onboarding has finished, wider onboarding will be tracked in parent task! [22:24:54] !log bking@apt1002 `sudo -E reprepro -C thirdparty/opensearch3 copy trixie-wikimedia bookworm-wikimedia opensearch` [22:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:11] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262 (10herron) 03NEW p:05Triage→03Medium [22:31:22] 10SRE-SLO: Sloth: create Debian package - https://phabricator.wikimedia.org/T416263 (10herron) 03NEW [22:31:55] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11576648 (10herron) [22:36:22] 10SRE-SLO: Sloth: create Debian package - https://phabricator.wikimedia.org/T416263#11576677 (10herron) 05Open→03Resolved a:03herron Sloth package for 0.15.0 has been built via gitlab CI and uploaded to apt: `titan1001:~$ apt-cache policy sloth sloth: Installed: (none) Candidate: 0.15.0-1 Versio... [22:37:27] 10SRE-SLO: Sloth: productionize and onboard all SLOs - https://phabricator.wikimedia.org/T416262#11576682 (10herron) [22:37:31] 10SRE-SLO, 10Observability-Alerting, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11576683 (10herron) [22:40:47] !log added 500G to the lv on mwlog1002 [22:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:12] 10SRE-swift-storage, 10Ceph, 06ServiceOps new, 07Epic, and 3 others: Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph - https://phabricator.wikimedia.org/T412951#11576712 (10Scott_French) Thank you, @elukey! No objections to targeting the "UTC mid-day" infra window (a... [23:04:02] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [23:11:17] (03PS1) 10Ryan Kemper: pyrra: fix wdqs availability SLO config [puppet] - 10https://gerrit.wikimedia.org/r/1235891 (https://phabricator.wikimedia.org/T393966) [23:11:19] (03PS1) 10Ryan Kemper: pyrra: absent old per-dc wdqs availability configs [puppet] - 10https://gerrit.wikimedia.org/r/1235892 (https://phabricator.wikimedia.org/T393966) [23:11:21] (03PS1) 10Ryan Kemper: pyrra: remove previously absented wdqs avail SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1235893 (https://phabricator.wikimedia.org/T393966) [23:14:25] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [23:15:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Arelion (2001:2035:0:cf1::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:20:36] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1187 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T416268 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [23:20:45] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T416268 (10ops-monitoring-bot) 03NEW [23:28:18] (03CR) 10Bking: [C:03+1] "Conditional +1, LGTM once tests are passing." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [23:29:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T415786)', diff saved to https://phabricator.wikimedia.org/P88431 and previous config saved to /var/cache/conftool/dbconfig/20260202-232921-marostegui.json [23:29:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:32:23] (03PS3) 10Ryan Kemper: elasticsearch_cluster: allow checking last reboot [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) [23:44:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P88432 and previous config saved to /var/cache/conftool/dbconfig/20260202-234429-marostegui.json [23:52:55] (03CR) 10Ryan Kemper: "@volans/elukey: I forget which one of you is handling spicerack stuff these days, so I just tagged both, feel free to remove yourselves" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [23:59:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P88433 and previous config saved to /var/cache/conftool/dbconfig/20260202-235937-marostegui.json