[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0000) [00:00:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Arelion (2001:2035:0:cf1::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [00:04:43] !log eqsin cp5022 troubleshooting onsite in progress [00:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T415786)', diff saved to https://phabricator.wikimedia.org/P88434 and previous config saved to /var/cache/conftool/dbconfig/20260203-001445-marostegui.json [00:14:49] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [00:15:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2227.codfw.wmnet with reason: Maintenance [00:15:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88435 and previous config saved to /var/cache/conftool/dbconfig/20260203-001511-marostegui.json [00:40:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235898 [00:40:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235898 (owner: 10TrainBranchBot) [00:54:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1235898 (owner: 10TrainBranchBot) [01:10:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235900 [01:10:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235900 (owner: 10TrainBranchBot) [01:11:14] RECOVERY - dump of s8 in codfw on backupmon1001 is OK: Last dump for s8 at codfw (db2198) taken on 2026-02-03 00:00:05 (184 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:18:10] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:18:26] (03PS1) 10Scott French: Rebuild to pick up new PHP packages (8.3.30) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235833 [01:37:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1235900 (owner: 10TrainBranchBot) [02:00:55] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:05:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:10:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.14 [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1235903 (https://phabricator.wikimedia.org/T413805) [02:10:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.14 [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1235903 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [02:13:34] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 39s) [02:21:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88436 and previous config saved to /var/cache/conftool/dbconfig/20260203-022119-marostegui.json [02:21:25] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [02:22:28] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.14 [core] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1235903 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [02:36:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P88437 and previous config saved to /var/cache/conftool/dbconfig/20260203-023627-marostegui.json [02:42:12] (03CR) 10RLazarus: [C:03+1] "Testing locally, I don't see this taking effect:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235833 (owner: 10Scott French) [02:46:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:51:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:51:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P88438 and previous config saved to /var/cache/conftool/dbconfig/20260203-025135-marostegui.json [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0300) [03:06:16] RECOVERY - dump of s1 in eqiad on backupmon1001 is OK: Last dump for s1 at eqiad (db1240) taken on 2026-02-03 00:00:10 (154 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:06:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T415786)', diff saved to https://phabricator.wikimedia.org/P88439 and previous config saved to /var/cache/conftool/dbconfig/20260203-030644-marostegui.json [03:06:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:06:51] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [03:07:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2239.codfw.wmnet with reason: Maintenance [03:11:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [03:16:28] (03CR) 10Scott French: "Thanks for the review! FYI, I'm updating the commit message, because I forgot to prefix it with what I'm changing." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235833 (owner: 10Scott French) [03:16:43] (03PS2) 10Scott French: php8.3: Rebuild to pick up new PHP packages (8.3.30) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235833 [03:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:20:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:25:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:26:18] RECOVERY - dump of m1 in codfw on backupmon1001 is OK: Last dump for m1 at codfw (db2160) taken on 2026-02-03 00:15:00 (77 GiB, +1.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:41:18] RECOVERY - dump of s8 in eqiad on backupmon1001 is OK: Last dump for s8 at eqiad (db1171) taken on 2026-02-03 00:00:03 (184 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:00:04] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0400) [04:02:07] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235912 (https://phabricator.wikimedia.org/T413805) [04:02:14] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235912 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [04:03:02] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235912 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [04:03:34] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.14 refs T413805 [04:03:40] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [04:48:03] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.14 refs T413805 (duration: 44m 29s) [04:48:07] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0500) [05:02:55] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.11 (duration: 02m 53s) [05:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:19] RECOVERY - dump of m1 in eqiad on backupmon1001 is OK: Last dump for m1 at eqiad (db1217) taken on 2026-02-03 02:58:17 (77 GiB, +1.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:16:39] FIRING: TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:18:10] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:21:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:34:16] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:46:55] (03PS1) 10Kevin Bazira: ml-services: Cap maxReplicas at 7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236103 (https://phabricator.wikimedia.org/T414060) [05:48:55] (03PS1) 10Marostegui: filtered_tables.txt: Remove ar_sha1 and rev_sha1 [puppet] - 10https://gerrit.wikimedia.org/r/1236104 (https://phabricator.wikimedia.org/T411164) [05:50:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2229 with weight 0 T415862', diff saved to https://phabricator.wikimedia.org/P88440 and previous config saved to /var/cache/conftool/dbconfig/20260203-055010-marostegui.json [05:50:18] T415862: Switchover s6 master (db2214 -> db2229) - https://phabricator.wikimedia.org/T415862 [05:50:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s6 T415862 [05:50:59] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2229 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1234788 (https://phabricator.wikimedia.org/T415862) (owner: 10Gerrit maintenance bot) [05:51:52] !log Starting s6 codfw failover from db2214 to db2229 - T415862 [05:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s6 codfw as read-only for maintenance - T415862', diff saved to https://phabricator.wikimedia.org/P88441 and previous config saved to /var/cache/conftool/dbconfig/20260203-055823-marostegui.json [05:58:29] T415862: Switchover s6 master (db2214 -> db2229) - https://phabricator.wikimedia.org/T415862 [05:58:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2229 to s6 primary and set section read-write T415862', diff saved to https://phabricator.wikimedia.org/P88442 and previous config saved to /var/cache/conftool/dbconfig/20260203-055844-marostegui.json [05:59:09] (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1234789 (https://phabricator.wikimedia.org/T415862) (owner: 10Gerrit maintenance bot) [05:59:15] !log marostegui@dns1006 START - running authdns-update [06:00:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2214 T415862', diff saved to https://phabricator.wikimedia.org/P88443 and previous config saved to /var/cache/conftool/dbconfig/20260203-060000-marostegui.json [06:00:15] !log marostegui@dns1006 END - running authdns-update [06:03:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2214.codfw.wmnet with reason: Maintenance [06:04:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db2213 with weight 0 T415900', diff saved to https://phabricator.wikimedia.org/P88444 and previous config saved to /var/cache/conftool/dbconfig/20260203-060411-marostegui.json [06:04:16] T415900: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T415900 [06:04:22] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1235054 (https://phabricator.wikimedia.org/T415900) (owner: 10Gerrit maintenance bot) [06:04:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T415900 [06:04:39] !log Starting s5 codfw failover from db2192 to db2213 - T415900 [06:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set s5 codfw as read-only for maintenance - T415900', diff saved to https://phabricator.wikimedia.org/P88445 and previous config saved to /var/cache/conftool/dbconfig/20260203-061002-marostegui.json [06:10:07] T415900: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T415900 [06:10:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db2213 to s5 primary and set section read-write T415900', diff saved to https://phabricator.wikimedia.org/P88446 and previous config saved to /var/cache/conftool/dbconfig/20260203-061025-marostegui.json [06:10:56] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1235055 (https://phabricator.wikimedia.org/T415900) (owner: 10Gerrit maintenance bot) [06:11:15] (03Abandoned) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1235055 (https://phabricator.wikimedia.org/T415900) (owner: 10Gerrit maintenance bot) [06:11:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2192 T415900', diff saved to https://phabricator.wikimedia.org/P88447 and previous config saved to /var/cache/conftool/dbconfig/20260203-061142-marostegui.json [06:12:46] (03PS1) 10Marostegui: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236105 (https://phabricator.wikimedia.org/T415900) [06:13:42] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236105 (https://phabricator.wikimedia.org/T415900) (owner: 10Marostegui) [06:13:45] !log marostegui@dns1006 START - running authdns-update [06:14:45] !log marostegui@dns1006 END - running authdns-update [06:16:13] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2192.codfw.wmnet with reason: Maintenance [06:17:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2161.codfw.wmnet with reason: schema change [06:19:25] (03CR) 10Marostegui: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1236104 (https://phabricator.wikimedia.org/T411164) (owner: 10Marostegui) [06:19:28] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove ar_sha1 and rev_sha1 [puppet] - 10https://gerrit.wikimedia.org/r/1236104 (https://phabricator.wikimedia.org/T411164) (owner: 10Marostegui) [06:22:45] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1236106 (https://phabricator.wikimedia.org/T416298) [06:23:14] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1236107 (https://phabricator.wikimedia.org/T416299) [06:23:23] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236108 (https://phabricator.wikimedia.org/T416299) [06:27:52] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1236109 (https://phabricator.wikimedia.org/T416300) [06:28:06] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1236110 (https://phabricator.wikimedia.org/T416300) [06:31:34] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11577438 (10Joe) 05Open→03Resolved [06:39:16] FIRING: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:16] RESOLVED: [2x] ProbeDown: Service titan1002:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:55:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:55:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1163 (T415786)', diff saved to https://phabricator.wikimedia.org/P88448 and previous config saved to /var/cache/conftool/dbconfig/20260203-065541-marostegui.json [06:56:02] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0700) [07:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0700). [07:09:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [07:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [07:17:55] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:24:02] (03PS2) 10Muehlenhoff: Kerberos: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1235834 (https://phabricator.wikimedia.org/T135991) [07:24:29] !log installing openssl security updates [07:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:00] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women' 'Event:Celebrate Women' Ammarpad # T416031 [07:34:04] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [07:41:36] (03CR) 10Tiziano Fogli: [C:03+2] thanos/compact: increase meta-fetch goroutines to fix compactor inactivity [puppet] - 10https://gerrit.wikimedia.org/r/1235829 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [08:00:05] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:12:32] !log Ran refreshImageMetadata.php for multiple files for T414643 [08:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:39] T414643: Opus file has unrecognized codecs - https://phabricator.wikimedia.org/T414643 [08:13:09] (03PS1) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1236232 [08:14:30] (03PS1) 10Tiziano Fogli: centralauth: add recording rules for grafana widgets (write) [puppet] - 10https://gerrit.wikimedia.org/r/1236233 (https://phabricator.wikimedia.org/T415035) [08:15:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1003.wikimedia.org [08:19:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1003.wikimedia.org [08:19:10] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Add citations' 'Event:Celebrate Women/Add citations' Ammarpad # T416031 [08:19:13] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [08:20:39] (03CR) 10Muehlenhoff: [C:03+2] Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1236232 (owner: 10Muehlenhoff) [08:20:52] !log jmm@dns1004 START - running authdns-update [08:21:52] !log jmm@dns1004 END - running authdns-update [08:25:40] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Create an article' 'Event:Celebrate Women/Create an article' Ammarpad # T416031 [08:25:46] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [08:27:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org [08:27:57] !log failover irc.wikimedia.org to irc1003.wikimedia.org [08:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org [08:37:14] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11577605 (10FCeratto-WMF) a:05FCeratto-WMF→03None [08:38:33] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Events' 'Event:Celebrate Women/Events' Ammarpad # T416031 [08:38:39] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [08:40:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T415786)', diff saved to https://phabricator.wikimedia.org/P88449 and previous config saved to /var/cache/conftool/dbconfig/20260203-084014-marostegui.json [08:40:20] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:45:18] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Events/2024' 'Event:Celebrate Women/Events/2024' Ammarpad # T416031 [08:45:21] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [08:45:42] (03PS1) 10Muehlenhoff: Add Cumin alias for crm [puppet] - 10https://gerrit.wikimedia.org/r/1236238 [08:47:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2145.codfw.wmnet with reason: Maintenance [08:47:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T415786)', diff saved to https://phabricator.wikimedia.org/P88450 and previous config saved to /var/cache/conftool/dbconfig/20260203-084737-marostegui.json [08:47:42] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [08:49:34] !log installing libcommons-lang3-java security updates [08:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P88451 and previous config saved to /var/cache/conftool/dbconfig/20260203-085022-marostegui.json [08:52:37] (03PS1) 10Btullis: Double the default postgresql WAL volume size to 30 GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236239 (https://phabricator.wikimedia.org/T375846) [08:56:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:58:05] o/ [08:58:25] (03PS2) 10Btullis: Increase the default PostgreSQL cluster volume sizes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236239 (https://phabricator.wikimedia.org/T375846) [08:58:45] I am the conductor this week for the MediaWiki train [08:58:59] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Events/2025' 'Event:Celebrate Women/Events/2025' Ammarpad # T416031 [08:59:06] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [08:59:30] and Logstash dies with some oauth2-proxy 500 server error ( 95fe764d-7863-48e5-88cf-09e3efcfd706 ) [08:59:44] (03CR) 10Brouberol: [C:03+1] Increase the default PostgreSQL cluster volume sizes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236239 (https://phabricator.wikimedia.org/T375846) (owner: 10Btullis) [08:59:52] thanksfully that was transient [09:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0900) [09:00:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P88452 and previous config saved to /var/cache/conftool/dbconfig/20260203-090031-marostegui.json [09:01:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:01:51] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236242 (https://phabricator.wikimedia.org/T413805) [09:02:01] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236242 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:03:09] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236242 (https://phabricator.wikimedia.org/T413805) (owner: 10TrainBranchBot) [09:07:55] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:10:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T415786)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20260203-091039-marostegui.json [09:11:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:11:09] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [09:11:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1169 (T415786)', diff saved to https://phabricator.wikimedia.org/P88454 and previous config saved to /var/cache/conftool/dbconfig/20260203-091110-marostegui.json [09:12:05] (03PS1) 10Muehlenhoff: Add an option to the flag generated firewall rules with low QoS [puppet] - 10https://gerrit.wikimedia.org/r/1236243 [09:12:55] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:13:37] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.14 refs T413805 [09:13:43] T413805: 1.46.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T413805 [09:17:29] (03CR) 10Btullis: [C:03+2] Increase the default PostgreSQL cluster volume sizes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236239 (https://phabricator.wikimedia.org/T375846) (owner: 10Btullis) [09:17:37] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Improve an article' 'Event:Celebrate Women/Improve an article' Ammarpad # T416031 [09:17:43] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [09:18:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236243 (owner: 10Muehlenhoff) [09:20:06] (03Merged) 10jenkins-bot: Increase the default PostgreSQL cluster volume sizes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236239 (https://phabricator.wikimedia.org/T375846) (owner: 10Btullis) [09:25:13] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for the addition" [puppet] - 10https://gerrit.wikimedia.org/r/1235824 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:26:35] (03CR) 10Elukey: [C:03+1] Kerberos: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1235834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:37:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1189 with weight 0 T416298', diff saved to https://phabricator.wikimedia.org/P88455 and previous config saved to /var/cache/conftool/dbconfig/20260203-093736-marostegui.json [09:37:41] T416298: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T416298 [09:37:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T416298 [09:38:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1236106 (https://phabricator.wikimedia.org/T416298) (owner: 10Gerrit maintenance bot) [09:38:42] !log Starting s3 eqiad failover from db1223 to db1189 - T416298 [09:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:49] (03PS1) 10Marostegui: Revert "db2161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1236246 [09:40:15] (03PS4) 10Elukey: docker_registry: move /v2/restricted to the s3 restricted backend [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) [09:40:22] (03CR) 10Elukey: docker_registry: move /v2/restricted to the s3 restricted backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [09:40:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1189 to s3 primary T416298', diff saved to https://phabricator.wikimedia.org/P88456 and previous config saved to /var/cache/conftool/dbconfig/20260203-094038-marostegui.json [09:41:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1223 T416298', diff saved to https://phabricator.wikimedia.org/P88457 and previous config saved to /var/cache/conftool/dbconfig/20260203-094116-marostegui.json [09:41:32] (03CR) 10Marostegui: [C:03+2] Revert "db2161: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1236246 (owner: 10Marostegui) [09:42:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2161: After schema change [09:43:19] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2214: After schema change [09:44:17] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db2192: After schema change [09:48:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1223.eqiad.wmnet with reason: Maintenance [09:50:57] (03PS1) 10Samwilson: Remove unused SpecialMobileEditWatchlist::outputSubtitle() [extensions/MobileFrontend] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236247 (https://phabricator.wikimedia.org/T416294) [09:59:55] (03CR) 10Elukey: "Hey Ryan! Thanks a lot for the context and the comments, they helped a lot in the review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1235112 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [10:02:02] jouncebot: nowandnext [10:02:02] For the next 0 hour(s) and 57 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T0900) [10:02:02] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1100) [10:02:13] (03CR) 10Elukey: [C:03+2] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1235879 (https://phabricator.wikimedia.org/T415632) (owner: 10Dzahn) [10:04:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for trueg - https://phabricator.wikimedia.org/T415632#11577871 (10elukey) The change is merged and it will be propagated by puppet during the next 30 mins :) [10:06:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for trueg - https://phabricator.wikimedia.org/T415632#11577876 (10elukey) 05Open→03Resolved [10:06:27] (03PS1) 10Elukey: admin: add the krb flag to the pham user [puppet] - 10https://gerrit.wikimedia.org/r/1236249 (https://phabricator.wikimedia.org/T414660) [10:07:11] (03CR) 10Muehlenhoff: "I've prepared a patch for an alternative solution, please have a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1236243" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [10:07:12] (03CR) 10CI reject: [V:04-1] admin: add the krb flag to the pham user [puppet] - 10https://gerrit.wikimedia.org/r/1236249 (https://phabricator.wikimedia.org/T414660) (owner: 10Elukey) [10:09:31] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Learn how Wikipedia works' 'Event:Celebrate Women/Learn how Wikipedia works' Ammarpad # T416031 [10:09:40] (03PS4) 10Daniel Kinzler: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 [10:09:43] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [10:10:00] (03PS2) 10Elukey: admin: add the krb flag to the pham user [puppet] - 10https://gerrit.wikimedia.org/r/1236249 (https://phabricator.wikimedia.org/T414660) [10:10:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-analytics-test: apply [10:10:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-analytics-test: apply [10:10:54] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-analytics-product: apply [10:11:01] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-analytics-product: apply [10:11:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-dev: apply [10:11:17] (03PS1) 10Slyngshede: LDAP: Use the escaping mechanism provided by LDAP3 [software/bitu] - 10https://gerrit.wikimedia.org/r/1236250 (https://phabricator.wikimedia.org/T412420) [10:11:35] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-dev: apply [10:11:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236249 (https://phabricator.wikimedia.org/T414660) (owner: 10Elukey) [10:12:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-ml: apply [10:12:28] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-ml: apply [10:13:07] (03CR) 10Muehlenhoff: [C:03+2] Kerberos: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1235834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:13:32] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-platform-eng: apply [10:13:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-platform-eng: apply [10:14:59] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [10:15:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [10:15:22] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-research: apply [10:15:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11577916 (10elukey) @Sucheta-Salgaonkar-WMF perfect! After reviewing your request I didn't get what level of access you'd need (see https://wikitech.wikimedia.org/wiki/... [10:16:16] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-research: apply [10:16:29] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Resources' 'Event:Celebrate Women/Resources' Ammarpad # T416031 [10:16:37] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [10:16:38] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-search: apply [10:16:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-search: apply [10:17:04] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [10:17:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-sre: apply [10:18:21] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Decom cookbook: run Homer when needed - https://phabricator.wikimedia.org/T416313 (10ayounsi) 03NEW [10:20:54] (03PS3) 10Elukey: admin: add the krb flag to the pham user [puppet] - 10https://gerrit.wikimedia.org/r/1236249 (https://phabricator.wikimedia.org/T414660) [10:20:54] (03PS1) 10Elukey: admin: add user ssalgaonkar-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1236251 (https://phabricator.wikimedia.org/T415594) [10:22:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, February 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/MobileFrontend] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236247 (https://phabricator.wikimedia.org/T416294) (owner: 10Samwilson) [10:22:30] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [10:23:18] (03PS5) 10Daniel Kinzler: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 [10:23:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11577933 (10elukey) [10:24:15] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Suggested activities' 'Event:Celebrate Women/Suggested activities' Ammarpad # T416031 [10:24:18] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [10:26:10] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11577937 (10elukey) >>! In T414660#11537922, @Novem_Linguae wrote: > I don't see krb: present in the patch, so looks like this was done a... [10:26:52] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [10:27:57] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-wikidata: apply [10:28:12] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-wikidata: apply [10:28:18] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2161: After schema change [10:28:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2214: After schema change [10:29:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1236251 (https://phabricator.wikimedia.org/T415594) (owner: 10Elukey) [10:29:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11577940 (10elukey) Nevermind, let's just add you to analytics-privatedata so you'll be future proof :) [10:29:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-wmde: apply [10:29:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-wmde: apply [10:29:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db2192: After schema change [10:30:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T415786)', diff saved to https://phabricator.wikimedia.org/P88471 and previous config saved to /var/cache/conftool/dbconfig/20260203-103037-marostegui.json [10:30:47] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:33:15] (03CR) 10Elukey: [C:03+2] admin: add the krb flag to the pham user [puppet] - 10https://gerrit.wikimedia.org/r/1236249 (https://phabricator.wikimedia.org/T414660) (owner: 10Elukey) [10:33:24] (03CR) 10Elukey: [C:03+2] admin: add user ssalgaonkar-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1236251 (https://phabricator.wikimedia.org/T415594) (owner: 10Elukey) [10:33:51] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki --reason 'Requested at [[phab:T416031]]' 'Celebrate Women/Translate an article' 'Event:Celebrate Women/Translate an article' Ammarpad # T416031 [10:33:54] T416031: Request to move translatable page: m:Celebrate Women - https://phabricator.wikimedia.org/T416031 [10:35:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11577954 (10ayounsi) Please use the per rack cloud hosts vlans for all those hosts: cloud-hosts1-f4-eqiad (1124) cloud-hosts1-c8-eqiad (1128) cloud-hosts1-d5-eqiad (1127) cloud-hosts1-e4-eqiad (1123) T... [10:36:14] (03PS4) 10Dpogorzelski: kserve: update image to 0.16 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 [10:36:23] (03CR) 10Dpogorzelski: kserve: update image to 0.16 (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235826 (owner: 10Dpogorzelski) [10:39:19] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Access to systems owned by data platform engineering team for Jerry Wang - https://phabricator.wikimedia.org/T416191#11577989 (10elukey) @JerryWang-WMF @BTullis Hi! I think that the production shell access... [10:43:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1230280 (owner: 10Dpogorzelski) [10:44:09] (03PS2) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1235827 (https://phabricator.wikimedia.org/T360794) [10:45:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P88472 and previous config saved to /var/cache/conftool/dbconfig/20260203-104547-marostegui.json [10:47:11] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-c7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T416234#11578049 (10Jclark-ctr) 05Open→03Resolved [10:48:11] (03PS1) 10Muehlenhoff: Record LDAP access for alexsanford [puppet] - 10https://gerrit.wikimedia.org/r/1236254 [10:49:47] (03CR) 10Elukey: [C:03+2] "To keep archives happy - reviewed within the I/F team, approved :)" [puppet] - 10https://gerrit.wikimedia.org/r/1230280 (owner: 10Dpogorzelski) [10:50:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T416268#11578061 (10Jclark-ctr) →14Duplicate dup:03T415002 [10:50:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11578059 (10Jclark-ctr) [10:51:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T415786)', diff saved to https://phabricator.wikimedia.org/P88473 and previous config saved to /var/cache/conftool/dbconfig/20260203-105059-marostegui.json [10:51:04] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [10:54:06] (03PS6) 10Daniel Kinzler: rest gateway: define new limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234512 [10:54:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11578074 (10VRiley-WMF) @RobH I was able to locate 2 that we have that are new in the box. We do have a few others that are loose on site. [10:59:04] 06SRE, 10SRE-Access-Requests: Requesting access to L3 data access for kimpham (developer name Kim.pham) - https://phabricator.wikimedia.org/T414660#11578095 (10elukey) ` elukey@krb1002:~$ sudo manage_principals.py create pham --email_address=kim.pham@wikimedia.de Principal successfully created. Make sure t... [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1100) [11:01:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P88474 and previous config saved to /var/cache/conftool/dbconfig/20260203-110057-marostegui.json [11:01:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P88475 and previous config saved to /var/cache/conftool/dbconfig/20260203-110108-marostegui.json [11:01:17] (03PS1) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236258 (https://phabricator.wikimedia.org/T360794) [11:05:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Eqiad: Replacement top-of-rack switch for rack C1 - https://phabricator.wikimedia.org/T403031#11578130 (10Peachey88) [11:07:01] (03PS1) 10Joal: Update druid_analytics middlemaanger java opts [puppet] - 10https://gerrit.wikimedia.org/r/1236260 (https://phabricator.wikimedia.org/T415799) [11:11:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P88476 and previous config saved to /var/cache/conftool/dbconfig/20260203-111120-marostegui.json [11:12:43] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [11:14:02] (03CR) 10Btullis: [C:03+2] Update druid_analytics middlemaanger java opts [puppet] - 10https://gerrit.wikimedia.org/r/1236260 (https://phabricator.wikimedia.org/T415799) (owner: 10Joal) [11:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [11:15:48] 06SRE, 06Infrastructure-Foundations: Build OpenGear serial port config from Netbox - https://phabricator.wikimedia.org/T415345#11578174 (10ayounsi) Nice, I copy pasted what you did in Netbox's "render config" feature, that's the result : https://netbox-next.wikimedia.org/dcim/devices/2258/render-config/ Ques... [11:16:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T415786)', diff saved to https://phabricator.wikimedia.org/P88477 and previous config saved to /var/cache/conftool/dbconfig/20260203-111607-marostegui.json [11:16:26] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:16:28] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2146.codfw.wmnet with reason: Maintenance [11:16:29] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11578175 (10Jclark-ctr) [11:16:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T415786)', diff saved to https://phabricator.wikimedia.org/P88478 and previous config saved to /var/cache/conftool/dbconfig/20260203-111636-marostegui.json [11:16:53] (03PS1) 10Jcrespo: backups: Remove analytics_meta regular backups: hue & airflow* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) [11:18:38] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt bast1004 - jclark@cumin1003" [11:18:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt bast1004 - jclark@cumin1003" [11:18:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:19:22] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11578185 (10Jclark-ctr) a:03Jclark-ctr [11:20:23] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host bast1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:20:31] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pool db1223: After schema change [11:21:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T415786)', diff saved to https://phabricator.wikimedia.org/P88480 and previous config saved to /var/cache/conftool/dbconfig/20260203-112130-marostegui.json [11:21:41] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [11:21:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [11:21:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T415786)', diff saved to https://phabricator.wikimedia.org/P88481 and previous config saved to /var/cache/conftool/dbconfig/20260203-112156-marostegui.json [11:23:30] jclark@cumin1003 provision (PID 1955543) is awaiting input [11:31:49] jclark@cumin1003 provision (PID 1955543) is awaiting input [11:32:31] 10SRE-SLO, 06Product Safety and Integrity, 10iPoid-Service (iPoid 1.0): IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11578210 (10MLechvien-WMF) [11:37:54] (03CR) 10Jcrespo: "Please review the proposed removal, both in case there are more things to remove left, or additional dbs to add to backups. It can be appl" [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [11:41:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host bast1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:43:54] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [11:46:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11578247 (10Jclark-ctr) [11:47:38] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt backup1015 - jclark@cumin1003" [11:47:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt backup1015 - jclark@cumin1003" [11:47:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:52:10] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11578273 (10Jclark-ctr) a:05Jclark-ctr→03Andrew @andrew would you be able to help with adding server to Site.pp and updating preseed.yaml for efi booting? [11:52:35] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:01:11] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11578302 (10Jclark-ctr) a:03Jclark-ctr [12:02:25] jclark@cumin1003 provision (PID 1989531) is awaiting input [12:02:41] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1236254 (owner: 10Muehlenhoff) [12:06:02] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:06:04] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pool db1223: After schema change [12:06:55] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:08:38] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [12:08:57] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:09:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1158 (T415786)', diff saved to https://phabricator.wikimedia.org/P88485 and previous config saved to /var/cache/conftool/dbconfig/20260203-120905-marostegui.json [12:09:10] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [12:12:18] jclark@cumin1003 provision (PID 2002953) is awaiting input [12:13:00] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v.1.1.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236276 (https://phabricator.wikimedia.org/T415325) [12:14:45] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v.1.1.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236277 (https://phabricator.wikimedia.org/T415325) [12:17:24] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:20:33] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:22:27] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:22:51] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:24:44] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:25:55] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:28:31] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:32:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11578372 (10Jclark-ctr) @elukey I’m having issues with this server failing to provision. I've manually set the username, password, and idrac network, but it continues to fail to pi... [12:32:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11578373 (10Jclark-ctr) a:05jcrespo→03Jclark-ctr [12:39:40] FIRING: [2x] ProbeDown: Service etherpad1004:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:40] RESOLVED: [2x] ProbeDown: Service etherpad1004:7443 has failed probes (http_etherpad_envoy_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T415786)', diff saved to https://phabricator.wikimedia.org/P88486 and previous config saved to /var/cache/conftool/dbconfig/20260203-125912-marostegui.json [12:59:35] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1300) [13:07:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T415786)', diff saved to https://phabricator.wikimedia.org/P88487 and previous config saved to /var/cache/conftool/dbconfig/20260203-130724-marostegui.json [13:07:43] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:08:14] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:09:43] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v.1.1.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236276 (https://phabricator.wikimedia.org/T415325) (owner: 10Santiago Faci) [13:09:49] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v.1.1.8 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236277 (https://phabricator.wikimedia.org/T415325) (owner: 10Santiago Faci) [13:10:59] !log joal@deploy2002 Started deploy [analytics/refinery@fc72bd3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fc72bd31] [13:11:35] (03CR) 10Brouberol: "All occurrences of `search_airflow` can go as well." [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [13:12:00] !log joal@deploy2002 Finished deploy [analytics/refinery@fc72bd3] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@fc72bd31] (duration: 01m 01s) [13:12:05] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ms-fe - jclark@cumin1003" [13:12:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt ms-fe - jclark@cumin1003" [13:12:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:11] (03CR) 10Brouberol: [C:03+1] Mark XML content dump jobs as deprecated. [dumps] - 10https://gerrit.wikimedia.org/r/1235874 (https://phabricator.wikimedia.org/T416180) (owner: 10Xcollazo) [13:12:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T415786)', diff saved to https://phabricator.wikimedia.org/P88488 and previous config saved to /var/cache/conftool/dbconfig/20260203-131245-marostegui.json [13:12:52] !log joal@deploy2002 Started deploy [analytics/refinery@fc72bd3] (thin): Regular analytics weekly train THIN [analytics/refinery@fc72bd31] [13:12:54] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:13:10] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:14:06] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:14:12] !log joal@deploy2002 Finished deploy [analytics/refinery@fc72bd3] (thin): Regular analytics weekly train THIN [analytics/refinery@fc72bd31] (duration: 01m 20s) [13:14:21] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:14:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P88489 and previous config saved to /var/cache/conftool/dbconfig/20260203-131424-marostegui.json [13:14:46] !log joal@deploy2002 Started deploy [analytics/refinery@fc72bd3]: Regular analytics weekly train [analytics/refinery@fc72bd31] [13:14:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:14:58] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:16:52] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:16:57] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11578520 (10Jclark-ctr) [13:17:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P88490 and previous config saved to /var/cache/conftool/dbconfig/20260203-131735-marostegui.json [13:19:59] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:20:10] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:20:18] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:21:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11578528 (10Jclark-ctr) @elukey These are failing as well, just like backup1015 in T414725. I haven’t made any changes to the server — it’s still using the default user and password. [13:21:57] !log joal@deploy2002 Finished deploy [analytics/refinery@fc72bd3]: Regular analytics weekly train [analytics/refinery@fc72bd31] (duration: 07m 11s) [13:23:09] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11578531 (10Jclark-ctr) [13:27:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P88491 and previous config saved to /var/cache/conftool/dbconfig/20260203-132745-marostegui.json [13:27:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P88492 and previous config saved to /var/cache/conftool/dbconfig/20260203-132755-marostegui.json [13:28:06] (03CR) 10Pmiazga: [C:03+1] "checked locally - tests works, service gets up properly" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1229119 (owner: 10Daniel Kinzler) [13:29:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P88493 and previous config saved to /var/cache/conftool/dbconfig/20260203-132936-marostegui.json [13:31:18] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:37:05] (03CR) 10Elukey: docker_registry: move /v2/restricted to the s3 restricted backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [13:37:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T415786)', diff saved to https://phabricator.wikimedia.org/P88494 and previous config saved to /var/cache/conftool/dbconfig/20260203-133754-marostegui.json [13:37:57] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:37:59] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:38:11] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1195.eqiad.wmnet with reason: Maintenance [13:38:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1195 (T415786)', diff saved to https://phabricator.wikimedia.org/P88495 and previous config saved to /var/cache/conftool/dbconfig/20260203-133818-marostegui.json [13:38:20] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236243 (owner: 10Muehlenhoff) [13:43:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P88496 and previous config saved to /var/cache/conftool/dbconfig/20260203-134303-marostegui.json [13:43:21] (03CR) 10Jcrespo: "Ack, amending." [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [13:44:48] (03PS2) 10Jcrespo: backups: Remove analytics_meta regular backups: hue & airflow* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) [13:44:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T415786)', diff saved to https://phabricator.wikimedia.org/P88497 and previous config saved to /var/cache/conftool/dbconfig/20260203-134445-marostegui.json [13:44:58] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:45:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2153.codfw.wmnet with reason: Maintenance [13:45:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T415786)', diff saved to https://phabricator.wikimedia.org/P88498 and previous config saved to /var/cache/conftool/dbconfig/20260203-134514-marostegui.json [13:45:27] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [13:46:03] (03CR) 10Jcrespo: [C:03+1] "Ok to merge AND deploy, or do you want to wait more for any reason?" [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [13:46:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236247 (https://phabricator.wikimedia.org/T416294) (owner: 10Samwilson) [13:48:51] (03Merged) 10jenkins-bot: Remove unused SpecialMobileEditWatchlist::outputSubtitle() [extensions/MobileFrontend] (wmf/1.46.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1236247 (https://phabricator.wikimedia.org/T416294) (owner: 10Samwilson) [13:49:31] (03PS1) 10Elukey: sre.hosts.provision: initialize dict when setting lldp [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) [13:49:57] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1236247|Remove unused SpecialMobileEditWatchlist::outputSubtitle() (T416294)]] [13:50:13] T416294: SpecialMobileEditWatchlist not compatible with SpecialEditWatchlist - https://phabricator.wikimedia.org/T416294 [13:51:00] (03CR) 10Brouberol: [C:03+1] "Yep, feel free to merge and deploy!" [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [13:51:59] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:52:11] !log samtar@deploy2002 samwilson, samtar: Backport for [[gerrit:1236247|Remove unused SpecialMobileEditWatchlist::outputSubtitle() (T416294)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:52:23] * TheresNoTime testing [13:52:34] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:52:51] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:53:52] !log samtar@deploy2002 samwilson, samtar: Continuing with sync [13:55:26] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: initialize dict when setting lldp [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [13:57:38] (03PS1) 10Slyngshede: Permission: command for expiring permission requests [software/bitu] - 10https://gerrit.wikimedia.org/r/1236298 (https://phabricator.wikimedia.org/T416152) [13:58:00] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1236247|Remove unused SpecialMobileEditWatchlist::outputSubtitle() (T416294)]] (duration: 08m 03s) [13:58:12] T416294: SpecialMobileEditWatchlist not compatible with SpecialEditWatchlist - https://phabricator.wikimedia.org/T416294 [13:58:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T415786)', diff saved to https://phabricator.wikimedia.org/P88500 and previous config saved to /var/cache/conftool/dbconfig/20260203-135813-marostegui.json [13:58:23] (03CR) 10Ayounsi: [C:03+1] sre.hosts.provision: initialize dict when setting lldp [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [13:58:24] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [13:58:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [13:58:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1170 (T415786)', diff saved to https://phabricator.wikimedia.org/P88501 and previous config saved to /var/cache/conftool/dbconfig/20260203-135840-marostegui.json [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1400) [14:00:05] TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:46] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1233836 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [14:01:04] jouncebot: 2slow4me :D [14:04:07] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:04:30] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for alexsanford [puppet] - 10https://gerrit.wikimedia.org/r/1236254 (owner: 10Muehlenhoff) [14:11:39] (03PS1) 10Ayounsi: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 [14:12:08] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1227786 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [14:12:17] (03PS2) 10Ayounsi: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 [14:12:25] (03CR) 10Brouberol: [C:03+2] an-test-druid: disable noisy GC stat logging [puppet] - 10https://gerrit.wikimedia.org/r/1227786 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [14:13:30] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [14:13:31] (03PS2) 10Brouberol: an-test-druid: disable noisy GC stat logging [puppet] - 10https://gerrit.wikimedia.org/r/1227786 (https://phabricator.wikimedia.org/T278056) [14:14:26] (03CR) 10Brouberol: [C:03+2] an-test-druid: disable noisy GC stat logging [puppet] - 10https://gerrit.wikimedia.org/r/1227786 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [14:15:08] (03Abandoned) 10Brouberol: an-test-druid: disable noisy GC stat logging [puppet] - 10https://gerrit.wikimedia.org/r/1227786 (https://phabricator.wikimedia.org/T278056) (owner: 10Brouberol) [14:16:18] (03PS1) 10Ayounsi: vlan_migration report: add eqiad row C/D vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236301 [14:16:47] (03CR) 10Cathal Mooney: [C:03+1] vlan_migration report: add eqiad row C/D vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236301 (owner: 10Ayounsi) [14:17:03] (03CR) 10CI reject: [V:04-1] Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [14:17:32] (03PS1) 10JavierMonton: topic: Flink enrichment pipeline [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236302 (https://phabricator.wikimedia.org/T360794) [14:22:31] (03PS2) 10Elukey: sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) [14:22:31] (03PS3) 10Elukey: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [14:22:31] (03PS1) 10Elukey: sre.hardware.upgrade-firmware: fix logging warning from prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 [14:23:27] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:23:49] (03CR) 10Ayounsi: [C:03+2] vlan_migration report: add eqiad row C/D vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236301 (owner: 10Ayounsi) [14:24:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1015.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:24:28] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:24:58] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:25:21] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:25:33] (03Merged) 10jenkins-bot: vlan_migration report: add eqiad row C/D vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236301 (owner: 10Ayounsi) [14:25:49] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:26:20] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:27:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11578759 (10GGalofre-WMF) Hi team, I want to be able to run some queries for some existing datasets. I'm starting with data on content contributions by topic or category for... [14:29:36] (03PS1) 10JavierMonton: topic: New Flink application [puppet] - 10https://gerrit.wikimedia.org/r/1236305 (https://phabricator.wikimedia.org/T360794) [14:30:05] elukey@cumin1003 provision (PID 2145139) is awaiting input [14:30:14] !log installing bind9 security updates [14:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:31] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v.1.1.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236276 (https://phabricator.wikimedia.org/T415325) (owner: 10Santiago Faci) [14:30:55] (03PS1) 10Ayounsi: vlan_migration: also look for hosts in eqiad [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236306 [14:31:03] (03PS2) 10JavierMonton: topic: New Flink application [puppet] - 10https://gerrit.wikimedia.org/r/1236305 (https://phabricator.wikimedia.org/T360794) [14:31:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11578785 (10elukey) The host is provisioned now! I didn't have any issue in picking up the NIC, the main problem was related to LLDP not being set correctly (an error... [14:32:21] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v.1.1.8 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236276 (https://phabricator.wikimedia.org/T415325) (owner: 10Santiago Faci) [14:32:53] (03CR) 10Ayounsi: [C:03+2] vlan_migration: also look for hosts in eqiad [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236306 (owner: 10Ayounsi) [14:34:33] (03Merged) 10jenkins-bot: vlan_migration: also look for hosts in eqiad [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1236306 (owner: 10Ayounsi) [14:35:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:35:37] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:36:07] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:37:47] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236305 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [14:41:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11578842 (10BTullis) I have unmounted /dev/sdd1 on this host, so feel free to replace the drive. [14:42:36] 10SRE-SLO, 10Observability-Alerting, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11578843 (10tappof) Requested a new Gerrit repository to store Sloth manifests: https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests Develo... [14:43:57] (03CR) 10Ayounsi: [C:03+1] sre.hardware.upgrade-firmware: fix logging warning from prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 (owner: 10Elukey) [14:44:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11578845 (10BTullis) Thanks. I've unmounted `/dev/sdl1` which was still showing errors on `dmesg -T` so you can feel free to swap the drive now. Here is the corrent state of the physical disks. ` ----... [14:44:17] (03PS1) 10Slyngshede: Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1236307 [14:44:29] (03PS3) 10JavierMonton: topic: New Flink application [puppet] - 10https://gerrit.wikimedia.org/r/1236305 (https://phabricator.wikimedia.org/T360794) [14:44:53] (03PS1) 10Muehlenhoff: VRTS: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) [14:46:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11578851 (10VRiley-WMF) I have moved equipment that is currently in E16 to the recommended locations according to the google doc [14:48:59] (03CR) 10Pmiazga: rest gateway: add tests for chart rendering (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 (owner: 10Daniel Kinzler) [14:49:05] (03CR) 10JavierMonton: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1236305 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [14:49:22] (03CR) 10Xcollazo: [C:03+1] "Thanks for review @brouberol@wikimedia.org, if you have +2, please go ahead." [dumps] - 10https://gerrit.wikimedia.org/r/1235874 (https://phabricator.wikimedia.org/T416180) (owner: 10Xcollazo) [14:49:39] (03CR) 10Brouberol: [C:03+2] Mark XML content dump jobs as deprecated. [dumps] - 10https://gerrit.wikimedia.org/r/1235874 (https://phabricator.wikimedia.org/T416180) (owner: 10Xcollazo) [14:49:51] (03PS4) 10Jelto: gitlab: set qos to low in rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1234984 [14:50:35] (03CR) 10Ayounsi: [C:03+1] sre.hardware.upgrade-firmware: fix logging warning from prospector (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 (owner: 10Elukey) [14:50:45] (03CR) 10Xcollazo: [C:03+1] "Thanks for review @brouberol@wikimedia.org and @joal@wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/1233836 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [14:51:07] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [14:52:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:53:17] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:53:21] (03PS5) 10Jelto: gitlab: set qos to low in rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1234984 [14:53:25] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:53:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:21] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7968/co" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [14:55:22] (03CR) 10Brouberol: [C:03+2] analytics: refinery: add data purge for File Export. [puppet] - 10https://gerrit.wikimedia.org/r/1233836 (https://phabricator.wikimedia.org/T414389) (owner: 10Xcollazo) [14:55:29] (03CR) 10Ayounsi: [C:03+1] sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [14:56:22] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:56:32] (03CR) 10Jelto: [C:03+1] "I like the more generic approach, I rebased Ic530008b013625358b3670a90de660011e4269e9 and PCC looks reasonable as well" [puppet] - 10https://gerrit.wikimedia.org/r/1236243 (owner: 10Muehlenhoff) [14:57:01] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [14:58:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11578947 (10Jclark-ctr) Sorry i gave you wrong ticket. thank you will take care of it shortly >>! In T409060#11578845, @BTullis wrote: > Thanks. I've unmounted `/dev/sdl1` which was still showing erro... [14:58:12] (03CR) 10Ssingh: [C:03+1] Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1236307 (owner: 10Slyngshede) [14:59:29] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1500) [15:01:48] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:09:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:23] !log installing openjdk-17 security updates [15:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:24] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [15:10:33] (03PS1) 10Cathal Mooney: Revert^2 "plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1236313 [15:11:00] (03CR) 10Ssingh: [C:03+1] Revert^2 "plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1236313 (owner: 10Cathal Mooney) [15:12:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:13:12] (03CR) 10Cathal Mooney: [C:03+2] Revert^2 "plugins/wmf-netbox: remove ipv4 only for DNS hosts BGP" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1236313 (owner: 10Cathal Mooney) [15:13:13] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1236315 (https://phabricator.wikimedia.org/T416356) [15:14:27] (03CR) 10Brouberol: [C:03+1] Stop logging batch start [dumps] - 10https://gerrit.wikimedia.org/r/1229127 (https://phabricator.wikimedia.org/T408423) (owner: 10Jakob) [15:14:31] (03CR) 10Brouberol: [C:03+2] Stop logging batch start [dumps] - 10https://gerrit.wikimedia.org/r/1229127 (https://phabricator.wikimedia.org/T408423) (owner: 10Jakob) [15:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [15:15:28] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358 (10Jacob_WMDE) 03NEW [15:15:36] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:15:43] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:16:09] !log cmooney@cumin1003 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Release v0.11.1 - enable IPv6 SAFI for DNS hosts - cmooney@cumin1003 [15:17:47] !log cmooney@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1003.eqiad.wmnet with reason: Release v0.11.1 - enable IPv6 SAFI for DNS hosts - cmooney@cumin1003 [15:18:08] (03PS2) 10Elukey: sre.hardware.upgrade-firmware: fix logging warning from prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 [15:18:13] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:18:19] (03CR) 10Elukey: sre.hardware.upgrade-firmware: fix logging warning from prospector (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 (owner: 10Elukey) [15:18:51] (03PS3) 10Elukey: sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) [15:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:20:25] (03Abandoned) 10Alexandros Kosiaris: multi-dc: Switch www.wikifunctions to Single-DC [puppet] - 10https://gerrit.wikimedia.org/r/1218338 (https://phabricator.wikimedia.org/T405461) (owner: 10Alexandros Kosiaris) [15:22:22] (03Abandoned) 10Alexandros Kosiaris: [DNM] Showcase atomic: false for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057907 (owner: 10Alexandros Kosiaris) [15:23:05] (03CR) 10CI reject: [V:04-1] sre.hardware.upgrade-firmware: fix logging warning from prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 (owner: 10Elukey) [15:23:22] (03Abandoned) 10Alexandros Kosiaris: Add a blubber.yaml file [software/debmonitor] - 10https://gerrit.wikimedia.org/r/434020 (owner: 10Alexandros Kosiaris) [15:23:22] (03Abandoned) 10Alexandros Kosiaris: Add a basic requirements.txt file for the pipeline [software/debmonitor] - 10https://gerrit.wikimedia.org/r/434021 (owner: 10Alexandros Kosiaris) [15:23:22] (03Abandoned) 10Alexandros Kosiaris: Revert "Revert "Revert "Revert "Add the LVS blocks to url_downloader"""" [puppet] - 10https://gerrit.wikimedia.org/r/346287 (owner: 10Alexandros Kosiaris) [15:23:22] (03Abandoned) 10Alexandros Kosiaris: scap: Move prefix from confd to key creation [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [15:23:23] (03Abandoned) 10Alexandros Kosiaris: cassandra::single_instance: Remove thrift ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/513122 (owner: 10Alexandros Kosiaris) [15:23:24] (03Abandoned) 10Alexandros Kosiaris: Narrow down ferm etcd allow_from. Take #2 [puppet] - 10https://gerrit.wikimedia.org/r/482655 (owner: 10Alexandros Kosiaris) [15:23:28] (03Abandoned) 10Alexandros Kosiaris: build_envoy_config: Allow data to be a list [puppet] - 10https://gerrit.wikimedia.org/r/816810 (owner: 10Alexandros Kosiaris) [15:23:32] (03Abandoned) 10Alexandros Kosiaris: Remove the puppet lvm module [puppet] - 10https://gerrit.wikimedia.org/r/787708 (https://phabricator.wikimedia.org/T67270) (owner: 10Alexandros Kosiaris) [15:23:36] (03Abandoned) 10Alexandros Kosiaris: interface: Add a new define for handling /e/n/i config [puppet] - 10https://gerrit.wikimedia.org/r/351603 (owner: 10Alexandros Kosiaris) [15:23:40] (03Abandoned) 10Alexandros Kosiaris: Remove fc-list file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [15:23:44] (03Abandoned) 10Alexandros Kosiaris: Stop pinning the cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/544966 (https://phabricator.wikimedia.org/T200803) (owner: 10Alexandros Kosiaris) [15:23:48] (03Abandoned) 10Alexandros Kosiaris: DNM: tilerator: Remove as much as possible of the last cruft [puppet] - 10https://gerrit.wikimedia.org/r/1169223 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [15:23:52] (03Abandoned) 10Alexandros Kosiaris: Run sextant update charts/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/942421 (owner: 10Alexandros Kosiaris) [15:23:57] (03Abandoned) 10Alexandros Kosiaris: WIP: Support arm64 in sre.hosts.provision [cookbooks] - 10https://gerrit.wikimedia.org/r/1163330 (https://phabricator.wikimedia.org/T397653) (owner: 10Alexandros Kosiaris) [15:24:13] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [15:25:43] (03PS3) 10Elukey: sre.hardware.upgrade-firmware: fix logging warning from prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 [15:25:46] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [15:26:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T415786)', diff saved to https://phabricator.wikimedia.org/P88503 and previous config saved to /var/cache/conftool/dbconfig/20260203-152602-marostegui.json [15:26:07] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:28:56] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: testing authdns IPv6 change] [15:29:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T415786)', diff saved to https://phabricator.wikimedia.org/P88504 and previous config saved to /var/cache/conftool/dbconfig/20260203-152941-marostegui.json [15:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1530) [15:30:15] (03PS3) 10Ssingh: dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1230351 (https://phabricator.wikimedia.org/T81605) [15:31:07] (03CR) 10Scott French: [C:03+1] "Picking this up while folks are out. LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1233161 (owner: 10Daniel Kinzler) [15:31:18] (03CR) 10Elukey: [C:03+2] sre.hardware.upgrade-firmware: fix logging warning from prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/1236303 (owner: 10Elukey) [15:31:27] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [15:31:33] (03PS4) 10Elukey: sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) [15:31:33] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [15:31:45] (03CR) 10Ayounsi: [C:03+2] Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [15:32:06] !log sudo cumin "A:dnsbox" "disable-puppet 'merging CR 1230351'": T81605 [15:32:07] (03PS4) 10Elukey: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [15:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:09] T81605: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 [15:32:27] (03PS5) 10Elukey: sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) [15:32:27] (03PS5) 10Elukey: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [15:33:03] (03CR) 10Ssingh: [C:03+2] dnsbox: advertise ns[0-2] IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1230351 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [15:34:16] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:41] 06SRE, 10LDAP-Access-Requests: Add Jacob Thwaites WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T416358#11579161 (10conny-kawohl_WMDE) Hi my name is Conny Kawohl, and I am the Engineering Manager of @Jacob_WMDE. Please add Jacob to the requested groups. [15:35:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P88505 and previous config saved to /var/cache/conftool/dbconfig/20260203-153611-marostegui.json [15:37:16] (03CR) 10Dzahn: [C:03+1] Add an option to the flag generated firewall rules with low QoS [puppet] - 10https://gerrit.wikimedia.org/r/1236243 (owner: 10Muehlenhoff) [15:39:07] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: testing authdns IPv6 change] [15:40:02] 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 13Patch-For-Review: Request: wdqs shell access for user lerickson - https://phabricator.wikimedia.org/T415373#11579187 (10Gehel) [15:40:35] (03CR) 10Scott French: [C:03+1] "Picking this up as well while folks are out. LGTM, with one highlight." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230444 (owner: 10Daniel Kinzler) [15:40:44] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:44] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:44] PROBLEM - Host ns2-v6 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:17] hmmm [15:41:27] (03PS6) 10Elukey: sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) [15:41:45] expected, let's wait for homer to finish running [15:43:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T415786)', diff saved to https://phabricator.wikimedia.org/P88506 and previous config saved to /var/cache/conftool/dbconfig/20260203-154308-marostegui.json [15:43:12] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:44:08] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org,service=authdns-ns2 [reason: testing authdns IPv6 change] [15:44:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P88507 and previous config saved to /var/cache/conftool/dbconfig/20260203-154449-marostegui.json [15:44:58] !log root@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover test-s4 None [15:45:56] (03PS1) 10TChin: [eventgate-analytics-external] Bump to 1.27.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236324 (https://phabricator.wikimedia.org/T411454) [15:46:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P88508 and previous config saved to /var/cache/conftool/dbconfig/20260203-154619-marostegui.json [15:46:31] (03CR) 10Slyngshede: [C:03+2] Meta IP location changes [dns] - 10https://gerrit.wikimedia.org/r/1236307 (owner: 10Slyngshede) [15:46:36] !log slyngshede@dns1004 START - running authdns-update [15:47:55] !log slyngshede@dns1004 END - running authdns-update [15:47:58] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org,service=authdns-ns2 [reason: testing authdns IPv6 change] [15:49:36] RECOVERY - Host ns2-v6 is UP: PING OK - Packet loss = 0%, RTA = 116.61 ms [15:49:46] ^ ok great [15:50:31] (03CR) 10Arendpieter: [C:03+1] Docker build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1229106 (https://phabricator.wikimedia.org/T412826) (owner: 10Slyngshede) [15:50:58] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [15:51:23] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [15:56:15] (03CR) 10Milimetric: [C:03+2] "part of a hotfix deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236324 (https://phabricator.wikimedia.org/T411454) (owner: 10TChin) [15:56:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T415786)', diff saved to https://phabricator.wikimedia.org/P88510 and previous config saved to /var/cache/conftool/dbconfig/20260203-155628-marostegui.json [15:56:31] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [15:56:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:57:04] (03CR) 10Phuedx: [C:03+1] [eventgate-analytics-external] Bump to 1.27.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236324 (https://phabricator.wikimedia.org/T411454) (owner: 10TChin) [15:57:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:57:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T415786)', diff saved to https://phabricator.wikimedia.org/P88511 and previous config saved to /var/cache/conftool/dbconfig/20260203-155713-marostegui.json [15:58:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P88513 and previous config saved to /var/cache/conftool/dbconfig/20260203-155816-marostegui.json [15:58:18] (03Merged) 10jenkins-bot: [eventgate-analytics-external] Bump to 1.27.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236324 (https://phabricator.wikimedia.org/T411454) (owner: 10TChin) [15:59:14] (03PS1) 10Federico Ceratto: hiera: Promote db-test1003 to test-s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1236327 (https://phabricator.wikimedia.org/T409926) [15:59:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P88515 and previous config saved to /var/cache/conftool/dbconfig/20260203-155957-marostegui.json [16:00:05] jelto, arnoldokoth, mutante, and arnaudb: How many deployers does it take to do SRE Collaboration Services office hours deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1600). [16:00:23] (03CR) 10Elukey: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [16:00:38] (03CR) 10Scott French: [C:03+1] docker_registry: move /v2/restricted to the s3 restricted backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1229145 (https://phabricator.wikimedia.org/T412951) (owner: 10Elukey) [16:00:52] (03PS17) 10Daniel Kinzler: rest gateway: add tests for chart rendering [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 [16:01:26] (03CR) 10Daniel Kinzler: rest gateway: add tests for chart rendering (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1225085 (owner: 10Daniel Kinzler) [16:04:32] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:04:58] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [16:05:10] (03Merged) 10jenkins-bot: sre.hosts.provision: set self.config_changes as defaultdict [cookbooks] - 10https://gerrit.wikimedia.org/r/1236297 (https://phabricator.wikimedia.org/T414725) (owner: 10Elukey) [16:05:28] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [16:06:23] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [16:10:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:10:21] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [16:11:10] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [16:11:15] 10SRE-SLO, 10Observability-Alerting, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11579392 (10herron) For the purposes of sloth onboarding I'd strongly prefer SLO definitions continue to live in puppet, for several reasons: * Abstra... [16:12:34] (03CR) 10Federico Ceratto: [C:03+2] hiera: Promote db-test1003 to test-s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1236327 (https://phabricator.wikimedia.org/T409926) (owner: 10Federico Ceratto) [16:13:20] !log disable Hurricane Electric IPv6 BGP session on cr2-magru to troubleshoot ns2 IPv6 routing issue [16:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P88517 and previous config saved to /var/cache/conftool/dbconfig/20260203-161325-marostegui.json [16:14:16] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236331 [16:15:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T415786)', diff saved to https://phabricator.wikimedia.org/P88519 and previous config saved to /var/cache/conftool/dbconfig/20260203-161506-marostegui.json [16:15:09] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:15:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:15:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T415786)', diff saved to https://phabricator.wikimedia.org/P88520 and previous config saved to /var/cache/conftool/dbconfig/20260203-161530-marostegui.json [16:15:55] (03PS6) 10Elukey: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [16:16:04] (03CR) 10Elukey: [C:03+2] "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [16:19:05] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236331 (owner: 10PipelineBot) [16:19:40] (03CR) 10CI reject: [V:04-1] Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [16:21:19] (03CR) 10Dzahn: [C:03+2] mailman: Enable profile::auto_restarts::service for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1235824 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:21:52] (03Merged) 10jenkins-bot: Add eqiad row C/D to LEGACY_VLANS [cookbooks] - 10https://gerrit.wikimedia.org/r/1236300 (owner: 10Ayounsi) [16:21:58] (03CR) 10AOkoth: [C:03+1] "Thanks for this @mmuhlenhoff@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:24:04] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11579465 (10elukey) @Jclark-ctr sadly from Redfish I don't see any LinkUp: ` >>> pprint(r.request("GET", f"{r.system_manager}/EthernetInterfaces/NIC.Integrated.1-2-1").json()['LinkStatus']) None >>>... [16:25:25] (03CR) 10Dzahn: [C:03+1] ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:28:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T415786)', diff saved to https://phabricator.wikimedia.org/P88521 and previous config saved to /var/cache/conftool/dbconfig/20260203-162833-marostegui.json [16:28:37] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [16:28:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:30:10] (03PS1) 10Jgiannelos: pcs: Configure node runtime memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236332 (https://phabricator.wikimedia.org/T410296) [16:31:17] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236333 [16:32:10] (03PS3) 10Jcrespo: backups: Remove analytics_meta regular backups: hue & airflow* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) [16:32:16] (03PS5) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) [16:32:43] (03CR) 10Dzahn: [C:03+2] VRTS: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:33:57] (03CR) 10Dzahn: "let's use I591dcb365702812348 - it seems nice for other use cases too" [puppet] - 10https://gerrit.wikimedia.org/r/1234984 (owner: 10Jelto) [16:34:43] (03CR) 10Scott French: [C:03+1] pcs: Configure node runtime memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236332 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [16:35:35] (03CR) 10Dzahn: [C:03+1] "@Brett just confirming that this does not point to ncredir anymore in DNS - so it's just a cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:40:12] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11579521 (10Dzahn) 05Stalled→03Open a:05DannyS712→03None [16:41:34] (03CR) 10Jgiannelos: [C:03+2] pcs: Configure node runtime memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236332 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [16:43:41] (03Merged) 10jenkins-bot: pcs: Configure node runtime memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1236332 (https://phabricator.wikimedia.org/T410296) (owner: 10Jgiannelos) [16:44:24] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:44:53] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:44:58] (03CR) 10Dzahn: [C:03+2] gerrit: remove differenciated logs for mod_qos [puppet] - 10https://gerrit.wikimedia.org/r/1234269 (owner: 10Arnaudb) [16:45:11] (03CR) 10Dzahn: [C:03+2] "deploying based on disk space alerts and discussion on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1234269 (owner: 10Arnaudb) [16:46:54] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:47:46] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:47:52] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:47:55] (03CR) 10Dzahn: [C:03+2] "16:33 < hashar> there are 21G in /var/log out of 73G on the root partition" [puppet] - 10https://gerrit.wikimedia.org/r/1234269 (owner: 10Arnaudb) [16:48:19] (03CR) 10Jcrespo: [C:03+2] backups: Remove analytics_meta regular backups: hue & airflow* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1236262 (https://phabricator.wikimedia.org/T369612) (owner: 10Jcrespo) [16:48:36] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:00:05] jhathaway and rzl: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:12] (03CR) 10Dzahn: [C:03+2] "deleted *qos*.log.gz files on gerrit hosts - disk usage down to 83% (13G free on / on prod host)" [puppet] - 10https://gerrit.wikimedia.org/r/1234269 (owner: 10Arnaudb) [17:02:00] !log gerrit - deployed gerrit:1234269 to remove separate *qos* apache logs - deleted *qos* logs to fix disk space issues - back to 83% usage on / on gerrit1003 [17:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11579700 (10jcrespo) [17:13:10] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:14:48] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11579717 (10jcrespo) [17:15:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11579718 (10jcrespo) [17:16:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11579745 (10jcrespo) [17:16:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup10[16-20] - https://phabricator.wikimedia.org/T414728#11579746 (10jcrespo) a:05jcrespo→03None [17:17:29] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup2015 - https://phabricator.wikimedia.org/T414724#11579751 (10jcrespo) a:05jcrespo→03None [17:17:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup20[16-20] - https://phabricator.wikimedia.org/T414727#11579766 (10jcrespo) a:05jcrespo→03None [17:31:05] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: On using Wikimedia Maps to build Kiwix Openstreetmap ZIMs - https://phabricator.wikimedia.org/T416374#11579842 (10Aklapper) Hi, please follow https://wikitech.wikimedia.org/wiki/Maps/External_usage and fill out the linked form - thanks! [17:40:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026.01.23 - 2026.02.13): Degraded RAID on an-worker1187 - https://phabricator.wikimedia.org/T415002#11579875 (10Jclark-ctr) @BTullis drive has been swapped Thank you [17:40:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T416066#11579880 (10Jclark-ctr) @BTullis drive has been swapped Thank you [17:42:55] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:45:02] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1015.eqiad.wmnet with OS bookworm [17:45:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11579898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm [17:45:13] !log jclark@cumin1003 START - Cookbook sre.hosts.move-vlan for host backup1015 [17:45:18] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [17:46:00] !log reprepro include php8.3_8.3.30-1+wmf11u2 in component/php83 [17:46:00] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11579910 (10Jclark-ctr) [17:46:25] !log sudo cumin -b1 -s120 "A:dnsbox and not P{dns1004* or dns7001*}" "run-puppet-agent --enable 'merging CR 1230351'": T81605 [17:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:28] T81605: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 [17:48:16] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:48:25] FIRING: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:39] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1015 - jclark@cumin1003" [17:48:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host backup1015 - jclark@cumin1003" [17:48:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:48:43] !log jclark@cumin1003 START - Cookbook sre.dns.wipe-cache backup1015.eqiad.wmnet 169.32.64.10.in-addr.arpa 9.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:48:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) backup1015.eqiad.wmnet 169.32.64.10.in-addr.arpa 9.6.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:48:47] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host backup1015 [17:49:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host backup1015 [17:49:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host backup1015 [17:50:30] PROBLEM - Bird Internet Routing Daemon on dns1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:50:40] ^ resolving [17:51:10] FIRING: BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:51:14] nope [17:51:24] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:51:30] RECOVERY - Bird Internet Routing Daemon on dns1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:51:37] :P [17:52:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T415786)', diff saved to https://phabricator.wikimedia.org/P88522 and previous config saved to /var/cache/conftool/dbconfig/20260203-175213-marostegui.json [17:52:16] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [17:53:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1021.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:54:35] (03CR) 10Scott French: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235833 (owner: 10Scott French) [17:55:12] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:55:24] (03CR) 10Scott French: [V:03+2 C:03+2] php8.3: Rebuild to pick up new PHP packages (8.3.30) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1235833 (owner: 10Scott French) [17:56:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:00:05] swfrench-wmf: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1800). [18:00:10] o/ [18:00:51] * swfrench-wmf will be starting a `scap sync-world` momentarily [18:01:21] jclark@cumin1003 provision (PID 2360932) is awaiting input [18:02:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P88523 and previous config saved to /var/cache/conftool/dbconfig/20260203-180221-marostegui.json [18:03:29] !log swfrench@deploy2002 Started scap sync-world: Rebuild deployment to pick up new production image [18:03:38] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:04:24] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:06:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:06:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11579975 (10Jclark-ctr) @elukey Thanks for the help. I’m having issues now getting it to start PXE. I’ll have to look around and see if anything els... [18:06:45] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe1022.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:06:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T415786)', diff saved to https://phabricator.wikimedia.org/P88524 and previous config saved to /var/cache/conftool/dbconfig/20260203-180650-marostegui.json [18:06:54] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:07:10] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:07:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T415786)', diff saved to https://phabricator.wikimedia.org/P88525 and previous config saved to /var/cache/conftool/dbconfig/20260203-180737-marostegui.json [18:08:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::7a4f:9b00:d4e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:10:18] (03CR) 10Dwisehaupt: [C:03+1] Add Cumin alias for crm [puppet] - 10https://gerrit.wikimedia.org/r/1236238 (owner: 10Muehlenhoff) [18:11:25] RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:12:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P88526 and previous config saved to /var/cache/conftool/dbconfig/20260203-181229-marostegui.json [18:20:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:22:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T415786)', diff saved to https://phabricator.wikimedia.org/P88527 and previous config saved to /var/cache/conftool/dbconfig/20260203-182238-marostegui.json [18:22:41] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:22:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P88528 and previous config saved to /var/cache/conftool/dbconfig/20260203-182245-marostegui.json [18:22:54] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [18:23:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T415786)', diff saved to https://phabricator.wikimedia.org/P88529 and previous config saved to /var/cache/conftool/dbconfig/20260203-182302-marostegui.json [18:23:08] * swfrench-wmf is still waiting on image builds [18:25:25] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:27:15] swfrench-wmf: https://xkcd.com/303/ but for image builds? [18:29:22] sukhe: heh, yes exactly [18:30:38] ... but with a bit more "anxious glance at how swift is behaving" than goofing around :) [18:30:56] PROBLEM - Bird Internet Routing Daemon on dns3004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:31:04] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:31:13] ^ expected [18:31:15] resolving shortly [18:31:58] RECOVERY - Bird Internet Routing Daemon on dns3004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:32:04] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:37:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P88530 and previous config saved to /var/cache/conftool/dbconfig/20260203-183753-marostegui.json [18:39:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe1024.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:49:24] PROBLEM - Bird Internet Routing Daemon on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:49:26] !log swfrench@deploy2002 Finished scap sync-world: Rebuild deployment to pick up new production image (duration: 46m 41s) [18:49:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ggalofre - https://phabricator.wikimedia.org/T415172#11580151 (10KCVelaga_WMF) @elukey didn't realize the newly labelled levels, nice! Gerard would want //analytics-privatedata-users level 1// for now, as he needs to access Supe... [18:50:24] RECOVERY - Bird Internet Routing Daemon on dns5003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:51:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:52:17] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [18:52:23] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [18:52:54] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [18:52:55] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [18:53:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T415786)', diff saved to https://phabricator.wikimedia.org/P88531 and previous config saved to /var/cache/conftool/dbconfig/20260203-185302-marostegui.json [18:53:05] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [18:53:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2173.codfw.wmnet with reason: Maintenance [18:53:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T415786)', diff saved to https://phabricator.wikimedia.org/P88532 and previous config saved to /var/cache/conftool/dbconfig/20260203-185326-marostegui.json [18:53:48] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install cumin2003 - https://phabricator.wikimedia.org/T416385 (10Jhancock.wm) 03NEW [18:54:02] alright, done with the infra window for today [18:56:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:56:26] PROBLEM - Bird Internet Routing Daemon on dns5004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:56:31] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1021.eqiad.wmnet with OS bullseye [18:56:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580204 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye [18:57:26] RECOVERY - Bird Internet Routing Daemon on dns5004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:57:28] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install apus-fe100[4-5] - https://phabricator.wikimedia.org/T416386 (10Jhancock.wm) 03NEW [18:57:40] FIRING: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:59:50] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install apus-fe200[4-5] - https://phabricator.wikimedia.org/T416387 (10Jhancock.wm) 03NEW [19:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T1900) [19:02:04] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:02:24] PROBLEM - Bird Internet Routing Daemon on dns7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:02:40] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:02:58] PROBLEM - Bird Internet Routing Daemon on dns6001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:03:04] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:03:24] RECOVERY - Bird Internet Routing Daemon on dns7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:03:58] RECOVERY - Bird Internet Routing Daemon on dns6001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:04:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1023.eqiad.wmnet with OS bullseye [19:04:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host ms-fe1023.eqiad.wmnet with OS bullseye [19:04:12] !log jclark@cumin1003 START - Cookbook sre.hosts.move-vlan for host ms-fe1023 [19:04:22] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [19:04:35] 10ops-eqiad, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390 (10Jhancock.wm) 03NEW [19:07:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11580357 (10Jclark-ctr) @MatthewVernon can you help with update preseed.yaml for efi booting? {F71660637} [19:08:58] PROBLEM - Bird Internet Routing Daemon on dns6002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:09:42] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [19:09:57] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1015.eqiad.wmnet with OS bookworm [19:09:58] RECOVERY - Bird Internet Routing Daemon on dns6002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [19:10:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm ex... [19:10:06] jclark@cumin1003 reimage (PID 2386601) is awaiting input [19:11:45] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1023 - jclark@cumin1003" [19:11:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1023 - jclark@cumin1003" [19:11:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:11:50] !log jclark@cumin1003 START - Cookbook sre.dns.wipe-cache ms-fe1023.eqiad.wmnet 170.32.64.10.in-addr.arpa 0.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:11:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1023.eqiad.wmnet 170.32.64.10.in-addr.arpa 0.7.1.0.2.3.0.0.4.6.0.0.0.1.0.0.3.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:11:54] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1023 [19:12:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1023 [19:12:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1023 [19:14:35] (03PS9) 10Ryan Kemper: hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) [19:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [19:15:12] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [19:15:13] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.reboot-workers (exit_code=99) for Hadoop analytics cluster [19:15:38] RECOVERY - Host ns0-v6 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:15:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T415786)', diff saved to https://phabricator.wikimedia.org/P88533 and previous config saved to /var/cache/conftool/dbconfig/20260203-191541-marostegui.json [19:15:44] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [19:16:39] ^ nice, all ns[0-2]-v6 are up [19:18:49] (03PS10) 10Ryan Kemper: hadoop.reboot-workers: make host override smarter [cookbooks] - 10https://gerrit.wikimedia.org/r/1214664 (https://phabricator.wikimedia.org/T411568) [19:19:01] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [19:19:14] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.hadoop.reboot-workers (exit_code=97) for Hadoop analytics cluster [19:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:20:14] !log sukhe@dns1004 START - running authdns-update [19:20:25] !log testing authdns-update (NOOP run) [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:55] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [19:21:12] !log sukhe@dns1004 END - running authdns-update [19:23:19] jclark@cumin1003 reimage (PID 2386190) is awaiting input [19:23:21] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1024.eqiad.wmnet with OS bullseye [19:23:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye [19:23:33] !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host ms-fe1024 [19:23:36] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1021.eqiad.wmnet with OS bullseye [19:23:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host ms-fe1021.eqiad.wmnet with OS bullseye exe... [19:23:46] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [19:25:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install ms-fe102[14] - https://phabricator.wikimedia.org/T416245#11580452 (10Jclark-ctr) a:05Jclark-ctr→03MatthewVernon [19:27:05] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install cloudcephosd2007-dev - https://phabricator.wikimedia.org/T416396 (10Jhancock.wm) 03NEW [19:27:21] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1024 - cmooney@cumin1003" [19:27:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-fe1024 - cmooney@cumin1003" [19:27:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:27:26] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache ms-fe1024.eqiad.wmnet 205.48.64.10.in-addr.arpa 5.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:27:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-fe1024.eqiad.wmnet 205.48.64.10.in-addr.arpa 5.0.2.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:27:30] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe1024 [19:28:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe1024 [19:28:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-fe1024 [19:30:13] (03PS1) 10Ssingh: wikimedia.org: add IPv6 AAAA glue record for ns1 [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) [19:30:30] 06SRE, 06Data-Platform-SRE (2026.01.23 - 2026.02.13), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11580484 (10RKemper) Resuming an-worker reboots. Identified 46 hosts needing reboot (kernel < `5.10.244`): `an-worker[111... [19:30:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P88534 and previous config saved to /var/cache/conftool/dbconfig/20260203-193049-marostegui.json [19:30:56] (03CR) 10CI reject: [V:04-1] wikimedia.org: add IPv6 AAAA glue record for ns1 [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:38:35] !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-fe1024.eqiad.wmnet with OS bullseye [19:38:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye ex... [19:39:36] (03CR) 10Ssingh: [C:04-2] "Do not merge because we are not ready and plus we need to add the v6 PTR." [dns] - 10https://gerrit.wikimedia.org/r/1236354 (https://phabricator.wikimedia.org/T81605) (owner: 10Ssingh) [19:39:58] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe1024.eqiad.wmnet with OS bullseye [19:40:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye [19:40:13] 10ops-codfw, 06SRE, 06DC-Ops: Q3:rack/setup/install cumin2003 - https://phabricator.wikimedia.org/T416385#11580539 (10Jhancock.wm) a:03MoritzMuehlenhoff @MoritzMuehlenhoff when you or someone you can delegate this to can, could you fill out the racking instructions for this server and update the required... [19:45:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P88536 and previous config saved to /var/cache/conftool/dbconfig/20260203-194557-marostegui.json [19:49:25] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11580560 (10A_smart_kitten) >>! In T413634#11489204, @A_smart_kitten wrote: > @dannys712 should your entries on https://www.mediawiki.org/wiki/... [19:50:30] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11580561 (10sbassett) Anything else left to do here? [19:51:47] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: DannyS712 "offboarding" - https://phabricator.wikimedia.org/T413634#11580566 (10A_smart_kitten) >>! In T413634#11580561, @sbassett wrote: > Anything else left to do here? FWICS there are still the additional Ge... [19:56:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T415786)', diff saved to https://phabricator.wikimedia.org/P88537 and previous config saved to /var/cache/conftool/dbconfig/20260203-195652-marostegui.json [19:56:56] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [19:58:00] 06SRE, 06Infrastructure-Foundations, 10netops: Cookbook sre.hosts.reimage: DHCP snippet created with old IP when --move-vlan is used - https://phabricator.wikimedia.org/T416401 (10cmooney) 03NEW p:05Triage→03Medium [19:59:12] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host backup1015.eqiad.wmnet with OS bookworm [19:59:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580615 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm [20:01:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T415786)', diff saved to https://phabricator.wikimedia.org/P88538 and previous config saved to /var/cache/conftool/dbconfig/20260203-200106-marostegui.json [20:01:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [20:01:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T415786)', diff saved to https://phabricator.wikimedia.org/P88539 and previous config saved to /var/cache/conftool/dbconfig/20260203-200130-marostegui.json [20:07:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P88540 and previous config saved to /var/cache/conftool/dbconfig/20260203-200700-marostegui.json [20:07:58] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580692 (10Jclark-ctr) @jcrespo with eLukey and Topranks help we where able to get it to start imaging but is failing because preseed.yaml is mis... [20:08:34] (03CR) 10Bartosz Dziewoński: [C:03+1] "TBH, I think it'd be fine to keep the old variable name, to make the refactoring simpler, but this is fine too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1235551 (https://phabricator.wikimedia.org/T404334) (owner: 10Gergő Tisza) [20:17:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P88541 and previous config saved to /var/cache/conftool/dbconfig/20260203-201709-marostegui.json [20:17:49] (03PS2) 10Seawolf35gerrit: Add map domains for ruwiki to the list of externallinks-excluded domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1236361 (https://phabricator.wikimedia.org/T416174) [20:25:45] (03CR) 10Dzahn: [C:03+2] "we got this one: https://phabricator.wikimedia.org/T416380" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:27:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T415786)', diff saved to https://phabricator.wikimedia.org/P88542 and previous config saved to /var/cache/conftool/dbconfig/20260203-202718-marostegui.json [20:27:21] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:27:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1218.eqiad.wmnet with reason: Maintenance [20:27:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T415786)', diff saved to https://phabricator.wikimedia.org/P88543 and previous config saved to /var/cache/conftool/dbconfig/20260203-202743-marostegui.json [20:28:07] (03CR) 10Dzahn: [C:03+2] "should be just "rsync" - Service vrts_rsync not present or not running -" [puppet] - 10https://gerrit.wikimedia.org/r/1236308 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [20:29:31] cmooney@cumin1003 reimage (PID 2390917) is awaiting input [20:30:18] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe1024.eqiad.wmnet with OS bullseye [20:30:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580820 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host ms-fe1024.eqiad.wmnet with OS bullseye ex... [20:31:43] (03PS1) 10Dzahn: VRTS: fix service name for profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1236364 (https://phabricator.wikimedia.org/T416380) [20:32:44] (03CR) 10Dzahn: [C:03+2] VRTS: fix service name for profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1236364 (https://phabricator.wikimedia.org/T416380) (owner: 10Dzahn) [20:40:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T415786)', diff saved to https://phabricator.wikimedia.org/P88544 and previous config saved to /var/cache/conftool/dbconfig/20260203-204024-marostegui.json [20:40:28] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [20:40:54] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ssalgaonkar-wmf - https://phabricator.wikimedia.org/T415594#11580864 (10Sucheta-Salgaonkar-WMF) @elukey you seriously rock, thank you so much!! you just unlocked a dashboard that I *really* needed to access today, feels so good... [20:46:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install bast1004 - https://phabricator.wikimedia.org/T416254#11580878 (10Jclark-ctr) a:05Andrew→03Jclark-ctr [20:47:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11580883 (10Jclark-ctr) [20:50:24] jclark@cumin1003 reimage (PID 2393945) is awaiting input [20:55:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P88545 and previous config saved to /var/cache/conftool/dbconfig/20260203-205532-marostegui.json [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T2100). nyaa~ [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:10:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P88547 and previous config saved to /var/cache/conftool/dbconfig/20260203-211041-marostegui.json [21:11:31] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11580981 (10Aklapper) [21:11:33] 10ops-codfw, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11580983 (10Aklapper) [21:12:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T415786)', diff saved to https://phabricator.wikimedia.org/P88548 and previous config saved to /var/cache/conftool/dbconfig/20260203-211201-marostegui.json [21:12:04] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [21:25:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T415786)', diff saved to https://phabricator.wikimedia.org/P88549 and previous config saved to /var/cache/conftool/dbconfig/20260203-212550-marostegui.json [21:25:54] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [21:26:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [21:26:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T415786)', diff saved to https://phabricator.wikimedia.org/P88550 and previous config saved to /var/cache/conftool/dbconfig/20260203-212616-marostegui.json [21:26:54] (03PS5) 10Dwisehaupt: frack dns cleanup and reconfig [dns] - 10https://gerrit.wikimedia.org/r/1233877 (https://phabricator.wikimedia.org/T364185) [21:27:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P88551 and previous config saved to /var/cache/conftool/dbconfig/20260203-212709-marostegui.json [21:28:58] (03CR) 10Herron: [C:03+1] centralauth: add recording rules for grafana widgets (write) [puppet] - 10https://gerrit.wikimedia.org/r/1236233 (https://phabricator.wikimedia.org/T415035) (owner: 10Tiziano Fogli) [21:33:17] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:42:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P88552 and previous config saved to /var/cache/conftool/dbconfig/20260203-214218-marostegui.json [21:42:25] (03CR) 10Jgreen: [C:03+1] frack dns cleanup and reconfig [dns] - 10https://gerrit.wikimedia.org/r/1233877 (https://phabricator.wikimedia.org/T364185) (owner: 10Dwisehaupt) [21:48:40] FIRING: SystemdUnitFailed: wmf_auto_restart_kerberos_rsync.service on krb1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:30] (03CR) 10Herron: [C:03+1] "Nice one! I'll help keep an eye on Thanos for the hours/days after deployment" [puppet] - 10https://gerrit.wikimedia.org/r/1219145 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [21:52:26] (03CR) 10Dwisehaupt: [C:03+2] frack dns cleanup and reconfig [dns] - 10https://gerrit.wikimedia.org/r/1233877 (https://phabricator.wikimedia.org/T364185) (owner: 10Dwisehaupt) [21:52:40] !log dwisehaupt@dns1004 START - running authdns-update [21:54:14] !log dwisehaupt@dns1004 END - running authdns-update [21:56:21] (03CR) 10Herron: [C:03+1] "Looks ok to me. What are your thoughts re: deployment strategy, maybe canary approach? Should we depool?" [puppet] - 10https://gerrit.wikimedia.org/r/1219146 (https://phabricator.wikimedia.org/T412924) (owner: 10Tiziano Fogli) [21:57:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T415786)', diff saved to https://phabricator.wikimedia.org/P88553 and previous config saved to /var/cache/conftool/dbconfig/20260203-215726-marostegui.json [21:57:30] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [21:57:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [21:57:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T415786)', diff saved to https://phabricator.wikimedia.org/P88554 and previous config saved to /var/cache/conftool/dbconfig/20260203-215751-marostegui.json [21:59:08] 10SRE-SLO, 10Observability-Alerting, 13Patch-For-Review, 06SRE Observability (FY2025/2026-Q3): sloth deployment - https://phabricator.wikimedia.org/T414579#11581131 (10tappof) >>! In T414579#11579392, @herron wrote: > For the purposes of sloth onboarding I'd strongly prefer SLO definitions continue to live... [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260203T2200) [22:01:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T415786)', diff saved to https://phabricator.wikimedia.org/P88555 and previous config saved to /var/cache/conftool/dbconfig/20260203-220126-marostegui.json [22:10:02] RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:11:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P88556 and previous config saved to /var/cache/conftool/dbconfig/20260203-221134-marostegui.json [22:14:49] (03CR) 10Ryan Kemper: [C:03+2] opensearch-semantic-search: provision ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230512 (https://phabricator.wikimedia.org/T414702) (owner: 10Ryan Kemper) [22:16:34] PROBLEM - Host an-worker1187 is DOWN: PING CRITICAL - Packet loss = 100% [22:21:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P88558 and previous config saved to /var/cache/conftool/dbconfig/20260203-222142-marostegui.json [22:22:16] (03Merged) 10jenkins-bot: opensearch-semantic-search: provision ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1230512 (https://phabricator.wikimedia.org/T414702) (owner: 10Ryan Kemper) [22:29:09] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [22:29:28] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:31:29] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [22:31:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T415786)', diff saved to https://phabricator.wikimedia.org/P88559 and previous config saved to /var/cache/conftool/dbconfig/20260203-223151-marostegui.json [22:31:54] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [22:32:03] !log ryankemper@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:32:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1219.eqiad.wmnet with reason: Maintenance [22:32:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T415786)', diff saved to https://phabricator.wikimedia.org/P88560 and previous config saved to /var/cache/conftool/dbconfig/20260203-223216-marostegui.json [22:33:28] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [22:34:22] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:34:32] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [22:35:25] !log ryankemper@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:39:28] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:39:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 1/3 UP : OSPFv3: 1/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:41:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:41:26] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:41:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:44:55] (03CR) 10Ryan Kemper: [C:03+2] opensearch-semantic-search: deploy eqiad & codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234128 (https://phabricator.wikimedia.org/T414691) (owner: 10Ryan Kemper) [22:46:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 208.80.153.216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:46:36] (03Merged) 10jenkins-bot: opensearch-semantic-search: deploy eqiad & codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1234128 (https://phabricator.wikimedia.org/T414691) (owner: 10Ryan Kemper) [22:49:13] ryankemper@cumin2002 reboot-workers (PID 319111) is awaiting input [23:00:10] !log bking@laptop roll-restarting wdqs codfw as it's lagging heavily [23:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T415786)', diff saved to https://phabricator.wikimedia.org/P88561 and previous config saved to /var/cache/conftool/dbconfig/20260203-230343-marostegui.json [23:03:47] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:10:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T415786)', diff saved to https://phabricator.wikimedia.org/P88562 and previous config saved to /var/cache/conftool/dbconfig/20260203-231044-marostegui.json [23:10:48] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:14:40] FIRING: MirrorHighLag: Mirrors - /srv/mirrors/ubuntu synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag [23:18:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P88563 and previous config saved to /var/cache/conftool/dbconfig/20260203-231851-marostegui.json [23:19:16] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventstreams-internal.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:23:29] !log vrts1003 - fix systemd state: sed -i 's/vrts_rsync/rsync/' /lib/systemd/system/wmf_auto_restart_vrts_rsync.service ; systemctl daemon-reload - T416380 T135991 [23:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:33] T416380: SystemdUnitFailed - wmf_auto_restart_vrts_rsync.service on vrts1003 - https://phabricator.wikimedia.org/T416380 [23:23:34] T135991: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 [23:25:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P88564 and previous config saved to /var/cache/conftool/dbconfig/20260203-232552-marostegui.json [23:27:32] (03CR) 10Dzahn: [C:03+2] "https://phabricator.wikimedia.org/T416380#11581327" [puppet] - 10https://gerrit.wikimedia.org/r/1236364 (https://phabricator.wikimedia.org/T416380) (owner: 10Dzahn) [23:28:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations: Eqiad: Fr-tech expansion - https://phabricator.wikimedia.org/T403035#11581330 (10RobH) I've taken the comment T403035#11518969 and updated it with our meeting plan of moving mgmt links to E7/E8. [[ https://docs.google.co... [23:34:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P88565 and previous config saved to /var/cache/conftool/dbconfig/20260203-233400-marostegui.json [23:40:52] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1015.eqiad.wmnet with OS bookworm [23:40:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install backup1015 - https://phabricator.wikimedia.org/T414725#11581338 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host backup1015.eqiad.wmnet with OS bookworm ex... [23:41:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P88566 and previous config saved to /var/cache/conftool/dbconfig/20260203-234100-marostegui.json [23:41:19] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [23:45:10] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [23:45:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt tools-k8 - jclark@cumin1003" [23:45:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:46:52] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-ctrl1001 [23:47:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-ctrl1001 [23:47:18] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-ctrl1002 [23:47:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-ctrl1002 [23:47:46] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1001 [23:47:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1001 [23:48:07] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1002 [23:48:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1002 [23:49:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T415786)', diff saved to https://phabricator.wikimedia.org/P88567 and previous config saved to /var/cache/conftool/dbconfig/20260203-234908-marostegui.json [23:49:11] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:49:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [23:49:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T415786)', diff saved to https://phabricator.wikimedia.org/P88568 and previous config saved to /var/cache/conftool/dbconfig/20260203-234932-marostegui.json [23:53:44] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1003 [23:54:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1003 [23:54:18] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1004 [23:54:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1004 [23:54:40] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1005 [23:54:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1005 [23:54:54] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1006 [23:55:03] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1006 [23:55:07] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1007 [23:55:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1007 [23:55:20] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host tools-k8s-worker1008 [23:55:29] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host tools-k8s-worker1008 [23:56:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T415786)', diff saved to https://phabricator.wikimedia.org/P88569 and previous config saved to /var/cache/conftool/dbconfig/20260203-235609-marostegui.json [23:56:12] T415786: Update imagelinks primary key on wmf production - https://phabricator.wikimedia.org/T415786 [23:56:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [23:56:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T415786)', diff saved to https://phabricator.wikimedia.org/P88570 and previous config saved to /var/cache/conftool/dbconfig/20260203-235634-marostegui.json [23:57:17] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [23:57:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie [23:57:29] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-ctrl1001.eqiad.wmnet with OS trixie [23:57:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11581365 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host tools-k8s-ctrl1002.eqiad.wmnet with OS trixie