[00:07:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169269 [00:08:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169269 (owner: 10TrainBranchBot) [00:39:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169269 (owner: 10TrainBranchBot) [00:49:42] 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11002922 (10BCornwall) [00:56:41] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/656116f558b545d0be774668bc593e31de0367f572473a72482afc9b5accedfa/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:07:41] PROBLEM - Host mr1-magru IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:07:53] PROBLEM - Host mr1-magru is DOWN: PING CRITICAL - Packet loss = 100% [01:07:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.10 [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169273 (https://phabricator.wikimedia.org/T392180) [01:07:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.10 [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169273 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [01:09:58] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:10:27] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:42] FIRING: JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:41] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:18:35] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [01:19:10] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.10 [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169273 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [01:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0200) [02:03:35] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0300) [03:01:57] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169277 (https://phabricator.wikimedia.org/T392180) [03:01:59] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169277 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [03:02:42] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169277 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [03:03:03] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.10 refs T392180 [03:03:07] T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180 [03:07:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:07:43] 10ops-magru: Power Supply - PS Redundancy - issue on ganeti7001:9290 - https://phabricator.wikimedia.org/T399525 (10phaultfinder) 03NEW [03:20:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:21:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.418 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:48:40] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.10 refs T392180 (duration: 45m 36s) [03:48:44] T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0400) [04:00:24] (03CR) 10Arnaudb: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb) [04:01:54] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.7 (duration: 01m 42s) [04:24:10] FIRING: BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:27:47] (03CR) 10Arnaudb: [C:03+1] mailman: avoid pint linting alerts related to backup instance [alerts] - 10https://gerrit.wikimedia.org/r/1169107 (owner: 10Tiziano Fogli) [04:28:31] 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11003045 (10KartikMistry) Since logs are fine, we don't have anything specific to QA for th... [04:29:10] RESOLVED: BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:30:06] (03CR) 10Arnaudb: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [04:34:00] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [05:03:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:53] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:58] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:04:03] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:07:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:13] FIRING: [19x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:35] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [05:35:45] (03PS1) 10Novem Linguae: Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 [05:36:33] (03PS2) 10Novem Linguae: Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) [05:49:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T399446 [05:49:48] T399446: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T399446 [05:50:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1210 with weight 0 T399446', diff saved to https://phabricator.wikimedia.org/P79039 and previous config saved to /var/cache/conftool/dbconfig/20250715-055011-root.json [05:51:57] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1169049 (https://phabricator.wikimedia.org/T399446) (owner: 10Gerrit maintenance bot) [05:53:42] (03CR) 10SD0001: "The issue seems to be due to a mw core bug in the process of automatically inserting the new content model entry in the db table. The work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [05:54:10] !log Starting s5 eqiad failover from db1230 to db1210 - T399446 [05:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:58:15] (03CR) 10Novem Linguae: "If deployers have edit access to the production SQL database and are willing to try this, +1 from me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [06:00:08] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0600) [06:00:08] marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0600) [06:01:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T399446', diff saved to https://phabricator.wikimedia.org/P79040 and previous config saved to /var/cache/conftool/dbconfig/20250715-060114-root.json [06:01:20] T399446: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T399446 [06:02:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1210 to s5 primary and set section read-write T399446', diff saved to https://phabricator.wikimedia.org/P79041 and previous config saved to /var/cache/conftool/dbconfig/20250715-060223-marostegui.json [06:03:35] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:18] (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169050 (https://phabricator.wikimedia.org/T399446) (owner: 10Gerrit maintenance bot) [06:04:21] !log marostegui@dns1006 START - running authdns-update [06:05:11] !log marostegui@dns1006 END - running authdns-update [06:06:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1230 T399446', diff saved to https://phabricator.wikimedia.org/P79042 and previous config saved to /var/cache/conftool/dbconfig/20250715-060600-root.json [06:13:57] (03PS1) 10Muehlenhoff: Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1169301 [06:15:10] FIRING: [3x] BFDdown: BFD session down between cr1-magru and 2001:1498:1:966:1::251 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:15:55] (03CR) 10Muehlenhoff: [C:03+2] Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1169301 (owner: 10Muehlenhoff) [06:17:53] (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169302 (https://phabricator.wikimedia.org/T399446) [06:18:33] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1230.eqiad.wmnet with reason: maintenance [06:18:45] (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169302 (https://phabricator.wikimedia.org/T399446) (owner: 10Marostegui) [06:19:00] moritzm: ok to merge? [06:20:10] RESOLVED: [3x] BFDdown: BFD session down between cr1-magru and 2001:1498:1:966:1::251 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:22:29] moritzm: ping [06:23:00] moritzm: I've merged as the commit says the contract has ended [06:24:33] (03PS1) 10Marostegui: db1230: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169303 (https://phabricator.wikimedia.org/T398928) [06:26:03] (03CR) 10Marostegui: [C:03+2] db1230: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169303 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [06:26:26] marostegui: thanks, sorry got distracted by something else [06:26:39] no worries [06:28:58] (03CR) 10Tiziano Fogli: [C:03+2] mailman: avoid pint linting alerts related to backup instance [alerts] - 10https://gerrit.wikimedia.org/r/1169107 (owner: 10Tiziano Fogli) [06:29:16] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AndyRussG out of all services on: 2394 hosts [06:29:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1185.eqiad.wmnet onto db1230.eqiad.wmnet [06:29:55] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1185 - Depool db1185.eqiad.wmnet to then clone it to db1230.eqiad.wmnet - marostegui@cumin1002 [06:30:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1185 - Depool db1185.eqiad.wmnet to then clone it to db1230.eqiad.wmnet - marostegui@cumin1002 [06:36:11] (03PS1) 10Muehlenhoff: Extend access until end of month [puppet] - 10https://gerrit.wikimedia.org/r/1169305 [06:36:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:36:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:36:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T399249)', diff saved to https://phabricator.wikimedia.org/P79044 and previous config saved to /var/cache/conftool/dbconfig/20250715-063651-marostegui.json [06:36:55] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [06:38:00] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2229 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1169306 (https://phabricator.wikimedia.org/T399533) [06:39:10] (03CR) 10Muehlenhoff: [C:03+2] Extend access until end of month [puppet] - 10https://gerrit.wikimedia.org/r/1169305 (owner: 10Muehlenhoff) [06:46:28] (03CR) 10Elukey: [C:03+1] deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (owner: 10Alexandros Kosiaris) [06:48:12] (03CR) 10Muehlenhoff: [C:04-1] "This patches the wrong files, these are only used by the old buster nodes, which will be entirely decommisioned once the new ones based on" [puppet] - 10https://gerrit.wikimedia.org/r/1169221 (owner: 10Alexandros Kosiaris) [06:48:25] (03CR) 10Elukey: admin: Empty out kartotherian-admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris) [06:48:40] (03CR) 10Elukey: [C:03+1] admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (owner: 10Alexandros Kosiaris) [06:52:22] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (owner: 10Alexandros Kosiaris) [06:54:49] (03CR) 10Muehlenhoff: "We can simply leave the Hiera changes to master.yaml and replica.yaml as-is, they will be entirely removed in a few weeks (when the old ro" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris) [06:55:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (owner: 10Alexandros Kosiaris) [06:57:34] (03CR) 10Muehlenhoff: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (owner: 10Alexandros Kosiaris) [07:00:04] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:03:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T399249)', diff saved to https://phabricator.wikimedia.org/P79045 and previous config saved to /var/cache/conftool/dbconfig/20250715-070305-marostegui.json [07:03:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:06:28] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11003252 (10Marostegui) Thank you! [07:07:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:11:09] marostegui@cumin1002 clone (PID 1530088) is awaiting input [07:11:49] (03CR) 10Elukey: statistics: Add Python script for model uploading to statistics machines. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz) [07:12:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:13:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:14:52] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:18:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P79046 and previous config saved to /var/cache/conftool/dbconfig/20250715-071813-marostegui.json [07:18:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:19:35] (03CR) 10Jelto: [C:03+1] "lgtm now!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [07:20:59] !log installing rubygems security updates [07:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:26:50] (03PS3) 10Volans: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 [07:28:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:58] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P79047 and previous config saved to /var/cache/conftool/dbconfig/20250715-073322-marostegui.json [07:33:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:33:48] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:34:57] (03CR) 10Jelto: [C:03+2] Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [07:36:31] (03CR) 10Vgutierrez: [C:03+2] hiera: use the alt chain on half upload@magru for measure cert [puppet] - 10https://gerrit.wikimedia.org/r/1169200 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez) [07:38:22] !log use GTS alt chain for the measure cert on cp[7013-7016] - T398596 [07:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:27] T398596: Consider using the alternate chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596 [07:38:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:40:08] (03PS1) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) [07:41:18] (03Merged) 10jenkins-bot: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans) [07:42:00] (03CR) 10Vgutierrez: "this is no longer needed" [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur) [07:43:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:35] (03Abandoned) 10Fabfur: varnish: pass WME HEAD reqs to pass for ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur) [07:48:00] (03CR) 10Vgutierrez: [C:03+1] varnish: Implement translation analytics vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [07:48:10] (03CR) 10Tryvix1509: [C:03+1] Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [07:48:27] (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [07:48:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T399249)', diff saved to https://phabricator.wikimedia.org/P79048 and previous config saved to /var/cache/conftool/dbconfig/20250715-074829-marostegui.json [07:48:34] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [07:48:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:48:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79049 and previous config saved to /var/cache/conftool/dbconfig/20250715-074851-marostegui.json [07:50:58] !log more Bird test on ganeti2034 & testvm2006 - T362392 [07:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:02] T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392 [07:53:15] (03PS4) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [07:53:43] FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:55:27] (03CR) 10CI reject: [V:04-1] data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:56:00] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [07:58:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:58:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:58:51] (03PS5) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [07:58:58] FIRING: [13x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:00:57] (03CR) 10Elukey: [C:03+2] httpbb(liftwing): update edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1167858 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou) [08:01:17] (03CR) 10CI reject: [V:04-1] data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:01:33] (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:03:43] FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:05:38] jouncebot: nowandnext [08:05:39] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [08:05:39] In 1 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000) [08:08:25] (03PS6) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) [08:08:43] RESOLVED: [6x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:14:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79051 and previous config saved to /var/cache/conftool/dbconfig/20250715-081458-marostegui.json [08:15:03] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [08:23:51] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1185 gradually with 4 steps - Pool db1185.eqiad.wmnet in after cloning [08:24:35] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:24:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:25:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 6.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:26:37] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=eqiad [08:27:08] !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bookworm [08:27:43] !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:27:52] !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:28:51] (03CR) 10Muehlenhoff: [C:03+2] No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [08:30:04] (03PS2) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) [08:30:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P79054 and previous config saved to /var/cache/conftool/dbconfig/20250715-083006-marostegui.json [08:32:04] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deal with archival of Buster on Debian mirrors - https://phabricator.wikimedia.org/T397209#11003387 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Buster has been archved on the Debian mirrors last weekend and all fallout shoul... [08:33:31] PROBLEM - Host gitlab-replica-b.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:36:07] (03CR) 10Btullis: [C:03+2] data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu) [08:36:25] ^ gitlab alert is expected, reimaging [08:37:26] (03PS1) 10Muehlenhoff: Stop using debug repository on Buster [puppet] - 10https://gerrit.wikimedia.org/r/1169610 (https://phabricator.wikimedia.org/T397209) [08:37:42] FIRING: [7x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:40:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [08:40:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169610 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff) [08:43:53] !log jelto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [08:45:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P79058 and previous config saved to /var/cache/conftool/dbconfig/20250715-084513-marostegui.json [08:46:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: Move OpenSSH server config away from using a Puppet template - https://phabricator.wikimedia.org/T393762#11003445 (10MoritzMuehlenhoff) 05Open→03Resolved This is implemented for Trixie and later [08:47:30] 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790#11003447 (10MoritzMuehlenhoff) 05Open→03Resolved Trixies uses a forward port of Puppet 7 which gets correctly installed during d-i. [08:48:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#11003454 (10MoritzMuehlenhoff) [08:48:31] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage [08:52:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:53:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [08:53:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [08:54:54] !log Restart mariadb on pc1 T399540 [08:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:58] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [08:59:05] RECOVERY - Host gitlab-replica-b.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:00:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79061 and previous config saved to /var/cache/conftool/dbconfig/20250715-090021-marostegui.json [09:00:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:00:48] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:00:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T399249)', diff saved to https://phabricator.wikimedia.org/P79062 and previous config saved to /var/cache/conftool/dbconfig/20250715-090055-marostegui.json [09:01:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [09:01:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:02:42] FIRING: [7x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:03:33] (03CR) 10Btullis: [C:03+2] "Thanks Scott. I have done that now." [puppet] - 10https://gerrit.wikimedia.org/r/1169106 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis) [09:03:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6267/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [09:05:04] (03CR) 10Elukey: [V:03+1 C:03+2] Pyrra-filesystem: purge unmanaged files from config directory [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [09:05:06] (03CR) 10Elukey: [V:03+2 C:03+2] Pyrra-filesystem: purge unmanaged files from config directory [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [09:09:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185 gradually with 4 steps - Pool db1185.eqiad.wmnet in after cloning [09:09:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1185.eqiad.wmnet onto db1230.eqiad.wmnet [09:10:01] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [09:10:14] FIRING: [19x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:11:06] !log btullis@cumin1003 START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [09:11:17] PROBLEM - TFTP service on install2004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [09:11:34] (03PS1) 10Marostegui: instances.yaml: Remove db1246 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1169612 (https://phabricator.wikimedia.org/T399449) [09:12:08] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bookworm [09:12:20] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1246 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1169612 (https://phabricator.wikimedia.org/T399449) (owner: 10Marostegui) [09:13:26] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:13:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1246 T399449', diff saved to https://phabricator.wikimedia.org/P79068 and previous config saved to /var/cache/conftool/dbconfig/20250715-091328-marostegui.json [09:13:34] T399449: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449 [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:40] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:15:20] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#11003583 (10BTullis) 05Open→03Resolved This is now done... [09:16:23] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [09:16:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:17:24] !log Restart mariadb on pc2 T399540 [09:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [09:17:31] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:18:35] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:18:48] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [09:19:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [09:19:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:19:52] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [09:19:54] (03PS4) 10Muehlenhoff: admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [09:20:36] (03CR) 10CI reject: [V:04-1] admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [09:20:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79072 and previous config saved to /var/cache/conftool/dbconfig/20250715-092050-root.json [09:21:36] (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169613 (https://phabricator.wikimedia.org/T398928) [09:21:40] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [09:22:33] (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169613 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [09:22:42] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:39] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:25:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T399249)', diff saved to https://phabricator.wikimedia.org/P79073 and previous config saved to /var/cache/conftool/dbconfig/20250715-092551-marostegui.json [09:25:55] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:27:16] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:28:15] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [09:28:31] (03PS3) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) [09:29:06] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:30:16] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:30:21] (03PS1) 10Marostegui: db1258: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169614 (https://phabricator.wikimedia.org/T399298) [09:30:53] (03CR) 10Marostegui: [C:03+2] db1258: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169614 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [09:31:45] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:31:57] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1258.eqiad.wmnet with reason: Maintenance [09:32:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1258 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79074 and previous config saved to /var/cache/conftool/dbconfig/20250715-093200-marostegui.json [09:33:06] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:34:52] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:35:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79076 and previous config saved to /var/cache/conftool/dbconfig/20250715-093556-root.json [09:36:33] (03CR) 10Btullis: "Happy in principle with this, when the CI passes." [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [09:36:59] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:37:12] (03CR) 10Btullis: admin: Remove platform-engineering group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [09:37:43] 10ops-magru: Power Supply - Status - issue on dns7002:9290 - https://phabricator.wikimedia.org/T399549 (10phaultfinder) 03NEW [09:38:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [09:38:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:38:52] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:39:01] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [09:39:02] !log Restart mariadb on pc3 T399540 [09:39:04] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [09:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:06] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [09:39:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79079 and previous config saved to /var/cache/conftool/dbconfig/20250715-093943-root.json [09:40:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P79080 and previous config saved to /var/cache/conftool/dbconfig/20250715-094058-marostegui.json [09:41:46] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [09:41:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:42:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [09:43:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:39] RECOVERY - Host mr1-magru is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms [09:44:51] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:44:51] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING WARNING - Packet loss = 50%, RTA = 123.54 ms [09:46:45] (03PS5) 10Muehlenhoff: admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [09:47:20] RECOVERY - Host mr1-magru IPv6 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms [09:47:31] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:47:42] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:42] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 123.46 ms [09:47:55] !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache [09:48:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [09:48:26] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:48:35] RESOLVED: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:48:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79082 and previous config saved to /var/cache/conftool/dbconfig/20250715-095101-root.json [09:51:06] !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=eqiad [09:54:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79083 and previous config saved to /var/cache/conftool/dbconfig/20250715-095449-root.json [09:54:58] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:55:47] (03CR) 10Arnaudb: [C:03+1] "lgtm, ping me when you need me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1168619 (owner: 10Hashar) [09:56:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P79084 and previous config saved to /var/cache/conftool/dbconfig/20250715-095605-marostegui.json [09:56:13] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [09:56:48] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [09:57:46] (03PS1) 10Muehlenhoff: Record LDAP access for mszwarc [puppet] - 10https://gerrit.wikimedia.org/r/1169616 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000) [10:02:30] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for mszwarc [puppet] - 10https://gerrit.wikimedia.org/r/1169616 (owner: 10Muehlenhoff) [10:03:35] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:04:41] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:04:58] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:05:26] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:05:33] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:05:46] !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [10:06:04] !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [10:06:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79085 and previous config saved to /var/cache/conftool/dbconfig/20250715-100607-root.json [10:06:34] (03CR) 10Muehlenhoff: [C:03+2] icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff) [10:09:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79086 and previous config saved to /var/cache/conftool/dbconfig/20250715-100955-root.json [10:11:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T399249)', diff saved to https://phabricator.wikimedia.org/P79087 and previous config saved to /var/cache/conftool/dbconfig/20250715-101113-marostegui.json [10:11:18] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:11:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:11:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79088 and previous config saved to /var/cache/conftool/dbconfig/20250715-101135-marostegui.json [10:11:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940#11003801 (10ayounsi) Netbox is unfortunately not made to track inventory items (as in on a shelf). There are some plugins tha... [10:12:54] (03CR) 10Ayounsi: [C:03+2] magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) (owner: 10Ayounsi) [10:13:30] (03Merged) 10jenkins-bot: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) (owner: 10Ayounsi) [10:14:10] (03PS1) 10Effie Mouzeli: mcrouter: assign pods a higher priority class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169620 (https://phabricator.wikimedia.org/T397683) [10:15:26] (03CR) 10Effie Mouzeli: [C:03+2] profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 (owner: 10Effie Mouzeli) [10:16:55] (03CR) 10Effie Mouzeli: [C:03+2] hieradata: migrate memcached gutter pool to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1166194 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli) [10:17:23] !log magru: setup BGP to Ufinet - T389767 [10:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:30] jouncebot: nowandnext [10:17:30] For the next 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000) [10:17:30] In 1 hour(s) and 42 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1200) [10:17:42] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [10:18:40] (03Merged) 10jenkins-bot: Configure Special:CreateAccount instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [10:19:22] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1167896|Configure Special:CreateAccount instrument (T394744)]] [10:19:27] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [10:19:52] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:19:56] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2005.codfw.wmnet [10:20:07] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [10:22:05] !log installing debian-archive-keyring updates from Bookworm point release [10:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] (03PS1) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) [10:22:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:23:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [10:23:25] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1167896|Configure Special:CreateAccount instrument (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:24:52] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:25:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79090 and previous config saved to /var/cache/conftool/dbconfig/20250715-102500-root.json [10:26:07] (03CR) 10Dragoniez: Create "abusefilter" editor user group for Vietnamese Wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [10:26:41] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2005.codfw.wmnet [10:26:44] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet [10:28:01] !log mszabo@deploy1003 Sync cancelled. [10:28:14] we'll be back after a commercial break [10:28:31] (03CR) 10Muehlenhoff: [C:03+2] admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle) [10:30:39] (03PS1) 10Máté Szabó: Register mediawiki.product_metrics.special_create_account stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) [10:31:16] (03CR) 10Kosta Harlan: [C:03+1] Register mediawiki.product_metrics.special_create_account stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [10:31:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [10:32:30] (03Merged) 10jenkins-bot: Register mediawiki.product_metrics.special_create_account stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó) [10:32:49] (03PS2) 10Muehlenhoff: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) [10:32:50] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11003846 (10elukey) @Mvolz hi! I added the success-ratio SLO, but the error budget looks not ok so I'd need your help to figure out what I am doin... [10:32:51] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1169623|Register mediawiki.product_metrics.special_create_account stream (T394744)]] [10:32:55] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [10:33:19] (03CR) 10Muehlenhoff: "With https://phabricator.wikimedia.org/T390139 resolved, this is ready for review again" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff) [10:36:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79093 and previous config saved to /var/cache/conftool/dbconfig/20250715-103641-marostegui.json [10:36:46] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [10:36:48] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1169623|Register mediawiki.product_metrics.special_create_account stream (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:38:09] (03PS1) 10Muehlenhoff: Deprecate dumpsdata-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1169627 [10:38:34] (03CR) 10Alexandros Kosiaris: "Fine by me, I 'll split in 2 patches, 1 to fix data.yaml and 1 to fully remove the files (and we piggyback on that one the rest of the cha" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris) [10:39:33] (03CR) 10Muehlenhoff: "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris) [10:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:40:53] !log mszabo@deploy1003 mszabo: Continuing with sync [10:45:01] (03CR) 10Alexandros Kosiaris: "Fair enough. I 'll merge this patch then with the one removing hiera files which would be removing all old maps nodes stuff. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1169221 (owner: 10Alexandros Kosiaris) [10:46:05] jouncebot: nowandnext [10:46:05] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000) [10:46:05] In 1 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1200) [10:46:45] (03PS4) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) [10:47:42] RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:48:11] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169623|Register mediawiki.product_metrics.special_create_account stream (T394744)]] (duration: 15m 19s) [10:48:16] T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744 [10:49:46] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11003919 (10elukey) Maybe we are counting also the Zotero's calls? If so I'd suggest to exclude them, since IIUC Citoid calls Zotero, but from the... [10:51:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P79096 and previous config saved to /var/cache/conftool/dbconfig/20250715-105148-marostegui.json [10:52:25] (03CR) 10Dragoniez: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [10:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:56:00] (03PS1) 10Muehlenhoff: Remove ldap-admins from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1169633 [11:01:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#11003958 (10MoritzMuehlenhoff) [11:02:57] (03PS1) 10Zabe: Set categorylinks to read new on jawiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169635 (https://phabricator.wikimedia.org/T397912) [11:03:23] !log fceratto@cumin1002 dbctl commit (dc=eqiad): 'Configure db1259', diff saved to https://phabricator.wikimedia.org/P79097 and previous config saved to /var/cache/conftool/dbconfig/20250715-110322-fceratto.json [11:04:59] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#11003962 (10MoritzMuehlenhoff) 05Open→03Invalid >>! In T396660#10967608, @Jclark-ctr wrote: > @MoritzMuehlenhoff is this still an issue could you verify again and we can try a di... [11:06:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P79098 and previous config saved to /var/cache/conftool/dbconfig/20250715-110655-marostegui.json [11:07:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:07:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169635 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [11:07:57] (03Merged) 10jenkins-bot: Set categorylinks to read new on jawiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169635 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [11:08:20] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169635|Set categorylinks to read new on jawiki and ruwiki (T397912)]] [11:08:25] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [11:08:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560 (10Vgutierrez) 03NEW [11:10:26] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169635|Set categorylinks to read new on jawiki and ruwiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:10:58] (03PS2) 10Alexandros Kosiaris: mtail: Remove tilerator from tests [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (https://phabricator.wikimedia.org/T381565) [11:11:00] (03PS2) 10Alexandros Kosiaris: deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (https://phabricator.wikimedia.org/T381565) [11:11:02] (03PS2) 10Alexandros Kosiaris: admin: Empty out kartotherian-admin [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (https://phabricator.wikimedia.org/T381565) [11:11:04] (03CR) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:11:04] (03PS2) 10Alexandros Kosiaris: admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (https://phabricator.wikimedia.org/T381565) [11:11:06] (03PS2) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) [11:11:07] (03PS2) 10Alexandros Kosiaris: DNM: tilerator: Remove as much as possible of the last cruft [puppet] - 10https://gerrit.wikimedia.org/r/1169223 (https://phabricator.wikimedia.org/T381565) [11:11:09] (03PS1) 10Alexandros Kosiaris: DNM: Prep patch for removal of old maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1169636 (https://phabricator.wikimedia.org/T381565) [11:11:27] !log zabe@deploy1003 zabe: Continuing with sync [11:11:42] (03CR) 10Effie Mouzeli: [C:03+2] k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [11:11:56] (03CR) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [11:13:47] (03PS2) 10Alexandros Kosiaris: DNM: Prep patch for removal of old maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1169636 (https://phabricator.wikimedia.org/T381565) [11:14:18] (03Abandoned) 10Alexandros Kosiaris: maps: Cleanup DB grants, add tegola, prep tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169221 (owner: 10Alexandros Kosiaris) [11:15:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [11:17:07] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169635|Set categorylinks to read new on jawiki and ruwiki (T397912)]] (duration: 08m 46s) [11:17:11] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [11:18:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:18:35] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:20:00] (03CR) 10Alexandros Kosiaris: [C:03+2] mtail: Remove tilerator from tests [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:20:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:20:09] (03CR) 10Alexandros Kosiaris: [C:03+2] deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:20:18] (03CR) 10Alexandros Kosiaris: [C:03+2] admin: Empty out kartotherian-admin [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:21:44] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:22:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79099 and previous config saved to /var/cache/conftool/dbconfig/20250715-112202-marostegui.json [11:22:08] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:22:18] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:22:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T399249)', diff saved to https://phabricator.wikimedia.org/P79100 and previous config saved to /var/cache/conftool/dbconfig/20250715-112225-marostegui.json [11:24:03] (03CR) 10Alexandros Kosiaris: [C:03+2] admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:26:43] !log restart atftp daemon @ install2004, it had crashed [11:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:45] (03PS3) 10Dreamy Jazz: mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) [11:26:59] RECOVERY - TFTP service on install2004 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [11:27:04] (03CR) 10Muehlenhoff: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [11:27:18] (03CR) 10Ladsgroup: [C:03+2] mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz) [11:27:22] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz) [11:28:33] moritzm: not filing a task because it is not an issue atm nor I think it requires further action, but FYI but atftpd crashed in close times on both install1004 and instal2004, on the last it didn't restart correctly back [11:34:25] ok, those will be upgraded to bookworm in the next months anyway, and that version will have a systemd-socket-activated atftpd [11:34:40] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [11:34:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:34:46] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2006.codfw.wmnet [11:41:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet [11:41:20] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2006.codfw.wmnet [11:44:12] (03PS1) 10Jcrespo: admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640 [11:45:39] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1005.eqiad.wmnet [11:48:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T399249)', diff saved to https://phabricator.wikimedia.org/P79101 and previous config saved to /var/cache/conftool/dbconfig/20250715-114833-marostegui.json [11:48:38] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [11:50:09] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16347 [11:50:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16347 [11:51:34] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 36351 [11:51:59] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1005.eqiad.wmnet [11:56:27] ayounsi@cumin1002 peering (PID 1896552) is awaiting input [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1200) [12:03:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P79102 and previous config saved to /var/cache/conftool/dbconfig/20250715-120340-marostegui.json [12:06:17] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11004180 (10Jhancock.wm) [12:08:00] (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168214 (owner: 10PipelineBot) [12:10:02] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168214 (owner: 10PipelineBot) [12:14:18] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:14:26] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11004195 (10elukey) @DLynch Hi! I have a couple of questions for you: * This is a preview of the metrics, https://w.wiki/EjUp, coul... [12:14:49] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:15:28] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:16:16] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:16:26] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:17:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36351 [12:17:14] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:18:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P79105 and previous config saved to /var/cache/conftool/dbconfig/20250715-121849-marostegui.json [12:23:41] !log update AS14907 RIPE import/export policies [12:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:56] (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [12:29:35] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399571 (10phaultfinder) 03NEW [12:29:36] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399570 (10phaultfinder) 03NEW [12:33:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 139009 [12:33:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T399249)', diff saved to https://phabricator.wikimedia.org/P79108 and previous config saved to /var/cache/conftool/dbconfig/20250715-123357-marostegui.json [12:34:02] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [12:34:07] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:34:13] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance [12:34:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 139009 [12:34:24] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399573 (10phaultfinder) 03NEW [12:34:27] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399572 (10phaultfinder) 03NEW [12:38:09] 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#11004255 (10Jclark-ctr) Thanks for Verifying [12:44:47] 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399573#11004280 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable. iDRAC now shows as healthy. Updated iDRAC firmware while logged in. [12:45:09] 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399571#11004284 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable. iDRAC now shows as healthy. Updated iDRAC firmware while logged in. [12:51:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove weight from the master T395771', diff saved to https://phabricator.wikimedia.org/P79109 and previous config saved to /var/cache/conftool/dbconfig/20250715-125157-marostegui.json [12:52:02] T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771 [12:54:07] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:57:02] ^^ I'll have a look, might be my fault [12:57:03] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991#11004304 (10ayounsi) 05Open→03Resolved [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:17] tappof: nah, it's a recurring thing caused by a lot of Hadoop workers being decommed [13:00:43] o/ [13:00:59] moritzm: Well, thank you! the timing was a bit suspicious :) [13:01:05] yes, I mentioned it to the team before, they are aware [13:01:08] nothing to deploy (disappointing – means I can’t test T399462 being fixed ^^) [13:01:09] T399462: SpiderPig live job log view (terminal / console) sometimes freezes - https://phabricator.wikimedia.org/T399462 [13:01:26] it just goes under the threshold for the alert [13:02:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:07:16] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560#11004320 (10KOfori) Approved. [13:11:21] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11004337 (10elukey) Just sent the email to Willy explaining the issue, fingers crossed to get some help from Dell :) [13:14:07] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance [13:14:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T399249)', diff saved to https://phabricator.wikimedia.org/P79110 and previous config saved to /var/cache/conftool/dbconfig/20250715-131450-marostegui.json [13:14:54] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:17:46] (03CR) 10Fabfur: cache::haproxy: Provide X-Trusted-Request score (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [13:23:10] (03CR) 10Alexandros Kosiaris: [C:04-1] mcrouter: assign pods a higher priority class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169620 (https://phabricator.wikimedia.org/T397683) (owner: 10Effie Mouzeli) [13:25:42] hello hello, is someone still around to help deploy a configuration change? [13:29:56] !log hashar@deploy1003 Started deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 [13:31:40] !log hashar@deploy1003 Finished deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 (duration: 01m 43s) [13:31:44] !log hashar@deploy1003 Started deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 [13:32:37] !log hashar@deploy1003 Finished deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 (duration: 00m 53s) [13:32:57] !log hashar@deploy1003 Started deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 [13:32:58] (03CR) 10Muehlenhoff: "Adding Scott as reviewer" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [13:33:32] !log hashar@deploy1003 Finished deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 (duration: 00m 35s) [13:36:24] (03PS1) 10Zabe: BETA: Stop writing to cl_to and cl_collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169653 (https://phabricator.wikimedia.org/T399579) [13:37:05] (03PS1) 10Brennen Bearnes: phabricator deployment: skip storage upgrade during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1169654 (https://phabricator.wikimedia.org/T370266) [13:37:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T399249)', diff saved to https://phabricator.wikimedia.org/P79111 and previous config saved to /var/cache/conftool/dbconfig/20250715-133712-marostegui.json [13:37:16] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [13:37:19] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [13:41:14] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:41:42] (03PS2) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) [13:41:51] (03CR) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [13:43:43] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11004524 (10Jclark-ctr) @eevans @VRiley-WMF {F64589137} KN09N7919I0709R1S serial looks like it was in slot 1 not 0 according to Hardware Inventory in idrac [13:43:50] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:44:41] (03PS1) 10Hashar: Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658 [13:46:18] (03PS2) 10Hashar: Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658 [13:47:05] (03CR) 10Fabfur: cache::haproxy: Provide X-Trusted-Request score (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [13:48:24] (03PS1) 10Vgutierrez: site: Remove cp[5013-5016] entries [puppet] - 10https://gerrit.wikimedia.org/r/1169659 (https://phabricator.wikimedia.org/T323830) [13:49:58] (03CR) 10Fabfur: [C:03+1] site: Remove cp[5013-5016] entries [puppet] - 10https://gerrit.wikimedia.org/r/1169659 (https://phabricator.wikimedia.org/T323830) (owner: 10Vgutierrez) [13:51:09] (03CR) 10Vgutierrez: [C:03+2] site: Remove cp[5013-5016] entries [puppet] - 10https://gerrit.wikimedia.org/r/1169659 (https://phabricator.wikimedia.org/T323830) (owner: 10Vgutierrez) [13:52:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P79112 and previous config saved to /var/cache/conftool/dbconfig/20250715-135219-marostegui.json [13:55:02] (03PS2) 10Scott French: configcluster.yaml - remove eventlogging from profile::etcd::tlsproxy::acls [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [13:55:14] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [13:55:56] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [13:56:35] (03PS1) 10Hashar: gerrit: remove GWT-only theme configuration [puppet] - 10https://gerrit.wikimedia.org/r/1169660 [13:57:11] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399570#11004556 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm power supply failed. server out of warranty. replaced with one from a decommed server. [13:57:42] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399572#11004560 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm power supply failed. server out of warranty. replaced with one from a decommed server. [13:58:20] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11004564 (10Jhancock.wm) @klausman I have a few servers of yours in codfw that need this updated. The PXE settings need to be updated. It shouldn't cause a reboot to reset the pxe, but if anyt... [13:59:08] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:03:30] (03CR) 10Scott French: [C:03+1] "Thanks, Andrew! There should be no harm in cleaning this up, and better to get rid of it to avoid future confusion." [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:04:07] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11004586 (10elukey) For future notes, these are the BIOS's Attributes: ` {'ACPICSTC2Latency': 800, 'ACPISRATL3CacheAsNUMADomain': 'Auto', 'ACSEnable': 'Auto', 'APBD... [14:05:00] (03CR) 10Tryvix1509: [C:03+1] Create "abusefilter" editor user group for Vietnamese Wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [14:05:59] (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509) [14:06:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [14:06:48] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11004594 (10Eevans) >>! In T396970#11004524, @Jclark-ctr wrote: > @eevans @VRiley-WMF {F64589137} > > KN09N7919I0709R1S serial looks like it was in slot 1 not 0 according to Hardware Inventory in idrac >... [14:06:50] !log reprepro include php8.3_8.3.23-1+wmf11u2 in component/php83 - T398245 [14:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:56] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [14:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P79113 and previous config saved to /var/cache/conftool/dbconfig/20250715-140726-marostegui.json [14:08:34] (03PS1) 10Ayounsi: WIP: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) [14:12:02] (03PS3) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) [14:14:11] (03CR) 10Elukey: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [14:15:36] (03PS1) 10Ayounsi: Routed ganeti: disable IPv4 ICMP redirects [puppet] - 10https://gerrit.wikimedia.org/r/1169663 (https://phabricator.wikimedia.org/T362392) [14:19:15] (03PS1) 10Vgutierrez: varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) [14:20:46] (03CR) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [14:20:51] (03CR) 10Ssingh: [C:03+2] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [14:21:40] (03PS3) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) [14:22:17] !log reprepro include php8.1_8.1.33-1+wmf11u1 in component/php81 [14:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:25] (03CR) 10CI reject: [V:04-1] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [14:22:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T399249)', diff saved to https://phabricator.wikimedia.org/P79114 and previous config saved to /var/cache/conftool/dbconfig/20250715-142234-marostegui.json [14:22:40] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:22:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:22:58] (03CR) 10Andrew Bogott: [C:03+2] Neutron: include a python dependency for wmcs-netns-events [puppet] - 10https://gerrit.wikimedia.org/r/1168648 (owner: 10Andrew Bogott) [14:23:54] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [14:25:14] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [14:25:41] (03PS4) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) [14:26:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560#11004652 (10ssingh) [14:28:25] (03CR) 10Ssingh: "Rebased, no code change." [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [14:28:42] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11004658 (10Jhancock.wm) (not trying to rush, just making sure i didn't miss something) Is there anything I can help with on this one? [14:29:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1430) [14:30:28] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:32:38] (03CR) 10Ssingh: [C:03+2] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh) [14:33:00] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:33:04] (03CR) 10Zabe: [C:03+2] BETA: Stop writing to cl_to and cl_collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169653 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [14:33:23] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:33:36] (03PS1) 10Tiziano Fogli: prom/metamonitor: add CNAMEs for metamonitoring endpoints [dns] - 10https://gerrit.wikimedia.org/r/1169668 (https://phabricator.wikimedia.org/T397003) [14:33:50] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:34:22] (03Merged) 10jenkins-bot: BETA: Stop writing to cl_to and cl_collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169653 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [14:35:45] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, I think this setting is safe in any way we operate the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1169663 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:36:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:36:35] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:37:24] (03PS4) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) [14:37:37] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#11004701 (10ssingh) Expiry has been sent to end of FY (June 2026) and contact has been set to Suman to get this request going. We... [14:37:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz) [14:38:39] (03PS1) 10Ssingh: admin: add vgutierrez to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1169672 (https://phabricator.wikimedia.org/T399560) [14:39:24] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169672 (https://phabricator.wikimedia.org/T399560) (owner: 10Ssingh) [14:39:32] (03CR) 10Ssingh: [C:03+2] admin: add vgutierrez to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1169672 (https://phabricator.wikimedia.org/T399560) (owner: 10Ssingh) [14:39:47] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11004722 (10Jhancock.wm) @elukey looks like this server and the one in T396365 are having this same issue with the provisioning script. they're both the 1 CPU test servers from su... [14:40:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560#11004725 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@krb1002:~$ sudo manage_principals.py create vgutierrez --email_address=vgutierrez@wi... [14:40:50] (03Abandoned) 10Herron: thanos: add recording rules for varnish SLO [puppet] - 10https://gerrit.wikimedia.org/r/740209 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [14:41:04] (03Abandoned) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009) (owner: 10Herron) [14:41:12] (03Abandoned) 10Herron: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [14:43:13] 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004737 (10MoritzMuehlenhoff) [14:43:14] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11004736 (10elukey) @Jhancock.wm Interesting! The absence of Console Redirection is new... Did you find anything in the BIOS about the console redirection by any chance? [14:43:54] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11004751 (10Jhancock.wm) I have not. I can take a closer look this afternoon. [14:44:07] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:45:19] 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004755 (10ssingh) ` sukhe@krb1002:~$ sudo manage_principals.py reset-password htriedman --email_address=htriedman-ctr@wikimedia.org Password reset successfully. Successfully sent... [14:48:21] (03PS1) 10Effie Mouzeli: dsh.yaml: removed conftool entries for testservers [puppet] - 10https://gerrit.wikimedia.org/r/1169673 [14:50:11] (03CR) 10Cathal Mooney: WIP: Ganeti Bird BGP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi) [14:53:16] (03PS1) 10Btullis: Bump hive metastore heap to support the refine migration [puppet] - 10https://gerrit.wikimedia.org/r/1169675 (https://phabricator.wikimedia.org/T369845) [14:54:39] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6275/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169675 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [14:55:07] (03CR) 10Fabfur: [C:03+1] cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [14:56:15] !log dancy@deploy1003 Installing scap version "4.188.2" for 1 host(s) [14:56:48] (03PS2) 10Vgutierrez: varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) [14:57:10] !log dancy@deploy1003 Installation of scap version "4.188.2" completed for 1 hosts [14:58:00] (03CR) 10Vgutierrez: "text tests are happy: `0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez) [14:58:32] (03CR) 10Btullis: [V:03+1 C:03+2] Bump hive metastore heap to support the refine migration [puppet] - 10https://gerrit.wikimedia.org/r/1169675 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [15:00:04] jelto, arnoldokoth, and mutante: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1500). [15:00:37] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet with reason: Phorge upgrade [15:02:05] !log stop replica @ db1217:m3, db2160:m3 T370266 [15:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:09] T370266: Update to Phorge upstream 2024.35 release - https://phabricator.wikimedia.org/T370266 [15:03:26] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1004.eqiad.wmnet with reason: version upgrade [15:03:53] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab2002.codfw.wmnet with reason: version upgrade [15:05:42] (03CR) 10Dzahn: [C:03+2] phabricator deployment: skip storage upgrade during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1169654 (https://phabricator.wikimedia.org/T370266) (owner: 10Brennen Bearnes) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:06] 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004989 (10Htriedman) This seems to have worked! Thank you for the lightning-fast response time :) [15:09:01] !log brennen@deploy1003 Started deploy [phabricator/deployment@ed8270c]: test deploy phab2002 for T370266 [15:09:05] T370266: Update to Phorge upstream 2024.35 release - https://phabricator.wikimedia.org/T370266 [15:09:07] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:39] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ed8270c]: test deploy phab2002 for T370266 (duration: 00m 38s) [15:10:51] 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11005013 (10ssingh) 05Open→03Resolved a:03ssingh [15:11:38] !log phabricator version upgrade in progress - expect short downtime [15:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:06] (03PS1) 10Btullis: Fail over hive services to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1169683 (https://phabricator.wikimedia.org/T369845) [15:12:14] !log brennen@deploy1003 Started deploy [phabricator/deployment@ed8270c]: deploy phab1004 for T370266 [15:12:44] !log brennen@deploy1003 Finished deploy [phabricator/deployment@ed8270c]: deploy phab1004 for T370266 (duration: 00m 30s) [15:13:05] (03CR) 10Btullis: [C:03+2] Fail over hive services to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1169683 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [15:13:18] !log btullis@dns1004 START - running authdns-update [15:14:12] !log btullis@dns1004 END - running authdns-update [15:14:50] andrew@cumin2002 reimage (PID 3814243) is awaiting input [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:10] (03PS1) 10Elukey: pyrra: fix istio latency SLI metric selector [puppet] - 10https://gerrit.wikimedia.org/r/1169686 [15:18:20] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6276/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169686 (owner: 10Elukey) [15:19:40] (03CR) 10Herron: [C:03+1] pyrra: fix istio latency SLI metric selector [puppet] - 10https://gerrit.wikimedia.org/r/1169686 (owner: 10Elukey) [15:20:20] (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: fix istio latency SLI metric selector [puppet] - 10https://gerrit.wikimedia.org/r/1169686 (owner: 10Elukey) [15:28:36] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [15:29:05] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [15:36:38] PROBLEM - snapshot of s3 in codfw on backupmon1001 is CRITICAL: Last snapshot for s3 at codfw (db2239) taken on 2025-07-14 08:26:30 is 1197 GiB, but the previous one was 1999 GiB, a change of -40.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:36:41] (03PS1) 10Brennen Bearnes: Revert "phabricator deployment: skip storage upgrade during deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1169692 (https://phabricator.wikimedia.org/T370266) [15:36:57] (03CR) 10Dzahn: [C:03+2] Revert "phabricator deployment: skip storage upgrade during deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1169692 (https://phabricator.wikimedia.org/T370266) (owner: 10Brennen Bearnes) [15:36:59] (03CR) 10Dzahn: [V:03+2 C:03+2] Revert "phabricator deployment: skip storage upgrade during deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1169692 (https://phabricator.wikimedia.org/T370266) (owner: 10Brennen Bearnes) [15:38:52] (03PS1) 10Ebernhardson: Repoint oss.sonatype.org to repo1.maven.org [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169693 [15:38:52] (03PS1) 10Ebernhardson: Update plugins for bugfix to extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169694 (https://phabricator.wikimedia.org/T399162) [15:39:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:42:10] !log phabricator version upgrade finished [15:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:50] \o/ [15:45:04] (03PS1) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 [15:46:21] andrew@cumin2002 reimage (PID 3847355) is awaiting input [15:46:58] !log start replica @ db1217:m3, db2160:m3 T370266 [15:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:04] T370266: Update to Phorge upstream 2024.35 release - https://phabricator.wikimedia.org/T370266 [15:47:18] Hey, how do we check whether something got deployed? This was +2ed but I wasn't around to check it on mwdebug... https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164179 would it have gone out automatically the last time we did a config change, or can I just put it on a deployment window at some point? [15:48:52] (03PS1) 10Btullis: Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 [15:49:14] (03PS2) 10Btullis: Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 (https://phabricator.wikimedia.org/T369845) [15:51:15] Mvolz: in mediawiki-config changes are only merged when they then get deployed. And since your change did not got reverted, it should be live. [15:51:23] (03PS2) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 [15:51:47] (03CR) 10Btullis: [C:03+2] Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [15:51:58] (03CR) 10BCornwall: [C:03+1] Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [15:52:02] !log btullis@dns1004 START - running authdns-update [15:52:13] You can always take a look at srv/mediawiki/ on deploy1003.eqiad.wmnet to see what the currently live code is [15:52:54] !log btullis@dns1004 END - running authdns-update [15:55:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005152 (10BCornwall) 05Resolved→03Open Ah, @VRiley-WMF, it seems that connectivity is no longer through the Mellanox card: ` [ 9.128067] mlx5_core 0000:3b:00.0: Port module event: module 0... [15:55:57] (03PS2) 10Ebernhardson: Update plugins for bugfix to extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169694 (https://phabricator.wikimedia.org/T399162) [15:55:57] (03PS3) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 [16:00:01] (03PS6) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) [16:00:05] jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:14] (03CR) 10BCornwall: varnish: Implement translation analytics vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [16:00:34] (03PS4) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 [16:03:32] (03CR) 10Ebernhardson: "recheck" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 (owner: 10Ebernhardson) [16:03:42] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:03:52] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:36] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.126 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:42] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:08:16] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11005227 (10VRiley-WMF) Swapped both of the failed SSDs with spares. will await for the reimage. [16:10:36] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet [16:10:38] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet [16:17:53] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11005301 (10Mvolz) I think it's less likely it's miscalculated and more likely it's just bad. Does it seem really very different from https://graf... [16:18:57] (03Abandoned) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164703 (owner: 10Pppery) [16:19:45] (03CR) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris) [16:21:54] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [16:22:00] (03CR) 10Aklapper: "Yeah, sorry - moving targets and priorities :(" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164703 (owner: 10Pppery) [16:22:21] (03CR) 10Mvolz: "Usually config changes don't get +2ed unless they ready to deploy so they can be tested on mwdebug before deploying during a deployment wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [16:22:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [16:25:19] PROBLEM - mysqld processes #page on es1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:25:19] PROBLEM - MariaDB read only es1 on es1032 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:25:52] looking [16:26:01] I am around if you need help [16:26:04] hmm [16:26:50] here [16:27:00] trying to connect [16:27:18] cwhite: I see you logged in, are you making any change? [16:27:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005406 (10VRiley-WMF) Orginally put the cable into the onboard port. Once it was able to reimage, that's when I just moved it over. It should be all set now. [16:27:41] no, still investigating [16:27:59] acking to prevent escalation to batphone..for now [16:28:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and 2a02:ec80:700:fe0b::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:28:59] ● wmf_auto_restart_prometheus-mysqld-exporter.service loaded failed failed [16:29:07] this is the only failed unit i see [16:29:11] not mysqld itself? [16:29:19] I don't see this host in orchestrator [16:29:20] that's a downtime expiration: 1d 0h 1m 57s [16:29:26] ah! [16:29:41] what is a reasonable time frame to extend it? [16:29:44] few more days? [16:29:44] my guess, I don't know [16:30:06] double check it is depooled [16:30:22] yes [16:30:40] then no emergency [16:30:49] actually I dont see any systemd service here called mysql or maria [16:30:51] ok [16:31:05] yet it's in es1 [16:31:36] mysql has been down there at least for 24 hours [16:32:27] searching for tickets [16:32:50] https://phabricator.wikimedia.org/P75467 [16:33:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and 2a02:ec80:700:fe0b::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:33:18] Apr 28 2025 [16:33:45] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm [16:33:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm [16:34:12] https://phabricator.wikimedia.org/T391921 is closed but I left a comment there [16:34:43] jynus: you saw that date somewhere outside the paste bin above? then it matches [16:36:28] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es1032.eqiad.wmnet with reason: T391921 [16:36:31] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [16:36:39] !log downtiming es1032 for 3 days - expired downtime for T391921? [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:42] mariadb-common is 10.11.11 on that host, like what that ticket was about [16:39:03] alright, with a new downtime and a comment on that ticket.. and no emergency.. I will declare it no incident [16:40:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [16:44:05] (03PS1) 10Pppery: Update source strings to 2024.35 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1169700 (https://phabricator.wikimedia.org/T399604) [16:44:17] 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11005556 (10Mvolz) Here's an example: on July 14 from 12:41 to 12:42 we received 63 requests for www.espncricinfo.com which all failed. (403 forbi... [16:45:44] (03CR) 10Pppery: "Cc abijeet for awareness; this is going to require a non-trivial amount of manual rename processing on the translatewiki side. Less than t" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1169700 (https://phabricator.wikimedia.org/T399604) (owner: 10Pppery) [16:49:06] (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French) [16:49:34] (03CR) 10Scott French: [V:03+2] "Thank you both for the reviews!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French) [16:49:35] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up new php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French) [16:53:15] FYI, please refrain from starting any new mediawiki deployments, as I'll be deploying at the top of the hour to pick up a new production image [16:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:55:07] (03PS1) 10Peter Fischer: Bump flink to 1.20.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169704 (https://phabricator.wikimedia.org/T398159) [16:57:43] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1017 [16:57:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:58:11] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1017 [16:59:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'update es1032', diff saved to https://phabricator.wikimedia.org/P79117 and previous config saved to /var/cache/conftool/dbconfig/20250715-165930-fceratto.json [17:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1700). [17:00:14] o/ [17:01:09] !log swfrench@deploy1003 Started scap sync-world: Rebuild to pick up new php8.1 production image [17:02:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:04:12] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [17:04:18] RECOVERY - mysqld processes #page on es1032 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:04:37] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [17:04:49] RECOVERY - MariaDB read only es1 on es1032 is OK: Version 10.11.13-MariaDB-log, Uptime 38s, read_only: True, event_scheduler: True, 6.94 QPS, connection latency: 0.034040s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [17:04:55] !log brett@cumin2002 START - Cookbook sre.dns.netbox [17:07:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Pooling in after update es1032', diff saved to https://phabricator.wikimedia.org/P79118 and previous config saved to /var/cache/conftool/dbconfig/20250715-170724-fceratto.json [17:07:32] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:09:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set es1032 back as master', diff saved to https://phabricator.wikimedia.org/P79119 and previous config saved to /var/cache/conftool/dbconfig/20250715-170919-fceratto.json [17:09:44] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1017 [17:09:56] !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1017 [17:14:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:30] !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:22:54] brett@cumin2002 provision (PID 3897999) is awaiting input [17:24:24] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:26:00] andrew@cumin2002 reimage (PID 3893701) is awaiting input [17:34:19] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm [17:34:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm [17:34:53] !log swfrench@deploy1003 Finished scap sync-world: Rebuild to pick up new php8.1 production image (duration: 34m 16s) [17:49:25] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bookworm [17:49:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm executed with errors: - lvs1017 (**FAIL**) - Downtimed... [17:51:32] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [17:54:42] jouncebot: nowandnext [17:54:42] For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1700) [17:54:42] In 0 hour(s) and 5 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1800) [17:54:43] (03PS5) 10Ssingh: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins) [17:54:43] (03CR) 10Ssingh: "Nice first attempt!" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins) [17:55:09] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage [17:55:15] (03CR) 10Ssingh: add start of recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins) [17:56:06] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#11005916 (10ssingh) 05Open→03Resolved a:03ssingh Things look fine so marking as resolved; please re-open if there are an... [17:58:44] !log swfrench@deploy1003 Started scap sync-world: Stop building buster-based webserver flavour images - T378128 [17:58:49] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:59:25] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:00:04] dancy and andre: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1800). [18:00:19] o/ [18:00:37] (03CR) 10Bking: [C:03+2] Bump flink to 1.20.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169704 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [18:00:48] (03CR) 10Bking: [V:03+2 C:03+2] Bump flink to 1.20.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169704 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer) [18:01:00] dancy: my deploy should wrap up momentarily. appears to have worked as expected. [18:01:05] !log swfrench@deploy1003 Finished scap sync-world: Stop building buster-based webserver flavour images - T378128 (duration: 02m 21s) [18:01:47] (03CR) 10Ssingh: [C:03+1] "Hi folks: Checking if you want Traffic to merge this? Happy to but asking in case you are waiting for something." [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede) [18:02:13] dancy: I should be out of your way now [18:03:08] Thanks. Running the train via spiderpig today [18:03:40] uh [18:04:13] oh Andre, do you want to press the button? [18:04:25] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:54] dancy: ehehe I'm already a bit braindead today (Phab deploy) but maybe tomorrow? [18:05:43] haha ok [18:06:39] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169715 (https://phabricator.wikimedia.org/T392180) [18:06:40] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169715 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [18:07:35] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169715 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot) [18:08:27] (03CR) 10Ssingh: [C:03+1] "@krinkle@fastmail.com: I am going to merge this chain after code review. Any concerns with that? I know they are cherry-picked to beta but" [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:08:34] (03CR) 10Ssingh: [C:03+1] beta: Document beta-specific "w.beta.wmcloud.org" handling [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle) [18:09:57] (03CR) 10Ssingh: [C:03+1] beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:10:15] (03CR) 10Krinkle: "Sounds good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:10:24] !log bking@build2001 /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*flink*' T398159 [18:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:35] T398159: SUP: Use flink 1.20.1 - https://phabricator.wikimedia.org/T398159 [18:11:01] (03CR) 10Ssingh: [C:03+1] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:11:06] (03CR) 10Ssingh: [C:03+2] deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:11:14] (03CR) 10Ssingh: [C:03+2] beta: Document beta-specific "w.beta.wmcloud.org" handling [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle) [18:11:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS bookworm [18:12:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm completed: - lvs1017 (**PASS**) - Removed from Puppet... [18:12:41] (03CR) 10Ssingh: [C:03+2] beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:12:52] (03PS3) 10Krinkle: beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) [18:15:05] (03CR) 10Ssingh: [C:03+2] beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:15:52] (03PS2) 10Krinkle: beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) [18:16:28] (03CR) 10Ssingh: [C:03+2] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:16:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005982 (10BCornwall) 05Open→03Resolved The link was re-connected to the Mellanox card; We then reconfigured the interface with: ` $ sudo -i cookbook sre.dns.netbox -t T387145 'update lvs1... [18:16:49] (03CR) 10Ssingh: [C:03+1] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:18:28] (03CR) 10Ssingh: [C:03+2] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:18:38] (03CR) 10Ssingh: [C:03+2] "Chain merged, thanks for the patches!" [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [18:19:05] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.10 refs T392180 [18:19:11] T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180 [18:19:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:28:45] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1169723 (https://phabricator.wikimedia.org/T399619) [18:28:50] (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169725 (https://phabricator.wikimedia.org/T399619) [18:29:07] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:30:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:30:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79120 and previous config saved to /var/cache/conftool/dbconfig/20250715-183047-marostegui.json [18:30:56] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:31:26] (03PS1) 10Legoktm: admin: temporarily disable legoktm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1169727 [18:36:00] (03CR) 10Dzahn: [C:03+1] Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658 (owner: 10Hashar) [18:36:08] (03PS1) 10Ahmon Dancy: logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462) [18:36:36] (03CR) 10CI reject: [V:04-1] logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462) (owner: 10Ahmon Dancy) [18:37:40] 06SRE, 10Beta-Cluster-Infrastructure, 06serviceops, 10Wikidata, 10wmde-wikidata-tech: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#11006122 (10Krinkle) 05Open→03Resolved a:03Krinkle This appears to be working now, and seemingly has been for a wh... [18:37:56] (03PS2) 10Ahmon Dancy: logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462) [18:38:35] (03CR) 10Dzahn: [C:03+2] admin: temporarily disable legoktm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1169727 (owner: 10Legoktm) [18:39:15] (03PS28) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085 [18:39:43] (03CR) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [18:39:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:44:38] (03CR) 10Dzahn: [C:03+2] logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462) (owner: 10Ahmon Dancy) [18:45:06] (03PS1) 10Ebernhardson: Move repository to gitlab [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169730 (https://phabricator.wikimedia.org/T399617) [18:47:03] puppet is still failing on all(?) analytics hosts [18:47:31] seems an alerting issue [18:50:19] (03PS1) 10Eevans: aqs1022: default to partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1169733 [18:52:06] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [18:52:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye [18:54:52] (03CR) 10Eevans: [C:03+2] aqs1022: default to partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1169733 (owner: 10Eevans) [18:56:41] (03PS1) 10DDesouza: Undeploy Readers Use Cases Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) [18:57:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [18:57:56] RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:01:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79121 and previous config saved to /var/cache/conftool/dbconfig/20250715-190120-marostegui.json [19:01:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:04:58] (03Abandoned) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 (owner: 10Ebernhardson) [19:05:15] (03Abandoned) 10Ebernhardson: Repoint oss.sonatype.org to repo1.maven.org [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169693 (owner: 10Ebernhardson) [19:05:23] (03Abandoned) 10Ebernhardson: Update plugins for bugfix to extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169694 (https://phabricator.wikimedia.org/T399162) (owner: 10Ebernhardson) [19:09:07] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:09:41] (03PS1) 10Krinkle: multiversion: Fix "Class Wikimedia\MWConfig\Exception not found" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737 [19:09:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:16:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P79122 and previous config saved to /var/cache/conftool/dbconfig/20250715-191627-marostegui.json [19:17:06] (03PS1) 10Eevans: aqs1012: perform a complete reimage [puppet] - 10https://gerrit.wikimedia.org/r/1169739 (https://phabricator.wikimedia.org/T396970) [19:19:07] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:33] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [19:24:07] RESOLVED: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:05] (03PS1) 10Eevans: aqs1012: perform a complete reimage (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1169742 (https://phabricator.wikimedia.org/T396970) [19:27:43] (03CR) 10Eevans: [C:03+2] aqs1012: perform a complete reimage (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1169742 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans) [19:28:53] (03Abandoned) 10Ssingh: C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh) [19:31:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P79123 and previous config saved to /var/cache/conftool/dbconfig/20250715-193134-marostegui.json [19:33:14] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [19:33:20] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye [19:37:50] (03PS4) 10Scott French: httpd: Rebase on bookworm and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128) [19:40:00] (03CR) 10Ottomata: "Great, okay, should I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:41:39] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [19:41:45] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [19:42:08] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [19:42:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye [19:46:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79124 and previous config saved to /var/cache/conftool/dbconfig/20250715-194642-marostegui.json [19:46:46] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:46:58] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:47:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T399249)', diff saved to https://phabricator.wikimedia.org/P79125 and previous config saved to /var/cache/conftool/dbconfig/20250715-194704-marostegui.json [19:49:25] (03PS1) 10Scott French: shellbox: bump image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752 [19:53:32] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [19:53:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [19:53:53] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [19:54:00] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T2000). [20:00:05] NovemLinguae and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:01:24] I can deploy [20:01:51] ty :) [20:05:11] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [20:05:27] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [20:05:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006406 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [20:05:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye [20:06:16] (03CR) 10Zabe: [C:03+2] Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [20:06:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [20:06:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:07:05] (03Merged) 10jenkins-bot: Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae) [20:07:26] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169298|Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" (T398080 T399372)]] [20:07:32] T398080: Set $wgSecurePollUseMediaWikiNamespace = true on English Wikipedia - https://phabricator.wikimedia.org/T398080 [20:07:32] T399372: MediaWiki\Storage\NameTableAccessException: No insert possible but primary DB didn't give us a record for 'SecurePoll' in 'content_models' - https://phabricator.wikimedia.org/T399372 [20:08:34] oh i almost forgot. there's a comment in that patch about, instead of deploying it, doing an SQL query instead [20:09:00] deployer discretion though. thoughts? [20:09:36] !log zabe@deploy1003 novemlinguae, zabe: Backport for [[gerrit:1169298|Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" (T398080 T399372)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:09:36] NovemLinguae: Is it intended that there will be a new content model? [20:10:08] (03PS6) 10CDobbins: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 [20:10:28] so the patch that was deployed a week ago turns on SecurePoll logging to subpages of MediaWiki:SecurePoll/*. and those pages do use a new content model, yes. the first edit of this logging on enwiki would create a new content model SecurePoll [20:10:38] due to what we suspect is a MediaWiki core bug, this is throwing an exception [20:10:54] we suspect that the SQL query would solve the issue, OR we can revert the patch from a week ago [20:11:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:14:11] Let me take a quick look [20:16:41] huh [20:17:08] (03PS7) 10CDobbins: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 [20:17:17] https://phabricator.wikimedia.org/P79126 [20:17:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T399249)', diff saved to https://phabricator.wikimedia.org/P79127 and previous config saved to /var/cache/conftool/dbconfig/20250715-201715-marostegui.json [20:17:22] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [20:17:26] NovemLinguae: apparently it took a few tries ^ [20:18:03] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6287/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins) [20:18:05] sounds like you ran the query. let me go test if it worked [20:18:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [20:18:16] yes [20:18:31] (03PS1) 10Eevans: aqs1012: must use partman/raid1-2dev-efi.cfg preseed [puppet] - 10https://gerrit.wikimedia.org/r/1169762 (https://phabricator.wikimedia.org/T396970) [20:19:37] alright, the query worked. the logging is working now. https://en.wikipedia.org/w/index.php?title=MediaWiki:SecurePoll/834/msg/en&action=history [20:19:49] we can abort the revert patch backport [20:19:56] !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [20:20:06] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**)... [20:20:21] (03PS8) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [20:20:36] if you have any insights about this weird bug feel free to post in https://phabricator.wikimedia.org/T399372 [20:20:52] yup [20:21:01] !log zabe@deploy1003 Sync cancelled. [20:21:16] (03PS1) 10Zabe: Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169763 [20:21:21] (03CR) 10Zabe: [C:03+2] Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169763 (owner: 10Zabe) [20:21:32] (03CR) 10Eevans: [C:03+2] aqs1012: must use partman/raid1-2dev-efi.cfg preseed [puppet] - 10https://gerrit.wikimedia.org/r/1169762 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans) [20:22:12] (03Merged) 10jenkins-bot: Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169763 (owner: 10Zabe) [20:22:17] sorry for the curveball. thanks for fixing it :) [20:22:31] no problem [20:22:44] (03CR) 10Zabe: [C:03+2] Undeploy Readers Use Cases Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [20:22:50] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm [20:23:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [20:23:23] (03PS9) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [20:23:33] (03Merged) 10jenkins-bot: Undeploy Readers Use Cases Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza) [20:23:36] (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:24:18] sorry I was unable to login earlier [20:24:34] no problem [20:24:39] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6288/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:25:07] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169763|Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki"]], [[gerrit:1169734|Undeploy Readers Use Cases Survey (T398870)]] [20:25:17] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [20:25:25] (03CR) 10Dzahn: [V:03+1 C:03+2] profile::httpd: include prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn) [20:27:18] !log zabe@deploy1003 dani, zabe: Backport for [[gerrit:1169763|Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki"]], [[gerrit:1169734|Undeploy Readers Use Cases Survey (T398870)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:27:33] danisztls: is it possible to test your patch? [20:27:50] zabe: yes [20:28:06] zabe: looks good [20:28:09] nice [20:28:11] syncing [20:28:11] !log zabe@deploy1003 dani, zabe: Continuing with sync [20:30:21] !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [20:30:37] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye [20:30:43] (03CR) 10Scott French: [C:03+1] "This falls squarely in "should be fine" territory, but it wouldn't hurt to do mildly carefully [0]. If you'd like, I can merge this and ta" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:32:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P79128 and previous config saved to /var/cache/conftool/dbconfig/20250715-203224-marostegui.json [20:33:35] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169763|Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki"]], [[gerrit:1169734|Undeploy Readers Use Cases Survey (T398870)]] (duration: 08m 27s) [20:33:42] T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870 [20:33:45] danisztls: should be live [20:35:07] (03CR) 10Dzahn: [V:03+1 C:03+2] "watched it being added on puppetserver1001" [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn) [20:37:51] zabe: thanks [20:39:17] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [20:44:49] !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage [20:45:29] yw [20:47:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P79129 and previous config saved to /var/cache/conftool/dbconfig/20250715-204732-marostegui.json [20:48:59] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage [20:50:09] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [20:50:49] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [20:53:05] (03PS1) 10Zabe: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169765 [20:53:13] (03PS1) 10Zabe: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169766 [20:53:32] (03CR) 10Zabe: [C:03+2] Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169765 (owner: 10Zabe) [20:53:36] (03CR) 10Zabe: [C:03+2] Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169766 (owner: 10Zabe) [20:54:07] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:11] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169769 (https://phabricator.wikimedia.org/T399579) [20:56:31] (03PS5) 10Scott French: httpd: Rebase on bookworm and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T2100) [21:02:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T399249)', diff saved to https://phabricator.wikimedia.org/P79130 and previous config saved to /var/cache/conftool/dbconfig/20250715-210240-marostegui.json [21:02:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:02:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance [21:02:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T399249)', diff saved to https://phabricator.wikimedia.org/P79131 and previous config saved to /var/cache/conftool/dbconfig/20250715-210251-marostegui.json [21:05:10] !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1012.eqiad.wmnet with OS bullseye [21:05:21] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye completed: - aqs1012 (**PASS**) - Removed from Puppet and PuppetD... [21:06:56] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [21:08:50] (03Merged) 10jenkins-bot: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169765 (owner: 10Zabe) [21:08:55] (03Merged) 10jenkins-bot: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169766 (owner: 10Zabe) [21:09:33] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169765|Also join linktarget on namespace to allow index usage]], [[gerrit:1169766|Also join linktarget on namespace to allow index usage]] [21:10:26] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage [21:11:39] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169765|Also join linktarget on namespace to allow index usage]], [[gerrit:1169766|Also join linktarget on namespace to allow index usage]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:28] !log zabe@deploy1003 zabe: Continuing with sync [21:13:42] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#11006699 (10Dzahn) deployed. watched the prometheus apache exporter getting installed on puppetserver1001. no issues there. IIt's also running on config-master1001 now. [21:14:07] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:13] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#11006700 (10Dzahn) 05Open→03Resolved a:03Dzahn [21:17:15] (03CR) 10Ryan Kemper: [C:03+1] cirrus: Drop absented periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1169209 (owner: 10Ebernhardson) [21:17:16] (03CR) 10Ryan Kemper: [C:03+2] cirrus: Drop absented periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1169209 (owner: 10Ebernhardson) [21:17:44] (03PS1) 10Eevans: aqs1012: setup data directories for 8-ssd JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1169773 (https://phabricator.wikimedia.org/T396970) [21:17:46] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169765|Also join linktarget on namespace to allow index usage]], [[gerrit:1169766|Also join linktarget on namespace to allow index usage]] (duration: 08m 12s) [21:18:35] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169773 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans) [21:22:43] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#11006731 (10Dzahn) regarding my previous comments about cloud VPS: While some instances / projects will use the httpd module.. nothing seems to include the `profile::htt... [21:23:16] (03CR) 10Eevans: [C:03+2] aqs1012: setup data directories for 8-ssd JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1169773 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans) [21:26:54] (03PS1) 10Zabe: CS: Undeploy Interwiki (step 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636) [21:26:56] (03PS1) 10Zabe: IS: Undeploy Interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 [21:26:56] (03PS1) 10Zabe: extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636) [21:27:24] (03PS2) 10Zabe: IS: Undeploy Interwiki (step 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 (https://phabricator.wikimedia.org/T399636) [21:28:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye [21:29:38] (03CR) 10Zabe: [C:03+2] CS: Undeploy Interwiki (step 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [21:30:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T399249)', diff saved to https://phabricator.wikimedia.org/P79132 and previous config saved to /var/cache/conftool/dbconfig/20250715-213021-marostegui.json [21:30:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:30:35] (03Merged) 10jenkins-bot: CS: Undeploy Interwiki (step 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [21:31:49] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169775|CS: Undeploy Interwiki (step 1) (T399636)]] [21:31:53] T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636 [21:33:58] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169775|CS: Undeploy Interwiki (step 1) (T399636)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:35:19] !log zabe@deploy1003 zabe: Continuing with sync [21:39:07] FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:40:45] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169775|CS: Undeploy Interwiki (step 1) (T399636)]] (duration: 08m 55s) [21:40:49] T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636 [21:44:08] (03CR) 10Zabe: [C:03+2] IS: Undeploy Interwiki (step 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [21:45:06] (03PS2) 10Zabe: extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636) [21:45:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P79133 and previous config saved to /var/cache/conftool/dbconfig/20250715-214528-marostegui.json [21:48:41] (03Merged) 10jenkins-bot: IS: Undeploy Interwiki (step 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [21:49:12] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169776|IS: Undeploy Interwiki (step 2) (T399636)]] [21:49:16] T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636 [21:51:22] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169776|IS: Undeploy Interwiki (step 2) (T399636)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:52:23] !log zabe@deploy1003 zabe: Continuing with sync [21:57:55] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169776|IS: Undeploy Interwiki (step 2) (T399636)]] (duration: 08m 42s) [21:57:59] T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636 [21:59:25] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:00:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P79134 and previous config saved to /var/cache/conftool/dbconfig/20250715-220036-marostegui.json [22:11:08] (03CR) 10Zabe: [C:03+2] extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [22:11:58] (03Merged) 10jenkins-bot: extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe) [22:12:23] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169777|extension-list: Undeploy Interwiki (step 3) (T399636)]] [22:12:27] T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636 [22:14:33] !log zabe@deploy1003 zabe: Backport for [[gerrit:1169777|extension-list: Undeploy Interwiki (step 3) (T399636)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:15:19] !log zabe@deploy1003 zabe: Continuing with sync [22:15:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T399249)', diff saved to https://phabricator.wikimedia.org/P79135 and previous config saved to /var/cache/conftool/dbconfig/20250715-221543-marostegui.json [22:15:49] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:15:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2192.codfw.wmnet with reason: Maintenance [22:16:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T399249)', diff saved to https://phabricator.wikimedia.org/P79136 and previous config saved to /var/cache/conftool/dbconfig/20250715-221606-marostegui.json [22:20:41] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169777|extension-list: Undeploy Interwiki (step 3) (T399636)]] (duration: 08m 17s) [22:20:48] T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636 [22:27:20] !log reprepro include php-excimer_1.2.5-1+wmf11u1 php-imagick_3.7.0-13+wmf11u1 php-luasandbox_4.1.2-1+wmf11u1 php-memcached_3.3.0-1+wmf11u1 php-pcov_1.0.12-1+wmf11u1 php-redis_6.2.0-1+wmf11u1 php-uuid_1.3.0-1+wmf11u1 php-wmerrors_2.0.0-1+wmf11u1 php-yaml_2.2.4-1+wmf11u1 wikidiff2_1.14.1-2+wmf11u1 xdebug_3.4.4-1+wmf11u1 in component/php83 - T398245 [22:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:26] T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245 [22:29:25] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:36:46] (03CR) 10Dzahn: [C:03+2] "still need to figure out: " failed to expand includes and copies: processing includes for variant 'sourcebot': There is no key 'sourcebot'" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:38:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:41:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T399249)', diff saved to https://phabricator.wikimedia.org/P79137 and previous config saved to /var/cache/conftool/dbconfig/20250715-224117-marostegui.json [22:41:23] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [22:43:53] (03PS1) 10Dzahn: add variant sourcebot to blubber file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1169785 (https://phabricator.wikimedia.org/T268199) [22:44:24] (03CR) 10Dzahn: [C:03+2] add variant sourcebot to blubber file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1169785 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:44:52] (03CR) 10Dzahn: [C:03+2] "recheck" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1169785 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [22:49:08] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [22:50:16] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [22:56:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P79138 and previous config saved to /var/cache/conftool/dbconfig/20250715-225624-marostegui.json [23:09:25] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:11:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P79139 and previous config saved to /var/cache/conftool/dbconfig/20250715-231132-marostegui.json [23:23:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:26:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T399249)', diff saved to https://phabricator.wikimedia.org/P79141 and previous config saved to /var/cache/conftool/dbconfig/20250715-232640-marostegui.json [23:26:44] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [23:26:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2201.codfw.wmnet with reason: Maintenance [23:38:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169789 [23:38:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169789 (owner: 10TrainBranchBot) [23:50:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169789 (owner: 10TrainBranchBot) [23:52:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2211.codfw.wmnet with reason: Maintenance [23:52:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T399249)', diff saved to https://phabricator.wikimedia.org/P79142 and previous config saved to /var/cache/conftool/dbconfig/20250715-235236-marostegui.json [23:52:41] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249