[00:07:59] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169269
[00:08:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169269 (owner: 10TrainBranchBot)
[00:39:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1169269 (owner: 10TrainBranchBot)
[00:49:42] <wikibugs>	 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11002922 (10BCornwall)
[00:56:41] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/656116f558b545d0be774668bc593e31de0367f572473a72482afc9b5accedfa/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:07:41] <icinga-wm>	 PROBLEM - Host mr1-magru IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[01:07:53] <icinga-wm>	 PROBLEM - Host mr1-magru is DOWN: PING CRITICAL - Packet loss = 100%
[01:07:54] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.10 [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169273 (https://phabricator.wikimedia.org/T392180)
[01:07:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.10 [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169273 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[01:09:58] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:10:27] <icinga-wm>	 PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[01:12:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:16:41] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:18:35] <jinxer-wm>	 FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[01:19:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.10 [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169273 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[01:55:07] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0200)
[02:03:35] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0300)
[03:01:57] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169277 (https://phabricator.wikimedia.org/T392180)
[03:01:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169277 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[03:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169277 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[03:03:03] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.10  refs T392180
[03:03:07] <stashbot>	 T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180
[03:07:02] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:07:43] <wikibugs>	 10ops-magru: Power Supply - PS Redundancy - issue on ganeti7001:9290 - https://phabricator.wikimedia.org/T399525 (10phaultfinder) 03NEW
[03:20:33] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:21:27] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.418 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:48:40] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.10  refs T392180 (duration: 45m 36s)
[03:48:44] <stashbot>	 T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0400)
[04:00:24] <wikibugs>	 (03CR) 10Arnaudb: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1167857 (https://phabricator.wikimedia.org/T398854) (owner: 10Arnaudb)
[04:01:54] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.7 (duration: 01m 42s)
[04:24:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:27:47] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mailman: avoid pint linting alerts related to backup instance [alerts] - 10https://gerrit.wikimedia.org/r/1169107 (owner: 10Tiziano Fogli)
[04:28:31] <wikibugs>	 10SRE-swift-storage, 10MinT, 10LPL Essential (2025 Jul-Sep), 10LPL Projects (MinT for Wikireaders – FY26 WE 3.1.5): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#11003045 (10KartikMistry) Since logs are fine, we don't have anything specific to QA for th...
[04:29:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[04:30:06] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[04:34:00] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans)
[05:03:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:03:53] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:03:58] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:04:03] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:07:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:10:13] <jinxer-wm>	 FIRING: [19x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:17:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:18:35] <jinxer-wm>	 FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[05:35:45] <wikibugs>	 (03PS1) 10Novem Linguae: Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298
[05:36:33] <wikibugs>	 (03PS2) 10Novem Linguae: Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080)
[05:49:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T399446
[05:49:48] <stashbot>	 T399446: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T399446
[05:50:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1210 with weight 0 T399446', diff saved to https://phabricator.wikimedia.org/P79039 and previous config saved to /var/cache/conftool/dbconfig/20250715-055011-root.json
[05:51:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1210 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1169049 (https://phabricator.wikimedia.org/T399446) (owner: 10Gerrit maintenance bot)
[05:53:42] <wikibugs>	 (03CR) 10SD0001: "The issue seems to be due to a mw core bug in the process of automatically inserting the new content model entry in the db table. The work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[05:54:10] <marostegui>	 !log Starting s5 eqiad failover from db1230 to db1210 - T399446
[05:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:07] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:58:15] <wikibugs>	 (03CR) 10Novem Linguae: "If deployers have edit access to the production SQL database and are willing to try this, +1 from me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[06:00:08] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0600)
[06:00:08] <jouncebot>	 marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0600)
[06:01:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T399446', diff saved to https://phabricator.wikimedia.org/P79040 and previous config saved to /var/cache/conftool/dbconfig/20250715-060114-root.json
[06:01:20] <stashbot>	 T399446: Switchover s5 master (db1230 -> db1210) - https://phabricator.wikimedia.org/T399446
[06:02:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1210 to s5 primary and set section read-write T399446', diff saved to https://phabricator.wikimedia.org/P79041 and previous config saved to /var/cache/conftool/dbconfig/20250715-060223-marostegui.json
[06:03:35] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:04:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169050 (https://phabricator.wikimedia.org/T399446) (owner: 10Gerrit maintenance bot)
[06:04:21] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[06:05:11] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[06:06:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1230 T399446', diff saved to https://phabricator.wikimedia.org/P79042 and previous config saved to /var/cache/conftool/dbconfig/20250715-060600-root.json
[06:13:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1169301
[06:15:10] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-magru and 2001:1498:1:966:1::251 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:15:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for andyrussg [puppet] - 10https://gerrit.wikimedia.org/r/1169301 (owner: 10Muehlenhoff)
[06:17:53] <wikibugs>	 (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169302 (https://phabricator.wikimedia.org/T399446)
[06:18:33] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1230.eqiad.wmnet with reason: maintenance
[06:18:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169302 (https://phabricator.wikimedia.org/T399446) (owner: 10Marostegui)
[06:19:00] <marostegui>	 moritzm: ok to merge?
[06:20:10] <jinxer-wm>	 RESOLVED: [3x] BFDdown: BFD session down between cr1-magru and 2001:1498:1:966:1::251 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:22:29] <marostegui>	 moritzm: ping
[06:23:00] <marostegui>	 moritzm: I've merged as the commit says the contract has ended
[06:24:33] <wikibugs>	 (03PS1) 10Marostegui: db1230: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169303 (https://phabricator.wikimedia.org/T398928)
[06:26:03] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1230: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169303 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[06:26:26] <moritzm>	 marostegui: thanks, sorry got distracted by something else
[06:26:39] <marostegui>	 no worries
[06:28:58] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] mailman: avoid pint linting alerts related to backup instance [alerts] - 10https://gerrit.wikimedia.org/r/1169107 (owner: 10Tiziano Fogli)
[06:29:16] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AndyRussG out of all services on: 2394 hosts
[06:29:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1185.eqiad.wmnet onto db1230.eqiad.wmnet
[06:29:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1185 - Depool db1185.eqiad.wmnet to then clone it to db1230.eqiad.wmnet - marostegui@cumin1002
[06:30:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1185 - Depool db1185.eqiad.wmnet to then clone it to db1230.eqiad.wmnet - marostegui@cumin1002
[06:36:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access until end of month [puppet] - 10https://gerrit.wikimedia.org/r/1169305
[06:36:27] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[06:36:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[06:36:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T399249)', diff saved to https://phabricator.wikimedia.org/P79044 and previous config saved to /var/cache/conftool/dbconfig/20250715-063651-marostegui.json
[06:36:55] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[06:38:00] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2229 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1169306 (https://phabricator.wikimedia.org/T399533)
[06:39:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Extend access until end of month [puppet] - 10https://gerrit.wikimedia.org/r/1169305 (owner: 10Muehlenhoff)
[06:46:28] <wikibugs>	 (03CR) 10Elukey: [C:03+1] deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (owner: 10Alexandros Kosiaris)
[06:48:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C:04-1] "This patches the wrong files, these are only used by the old buster nodes, which will be entirely decommisioned once the new ones based on" [puppet] - 10https://gerrit.wikimedia.org/r/1169221 (owner: 10Alexandros Kosiaris)
[06:48:25] <wikibugs>	 (03CR) 10Elukey: admin: Empty out kartotherian-admin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris)
[06:48:40] <wikibugs>	 (03CR) 10Elukey: [C:03+1] admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (owner: 10Alexandros Kosiaris)
[06:52:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (owner: 10Alexandros Kosiaris)
[06:54:49] <wikibugs>	 (03CR) 10Muehlenhoff: "We can simply leave the Hiera changes to master.yaml and replica.yaml as-is, they will be entirely removed in a few weeks (when the old ro" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris)
[06:55:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (owner: 10Alexandros Kosiaris)
[06:57:34] <wikibugs>	 (03CR) 10Muehlenhoff: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (owner: 10Alexandros Kosiaris)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:03:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T399249)', diff saved to https://phabricator.wikimedia.org/P79045 and previous config saved to /var/cache/conftool/dbconfig/20250715-070305-marostegui.json
[07:03:25] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[07:06:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#11003252 (10Marostegui) Thank you!
[07:07:02] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:11:09] <logmsgbot>	 marostegui@cumin1002 clone (PID 1530088) is awaiting input
[07:11:49] <wikibugs>	 (03CR) 10Elukey: statistics: Add Python script for model uploading to statistics machines. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1166345 (https://phabricator.wikimedia.org/T394301) (owner: 10Bartosz Wójtowicz)
[07:12:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[07:13:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:14:52] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:18:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P79046 and previous config saved to /var/cache/conftool/dbconfig/20250715-071813-marostegui.json
[07:18:43] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:19:35] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm now!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans)
[07:20:59] <moritzm>	 !log installing rubygems security updates
[07:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:43] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:26:50] <wikibugs>	 (03PS3) 10Volans: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889
[07:28:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:28:43] <jinxer-wm>	 RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:28:58] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:33:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P79047 and previous config saved to /var/cache/conftool/dbconfig/20250715-073322-marostegui.json
[07:33:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:33:48] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:34:57] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans)
[07:36:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: use the alt chain on half upload@magru for measure cert [puppet] - 10https://gerrit.wikimedia.org/r/1169200 (https://phabricator.wikimedia.org/T398596) (owner: 10Vgutierrez)
[07:38:22] <vgutierrez>	 !log use GTS alt chain for the measure cert on cp[7013-7016] - T398596
[07:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:27] <stashbot>	 T398596: Consider using the alternate chain of Google Trust Services certificates - https://phabricator.wikimedia.org/T398596
[07:38:43] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:40:08] <wikibugs>	 (03PS1) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535)
[07:41:18] <wikibugs>	 (03Merged) 10jenkins-bot: Collab: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167889 (owner: 10Volans)
[07:42:00] <wikibugs>	 (03CR) 10Vgutierrez: "this is no longer needed" [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur)
[07:43:43] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:47:35] <wikibugs>	 (03Abandoned) 10Fabfur: varnish: pass WME HEAD reqs to pass for ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur)
[07:48:00] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] varnish: Implement translation analytics vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[07:48:10] <wikibugs>	 (03CR) 10Tryvix1509: [C:03+1] Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[07:48:27] <wikibugs>	 (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[07:48:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T399249)', diff saved to https://phabricator.wikimedia.org/P79048 and previous config saved to /var/cache/conftool/dbconfig/20250715-074829-marostegui.json
[07:48:34] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[07:48:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:48:43] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:48:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[07:48:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79049 and previous config saved to /var/cache/conftool/dbconfig/20250715-074851-marostegui.json
[07:50:58] <XioNoX>	 !log more Bird test on ganeti2034 & testvm2006 - T362392
[07:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:02] <stashbot>	 T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392
[07:53:15] <wikibugs>	 (03PS4) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845)
[07:53:43] <jinxer-wm>	 FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:55:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[07:56:00] <wikibugs>	 (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[07:58:43] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:58:43] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:58:51] <wikibugs>	 (03PS5) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845)
[07:58:58] <jinxer-wm>	 FIRING: [13x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:00:57] <wikibugs>	 (03CR) 10Elukey: [C:03+2] httpbb(liftwing): update edit-check tests [puppet] - 10https://gerrit.wikimedia.org/r/1167858 (https://phabricator.wikimedia.org/T397013) (owner: 10AikoChou)
[08:01:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[08:01:33] <wikibugs>	 (03CR) 10Aqu: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[08:03:43] <jinxer-wm>	 FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:05:38] <kostajh>	 jouncebot: nowandnext
[08:05:39] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 54 minute(s)
[08:05:39] <jouncebot>	 In 1 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000)
[08:08:25] <wikibugs>	 (03PS6) 10Aqu: data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845)
[08:08:43] <jinxer-wm>	 RESOLVED: [6x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[08:14:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79051 and previous config saved to /var/cache/conftool/dbconfig/20250715-081458-marostegui.json
[08:15:03] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[08:23:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1185 gradually with 4 steps - Pool db1185.eqiad.wmnet in after cloning
[08:24:35] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:24:45] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:25:35] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 9.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:25:41] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54226 bytes in 6.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:26:37] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=eqiad
[08:27:08] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.reimage for host gitlab1003.wikimedia.org with OS bookworm
[08:27:43] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:27:52] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:28:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] No longer use mirrors.debian.org on Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/1160171 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff)
[08:30:04] <wikibugs>	 (03PS2) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535)
[08:30:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P79054 and previous config saved to /var/cache/conftool/dbconfig/20250715-083006-marostegui.json
[08:32:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deal with archival of Buster on Debian mirrors - https://phabricator.wikimedia.org/T397209#11003387 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Buster has been archved on the Debian mirrors last weekend and all fallout shoul...
[08:33:31] <icinga-wm>	 PROBLEM - Host gitlab-replica-b.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[08:36:07] <wikibugs>	 (03CR) 10Btullis: [C:03+2] data-engineering: Refine switch-over preparation [puppet] - 10https://gerrit.wikimedia.org/r/1167572 (https://phabricator.wikimedia.org/T369845) (owner: 10Aqu)
[08:36:25] <jelto>	 ^ gitlab alert is expected, reimaging 
[08:37:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop using debug repository on Buster [puppet] - 10https://gerrit.wikimedia.org/r/1169610 (https://phabricator.wikimedia.org/T397209)
[08:37:42] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:40:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh)
[08:40:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169610 (https://phabricator.wikimedia.org/T397209) (owner: 10Muehlenhoff)
[08:43:53] <logmsgbot>	 !log jelto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage
[08:45:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P79058 and previous config saved to /var/cache/conftool/dbconfig/20250715-084513-marostegui.json
[08:46:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: Move OpenSSH server config away from using a Puppet template - https://phabricator.wikimedia.org/T393762#11003445 (10MoritzMuehlenhoff) 05Open→03Resolved This is implemented for Trixie and later
[08:47:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Use a forward port of Puppet 7 on Trixie hosts - https://phabricator.wikimedia.org/T392790#11003447 (10MoritzMuehlenhoff) 05Open→03Resolved Trixies uses a forward port of Puppet 7 which gets correctly installed during d-i.
[08:48:25] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Move ganeti2045-ganeti2050 into production and decom ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T396590#11003454 (10MoritzMuehlenhoff)
[08:48:31] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab1003.wikimedia.org with reason: host reimage
[08:52:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:53:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[08:53:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[08:54:54] <marostegui>	 !log Restart mariadb on pc1 T399540
[08:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:58] <stashbot>	 T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540
[08:59:05] <icinga-wm>	 RECOVERY - Host gitlab-replica-b.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[09:00:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T399249)', diff saved to https://phabricator.wikimedia.org/P79061 and previous config saved to /var/cache/conftool/dbconfig/20250715-090021-marostegui.json
[09:00:26] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[09:00:48] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[09:00:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T399249)', diff saved to https://phabricator.wikimedia.org/P79062 and previous config saved to /var/cache/conftool/dbconfig/20250715-090055-marostegui.json
[09:01:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[09:01:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:02:42] <jinxer-wm>	 FIRING: [7x] JobUnavailable: Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:03:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "Thanks Scott. I have done that now." [puppet] - 10https://gerrit.wikimedia.org/r/1169106 (https://phabricator.wikimedia.org/T380866) (owner: 10Btullis)
[09:03:33] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6267/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[09:05:04] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] Pyrra-filesystem: purge unmanaged files from config directory [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[09:05:06] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Pyrra-filesystem: purge unmanaged files from config directory [puppet] - 10https://gerrit.wikimedia.org/r/1169234 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[09:09:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1185 gradually with 4 steps - Pool db1185.eqiad.wmnet in after cloning
[09:09:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1185.eqiad.wmnet onto db1230.eqiad.wmnet
[09:10:01] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[09:10:14] <jinxer-wm>	 FIRING: [19x] CertAlmostExpired: Certificate for service cr1-magru.wikimedia.org:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:11:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons.
[09:11:17] <icinga-wm>	 PROBLEM - TFTP service on install2004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd
[09:11:34] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1246 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1169612 (https://phabricator.wikimedia.org/T399449)
[09:12:08] <logmsgbot>	 !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab1003.wikimedia.org with OS bookworm
[09:12:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db1246 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1169612 (https://phabricator.wikimedia.org/T399449) (owner: 10Marostegui)
[09:13:26] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[09:13:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1246 T399449', diff saved to https://phabricator.wikimedia.org/P79068 and previous config saved to /var/cache/conftool/dbconfig/20250715-091328-marostegui.json
[09:13:34] <stashbot>	 T399449: decommission db1246.eqiad.wmnet - https://phabricator.wikimedia.org/T399449
[09:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:14:40] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[09:15:20] <wikibugs>	 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#11003583 (10BTullis) 05Open→03Resolved This is now done...
[09:16:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[09:16:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:17:24] <marostegui>	 !log Restart mariadb on pc2 T399540
[09:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:27] <stashbot>	 T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540
[09:17:31] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:18:35] <jinxer-wm>	 FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[09:18:48] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[09:19:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[09:19:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:19:52] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' .
[09:19:54] <wikibugs>	 (03PS4) 10Muehlenhoff: admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle)
[09:20:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle)
[09:20:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P79072 and previous config saved to /var/cache/conftool/dbconfig/20250715-092050-root.json
[09:21:36] <wikibugs>	 (03PS1) 10Marostegui: db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169613 (https://phabricator.wikimedia.org/T398928)
[09:21:40] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' .
[09:22:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1230: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1169613 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[09:22:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:24:39] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:25:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T399249)', diff saved to https://phabricator.wikimedia.org/P79073 and previous config saved to /var/cache/conftool/dbconfig/20250715-092551-marostegui.json
[09:25:55] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[09:27:16] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:28:15] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[09:28:31] <wikibugs>	 (03PS3) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535)
[09:29:06] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:30:16] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:30:21] <wikibugs>	 (03PS1) 10Marostegui: db1258: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169614 (https://phabricator.wikimedia.org/T399298)
[09:30:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1258: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1169614 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui)
[09:31:45] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:31:57] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1258.eqiad.wmnet with reason: Maintenance
[09:32:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1258 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P79074 and previous config saved to /var/cache/conftool/dbconfig/20250715-093200-marostegui.json
[09:33:06] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:34:52] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:35:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P79076 and previous config saved to /var/cache/conftool/dbconfig/20250715-093556-root.json
[09:36:33] <wikibugs>	 (03CR) 10Btullis: "Happy in principle with this, when the CI passes." [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle)
[09:36:59] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:37:12] <wikibugs>	 (03CR) 10Btullis: admin: Remove platform-engineering group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle)
[09:37:43] <wikibugs>	 10ops-magru: Power Supply - Status - issue on dns7002:9290 - https://phabricator.wikimedia.org/T399549 (10phaultfinder) 03NEW
[09:38:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[09:38:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:38:52] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:39:01] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[09:39:02] <marostegui>	 !log Restart mariadb on pc3 T399540
[09:39:04] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[09:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:06] <stashbot>	 T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540
[09:39:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P79079 and previous config saved to /var/cache/conftool/dbconfig/20250715-093943-root.json
[09:40:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P79080 and previous config saved to /var/cache/conftool/dbconfig/20250715-094058-marostegui.json
[09:41:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[09:41:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:42:24] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons.
[09:43:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:44:39] <icinga-wm>	 RECOVERY - Host mr1-magru is UP: PING OK - Packet loss = 0%, RTA = 110.93 ms
[09:44:51] <icinga-wm>	 PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100%
[09:44:51] <icinga-wm>	 RECOVERY - Host mr1-magru.oob IPv6 is UP: PING WARNING - Packet loss = 50%, RTA = 123.54 ms
[09:46:45] <wikibugs>	 (03PS5) 10Muehlenhoff: admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle)
[09:47:20] <icinga-wm>	 RECOVERY - Host mr1-magru IPv6 is UP: PING OK - Packet loss = 0%, RTA = 111.03 ms
[09:47:31] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:47:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:47:42] <icinga-wm>	 RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 123.46 ms
[09:47:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.parsercache
[09:48:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0)
[09:48:26] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:48:35] <jinxer-wm>	 RESOLVED: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown
[09:48:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:51:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P79082 and previous config saved to /var/cache/conftool/dbconfig/20250715-095101-root.json
[09:51:06] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=eqiad
[09:54:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P79083 and previous config saved to /var/cache/conftool/dbconfig/20250715-095449-root.json
[09:54:58] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:55:47] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm, ping me when you need me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1168619 (owner: 10Hashar)
[09:56:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P79084 and previous config saved to /var/cache/conftool/dbconfig/20250715-095605-marostegui.json
[09:56:13] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[09:56:48] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[09:57:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for mszwarc [puppet] - 10https://gerrit.wikimedia.org/r/1169616
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000)
[10:02:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for mszwarc [puppet] - 10https://gerrit.wikimedia.org/r/1169616 (owner: 10Muehlenhoff)
[10:03:35] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:04:41] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:04:58] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:05:26] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:05:33] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:05:46] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:06:04] <logmsgbot>	 !log gmodena@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:06:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P79085 and previous config saved to /var/cache/conftool/dbconfig/20250715-100607-root.json
[10:06:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff)
[10:09:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P79086 and previous config saved to /var/cache/conftool/dbconfig/20250715-100955-root.json
[10:11:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T399249)', diff saved to https://phabricator.wikimedia.org/P79087 and previous config saved to /var/cache/conftool/dbconfig/20250715-101113-marostegui.json
[10:11:18] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[10:11:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[10:11:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79088 and previous config saved to /var/cache/conftool/dbconfig/20250715-101135-marostegui.json
[10:11:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940#11003801 (10ayounsi) Netbox is unfortunately not made to track inventory items (as in on a shelf). There are some plugins tha...
[10:12:54] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) (owner: 10Ayounsi)
[10:13:30] <wikibugs>	 (03Merged) 10jenkins-bot: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) (owner: 10Ayounsi)
[10:14:10] <wikibugs>	 (03PS1) 10Effie Mouzeli: mcrouter: assign pods a higher priority class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169620 (https://phabricator.wikimedia.org/T397683)
[10:15:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] profile::kubernetes::mediawiki_runner: add feature_flag support [puppet] - 10https://gerrit.wikimedia.org/r/1169108 (owner: 10Effie Mouzeli)
[10:16:55] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] hieradata: migrate memcached gutter pool to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1166194 (https://phabricator.wikimedia.org/T398611) (owner: 10Effie Mouzeli)
[10:17:23] <XioNoX>	 !log magru: setup BGP to Ufinet - T389767
[10:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:30] <mszabo>	 jouncebot: nowandnext
[10:17:30] <jouncebot>	 For the next 0 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000)
[10:17:30] <jouncebot>	 In 1 hour(s) and 42 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1200)
[10:17:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:17:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[10:18:40] <wikibugs>	 (03Merged) 10jenkins-bot: Configure Special:CreateAccount instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167896 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[10:19:22] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1167896|Configure Special:CreateAccount instrument (T394744)]]
[10:19:27] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[10:19:52] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:19:56] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2005.codfw.wmnet
[10:20:07] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet
[10:22:05] <moritzm>	 !log installing debian-archive-keyring updates from Bookworm point release
[10:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:21] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058)
[10:22:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqord and cr3-ulsfo (198.35.26.192) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[10:23:10] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez)
[10:23:25] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1167896|Configure Special:CreateAccount instrument (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:24:52] <jinxer-wm>	 FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:25:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1258 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P79090 and previous config saved to /var/cache/conftool/dbconfig/20250715-102500-root.json
[10:26:07] <wikibugs>	 (03CR) 10Dragoniez: Create "abusefilter" editor user group for Vietnamese Wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[10:26:41] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2005.codfw.wmnet
[10:26:44] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet
[10:28:01] <logmsgbot>	 !log mszabo@deploy1003 Sync cancelled.
[10:28:14] <mszabo>	 we'll be back after a commercial break
[10:28:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] admin: Remove platform-engineering group [puppet] - 10https://gerrit.wikimedia.org/r/1138828 (owner: 10Krinkle)
[10:30:39] <wikibugs>	 (03PS1) 10Máté Szabó: Register mediawiki.product_metrics.special_create_account stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744)
[10:31:16] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Register mediawiki.product_metrics.special_create_account stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[10:31:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[10:32:30] <wikibugs>	 (03Merged) 10jenkins-bot: Register mediawiki.product_metrics.special_create_account stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169623 (https://phabricator.wikimedia.org/T394744) (owner: 10Máté Szabó)
[10:32:49] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove golang-1.17 and golang-1.18 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557)
[10:32:50] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11003846 (10elukey) @Mvolz hi! I added the success-ratio SLO, but the error budget looks not ok so I'd need your help to figure out what I am doin...
[10:32:51] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1169623|Register mediawiki.product_metrics.special_create_account stream (T394744)]]
[10:32:55] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[10:33:19] <wikibugs>	 (03CR) 10Muehlenhoff: "With https://phabricator.wikimedia.org/T390139 resolved, this is ready for review again" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1131630 (https://phabricator.wikimedia.org/T383557) (owner: 10Muehlenhoff)
[10:36:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79093 and previous config saved to /var/cache/conftool/dbconfig/20250715-103641-marostegui.json
[10:36:46] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[10:36:48] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1169623|Register mediawiki.product_metrics.special_create_account stream (T394744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:38:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate dumpsdata-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1169627
[10:38:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Fine by me, I 'll split in 2 patches, 1 to fix data.yaml and 1 to fully remove the files (and we piggyback on that one the rest of the cha" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris)
[10:39:33] <wikibugs>	 (03CR) 10Muehlenhoff: "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (owner: 10Alexandros Kosiaris)
[10:40:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:40:53] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Continuing with sync
[10:45:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Fair enough. I 'll merge this patch then with the one removing hiera files which would be removing all old maps nodes stuff. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1169221 (owner: 10Alexandros Kosiaris)
[10:46:05] <Dreamy_Jazz>	 jouncebot: nowandnext
[10:46:05] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1000)
[10:46:05] <jouncebot>	 In 1 hour(s) and 13 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1200)
[10:46:45] <wikibugs>	 (03PS4) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535)
[10:47:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:48:11] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169623|Register mediawiki.product_metrics.special_create_account stream (T394744)]] (duration: 15m 19s)
[10:48:16] <stashbot>	 T394744: Instrument account creation funnel (analytics for Special:CreateAccount) - https://phabricator.wikimedia.org/T394744
[10:49:46] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11003919 (10elukey) Maybe we are counting also the Zotero's calls? If so I'd suggest to exclude them, since IIUC Citoid calls Zotero, but from the...
[10:51:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P79096 and previous config saved to /var/cache/conftool/dbconfig/20250715-105148-marostegui.json
[10:52:25] <wikibugs>	 (03CR) 10Dragoniez: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[10:55:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:56:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ldap-admins from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/1169633
[11:01:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#11003958 (10MoritzMuehlenhoff)
[11:02:57] <wikibugs>	 (03PS1) 10Zabe: Set categorylinks to read new on jawiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169635 (https://phabricator.wikimedia.org/T397912)
[11:03:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=eqiad): 'Configure db1259', diff saved to https://phabricator.wikimedia.org/P79097 and previous config saved to /var/cache/conftool/dbconfig/20250715-110322-fceratto.json
[11:04:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#11003962 (10MoritzMuehlenhoff) 05Open→03Invalid >>! In T396660#10967608, @Jclark-ctr wrote: > @MoritzMuehlenhoff  is this still an issue could you verify again and we can try a di...
[11:06:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P79098 and previous config saved to /var/cache/conftool/dbconfig/20250715-110655-marostegui.json
[11:07:02] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[11:07:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169635 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[11:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: Set categorylinks to read new on jawiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169635 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[11:08:20] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169635|Set categorylinks to read new on jawiki and ruwiki (T397912)]]
[11:08:25] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[11:08:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560 (10Vgutierrez) 03NEW
[11:10:26] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1169635|Set categorylinks to read new on jawiki and ruwiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[11:10:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mtail: Remove tilerator from tests [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (https://phabricator.wikimedia.org/T381565)
[11:11:00] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (https://phabricator.wikimedia.org/T381565)
[11:11:02] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: admin: Empty out kartotherian-admin [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (https://phabricator.wikimedia.org/T381565)
[11:11:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:11:04] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (https://phabricator.wikimedia.org/T381565)
[11:11:06] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565)
[11:11:07] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: DNM: tilerator: Remove as much as possible of the last cruft [puppet] - 10https://gerrit.wikimedia.org/r/1169223 (https://phabricator.wikimedia.org/T381565)
[11:11:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: DNM: Prep patch for removal of old maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1169636 (https://phabricator.wikimedia.org/T381565)
[11:11:27] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[11:11:42] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] k8s::mediawiki_runner: allow outgoing connections to memcached [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[11:11:56] <wikibugs>	 (03CR) 10Effie Mouzeli: k8s::mediawiki_runner: allow outgoing connections to memcached (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169118 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[11:13:47] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: DNM: Prep patch for removal of old maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1169636 (https://phabricator.wikimedia.org/T381565)
[11:14:18] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: maps: Cleanup DB grants, add tegola, prep tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169221 (owner: 10Alexandros Kosiaris)
[11:15:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[11:17:07] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169635|Set categorylinks to read new on jawiki and ruwiki (T397912)]] (duration: 08m 46s)
[11:17:11] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[11:18:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:18:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:20:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mtail: Remove tilerator from tests [puppet] - 10https://gerrit.wikimedia.org/r/1169217 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:20:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:20:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deployment: Remove tilerator from scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/1169218 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:20:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] admin: Empty out kartotherian-admin [puppet] - 10https://gerrit.wikimedia.org/r/1169219 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:21:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:22:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T399249)', diff saved to https://phabricator.wikimedia.org/P79099 and previous config saved to /var/cache/conftool/dbconfig/20250715-112202-marostegui.json
[11:22:08] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[11:22:18] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[11:22:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T399249)', diff saved to https://phabricator.wikimedia.org/P79100 and previous config saved to /var/cache/conftool/dbconfig/20250715-112225-marostegui.json
[11:24:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] admin: Remove tilerator/tileratorui system users [puppet] - 10https://gerrit.wikimedia.org/r/1169220 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:26:43] <jynus>	 !log restart atftp daemon @ install2004, it had crashed
[11:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:45] <wikibugs>	 (03PS3) 10Dreamy Jazz: mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302)
[11:26:59] <icinga-wm>	 RECOVERY - TFTP service on install2004 is OK: PROCS OK: 1 process with UID = 65534 (nobody), regex args .*/usr/sbin/atftpd .* https://wikitech.wikimedia.org/wiki/Monitoring/atftpd
[11:27:04] <wikibugs>	 (03CR) 10Muehlenhoff: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[11:27:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz)
[11:27:22] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz)
[11:28:33] <jynus>	 moritzm: not filing a task because it is not an issue atm nor I think it requires further action, but FYI but atftpd crashed in close times on both install1004 and instal2004, on the last it didn't restart correctly back
[11:34:25] <moritzm>	 ok, those will be upgraded to bookworm in the next months anyway, and that version will have a systemd-socket-activated atftpd
[11:34:40] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet
[11:34:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:34:46] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp2006.codfw.wmnet
[11:41:07] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet
[11:41:20] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2006.codfw.wmnet
[11:44:12] <wikibugs>	 (03PS1) 10Jcrespo: admin: Add new systemctl alias and update $? output for jynus [puppet] - 10https://gerrit.wikimedia.org/r/1169640
[11:45:39] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-gp1005.eqiad.wmnet
[11:48:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T399249)', diff saved to https://phabricator.wikimedia.org/P79101 and previous config saved to /var/cache/conftool/dbconfig/20250715-114833-marostegui.json
[11:48:38] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[11:50:09] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 16347
[11:50:45] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16347
[11:51:34] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 36351
[11:51:59] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1005.eqiad.wmnet
[11:56:27] <logmsgbot>	 ayounsi@cumin1002 peering (PID 1896552) is awaiting input
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1200)
[12:03:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P79102 and previous config saved to /var/cache/conftool/dbconfig/20250715-120340-marostegui.json
[12:06:17] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2009 - https://phabricator.wikimedia.org/T396365#11004180 (10Jhancock.wm)
[12:08:00] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168214 (owner: 10PipelineBot)
[12:10:02] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168214 (owner: 10PipelineBot)
[12:14:18] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:14:26] <wikibugs>	 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#11004195 (10elukey) @DLynch Hi! I have a couple of questions for you:  * This is a preview of the metrics, https://w.wiki/EjUp, coul...
[12:14:49] <logmsgbot>	 !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:15:28] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:16:16] <logmsgbot>	 !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:16:26] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:17:03] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 36351
[12:17:14] <logmsgbot>	 !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:18:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P79105 and previous config saved to /var/cache/conftool/dbconfig/20250715-121849-marostegui.json
[12:23:41] <XioNoX>	 !log update AS14907 RIPE import/export policies
[12:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:56] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] prom/metamonitor: add dead man switch and public endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[12:29:35] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399571 (10phaultfinder) 03NEW
[12:29:36] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399570 (10phaultfinder) 03NEW
[12:33:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 139009
[12:33:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T399249)', diff saved to https://phabricator.wikimedia.org/P79108 and previous config saved to /var/cache/conftool/dbconfig/20250715-123357-marostegui.json
[12:34:02] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[12:34:07] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:34:13] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[12:34:14] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 139009
[12:34:24] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399573 (10phaultfinder) 03NEW
[12:34:27] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399572 (10phaultfinder) 03NEW
[12:38:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Upgrade firmware (NIC and system) on ganeti1047 - https://phabricator.wikimedia.org/T396660#11004255 (10Jclark-ctr) Thanks for Verifying
[12:44:47] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399573#11004280 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable. iDRAC now shows as healthy. Updated iDRAC firmware while logged in.
[12:45:09] <wikibugs>	 10ops-eqiad, 06DC-Ops: Power Supply - Status - issue on cirrussearch1088:9290 - https://phabricator.wikimedia.org/T399571#11004284 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable. iDRAC now shows as healthy. Updated iDRAC firmware while logged in.
[12:51:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove weight from the master T395771', diff saved to https://phabricator.wikimedia.org/P79109 and previous config saved to /var/cache/conftool/dbconfig/20250715-125157-marostegui.json
[12:52:02] <stashbot>	 T395771: Productionize es2047, es2048, es1047, es1048 - https://phabricator.wikimedia.org/T395771
[12:54:07] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:57:02] <tappof>	 ^^ I'll have a look, might be my fault
[12:57:03] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-drmrs) - https://phabricator.wikimedia.org/T393991#11004304 (10ayounsi) 05Open→03Resolved
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1300).
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:17] <moritzm>	 tappof: nah, it's a recurring thing caused by a lot of Hadoop workers being decommed
[13:00:43] <Lucas_WMDE>	 o/
[13:00:59] <tappof>	 moritzm: Well, thank you! the timing was a bit suspicious :)
[13:01:05] <jynus>	 yes, I mentioned it to the team before, they are aware
[13:01:08] <Lucas_WMDE>	 nothing to deploy (disappointing – means I can’t test T399462 being fixed ^^)
[13:01:09] <stashbot>	 T399462: SpiderPig live job log view (terminal / console) sometimes freezes - https://phabricator.wikimedia.org/T399462
[13:01:26] <jynus>	 it just goes under the threshold for the alert
[13:02:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:07:16] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560#11004320 (10KOfori) Approved.
[13:11:21] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11004337 (10elukey) Just sent the email to Willy explaining the issue, fingers crossed to get some help from Dell :)
[13:14:07] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:14:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[13:14:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T399249)', diff saved to https://phabricator.wikimedia.org/P79110 and previous config saved to /var/cache/conftool/dbconfig/20250715-131450-marostegui.json
[13:14:54] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[13:17:46] <wikibugs>	 (03CR) 10Fabfur: cache::haproxy: Provide X-Trusted-Request score (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez)
[13:23:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] mcrouter: assign pods a higher priority class (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169620 (https://phabricator.wikimedia.org/T397683) (owner: 10Effie Mouzeli)
[13:25:42] <abijeet>	 hello hello, is someone still around to help deploy a configuration change?
[13:29:56] <logmsgbot>	 !log hashar@deploy1003 Started deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154
[13:31:40] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 (duration: 01m 43s)
[13:31:44] <logmsgbot>	 !log hashar@deploy1003 Started deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154
[13:32:37] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 (duration: 00m 53s)
[13:32:57] <logmsgbot>	 !log hashar@deploy1003 Started deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154
[13:32:58] <wikibugs>	 (03CR) 10Muehlenhoff: "Adding Scott as reviewer" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[13:33:32] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [releng/jenkins-deploy@ea02eb9] (releasing): jenkins-rel: update plugins to address vulnerabilities - T399154 (duration: 00m 35s)
[13:36:24] <wikibugs>	 (03PS1) 10Zabe: BETA: Stop writing to cl_to and cl_collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169653 (https://phabricator.wikimedia.org/T399579)
[13:37:05] <wikibugs>	 (03PS1) 10Brennen Bearnes: phabricator deployment: skip storage upgrade during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1169654 (https://phabricator.wikimedia.org/T370266)
[13:37:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T399249)', diff saved to https://phabricator.wikimedia.org/P79111 and previous config saved to /var/cache/conftool/dbconfig/20250715-133712-marostegui.json
[13:37:16] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[13:37:19] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[13:41:14] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:41:42] <wikibugs>	 (03PS2) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058)
[13:41:51] <wikibugs>	 (03CR) 10Vgutierrez: cache::haproxy: Provide X-Trusted-Request score (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez)
[13:43:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11004524 (10Jclark-ctr) @eevans @VRiley-WMF   {F64589137}  KN09N7919I0709R1S serial looks like it was in slot 1 not 0 according to Hardware Inventory in idrac
[13:43:50] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-serve1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:44:41] <wikibugs>	 (03PS1) 10Hashar: Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658
[13:46:18] <wikibugs>	 (03PS2) 10Hashar: Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658
[13:47:05] <wikibugs>	 (03CR) 10Fabfur: cache::haproxy: Provide X-Trusted-Request score (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez)
[13:48:24] <wikibugs>	 (03PS1) 10Vgutierrez: site: Remove cp[5013-5016] entries [puppet] - 10https://gerrit.wikimedia.org/r/1169659 (https://phabricator.wikimedia.org/T323830)
[13:49:58] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] site: Remove cp[5013-5016] entries [puppet] - 10https://gerrit.wikimedia.org/r/1169659 (https://phabricator.wikimedia.org/T323830) (owner: 10Vgutierrez)
[13:51:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] site: Remove cp[5013-5016] entries [puppet] - 10https://gerrit.wikimedia.org/r/1169659 (https://phabricator.wikimedia.org/T323830) (owner: 10Vgutierrez)
[13:52:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P79112 and previous config saved to /var/cache/conftool/dbconfig/20250715-135219-marostegui.json
[13:55:02] <wikibugs>	 (03PS2) 10Scott French: configcluster.yaml - remove eventlogging from profile::etcd::tlsproxy::acls [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[13:55:14] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[13:55:56] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[13:56:35] <wikibugs>	 (03PS1) 10Hashar: gerrit: remove GWT-only theme configuration [puppet] - 10https://gerrit.wikimedia.org/r/1169660
[13:57:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399570#11004556 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm power supply failed. server out of warranty. replaced with one from a decommed server.
[13:57:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on maps2008:9290 - https://phabricator.wikimedia.org/T399572#11004560 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm power supply failed. server out of warranty. replaced with one from a decommed server.
[13:58:20] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11004564 (10Jhancock.wm) @klausman I have a few servers of yours in codfw that need this updated. The PXE settings need to be updated. It shouldn't cause a reboot to reset the pxe, but if anyt...
[13:59:08] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:03:30] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Andrew! There should be no harm in cleaning this up, and better to get rid of it to avoid future confusion." [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[14:04:07] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:04:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11004586 (10elukey) For future notes, these are the BIOS's Attributes:  ` {'ACPICSTC2Latency': 800,  'ACPISRATL3CacheAsNUMADomain': 'Auto',  'ACSEnable': 'Auto',  'APBD...
[14:05:00] <wikibugs>	 (03CR) 10Tryvix1509: [C:03+1] Create "abusefilter" editor user group for Vietnamese Wikipedia (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[14:05:59] <wikibugs>	 (03CR) 10Tryvix1509: Create "abusefilter" editor user group for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169603 (https://phabricator.wikimedia.org/T399535) (owner: 10Tryvix1509)
[14:06:41] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[14:06:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11004594 (10Eevans) >>! In T396970#11004524, @Jclark-ctr wrote: > @eevans @VRiley-WMF   {F64589137} >  > KN09N7919I0709R1S serial looks like it was in slot 1 not 0 according to Hardware Inventory in idrac    >...
[14:06:50] <swfrench-wmf>	 !log reprepro include php8.3_8.3.23-1+wmf11u2 in component/php83 - T398245
[14:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:56] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[14:07:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P79113 and previous config saved to /var/cache/conftool/dbconfig/20250715-140726-marostegui.json
[14:08:34] <wikibugs>	 (03PS1) 10Ayounsi: WIP: Ganeti Bird BGP [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392)
[14:12:02] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565)
[14:14:11] <wikibugs>	 (03CR) 10Elukey: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[14:15:36] <wikibugs>	 (03PS1) 10Ayounsi: Routed ganeti: disable IPv4 ICMP redirects [puppet] - 10https://gerrit.wikimedia.org/r/1169663 (https://phabricator.wikimedia.org/T362392)
[14:19:15] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058)
[14:20:46] <wikibugs>	 (03CR) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh)
[14:20:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh)
[14:21:40] <wikibugs>	 (03PS3) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436)
[14:22:17] <swfrench-wmf>	 !log reprepro include php8.1_8.1.33-1+wmf11u1 in component/php81
[14:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh)
[14:22:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T399249)', diff saved to https://phabricator.wikimedia.org/P79114 and previous config saved to /var/cache/conftool/dbconfig/20250715-142234-marostegui.json
[14:22:40] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[14:22:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:22:52] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[14:22:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Neutron: include a python dependency for wmcs-netns-events [puppet] - 10https://gerrit.wikimedia.org/r/1168648 (owner: 10Andrew Bogott)
[14:23:54] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[14:25:14] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[14:25:41] <wikibugs>	 (03PS4) 10Ssingh: admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436)
[14:26:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560#11004652 (10ssingh)
[14:28:25] <wikibugs>	 (03CR) 10Ssingh: "Rebased, no code change." [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh)
[14:28:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#11004658 (10Jhancock.wm) (not trying to rush, just making sure i didn't miss something) Is there anything I can help with on this one?
[14:29:07] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1430)
[14:30:28] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:32:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add OKryva-WMF to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1169189 (https://phabricator.wikimedia.org/T399436) (owner: 10Ssingh)
[14:33:00] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:33:04] <wikibugs>	 (03CR) 10Zabe: [C:03+2] BETA: Stop writing to cl_to and cl_collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169653 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe)
[14:33:23] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:33:36] <wikibugs>	 (03PS1) 10Tiziano Fogli: prom/metamonitor: add CNAMEs for metamonitoring endpoints [dns] - 10https://gerrit.wikimedia.org/r/1169668 (https://phabricator.wikimedia.org/T397003)
[14:33:50] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:34:22] <wikibugs>	 (03Merged) 10jenkins-bot: BETA: Stop writing to cl_to and cl_collation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169653 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe)
[14:35:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM, I think this setting is safe in any way we operate the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1169663 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[14:36:24] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:36:35] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2010.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[14:37:24] <wikibugs>	 (03PS4) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148)
[14:37:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#11004701 (10ssingh) Expiry has been sent to end of FY (June 2026) and contact has been set to Suman to get this request going. We...
[14:37:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) (owner: 10Dreamy Jazz)
[14:38:39] <wikibugs>	 (03PS1) 10Ssingh: admin: add vgutierrez to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1169672 (https://phabricator.wikimedia.org/T399560)
[14:39:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1169672 (https://phabricator.wikimedia.org/T399560) (owner: 10Ssingh)
[14:39:32] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] admin: add vgutierrez to analytics-privatedata-users (with krb) [puppet] - 10https://gerrit.wikimedia.org/r/1169672 (https://phabricator.wikimedia.org/T399560) (owner: 10Ssingh)
[14:39:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11004722 (10Jhancock.wm) @elukey looks like this server and the one in T396365 are having this same issue with the provisioning script. they're both the 1 CPU test servers from su...
[14:40:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for vgutierrez - https://phabricator.wikimedia.org/T399560#11004725 (10ssingh) 05Open→03Resolved a:03ssingh ` sukhe@krb1002:~$ sudo manage_principals.py create vgutierrez --email_address=vgutierrez@wi...
[14:40:50] <wikibugs>	 (03Abandoned) 10Herron: thanos: add recording rules for varnish SLO [puppet] - 10https://gerrit.wikimedia.org/r/740209 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron)
[14:41:04] <wikibugs>	 (03Abandoned) 10Herron: add error and latency budget burndown graph panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/715536 (https://phabricator.wikimedia.org/T290009) (owner: 10Herron)
[14:41:12] <wikibugs>	 (03Abandoned) 10Herron: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron)
[14:43:13] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004737 (10MoritzMuehlenhoff)
[14:43:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11004736 (10elukey) @Jhancock.wm Interesting! The absence of Console Redirection is new... Did you find anything in the BIOS about the console redirection by any chance?
[14:43:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11004751 (10Jhancock.wm) I have not. I can take a closer look this afternoon.
[14:44:07] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:45:19] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004755 (10ssingh) ` sukhe@krb1002:~$ sudo manage_principals.py reset-password htriedman --email_address=htriedman-ctr@wikimedia.org Password reset successfully. Successfully sent...
[14:48:21] <wikibugs>	 (03PS1) 10Effie Mouzeli: dsh.yaml: removed conftool entries for testservers [puppet] - 10https://gerrit.wikimedia.org/r/1169673
[14:50:11] <wikibugs>	 (03CR) 10Cathal Mooney: WIP: Ganeti Bird BGP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169662 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[14:53:16] <wikibugs>	 (03PS1) 10Btullis: Bump hive metastore heap to support the refine migration [puppet] - 10https://gerrit.wikimedia.org/r/1169675 (https://phabricator.wikimedia.org/T369845)
[14:54:39] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6275/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169675 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis)
[14:55:07] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] cache::haproxy: Provide X-Trusted-Request score [puppet] - 10https://gerrit.wikimedia.org/r/1169621 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez)
[14:56:15] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.188.2" for 1 host(s)
[14:56:48] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Apply requestctl rules based on X-Trusted-Request [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058)
[14:57:10] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.188.2" completed for 1 hosts
[14:58:00] <wikibugs>	 (03CR) 10Vgutierrez: "text tests are happy: `0 tests failed, 0 tests skipped, 39 tests passed`" [puppet] - 10https://gerrit.wikimedia.org/r/1169664 (https://phabricator.wikimedia.org/T399058) (owner: 10Vgutierrez)
[14:58:32] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Bump hive metastore heap to support the refine migration [puppet] - 10https://gerrit.wikimedia.org/r/1169675 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis)
[15:00:04] <jouncebot>	 jelto, arnoldokoth, and mutante: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1500).
[15:00:37] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet with reason: Phorge upgrade
[15:02:05] <jynus>	 !log stop replica @ db1217:m3, db2160:m3 T370266
[15:02:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:09] <stashbot>	 T370266: Update to Phorge upstream 2024.35 release - https://phabricator.wikimedia.org/T370266
[15:03:26] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1004.eqiad.wmnet with reason: version upgrade
[15:03:53] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab2002.codfw.wmnet with reason: version upgrade
[15:05:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator deployment: skip storage upgrade during deploy [puppet] - 10https://gerrit.wikimedia.org/r/1169654 (https://phabricator.wikimedia.org/T370266) (owner: 10Brennen Bearnes)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:06] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11004989 (10Htriedman) This seems to have worked! Thank you for the lightning-fast response time :)
[15:09:01] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@ed8270c]: test deploy phab2002 for T370266
[15:09:05] <stashbot>	 T370266: Update to Phorge upstream 2024.35 release - https://phabricator.wikimedia.org/T370266
[15:09:07] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[15:09:39] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@ed8270c]: test deploy phab2002 for T370266 (duration: 00m 38s)
[15:10:51] <wikibugs>	 10SRE-Access-Requests, 06Data-Engineering: Requesting a kerberos identity - htriedman - https://phabricator.wikimedia.org/T398501#11005013 (10ssingh) 05Open→03Resolved a:03ssingh
[15:11:38] <mutante>	 !log phabricator version upgrade in progress - expect short downtime
[15:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:06] <wikibugs>	 (03PS1) 10Btullis: Fail over hive services to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1169683 (https://phabricator.wikimedia.org/T369845)
[15:12:14] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@ed8270c]: deploy phab1004 for T370266
[15:12:44] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@ed8270c]: deploy phab1004 for T370266 (duration: 00m 30s)
[15:13:05] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Fail over hive services to an-coord1004 [dns] - 10https://gerrit.wikimedia.org/r/1169683 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis)
[15:13:18] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[15:14:12] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[15:14:50] <logmsgbot>	 andrew@cumin2002 reimage (PID 3814243) is awaiting input
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:10] <wikibugs>	 (03PS1) 10Elukey: pyrra: fix istio latency SLI metric selector [puppet] - 10https://gerrit.wikimedia.org/r/1169686
[15:18:20] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6276/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169686 (owner: 10Elukey)
[15:19:40] <wikibugs>	 (03CR) 10Herron: [C:03+1] pyrra: fix istio latency SLI metric selector [puppet] - 10https://gerrit.wikimedia.org/r/1169686 (owner: 10Elukey)
[15:20:20] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] pyrra: fix istio latency SLI metric selector [puppet] - 10https://gerrit.wikimedia.org/r/1169686 (owner: 10Elukey)
[15:28:36] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[15:29:05] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[15:36:38] <icinga-wm>	 PROBLEM - snapshot of s3 in codfw on backupmon1001 is CRITICAL: Last snapshot for s3 at codfw (db2239) taken on 2025-07-14 08:26:30 is 1197 GiB, but the previous one was 1999 GiB, a change of -40.1 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:36:41] <wikibugs>	 (03PS1) 10Brennen Bearnes: Revert "phabricator deployment: skip storage upgrade during deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1169692 (https://phabricator.wikimedia.org/T370266)
[15:36:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "phabricator deployment: skip storage upgrade during deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1169692 (https://phabricator.wikimedia.org/T370266) (owner: 10Brennen Bearnes)
[15:36:59] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] Revert "phabricator deployment: skip storage upgrade during deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1169692 (https://phabricator.wikimedia.org/T370266) (owner: 10Brennen Bearnes)
[15:38:52] <wikibugs>	 (03PS1) 10Ebernhardson: Repoint oss.sonatype.org to repo1.maven.org [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169693
[15:38:52] <wikibugs>	 (03PS1) 10Ebernhardson: Update plugins for bugfix to extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169694 (https://phabricator.wikimedia.org/T399162)
[15:39:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:42:10] <mutante>	 !log phabricator version upgrade finished
[15:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:50] <thcipriani>	 \o/
[15:45:04] <wikibugs>	 (03PS1) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696
[15:46:21] <logmsgbot>	 andrew@cumin2002 reimage (PID 3847355) is awaiting input
[15:46:58] <jynus>	 !log start replica @ db1217:m3, db2160:m3 T370266
[15:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:04] <stashbot>	 T370266: Update to Phorge upstream 2024.35 release - https://phabricator.wikimedia.org/T370266
[15:47:18] <Mvolz>	 Hey, how do we check whether something got deployed? This was +2ed but I wasn't around to check it on mwdebug... https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1164179 would it have gone out automatically the last time we did a config change, or can I just put it on a deployment window at some point? 
[15:48:52] <wikibugs>	 (03PS1) 10Btullis: Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697
[15:49:14] <wikibugs>	 (03PS2) 10Btullis: Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 (https://phabricator.wikimedia.org/T369845)
[15:51:15] <zabe>	 Mvolz: in mediawiki-config changes are only merged when they then get deployed. And since your change did not got reverted, it should be live.
[15:51:23] <wikibugs>	 (03PS2) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696
[15:51:47] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis)
[15:51:58] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Revert "Fail over hive services to an-coord1004" [dns] - 10https://gerrit.wikimedia.org/r/1169697 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis)
[15:52:02] <logmsgbot>	 !log btullis@dns1004 START - running authdns-update
[15:52:13] <zabe>	 You can always take a look at srv/mediawiki/ on deploy1003.eqiad.wmnet to see what the currently live code is
[15:52:54] <logmsgbot>	 !log btullis@dns1004 END - running authdns-update
[15:55:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005152 (10BCornwall) 05Resolved→03Open Ah, @VRiley-WMF, it seems that connectivity is no longer through the Mellanox card:  ` [    9.128067] mlx5_core 0000:3b:00.0: Port module event: module 0...
[15:55:57] <wikibugs>	 (03PS2) 10Ebernhardson: Update plugins for bugfix to extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169694 (https://phabricator.wikimedia.org/T399162)
[15:55:57] <wikibugs>	 (03PS3) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696
[16:00:01] <wikibugs>	 (03PS6) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550)
[16:00:05] <jouncebot>	 jhathaway and moritzm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:14] <wikibugs>	 (03CR) 10BCornwall: varnish: Implement translation analytics vars (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[16:00:34] <wikibugs>	 (03PS4) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696
[16:03:32] <wikibugs>	 (03CR) 10Ebernhardson: "recheck" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 (owner: 10Ebernhardson)
[16:03:42] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:03:52] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:36] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.126 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:05:42] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54225 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:08:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11005227 (10VRiley-WMF) Swapped both of the failed SSDs with spares. will await for the reimage.
[16:10:36] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet
[16:10:38] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db[2160,2234].codfw.wmnet,db[1217,1250].eqiad.wmnet
[16:17:53] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11005301 (10Mvolz) I think it's less likely it's miscalculated and more likely it's just bad. Does it seem really very different from https://graf...
[16:18:57] <wikibugs>	 (03Abandoned) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164703 (owner: 10Pppery)
[16:19:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: maps: Add tegola user in DB, mark tilerator for removal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169222 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[16:21:54] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[16:22:00] <wikibugs>	 (03CR) 10Aklapper: "Yeah, sorry - moving targets and priorities :(" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1164703 (owner: 10Pppery)
[16:22:21] <wikibugs>	 (03CR) 10Mvolz: "Usually config changes don't get +2ed unless they ready to deploy so they can be tested on mwdebug before deploying during a deployment wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1164179 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[16:22:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[16:25:19] <icinga-wm>	 PROBLEM - mysqld processes #page on es1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[16:25:19] <icinga-wm>	 PROBLEM - MariaDB read only es1 on es1032 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[16:25:52] <federico3>	 looking
[16:26:01] <akosiaris>	 I am around if you need help
[16:26:04] <cwhite>	 hmm
[16:26:50] <mutante>	 here
[16:27:00] <mutante>	 trying to connect
[16:27:18] <federico3>	 cwhite: I see you logged in, are you making any change?
[16:27:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005406 (10VRiley-WMF) Orginally put the cable into the onboard port. Once it was able to reimage, that's when I just moved it over. It should be all set now.
[16:27:41] <cwhite>	 no, still investigating
[16:27:59] <mutante>	 acking to prevent escalation to batphone..for now
[16:28:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and 2a02:ec80:700:fe0b::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:28:59] <mutante>	 ● wmf_auto_restart_prometheus-mysqld-exporter.service loaded failed failed
[16:29:07] <mutante>	 this is the only failed unit i see
[16:29:11] <mutante>	 not mysqld itself?
[16:29:19] <cwhite>	 I don't see this host in orchestrator
[16:29:20] <jynus>	 that's a downtime expiration: 1d 0h 1m 57s
[16:29:26] <mutante>	 ah!
[16:29:41] <mutante>	 what is a reasonable time frame to extend it?
[16:29:44] <mutante>	 few more days?
[16:29:44] <jynus>	 my guess, I don't know
[16:30:06] <jynus>	 double check it is depooled
[16:30:22] <federico3>	 yes
[16:30:40] <jynus>	 then no emergency
[16:30:49] <mutante>	 actually I dont see any systemd service here called mysql or maria
[16:30:51] <mutante>	 ok
[16:31:05] <federico3>	 yet it's in es1
[16:31:36] <jynus>	 mysql has been down there at least for 24 hours
[16:32:27] <mutante>	 searching for tickets
[16:32:50] <mutante>	 https://phabricator.wikimedia.org/P75467
[16:33:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and 2a02:ec80:700:fe0b::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:33:18] <jynus>	 Apr 28 2025
[16:33:45] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm
[16:33:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm
[16:34:12] <mutante>	 https://phabricator.wikimedia.org/T391921 is closed but I left a comment there
[16:34:43] <mutante>	 jynus: you saw that date somewhere outside the paste bin above? then it matches
[16:36:28] <logmsgbot>	 !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on es1032.eqiad.wmnet with reason: T391921
[16:36:31] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[16:36:39] <mutante>	 !log downtiming es1032 for 3 days - expired downtime for T391921?
[16:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:42] <mutante>	 mariadb-common is 10.11.11 on that host, like what that ticket was about
[16:39:03] <mutante>	 alright, with a new downtime and a comment on that ticket.. and no emergency.. I will declare it no incident
[16:40:57] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[16:44:05] <wikibugs>	 (03PS1) 10Pppery: Update source strings to 2024.35 [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1169700 (https://phabricator.wikimedia.org/T399604)
[16:44:17] <wikibugs>	 06SRE, 10SRE-SLO, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#11005556 (10Mvolz) Here's an example: on July 14 from 12:41 to 12:42 we received 63 requests for www.espncricinfo.com which all failed. (403 forbi...
[16:45:44] <wikibugs>	 (03CR) 10Pppery: "Cc abijeet for awareness; this is going to require a non-trivial amount of manual rename processing on the translatewiki side. Less than t" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1169700 (https://phabricator.wikimedia.org/T399604) (owner: 10Pppery)
[16:49:06] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Built locally with docker-pkg." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French)
[16:49:34] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Thank you both for the reviews!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French)
[16:49:35] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: rebuild to pick up new php version [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169170 (owner: 10Scott French)
[16:53:15] <swfrench-wmf>	 FYI, please refrain from starting any new mediawiki deployments, as I'll be deploying at the top of the hour to pick up a new production image
[16:54:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:55:07] <wikibugs>	 (03PS1) 10Peter Fischer: Bump flink to 1.20.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169704 (https://phabricator.wikimedia.org/T398159)
[16:57:43] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1017
[16:57:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:58:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1017
[16:59:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'update es1032', diff saved to https://phabricator.wikimedia.org/P79117 and previous config saved to /var/cache/conftool/dbconfig/20250715-165930-fceratto.json
[17:00:05] <jouncebot>	 swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1700).
[17:00:14] <swfrench-wmf>	 o/
[17:01:09] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Rebuild to pick up new php8.1 production image
[17:02:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:04:12] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[17:04:18] <icinga-wm>	 RECOVERY - mysqld processes #page on es1032 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[17:04:37] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm
[17:04:49] <icinga-wm>	 RECOVERY - MariaDB read only es1 on es1032 is OK: Version 10.11.13-MariaDB-log, Uptime 38s, read_only: True, event_scheduler: True, 6.94 QPS, connection latency: 0.034040s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[17:04:55] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.netbox
[17:07:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Pooling in after update es1032', diff saved to https://phabricator.wikimedia.org/P79118 and previous config saved to /var/cache/conftool/dbconfig/20250715-170724-fceratto.json
[17:07:32] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:09:15] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:09:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Set es1032 back as master', diff saved to https://phabricator.wikimedia.org/P79119 and previous config saved to /var/cache/conftool/dbconfig/20250715-170919-fceratto.json
[17:09:44] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs1017
[17:09:56] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1017
[17:14:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:14:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.provision for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[17:22:54] <logmsgbot>	 brett@cumin2002 provision (PID 3897999) is awaiting input
[17:24:24] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs1017.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[17:26:00] <logmsgbot>	 andrew@cumin2002 reimage (PID 3893701) is awaiting input
[17:34:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1017.eqiad.wmnet with OS bookworm
[17:34:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm
[17:34:53] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Rebuild to pick up new php8.1 production image (duration: 34m 16s)
[17:49:25] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs1017.eqiad.wmnet with OS bookworm
[17:49:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm executed with errors: - lvs1017 (**FAIL**)   - Downtimed...
[17:51:32] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage
[17:54:42] <swfrench-wmf>	 jouncebot: nowandnext
[17:54:42] <jouncebot>	 For the next 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1700)
[17:54:42] <jouncebot>	 In 0 hour(s) and 5 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1800)
[17:54:43] <wikibugs>	 (03PS5) 10Ssingh: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins)
[17:54:43] <wikibugs>	 (03CR) 10Ssingh: "Nice first attempt!" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins)
[17:55:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1017.eqiad.wmnet with reason: host reimage
[17:55:15] <wikibugs>	 (03CR) 10Ssingh: add start of recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins)
[17:56:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for OKryva-WMF - https://phabricator.wikimedia.org/T399436#11005916 (10ssingh) 05Open→03Resolved a:03ssingh Things look fine so marking as resolved; please re-open if there are an...
[17:58:44] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Stop building buster-based webserver flavour images - T378128
[17:58:49] <stashbot>	 T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128
[17:59:25] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:00:04] <jouncebot>	 dancy and andre: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T1800).
[18:00:19] <dancy>	 o/ 
[18:00:37] <wikibugs>	 (03CR) 10Bking: [C:03+2] Bump flink to 1.20.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169704 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer)
[18:00:48] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Bump flink to 1.20.1. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1169704 (https://phabricator.wikimedia.org/T398159) (owner: 10Peter Fischer)
[18:01:00] <swfrench-wmf>	 dancy: my deploy should wrap up momentarily. appears to have worked as expected.
[18:01:05] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Stop building buster-based webserver flavour images - T378128 (duration: 02m 21s)
[18:01:47] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Hi folks: Checking if you want Traffic to merge this? Happy to but asking in case you are waiting for something." [dns] - 10https://gerrit.wikimedia.org/r/1164124 (https://phabricator.wikimedia.org/T362985) (owner: 10Slyngshede)
[18:02:13] <swfrench-wmf>	 dancy: I should be out of your way now
[18:03:08] <dancy>	 Thanks.  Running the train via spiderpig today
[18:03:40] <andre>	 uh
[18:04:13] <dancy>	 oh Andre, do you want to press the button?
[18:04:25] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:04:54] <andre>	 dancy: ehehe I'm already a bit braindead today (Phab deploy) but maybe tomorrow?
[18:05:43] <dancy>	 haha ok
[18:06:39] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169715 (https://phabricator.wikimedia.org/T392180)
[18:06:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169715 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[18:07:35] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169715 (https://phabricator.wikimedia.org/T392180) (owner: 10TrainBranchBot)
[18:08:27] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "@krinkle@fastmail.com: I am going to merge this chain after code review. Any concerns with that? I know they are cherry-picked to beta but" [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:08:34] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] beta: Document beta-specific "w.beta.wmcloud.org" handling [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle)
[18:09:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:10:15] <wikibugs>	 (03CR) 10Krinkle: "Sounds good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:10:24] <inflatador>	 !log bking@build2001 /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*flink*' T398159
[18:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:35] <stashbot>	 T398159: SUP: Use flink 1.20.1 - https://phabricator.wikimedia.org/T398159
[18:11:01] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:11:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] deployment-prep: Add Apache vhost aliases for *.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1153764 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:11:14] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] beta: Document beta-specific "w.beta.wmcloud.org" handling [puppet] - 10https://gerrit.wikimedia.org/r/1160441 (https://phabricator.wikimedia.org/T396012) (owner: 10Krinkle)
[18:11:59] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1017.eqiad.wmnet with OS bookworm
[18:12:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs1017.eqiad.wmnet with OS bookworm completed: - lvs1017 (**PASS**)   - Removed from Puppet...
[18:12:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:12:52] <wikibugs>	 (03PS3) 10Krinkle: beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318)
[18:15:05] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] beta: Add redirect for upload.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1167268 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:15:52] <wikibugs>	 (03PS2) 10Krinkle: beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318)
[18:16:28] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:16:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#11005982 (10BCornwall) 05Open→03Resolved The link was re-connected to the Mellanox card; We then reconfigured the interface with:  ` $ sudo -i cookbook sre.dns.netbox -t T387145 'update lvs1...
[18:16:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:18:28] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] beta: Change beta.wmcloud.org stub redirect to new Meta-Wiki canonical [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:18:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] "Chain merged, thanks for the patches!" [puppet] - 10https://gerrit.wikimedia.org/r/1167273 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[18:19:05] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.10  refs T392180
[18:19:11] <stashbot>	 T392180: 1.45.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T392180
[18:19:15] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:28:45] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1173 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1169723 (https://phabricator.wikimedia.org/T399619)
[18:28:50] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1169725 (https://phabricator.wikimedia.org/T399619)
[18:29:07] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:30:40] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance
[18:30:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79120 and previous config saved to /var/cache/conftool/dbconfig/20250715-183047-marostegui.json
[18:30:56] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[18:31:26] <wikibugs>	 (03PS1) 10Legoktm: admin: temporarily disable legoktm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1169727
[18:36:00] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Revert "Gerrit: Set cache for groups" [puppet] - 10https://gerrit.wikimedia.org/r/1169658 (owner: 10Hashar)
[18:36:08] <wikibugs>	 (03PS1) 10Ahmon Dancy: logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462)
[18:36:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462) (owner: 10Ahmon Dancy)
[18:37:40] <wikibugs>	 06SRE, 10Beta-Cluster-Infrastructure, 06serviceops, 10Wikidata, 10wmde-wikidata-tech: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#11006122 (10Krinkle) 05Open→03Resolved a:03Krinkle This appears to be working now, and seemingly has been for a wh...
[18:37:56] <wikibugs>	 (03PS2) 10Ahmon Dancy: logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462)
[18:38:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: temporarily disable legoktm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1169727 (owner: 10Legoktm)
[18:39:15] <wikibugs>	 (03PS28) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var [puppet] - 10https://gerrit.wikimedia.org/r/1154085
[18:39:43] <wikibugs>	 (03CR) 10CDobbins: varnish: Replace X-RB-NOREDIR with rb_noredir var (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins)
[18:39:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:44:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] logspam.pl: Consolidate T322462 log messages [puppet] - 10https://gerrit.wikimedia.org/r/1169728 (https://phabricator.wikimedia.org/T322462) (owner: 10Ahmon Dancy)
[18:45:06] <wikibugs>	 (03PS1) 10Ebernhardson: Move repository to gitlab [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169730 (https://phabricator.wikimedia.org/T399617)
[18:47:03] <mutante>	 puppet is still failing on all(?) analytics hosts
[18:47:31] <mutante>	 seems an alerting issue
[18:50:19] <wikibugs>	 (03PS1) 10Eevans: aqs1022: default to partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1169733
[18:52:06] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[18:52:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye
[18:54:52] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1022: default to partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/1169733 (owner: 10Eevans)
[18:56:41] <wikibugs>	 (03PS1) 10DDesouza: Undeploy Readers Use Cases Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870)
[18:57:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza)
[18:57:56] <icinga-wm>	 RECOVERY - Host aqs1012 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[19:01:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79121 and previous config saved to /var/cache/conftool/dbconfig/20250715-190120-marostegui.json
[19:01:25] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[19:04:58] <wikibugs>	 (03Abandoned) 10Ebernhardson: WIP: Change packaging format to 3.0 (native) [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169696 (owner: 10Ebernhardson)
[19:05:15] <wikibugs>	 (03Abandoned) 10Ebernhardson: Repoint oss.sonatype.org to repo1.maven.org [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169693 (owner: 10Ebernhardson)
[19:05:23] <wikibugs>	 (03Abandoned) 10Ebernhardson: Update plugins for bugfix to extended regex support [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1169694 (https://phabricator.wikimedia.org/T399162) (owner: 10Ebernhardson)
[19:09:07] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[19:09:41] <wikibugs>	 (03PS1) 10Krinkle: multiversion: Fix "Class Wikimedia\MWConfig\Exception not found" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169737
[19:09:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[19:16:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P79122 and previous config saved to /var/cache/conftool/dbconfig/20250715-191627-marostegui.json
[19:17:06] <wikibugs>	 (03PS1) 10Eevans: aqs1012: perform a complete reimage [puppet] - 10https://gerrit.wikimedia.org/r/1169739 (https://phabricator.wikimedia.org/T396970)
[19:19:07] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:20:33] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[19:24:07] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:25:05] <wikibugs>	 (03PS1) 10Eevans: aqs1012: perform a complete reimage (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1169742 (https://phabricator.wikimedia.org/T396970)
[19:27:43] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1012: perform a complete reimage (JBOD) [puppet] - 10https://gerrit.wikimedia.org/r/1169742 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans)
[19:28:53] <wikibugs>	 (03Abandoned) 10Ssingh: C:bird::anycast_healthchecker: notify service on conf file change [puppet] - 10https://gerrit.wikimedia.org/r/1166238 (owner: 10Ssingh)
[19:31:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P79123 and previous config saved to /var/cache/conftool/dbconfig/20250715-193134-marostegui.json
[19:33:14] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[19:33:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye
[19:37:50] <wikibugs>	 (03PS4) 10Scott French: httpd: Rebase on bookworm and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128)
[19:40:00] <wikibugs>	 (03CR) 10Ottomata: "Great, okay, should I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[19:41:39] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[19:41:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**)   - Removed from Puppet...
[19:42:08] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[19:42:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006282 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye
[19:46:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T399249)', diff saved to https://phabricator.wikimedia.org/P79124 and previous config saved to /var/cache/conftool/dbconfig/20250715-194642-marostegui.json
[19:46:46] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[19:46:58] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance
[19:47:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T399249)', diff saved to https://phabricator.wikimedia.org/P79125 and previous config saved to /var/cache/conftool/dbconfig/20250715-194704-marostegui.json
[19:49:25] <wikibugs>	 (03PS1) 10Scott French: shellbox: bump image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1169752
[19:53:32] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[19:53:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**)   - Removed from Puppet...
[19:53:53] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[19:54:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006323 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T2000).
[20:00:05] <jouncebot>	 NovemLinguae and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <NovemLinguae>	 o/
[20:01:24] <zabe>	 I can deploy
[20:01:51] <NovemLinguae>	 ty :)
[20:05:11] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[20:05:27] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[20:05:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006406 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**)   - Removed from Puppet...
[20:05:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006413 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye
[20:06:16] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[20:06:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[20:06:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[20:07:05] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169298 (https://phabricator.wikimedia.org/T398080) (owner: 10Novem Linguae)
[20:07:26] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169298|Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" (T398080 T399372)]]
[20:07:32] <stashbot>	 T398080: Set $wgSecurePollUseMediaWikiNamespace = true on English Wikipedia - https://phabricator.wikimedia.org/T398080
[20:07:32] <stashbot>	 T399372: MediaWiki\Storage\NameTableAccessException: No insert possible but primary DB didn't give us a record for 'SecurePoll' in 'content_models' - https://phabricator.wikimedia.org/T399372
[20:08:34] <NovemLinguae>	 oh i almost forgot. there's a comment in that patch about, instead of deploying it, doing an SQL query instead
[20:09:00] <NovemLinguae>	 deployer discretion though. thoughts?
[20:09:36] <logmsgbot>	 !log zabe@deploy1003 novemlinguae, zabe: Backport for [[gerrit:1169298|Revert "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" (T398080 T399372)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:09:36] <zabe>	 NovemLinguae: Is it intended that there will be a new content model?
[20:10:08] <wikibugs>	 (03PS6) 10CDobbins: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156
[20:10:28] <NovemLinguae>	 so the patch that was deployed a week ago turns on SecurePoll logging to subpages of MediaWiki:SecurePoll/*. and those pages do use a new content model, yes. the first edit of this logging on enwiki would create a new content model SecurePoll
[20:10:38] <NovemLinguae>	 due to what we suspect is a MediaWiki core bug, this is throwing an exception
[20:10:54] <NovemLinguae>	 we suspect that the SQL query would solve the issue, OR we can revert the patch from a week ago
[20:11:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[20:14:11] <zabe>	 Let me take a quick look
[20:16:41] <zabe>	 huh
[20:17:08] <wikibugs>	 (03PS7) 10CDobbins: add start of recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156
[20:17:17] <zabe>	 https://phabricator.wikimedia.org/P79126
[20:17:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T399249)', diff saved to https://phabricator.wikimedia.org/P79127 and previous config saved to /var/cache/conftool/dbconfig/20250715-201715-marostegui.json
[20:17:22] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[20:17:26] <zabe>	 NovemLinguae: apparently it took a few tries ^
[20:18:03] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6287/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (owner: 10CDobbins)
[20:18:05] <NovemLinguae>	 sounds like you ran the query. let me go test if it worked
[20:18:15] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[20:18:16] <zabe>	 yes
[20:18:31] <wikibugs>	 (03PS1) 10Eevans: aqs1012: must use partman/raid1-2dev-efi.cfg preseed [puppet] - 10https://gerrit.wikimedia.org/r/1169762 (https://phabricator.wikimedia.org/T396970)
[20:19:37] <NovemLinguae>	 alright, the query worked. the logging is working now. https://en.wikipedia.org/w/index.php?title=MediaWiki:SecurePoll/834/msg/en&action=history
[20:19:49] <NovemLinguae>	 we can abort the revert patch backport
[20:19:56] <logmsgbot>	 !log eevans@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[20:20:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**)...
[20:20:21] <wikibugs>	 (03PS8) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[20:20:36] <NovemLinguae>	 if you have any insights about this weird bug feel free to post in https://phabricator.wikimedia.org/T399372
[20:20:52] <zabe>	 yup
[20:21:01] <logmsgbot>	 !log zabe@deploy1003 Sync cancelled.
[20:21:16] <wikibugs>	 (03PS1) 10Zabe: Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169763
[20:21:21] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169763 (owner: 10Zabe)
[20:21:32] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1012: must use partman/raid1-2dev-efi.cfg preseed [puppet] - 10https://gerrit.wikimedia.org/r/1169762 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans)
[20:22:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169763 (owner: 10Zabe)
[20:22:17] <NovemLinguae>	 sorry for the curveball. thanks for fixing it :)
[20:22:31] <zabe>	 no problem
[20:22:44] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Undeploy Readers Use Cases Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza)
[20:22:50] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2007-dev.codfw.wmnet with OS bookworm
[20:23:22] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[20:23:23] <wikibugs>	 (03PS9) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[20:23:33] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy Readers Use Cases Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169734 (https://phabricator.wikimedia.org/T398870) (owner: 10DDesouza)
[20:23:36] <wikibugs>	 (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[20:24:18] <danisztls>	 sorry I was unable to login earlier
[20:24:34] <zabe>	 no problem
[20:24:39] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6288/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[20:25:07] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169763|Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki"]], [[gerrit:1169734|Undeploy Readers Use Cases Survey (T398870)]]
[20:25:17] <stashbot>	 T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870
[20:25:25] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] profile::httpd: include prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn)
[20:27:18] <logmsgbot>	 !log zabe@deploy1003 dani, zabe: Backport for [[gerrit:1169763|Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki"]], [[gerrit:1169734|Undeploy Readers Use Cases Survey (T398870)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:27:33] <zabe>	 danisztls: is it possible to test your patch?
[20:27:50] <danisztls>	 zabe: yes
[20:28:06] <danisztls>	 zabe: looks good
[20:28:09] <zabe>	 nice
[20:28:11] <zabe>	 syncing
[20:28:11] <logmsgbot>	 !log zabe@deploy1003 dani, zabe: Continuing with sync
[20:30:21] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[20:30:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye
[20:30:43] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "This falls squarely in "should be fine" territory, but it wouldn't hurt to do mildly carefully [0]. If you'd like, I can merge this and ta" [puppet] - 10https://gerrit.wikimedia.org/r/1111239 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[20:32:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P79128 and previous config saved to /var/cache/conftool/dbconfig/20250715-203224-marostegui.json
[20:33:35] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169763|Revert^2 "initialiseSettings: set wgSecurePollUseMediaWikiNamespace = true for enwiki"]], [[gerrit:1169734|Undeploy Readers Use Cases Survey (T398870)]] (duration: 08m 27s)
[20:33:42] <stashbot>	 T398870: Open-ended survey of enwiki readers - https://phabricator.wikimedia.org/T398870
[20:33:45] <zabe>	 danisztls: should be live
[20:35:07] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "watched it being added on puppetserver1001" [puppet] - 10https://gerrit.wikimedia.org/r/1167962 (https://phabricator.wikimedia.org/T187434) (owner: 10Dzahn)
[20:37:51] <danisztls>	 zabe: thanks
[20:39:17] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[20:44:49] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage
[20:45:29] <zabe>	 yw
[20:47:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P79129 and previous config saved to /var/cache/conftool/dbconfig/20250715-204732-marostegui.json
[20:48:59] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: host reimage
[20:50:09] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[20:50:49] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[20:53:05] <wikibugs>	 (03PS1) 10Zabe: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169765
[20:53:13] <wikibugs>	 (03PS1) 10Zabe: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169766
[20:53:32] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169765 (owner: 10Zabe)
[20:53:36] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169766 (owner: 10Zabe)
[20:54:07] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:55:11] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169769 (https://phabricator.wikimedia.org/T399579)
[20:56:31] <wikibugs>	 (03PS5) 10Scott French: httpd: Rebase on bookworm and cascade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1162030 (https://phabricator.wikimedia.org/T378128)
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250715T2100)
[21:02:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T399249)', diff saved to https://phabricator.wikimedia.org/P79130 and previous config saved to /var/cache/conftool/dbconfig/20250715-210240-marostegui.json
[21:02:44] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[21:02:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance
[21:02:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T399249)', diff saved to https://phabricator.wikimedia.org/P79131 and previous config saved to /var/cache/conftool/dbconfig/20250715-210251-marostegui.json
[21:05:10] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1012.eqiad.wmnet with OS bullseye
[21:05:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#11006694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1003 for host aqs1012.eqiad.wmnet with OS bullseye completed: - aqs1012 (**PASS**)   - Removed from Puppet and PuppetD...
[21:06:56] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage
[21:08:50] <wikibugs>	 (03Merged) 10jenkins-bot: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1169765 (owner: 10Zabe)
[21:08:55] <wikibugs>	 (03Merged) 10jenkins-bot: Also join linktarget on namespace to allow index usage [core] (wmf/1.45.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1169766 (owner: 10Zabe)
[21:09:33] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169765|Also join linktarget on namespace to allow index usage]], [[gerrit:1169766|Also join linktarget on namespace to allow index usage]]
[21:10:26] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd2007-dev.codfw.wmnet with reason: host reimage
[21:11:39] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1169765|Also join linktarget on namespace to allow index usage]], [[gerrit:1169766|Also join linktarget on namespace to allow index usage]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:12:28] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[21:13:42] <wikibugs>	 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#11006699 (10Dzahn) deployed. watched the prometheus apache exporter getting installed on puppetserver1001. no issues there.  IIt's also running on config-master1001 now.
[21:14:07] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:14:13] <wikibugs>	 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#11006700 (10Dzahn) 05Open→03Resolved a:03Dzahn
[21:17:15] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrus: Drop absented periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1169209 (owner: 10Ebernhardson)
[21:17:16] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] cirrus: Drop absented periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/1169209 (owner: 10Ebernhardson)
[21:17:44] <wikibugs>	 (03PS1) 10Eevans: aqs1012: setup data directories for 8-ssd JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1169773 (https://phabricator.wikimedia.org/T396970)
[21:17:46] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169765|Also join linktarget on namespace to allow index usage]], [[gerrit:1169766|Also join linktarget on namespace to allow index usage]] (duration: 08m 12s)
[21:18:35] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169773 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans)
[21:22:43] <wikibugs>	 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#11006731 (10Dzahn) regarding my previous comments about cloud VPS:  While some instances / projects will use the httpd module.. nothing seems to include the `profile::htt...
[21:23:16] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs1012: setup data directories for 8-ssd JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1169773 (https://phabricator.wikimedia.org/T396970) (owner: 10Eevans)
[21:26:54] <wikibugs>	 (03PS1) 10Zabe: CS: Undeploy Interwiki (step 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636)
[21:26:56] <wikibugs>	 (03PS1) 10Zabe: IS: Undeploy Interwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776
[21:26:56] <wikibugs>	 (03PS1) 10Zabe: extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636)
[21:27:24] <wikibugs>	 (03PS2) 10Zabe: IS: Undeploy Interwiki (step 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 (https://phabricator.wikimedia.org/T399636)
[21:28:09] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd2007-dev.codfw.wmnet with OS bullseye
[21:29:38] <wikibugs>	 (03CR) 10Zabe: [C:03+2] CS: Undeploy Interwiki (step 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe)
[21:30:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T399249)', diff saved to https://phabricator.wikimedia.org/P79132 and previous config saved to /var/cache/conftool/dbconfig/20250715-213021-marostegui.json
[21:30:26] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[21:30:35] <wikibugs>	 (03Merged) 10jenkins-bot: CS: Undeploy Interwiki (step 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169775 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe)
[21:31:49] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169775|CS: Undeploy Interwiki (step 1) (T399636)]]
[21:31:53] <stashbot>	 T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636
[21:33:58] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1169775|CS: Undeploy Interwiki (step 1) (T399636)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:35:19] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[21:39:07] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service aqs1012-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:40:45] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169775|CS: Undeploy Interwiki (step 1) (T399636)]] (duration: 08m 55s)
[21:40:49] <stashbot>	 T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636
[21:44:08] <wikibugs>	 (03CR) 10Zabe: [C:03+2] IS: Undeploy Interwiki (step 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe)
[21:45:06] <wikibugs>	 (03PS2) 10Zabe: extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636)
[21:45:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P79133 and previous config saved to /var/cache/conftool/dbconfig/20250715-214528-marostegui.json
[21:48:41] <wikibugs>	 (03Merged) 10jenkins-bot: IS: Undeploy Interwiki (step 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169776 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe)
[21:49:12] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169776|IS: Undeploy Interwiki (step 2) (T399636)]]
[21:49:16] <stashbot>	 T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636
[21:51:22] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1169776|IS: Undeploy Interwiki (step 2) (T399636)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:52:23] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[21:57:55] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169776|IS: Undeploy Interwiki (step 2) (T399636)]] (duration: 08m 42s)
[21:57:59] <stashbot>	 T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636
[21:59:25] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:00:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P79134 and previous config saved to /var/cache/conftool/dbconfig/20250715-220036-marostegui.json
[22:11:08] <wikibugs>	 (03CR) 10Zabe: [C:03+2] extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe)
[22:11:58] <wikibugs>	 (03Merged) 10jenkins-bot: extension-list: Undeploy Interwiki (step 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1169777 (https://phabricator.wikimedia.org/T399636) (owner: 10Zabe)
[22:12:23] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1169777|extension-list: Undeploy Interwiki (step 3) (T399636)]]
[22:12:27] <stashbot>	 T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636
[22:14:33] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1169777|extension-list: Undeploy Interwiki (step 3) (T399636)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:15:19] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[22:15:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T399249)', diff saved to https://phabricator.wikimedia.org/P79135 and previous config saved to /var/cache/conftool/dbconfig/20250715-221543-marostegui.json
[22:15:49] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[22:15:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2192.codfw.wmnet with reason: Maintenance
[22:16:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T399249)', diff saved to https://phabricator.wikimedia.org/P79136 and previous config saved to /var/cache/conftool/dbconfig/20250715-221606-marostegui.json
[22:20:41] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1169777|extension-list: Undeploy Interwiki (step 3) (T399636)]] (duration: 08m 17s)
[22:20:48] <stashbot>	 T399636: Undeploy the Interwiki extension from WMF prod - https://phabricator.wikimedia.org/T399636
[22:27:20] <swfrench-wmf>	 !log reprepro include php-excimer_1.2.5-1+wmf11u1 php-imagick_3.7.0-13+wmf11u1 php-luasandbox_4.1.2-1+wmf11u1 php-memcached_3.3.0-1+wmf11u1 php-pcov_1.0.12-1+wmf11u1 php-redis_6.2.0-1+wmf11u1 php-uuid_1.3.0-1+wmf11u1 php-wmerrors_2.0.0-1+wmf11u1 php-yaml_2.2.4-1+wmf11u1 wikidiff2_1.14.1-2+wmf11u1 xdebug_3.4.4-1+wmf11u1 in component/php83 - T398245
[22:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:26] <stashbot>	 T398245: Prepare WMF PHP 8.3 packages for bullseye - https://phabricator.wikimedia.org/T398245
[22:29:25] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:36:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "still need to figure out: " failed to expand includes and copies: processing includes for variant 'sourcebot': There is no key 'sourcebot'" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1167290 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[22:38:48] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on wdqs1022:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[22:41:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T399249)', diff saved to https://phabricator.wikimedia.org/P79137 and previous config saved to /var/cache/conftool/dbconfig/20250715-224117-marostegui.json
[22:41:23] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[22:43:53] <wikibugs>	 (03PS1) 10Dzahn: add variant sourcebot to blubber file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1169785 (https://phabricator.wikimedia.org/T268199)
[22:44:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] add variant sourcebot to blubber file [container/codesearch] - 10https://gerrit.wikimedia.org/r/1169785 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[22:44:52] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "recheck" [container/codesearch] - 10https://gerrit.wikimedia.org/r/1169785 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn)
[22:49:08] <logmsgbot>	 !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[22:50:16] <logmsgbot>	 !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[22:56:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P79138 and previous config saved to /var/cache/conftool/dbconfig/20250715-225624-marostegui.json
[23:09:25] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[23:11:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P79139 and previous config saved to /var/cache/conftool/dbconfig/20250715-231132-marostegui.json
[23:23:15] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[23:26:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T399249)', diff saved to https://phabricator.wikimedia.org/P79141 and previous config saved to /var/cache/conftool/dbconfig/20250715-232640-marostegui.json
[23:26:44] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[23:26:55] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2201.codfw.wmnet with reason: Maintenance
[23:38:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169789
[23:38:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169789 (owner: 10TrainBranchBot)
[23:50:58] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1169789 (owner: 10TrainBranchBot)
[23:52:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2211.codfw.wmnet with reason: Maintenance
[23:52:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T399249)', diff saved to https://phabricator.wikimedia.org/P79142 and previous config saved to /var/cache/conftool/dbconfig/20250715-235236-marostegui.json
[23:52:41] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249