[00:00:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050810 (owner: 10TrainBranchBot)
[00:04:15] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:13:41] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:25:25] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 127 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[00:35:13] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 74 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[00:46:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:47:53] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:58:51] <icinga-wm>	 PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:58:52] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T368866 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[00:59:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368866 (10ops-monitoring-bot) 03NEW
[01:03:37] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:03:51] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:05:27] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:05:41] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:07:27] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 352.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:10:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:14:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:15:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:19:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:28:41] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:07:41] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 48.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:09:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:31] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[02:15:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:31] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[02:45:33] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:33] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:32] <wikibugs>	 (03PS2) 10David Martin: Add wikilambda_zobject_join to puppet script for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435)
[02:59:15] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:28] <wikibugs>	 (03CR) 10David Martin: "The table creation patch was successfully deployed and is operating correctly on production.  Would be very helpful to get this patch land" [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin)
[03:06:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:07:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[03:16:41] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:19:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:20:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:54:27] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:58:27] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 27.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:15:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9938078 (10Papaul) @elukey we received last week a temporally license from SuperMicro to test out Redflish, I upload the license to the server, you can test and let me k...
[04:19:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:20:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:32:41] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:36:35] <wikibugs>	 (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1050341 (owner: 10L10n-bot)
[04:49:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[04:49:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[04:49:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:49:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:49:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T367856)', diff saved to https://phabricator.wikimedia.org/P65556 and previous config saved to /var/cache/conftool/dbconfig/20240701-044945-marostegui.json
[04:49:48] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[04:50:44] <marostegui>	 !log dbmaint eqiad Rebuild pagelinks table on s8 master T364069
[04:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:50:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[04:51:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2133,2160].codfw.wmnet,db[1195,1217,1228].eqiad.wmnet with reason: m2 switchover T368494
[04:51:55] <stashbot>	 T368494: Switchover m2 master db1195 -> db1228 - https://phabricator.wikimedia.org/T368494
[04:52:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2133,2160].codfw.wmnet,db[1195,1217,1228].eqiad.wmnet with reason: m2 switchover T368494
[04:54:39] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1228 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1050814 (https://phabricator.wikimedia.org/T368494)
[04:55:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1228 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1050814 (https://phabricator.wikimedia.org/T368494) (owner: 10Marostegui)
[04:56:42] <marostegui>	 !log Failover m2 from db1195 to db1228 - T368494
[04:56:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:41] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 19.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:01:24] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9938126 (10Marostegui)
[05:02:01] <wikibugs>	 (03PS1) 10Marostegui: db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1050815
[05:02:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: Reboot
[05:02:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: Reboot
[05:02:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1050815 (owner: 10Marostegui)
[05:24:16] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "All seem reasonable and/or needed to me !" [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney)
[05:33:38] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:33:50] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:36:28] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:36:40] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52339 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:55:22] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 155440808 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:56:22] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 35696 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:57:15] <wikibugs>	 (03PS5) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275)
[05:57:24] <wikibugs>	 (03CR) 10Ayounsi: Homer: fix Netbox 4 breaking changes (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:24:36] <wikibugs>	 (03CR) 10Ayounsi: "The cookbook sends a long cli string or commands separated by semi colons. I worry that we will hit some limit at some point sending too m" [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[06:25:26] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[06:33:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[06:33:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[06:33:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T364069)', diff saved to https://phabricator.wikimedia.org/P65557 and previous config saved to /var/cache/conftool/dbconfig/20240701-063344-marostegui.json
[06:33:48] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[06:35:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1195 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/1050943 (https://phabricator.wikimedia.org/T368871)
[06:36:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169 T368871', diff saved to https://phabricator.wikimedia.org/P65558 and previous config saved to /var/cache/conftool/dbconfig/20240701-063601-root.json
[06:36:05] <stashbot>	 T368871: Move db1195 to s1 - https://phabricator.wikimedia.org/T368871
[06:36:41] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn)
[06:36:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1169.eqiad.wmnet onto db1195.eqiad.wmnet
[06:38:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Move db1195 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/1050943 (https://phabricator.wikimedia.org/T368871) (owner: 10Marostegui)
[06:41:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:50:14] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "a few nits, but overall lgtm. Could be worth another reviewer at least for the python side of things." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney)
[06:59:47] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1195 [puppet] - 10https://gerrit.wikimedia.org/r/1050946 (https://phabricator.wikimedia.org/T368871)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T0700). nyaa~
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1195 [puppet] - 10https://gerrit.wikimedia.org/r/1050946 (https://phabricator.wikimedia.org/T368871) (owner: 10Marostegui)
[07:02:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1195 in s1 T368871', diff saved to https://phabricator.wikimedia.org/P65559 and previous config saved to /var/cache/conftool/dbconfig/20240701-070243-marostegui.json
[07:02:47] <stashbot>	 T368871: Move db1195 to s1 - https://phabricator.wikimedia.org/T368871
[07:07:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:08:24] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby)
[07:12:56] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding of akhatun [puppet] - 10https://gerrit.wikimedia.org/r/1050949
[07:18:46] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951
[07:19:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951 (owner: 10Slyngshede)
[07:21:03] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951
[07:25:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:27:20] <icinga-wm>	 PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 12401 MB (5% inode=73%): /tmp 12401 MB (5% inode=73%): /var/tmp 12401 MB (5% inode=73%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops
[07:29:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:36:20] <wikibugs>	 (03CR) 10Brouberol: "l" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[07:41:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[07:44:51] <elukey>	 !log `apt-get clean` on buil2001 to free some space in the root partition
[07:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:00] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French)
[07:49:27] <wikibugs>	 (03CR) 10Brouberol: dse-k8s-services: Add net-new chart for Airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[07:58:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:06:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:07:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1169.eqiad.wmnet onto db1195.eqiad.wmnet
[08:10:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:13:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65560 and previous config saved to /var/cache/conftool/dbconfig/20240701-081307-root.json
[08:13:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:13:57] <wikibugs>	 (03PS1) 10Marostegui: db1195: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1051059 (https://phabricator.wikimedia.org/T368871)
[08:14:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:14:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1195: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1051059 (https://phabricator.wikimedia.org/T368871) (owner: 10Marostegui)
[08:15:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65561 and previous config saved to /var/cache/conftool/dbconfig/20240701-081514-root.json
[08:18:12] <logmsgbot>	 !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es1025 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65562 and previous config saved to /var/cache/conftool/dbconfig/20240701-081811-jynus.json
[08:18:15] <stashbot>	 T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812
[08:21:18] <wikibugs>	 (03PS1) 10Urbanecm: JsonSchemaValidator: Measure duration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245)
[08:28:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65563 and previous config saved to /var/cache/conftool/dbconfig/20240701-082813-root.json
[08:28:18] <urbanecm>	 jouncebot: nowandnext
[08:28:18] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 31 minute(s)
[08:28:18] <jouncebot>	 In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000)
[08:28:29] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] JsonSchemaValidator: Measure duration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) (owner: 10Urbanecm)
[08:30:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65564 and previous config saved to /var/cache/conftool/dbconfig/20240701-083020-root.json
[08:36:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) (owner: 10Urbanecm)
[08:36:17] <wikibugs>	 10SRE-tools, 10conftool, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9938525 (10ABran-WMF)
[08:36:31] <wikibugs>	 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9938526 (10ABran-WMF)
[08:36:34] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[08:37:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[08:38:11] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[08:39:52] <wikibugs>	 (03CR) 10JMeybohm: api,rest-gateway: upgrade Envoy version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[08:39:55] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[08:40:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[08:40:47] <wikibugs>	 (03PS27) 10DCausse: wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950)
[08:40:47] <wikibugs>	 (03PS7) 10DCausse: wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950)
[08:43:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65565 and previous config saved to /var/cache/conftool/dbconfig/20240701-084318-root.json
[08:44:16] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:45:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65566 and previous config saved to /var/cache/conftool/dbconfig/20240701-084525-root.json
[08:45:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:51:20] <wikibugs>	 (03Merged) 10jenkins-bot: JsonSchemaValidator: Measure duration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) (owner: 10Urbanecm)
[08:51:55] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1051060|JsonSchemaValidator: Measure duration (T365245)]]
[08:51:58] <stashbot>	 T365245: Benchmark validation usages - https://phabricator.wikimedia.org/T365245
[08:54:57] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[08:56:34] <wikibugs>	 (03PS1) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973)
[08:56:35] <icinga-wm>	 PROBLEM - MariaDB disk space #page on es1025 is CRITICAL: DISK CRITICAL - /run/credentials/systemd-tmpfiles-clean.service is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[08:56:55] <jynus>	 ^ marostegui
[08:56:58] <volans>	 doh
[08:57:07] <jynus>	 it is depooled, I am backing it up
[08:57:12] <fabfur>	 ack!
[08:57:18] <jynus>	 but why it alerted?
[08:57:35] <icinga-wm>	 RECOVERY - MariaDB disk space #page on es1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[08:57:36] <volans>	 ah ok false alarm
[08:57:55] <jynus>	 it seems a run special filesystem
[08:58:04] <jynus>	 no real "disk", right?
[08:58:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65567 and previous config saved to /var/cache/conftool/dbconfig/20240701-085824-root.json
[08:58:34] <jynus>	 I am going to downtime it
[08:58:41] <jynus>	 ^ slyngs fabfur
[08:58:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 356.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:58:48] <jynus>	 as it is depooled
[08:58:57] <slyngs>	 Thanks
[08:59:16] <marostegui>	 That's weird 
[08:59:32] <jynus>	 although feel free to followup on the disk alert generalities
[08:59:43] <marostegui>	 What would be the reason that filled up?
[08:59:48] <marostegui>	 I don't get it
[08:59:55] <jynus>	 I think it is a script failure
[09:00:16] <jynus>	 possible a bug (?)
[09:00:21] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[09:00:22] <jynus>	 of the alerting
[09:00:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65568 and previous config saved to /var/cache/conftool/dbconfig/20240701-090031-root.json
[09:00:33] <jynus>	 may need debugging, I'm on a meeting will check later
[09:00:59] <jynus>	 mybe a race condition
[09:01:26] <marostegui>	 don't worry I will check now
[09:04:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler)
[09:06:14] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1051060|JsonSchemaValidator: Measure duration (T365245)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:06:17] <stashbot>	 T365245: Benchmark validation usages - https://phabricator.wikimedia.org/T365245
[09:06:21] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[09:08:22] <wikibugs>	 (03PS7) 10Clément Goubert: envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949)
[09:08:38] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127)
[09:09:41] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[09:12:32] <dcausse>	 jouncebot: nowandnext
[09:12:32] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 47 minute(s)
[09:12:32] <jouncebot>	 In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000)
[09:13:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65569 and previous config saved to /var/cache/conftool/dbconfig/20240701-091329-root.json
[09:14:10] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1051060|JsonSchemaValidator: Measure duration (T365245)]] (duration: 22m 15s)
[09:14:12] <stashbot>	 T365245: Benchmark validation usages - https://phabricator.wikimedia.org/T365245
[09:15:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65570 and previous config saved to /var/cache/conftool/dbconfig/20240701-091536-root.json
[09:15:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[09:16:04] <urbanecm>	 dcausse: fwiw I was deploying, but I'm now done. 
[09:16:27] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: enable "new" image diff UI [puppet] - 10https://gerrit.wikimedia.org/r/1050614 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar)
[09:16:34] <dcausse>	 urbanecm: ok thanks :)
[09:16:49] <duesen>	 urbanecm, dcausse: I'd like to deploy a core patch... what are you up to? This one: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1051076
[09:17:54] <dcausse>	 duesen: was about to deploy something not directly related to MW, feel free to go ahead
[09:18:30] * urbanecm doesn't have anything else to deploy
[09:20:06] <duesen>	 dcausse: actually I realized that I have to run an errand now, I'll do it in a couple of hours.
[09:20:17] <dcausse>	 ack
[09:20:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:22:25] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Seems reasonable :)" [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[09:23:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[09:25:33] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:26:24] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[09:28:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65572 and previous config saved to /var/cache/conftool/dbconfig/20240701-092835-root.json
[09:30:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65573 and previous config saved to /var/cache/conftool/dbconfig/20240701-093042-root.json
[09:36:51] <wikibugs>	 (03PS1) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259)
[09:38:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:40:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:40:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9938810 (10Volans)
[09:40:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T364069)', diff saved to https://phabricator.wikimedia.org/P65574 and previous config saved to /var/cache/conftool/dbconfig/20240701-094050-marostegui.json
[09:40:53] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[09:41:41] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse)
[09:41:45] <wikibugs>	 (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:42:19] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9938825 (10Volans) Pending @leila 's approval.
[09:42:36] <wikibugs>	 (03PS1) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086
[09:42:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse)
[09:42:51] <wikibugs>	 (03PS2) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086
[09:42:53] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert)
[09:43:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65575 and previous config saved to /var/cache/conftool/dbconfig/20240701-094341-root.json
[09:45:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline re: dashboard links, other than that LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[09:45:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65576 and previous config saved to /var/cache/conftool/dbconfig/20240701-094547-root.json
[09:45:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] frack: Remove old ingenico/globalcollect job checks [puppet] - 10https://gerrit.wikimedia.org/r/1048551 (https://phabricator.wikimedia.org/T368114) (owner: 10Dwisehaupt)
[09:46:20] <wikibugs>	 (03PS3) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086
[09:46:20] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert)
[09:46:53] <wikibugs>	 (03PS2) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259)
[09:49:38] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[09:49:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:53:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9938867 (10fgiunchedi) Those are SSH probes from local prometheus hosts indeed, in this case the probe consists of a TCP conne...
[09:55:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P65577 and previous config saved to /var/cache/conftool/dbconfig/20240701-095557-marostegui.json
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000)
[10:00:24] <wikibugs>	 (03PS1) 10Clément Goubert: fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949)
[10:02:10] <wikibugs>	 (03CR) 10DCausse: [C:03+1] fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[10:02:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney)
[10:02:21] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[10:02:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9938917 (10fgiunchedi) >>! In T326322#9934200, @cmooney wrote: > @fgiunchedi I was perhaps a little cheeky and merged this, but it was c...
[10:03:08] <wikibugs>	 (03Merged) 10jenkins-bot: fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[10:03:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] pontoon: Remove more puppet 5 leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1047502 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[10:05:33] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:09:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:10:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney)
[10:11:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P65578 and previous config saved to /var/cache/conftool/dbconfig/20240701-101104-marostegui.json
[10:14:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9938962 (10elukey) >>! In T365167#9938078, @Papaul wrote: > @elukey we received last week a temporally license from SuperMicro to test out Redflish, I upload the license...
[10:16:48] <wikibugs>	 (03CR) 10Elukey: api,rest-gateway: upgrade Envoy version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:18:55] <wikibugs>	 (03PS2) 10Elukey: admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366)
[10:18:56] <wikibugs>	 (03PS2) 10Elukey: admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366)
[10:18:56] <wikibugs>	 (03PS2) 10Elukey: api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366)
[10:18:57] <wikibugs>	 (03PS2) 10Elukey: admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366)
[10:19:43] <wikibugs>	 (03PS3) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259)
[10:20:12] <wikibugs>	 (03PS4) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086
[10:20:15] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert)
[10:23:38] <fabfur>	 !log upgrading A:cp-drmrs to haproxy 2.8.10 (T367756)
[10:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:41] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[10:25:36] <wikibugs>	 (03CR) 10Elukey: "Folks can you add a bit more info about why you need this configuration in the commit msg? I get the socket creation, but I am wondering i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[10:26:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T364069)', diff saved to https://phabricator.wikimedia.org/P65579 and previous config saved to /var/cache/conftool/dbconfig/20240701-102611-marostegui.json
[10:26:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[10:26:14] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[10:26:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[10:26:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T364069)', diff saved to https://phabricator.wikimedia.org/P65580 and previous config saved to /var/cache/conftool/dbconfig/20240701-102633-marostegui.json
[10:29:57] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table dewiki.archive: Index for table archive is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1161-bin.002587, end_log_pos 631041457 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:30:52] <marostegui>	 ^ i will get that
[10:31:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:34:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:34:29] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:35:05] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table dewiki.archive: Index for table archive is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1161-bin.002587, end_log_pos 631041457 Marostegui working on it https://wikitech.wikimedia.org/wiki/MariaDB/troubleshoo
[10:35:05] <icinga-wm>	 epooling_a_replica
[10:36:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert)
[10:36:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert)
[10:36:57] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:37:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[10:37:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[10:37:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[10:38:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[10:38:27] <duesen>	 I'd like to deploy a core patch in about half an hour, any objections? This one: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1051076
[10:38:33] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:38:35] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:39:19] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs
[10:39:27] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs
[10:39:30] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:41:11] <duesen>	 marostegui, fabfur, _joe_: is it ok if I deploy a core patch in about half an hour? I'd hit +2 on it now, so it can go through CI. 
[10:41:45] <duesen>	 I can also wait for the backport window, but I'd prefer to get this out of the way early.
[10:42:33] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:43:06] <claime>	 !log running puppet on maps servers
[10:43:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:35] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:44:02] <wikibugs>	 (03PS2) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973)
[10:44:05] <marostegui>	 duesen: fine by me :)
[10:44:32] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] "prepare for backport deployment" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler)
[10:46:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:46:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:47:02] <claime>	 !log running /usr/local/bin/apply-config-kartotherian on maps-replica
[10:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[10:48:30] <_joe_>	 duesen: I think you're good to go, but next time ask in #serviceops
[10:49:27] <claime>	 !log running /usr/local/bin/apply-config-kartotherian on maps-master
[10:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:32] <duesen>	 _joe_: ok. I keep getting confused about where to ask. tech, sre, serviceops, operations...
[10:49:49] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:50:34] <duesen>	 _joe_: Should I add this to https://wikitech.wikimedia.org/wiki/How_to_deploy_code ? It currently says "Join the IRC channels #wikimedia-operations connect and #wikimedia-tech connect on libera.chat and be available before and after all changes."
[10:51:03] <_joe_>	 duesen: well this isn't a change in a backport window
[10:51:16] <_joe_>	 jouncebot: nowandnext
[10:51:16] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000)
[10:51:16] <jouncebot>	 In 2 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1300)
[10:51:28] <_joe_>	 the mediawiki infra window is managed by serviceops
[10:51:32] <_joe_>	 maybe we should clarify that
[10:51:35] <_joe_>	 :)
[10:52:45] <duesen>	 _joe_:  Maybe each block in https://wikitech.wikimedia.org/wiki/Deployments should mention the associated IRC channel?
[10:52:47] <wikibugs>	 (03CR) 10Volans: "Until T354410 is resolved 3.12 can't be officially supported" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi)
[10:54:16] <wikibugs>	 (03PS3) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973)
[10:54:49] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] "once again, after fixing a missing constant in tests" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler)
[10:57:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:00:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler)
[11:01:26] <wikibugs>	 (03PS4) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973)
[11:01:56] <duesen>	 grrr, I'm having trouble getting this to pass CI, for silly reasons >:(
[11:02:38] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] "grr, once again..." [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler)
[11:02:41] <wikibugs>	 (03PS2) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608
[11:03:17] <wikibugs>	 (03CR) 10Clément Goubert: team-sre/redis: Alert on replica down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[11:05:30] <wikibugs>	 (03PS3) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608
[11:07:41] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:11:14] <wikibugs>	 (03PS4) 10Clément Goubert: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265)
[11:11:40] <wikibugs>	 (03CR) 10Ayounsi: "I worked around the Tox issue locally by installing `kafka-python-ng`, noted that this patch is only to have tox working, not for spicerac" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi)
[11:15:17] <wikibugs>	 (03PS1) 10Jelto: sre.gitlab.upgrade: lock backups during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1051107 (https://phabricator.wikimedia.org/T367501)
[11:18:35] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9939276 (10Sfaci) Hi @Scott_French!  Thanks for your suggestion!.  Just for cu...
[11:19:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye
[11:22:09] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9939303 (10SGupta-WMF) @Scott_French I have updated the repo , and tagged the...
[11:27:58] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.wikireplicas.update-views
[11:29:10] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99)
[11:30:08] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs
[11:31:28] <wikibugs>	 (03Merged) 10jenkins-bot: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler)
[11:32:51] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050949 (owner: 10Slyngshede)
[11:33:15] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs
[11:34:33] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050951 (owner: 10Slyngshede)
[11:35:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding of akhatun [puppet] - 10https://gerrit.wikimedia.org/r/1050949 (owner: 10Slyngshede)
[11:37:26] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951 (owner: 10Slyngshede)
[11:37:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[11:37:35] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[11:39:12] <logmsgbot>	 !log daniel@deploy1002 Started scap: Backport for [[gerrit:1051076|REST: detect mismatching value types in json request (T305973)]]
[11:39:15] <stashbot>	 T305973: JsonBodyValidator does not validate the parameter types - https://phabricator.wikimedia.org/T305973
[11:40:03] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[11:41:11] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 2188 hosts
[11:41:56] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 2188 hosts
[11:43:01] <logmsgbot>	 !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging FebinBellamy out of all services on: 2188 hosts
[11:43:12] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[11:43:43] <logmsgbot>	 !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging FebinBellamy out of all services on: 2188 hosts
[11:45:07] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:45:28] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[11:46:09] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[11:49:05] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[11:51:59] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' .
[11:55:44] <wikibugs>	 (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756)
[11:56:36] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[11:58:09] <wikibugs>	 (03PS2) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127)
[11:58:14] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[11:58:40] <wikibugs>	 (03PS1) 10JMeybohm: cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978)
[11:59:44] <wikibugs>	 (03PS2) 10JMeybohm: cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978)
[12:00:37] <logmsgbot>	 !log daniel@deploy1002 daniel: Backport for [[gerrit:1051076|REST: detect mismatching value types in json request (T305973)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:00:47] <stashbot>	 T305973: JsonBodyValidator does not validate the parameter types - https://phabricator.wikimedia.org/T305973
[12:01:05] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2026.codfw.wmnet with OS bullseye
[12:01:34] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye
[12:03:19] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:04:23] <logmsgbot>	 !log daniel@deploy1002 daniel: Continuing with sync
[12:05:33] <wikibugs>	 (03CR) 10DCausse: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse)
[12:05:50] <wikibugs>	 (03PS4) 10Filippo Giunchedi: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[12:05:58] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[12:06:10] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:06:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] team-sre/redis: Alert on replica down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[12:06:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:08:13] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:08:59] <wikibugs>	 (03Merged) 10jenkins-bot: cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[12:09:27] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:12:00] <logmsgbot>	 !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1051076|REST: detect mismatching value types in json request (T305973)]] (duration: 32m 48s)
[12:12:03] <stashbot>	 T305973: JsonBodyValidator does not validate the parameter types - https://phabricator.wikimedia.org/T305973
[12:13:55] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9939480 (10dcaro) With the current data, you can start observing that `cloudcephosd1034-sdh` (the new drive that has...
[12:14:03] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[12:16:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:17:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:18:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:19:01] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:20:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:21:16] <wikibugs>	 (03PS1) 10Daniel Kinzler: Revert "REST: detect mismatching value types in json request" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051119
[12:21:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:21:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:22:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:23:03] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[12:24:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] hieradata: remove thanos-query settings from thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[12:27:33] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[12:28:33] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[12:29:46] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:30:47] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:31:15] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:31:31] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:32:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:32:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:32:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:32:36] <wikibugs>	 (03PS5) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608
[12:32:42] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[12:32:45] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:33:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:33:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:33:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:33:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:34:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:34:26] <wikibugs>	 (03CR) 10Clément Goubert: team-sre/redis: Alert on replica down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[12:35:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:35:05] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:35:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:35:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:35:24] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[12:35:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[12:35:34] <logmsgbot>	 !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[12:39:56] <claime>	 !log Running update-netboot-image bullseye for 11.10 release
[12:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, depends on mysqld-exporter version" [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[12:41:30] <wikibugs>	 (03Abandoned) 10Daniel Kinzler: Revert "REST: detect mismatching value types in json request" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051119 (owner: 10Daniel Kinzler)
[12:41:39] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051129
[12:43:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Nice! LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[12:47:03] <wikibugs>	 (03PS1) 10Fabfur: hiera: removed unused haproxy28 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756)
[12:47:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:48:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: mariadb: monitoring memory pressure (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb)
[12:48:40] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131
[12:48:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: mariadb: add monitoring on io pressure for mariadb hosts (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb)
[12:49:09] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru
[12:49:10] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru
[12:49:11] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131
[12:49:18] <fabfur>	 !log upgrading A:cp-magru to haproxy 2.8.10 (T367756)
[12:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:21] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[12:49:32] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:50:53] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131 (owner: 10Jgiannelos)
[12:51:12] <claime>	 !log Running update-netboot-image bullseye for 11.10 release on puppetserver1001
[12:51:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[12:51:49] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131 (owner: 10Jgiannelos)
[12:54:10] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse)
[12:54:22] <wikibugs>	 (03PS3) 10Elukey: api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366)
[12:54:23] <wikibugs>	 (03PS3) 10Elukey: admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366)
[12:54:23] <wikibugs>	 (03PS1) 10Elukey: admin_ng: remove coredns image tag override for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051132 (https://phabricator.wikimedia.org/T368366)
[12:54:36] <logmsgbot>	 !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2026.codfw.wmnet with OS bullseye
[12:55:01] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse)
[12:55:19] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye
[12:55:24] <wikibugs>	 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9939575 (10ABran-WMF) p:05Low→03Medium
[12:55:49] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:56:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Please remove the related probe at hieradata/common/profile/prometheus/ops.yaml too:" [puppet] - 10https://gerrit.wikimedia.org/r/1048551 (https://phabricator.wikimedia.org/T368114) (owner: 10Dwisehaupt)
[12:56:12] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:56:16] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:56:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[12:56:38] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:57:26] <jinxer-wm>	 RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[12:57:29] <wikibugs>	 (03PS3) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[12:57:36] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert)
[12:57:52] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507)
[12:58:47] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] api,rest-gateway: upgrade Envoy version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1300). Please do the needful.
[13:00:05] <jouncebot>	 MatmaRex and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] <MatmaRex>	 hi
[13:00:21] <urbanecm>	 i can deploy today
[13:00:32] <urbanecm>	 unless Lucas_WMDE wants to :)
[13:01:02] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] FixTrailingWhitespaceIds: Don't crash on complex conflicts [extensions/DiscussionTools] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050406 (https://phabricator.wikimedia.org/T356196) (owner: 10Bartosz Dziewoński)
[13:01:37] <urbanecm>	 MatmaRex: i assume i need to backport the things before running them :D
[13:01:49] <wikibugs>	 (03PS2) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned [puppet] - 10https://gerrit.wikimedia.org/r/1051134
[13:01:54] <MatmaRex>	 urbanecm: yes please
[13:01:58] <urbanecm>	 will do
[13:02:34] <wikibugs>	 (03CR) 10JMeybohm: "I was panning on flipping the switch my morning tomorrow (maybe before the backport window). That way the train run tomorrow will run with" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[13:02:38] <Lucas_WMDE>	 o/
[13:02:41] <Lucas_WMDE>	 sorry for the delay
[13:03:03] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051135 (https://phabricator.wikimedia.org/T368862)
[13:03:28] <urbanecm>	 Lucas_WMDE: no worries. til about https://wikitech.wikimedia.org/wiki/Update_the_interwiki_cache, we should update it to match reality :)
[13:03:44] <Lucas_WMDE>	 oh that sounds promising -.-
[13:04:53] <urbanecm>	 Lucas_WMDE: this is what _actually_ happens ;) https://www.irccloud.com/pastebin/Et9c7FR1/
[13:05:14] <urbanecm>	 would you mind helping with updating it while i do the backport itself?
[13:05:22] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051135 (https://phabricator.wikimedia.org/T368862) (owner: 10Urbanecm)
[13:05:28] <Lucas_WMDE>	 o_O
[13:05:37] <Lucas_WMDE>	 and you had to run all those commands manually?
[13:05:44] <urbanecm>	 yeah...
[13:05:55] <Lucas_WMDE>	 is `git push-for-review` a standard command on the deployment servers? I don’t think I’ve seen it before
[13:05:56] <urbanecm>	 it _used_ to automagically create the patch for you
[13:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051135 (https://phabricator.wikimedia.org/T368862) (owner: 10Urbanecm)
[13:06:04] <urbanecm>	 but not anymore
[13:06:06] <Lucas_WMDE>	 (but I’m guessing it’s used by scap backport --revert?)
[13:06:32] <urbanecm>	 Lucas_WMDE: oh, sorry. that's my .gitconfig alias. it stands for `git push origin HEAD:refs/for/master`
[13:06:42] <Lucas_WMDE>	 ok
[13:06:54] <Lucas_WMDE>	 ah, now I see the username+password below, I skipped over that
[13:06:58] <Lucas_WMDE>	 ok so no magic there
[13:07:04] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1051135|Update interwiki map (T368862)]]
[13:07:04] <urbanecm>	 yep
[13:07:06] <wikibugs>	 (03PS3) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned [puppet] - 10https://gerrit.wikimedia.org/r/1051134
[13:07:07] <stashbot>	 T368862: Please run maintenance task "scap update-interwiki-cache" (30 June 2024) - https://phabricator.wikimedia.org/T368862
[13:07:13] <urbanecm>	 just some gitfu
[13:08:09] <urbanecm>	 Lucas_WMDE: fwiw, T247107 is why it no longer autocommits anything
[13:08:10] <stashbot>	 T247107: Make 'scap update-interwiki-cache' less scary - https://phabricator.wikimedia.org/T247107
[13:09:21] <wikibugs>	 (03Merged) 10jenkins-bot: FixTrailingWhitespaceIds: Don't crash on complex conflicts [extensions/DiscussionTools] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050406 (https://phabricator.wikimedia.org/T356196) (owner: 10Bartosz Dziewoński)
[13:09:45] <Lucas_WMDE>	 “This however is not documented anywhere”
[13:09:46] <Lucas_WMDE>	 well
[13:09:56] <Lucas_WMDE>	 except on the wikitech page which is now badly outdated
[13:10:00] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1051135|Update interwiki map (T368862)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:10:15] <Lucas_WMDE>	 I’ll see if I can update that
[13:10:54] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[13:10:56] <urbanecm>	 thanks
[13:11:36] <Lucas_WMDE>	 do you know if there’s an existing task to further improve update-interwiki-cache?
[13:11:48] <Lucas_WMDE>	 because at least as I see it now it still seems far from ideal
[13:11:52] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: remove coredns image tag override for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051132 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[13:11:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: removed unused haproxy28 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[13:11:59] <wikibugs>	 (03CR) 10Elukey: [C:03+2] api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[13:12:01] <Lucas_WMDE>	 it silently updates the production config and leaves you alone to figure out what to do with that
[13:12:06] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[13:12:26] <Lucas_WMDE>	 it would be nice if you could run it against your local config checkout outside deploy1002 but that’s not gonna happen while it’s still part of scap
[13:13:26] <urbanecm>	 Lucas_WMDE: technically you can run `extensions/WikimediaMaintenance/dumpInterwiki.php` locally. not the easiest, as it assumes certain things (like MEDIAWIKI_DEPLOYMENT_DIR) exists, but...
[13:13:47] <urbanecm>	 but the scap part doesn't do realy much more than that
[13:14:19] <urbanecm>	 no idea if we have a task tho
[13:14:22] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[13:15:43] <Lucas_WMDE>	 ok, thanks
[13:16:05] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1051135|Update interwiki map (T368862)]] (duration: 09m 01s)
[13:16:07] <stashbot>	 T368862: Please run maintenance task "scap update-interwiki-cache" (30 June 2024) - https://phabricator.wikimedia.org/T368862
[13:16:20] <urbanecm>	 okay, interwiki done
[13:16:35] <urbanecm>	 now the second part
[13:16:48] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1050406|FixTrailingWhitespaceIds: Don't crash on complex conflicts (T356196)]]
[13:16:52] <stashbot>	 T356196: Auto triming of internal links is breaking anchors if the last character is a space - https://phabricator.wikimedia.org/T356196
[13:17:21] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru
[13:17:43] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[13:19:20] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru
[13:21:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:21:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:24:05] <Lucas_WMDE>	 urbanecm: can I link that irccloud snippet in the documentation?
[13:24:11] <urbanecm>	 Lucas_WMDE: sure
[13:24:13] <Lucas_WMDE>	 (idk how long irccloud keeps snippets)
[13:24:13] <Lucas_WMDE>	 ok
[13:24:29] <urbanecm>	 Lucas_WMDE: but it might be wiser to copy it. just in case irccloud deletes it
[13:24:52] * Lucas_WMDE looks for a collapse template
[13:25:34] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050406|FixTrailingWhitespaceIds: Don't crash on complex conflicts (T356196)]] (duration: 08m 46s)
[13:25:37] <stashbot>	 T356196: Auto triming of internal links is breaking anchors if the last character is a space - https://phabricator.wikimedia.org/T356196
[13:26:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:26:12] <urbanecm>	 MatmaRex: okay, script backported. do you want me to run it for one wiki first for testing?
[13:26:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:26:41] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: deploy1003: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/1050628 (https://phabricator.wikimedia.org/T364416)
[13:26:41] <MatmaRex>	 urbanecm: you could, it won't hurt
[13:26:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[13:26:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[13:27:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[13:27:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[13:27:29] <MatmaRex>	 (it's a logged update so it will skip that wiki automatically later if you run it with `foreachwiki`)
[13:27:39] <Lucas_WMDE>	 urbanecm: updated https://wikitech.wikimedia.org/wiki/Update_the_interwiki_cache
[13:27:51] <urbanecm>	 MatmaRex: looks like it works. anything to verify before running it all? https://www.irccloud.com/pastebin/UDn1Bbob/
[13:28:24] <wikibugs>	 (03PS3) 10Clare Ming: extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234)
[13:29:05] <MatmaRex>	 urbanecm: don't think so. i ran it on the beta cluster and locally a bunch of times
[13:29:14] <urbanecm>	 MatmaRex: okay, proceeding then
[13:29:25] <wikibugs>	 (03PS4) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234)
[13:29:45] <MatmaRex>	 thank you!
[13:29:45] <urbanecm>	 !log mwmaint1002: [urbanecm@mwmaint1002 ~]$ foreachwiki DiscussionTools:FixTrailingWhitespaceIds (T356196)
[13:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: sync
[13:29:56] <urbanecm>	 anything else MatmaRex?
[13:30:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: sync
[13:30:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[13:30:11] <MatmaRex>	 nope
[13:30:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync
[13:30:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/1050628 (https://phabricator.wikimedia.org/T364416) (owner: 10Alexandros Kosiaris)
[13:30:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync
[13:31:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T364069)', diff saved to https://phabricator.wikimedia.org/P65581 and previous config saved to /var/cache/conftool/dbconfig/20240701-133118-marostegui.json
[13:31:22] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[13:32:52] <Lucas_WMDE>	 urbanecm: and it looks like T347982 is sort of the general “improve scap update-interwiki-cache” task I was looking for
[13:32:53] <stashbot>	 T347982: scap update-interwiki-cache is broken - https://phabricator.wikimedia.org/T347982
[13:33:08] <urbanecm>	 Lucas_WMDE: documentation page looks good to me
[13:33:13] <urbanecm>	 (process doesn't, but that's besides the point)
[13:33:15] <urbanecm>	 thanks for the udpate!
[13:33:45] <Lucas_WMDE>	 yay, thanks ^^
[13:33:48] <urbanecm>	 MatmaRex: it is a reason for concern if it prints out "Failed to update sth sth" a little too frequently?
[13:34:52] <MatmaRex>	 urbanecm: not sure. can you copy a few examples?
[13:34:53] <urbanecm>	 and a bunch of more for commonswiki https://www.irccloud.com/pastebin/8URdlJEt/
[13:34:53] <wikibugs>	 (03PS1) 10Elukey: blubber: no-op change to trigger a rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1051139
[13:34:59] <urbanecm>	 MatmaRex: was just doing that, see above!
[13:37:11] <wikibugs>	 (03PS1) 10JMeybohm: flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978)
[13:37:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2026.codfw.wmnet with OS bullseye
[13:37:29] <urbanecm>	 afaics, the db errors are logged at https://logstash.wikimedia.org/goto/650fa401ec1a1e99491516292afb0d65
[13:38:04] <wikibugs>	 (03CR) 10DCausse: [C:03+1] flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[13:39:15] <MatmaRex>	 urbanecm: i think these messages are expected, commons just has a bit more of these cases than i thought it would. but the scenario is exactly the same as we found on the beta cluster
[13:39:31] <urbanecm>	 okay, that's good to know.
[13:39:34] <urbanecm>	 leaving it running then :)
[13:39:57] <MatmaRex>	 e.g. https://commons.wikimedia.org/wiki/User_talk:Reneschuler#File:Alden_-_12x12_Mixed_Media_on_Canvas_by_Fine_Artist_Rene_Romero_Schuler.jpg where they post multiple messages with identical topic titles
[13:39:58] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[13:40:20] <wikibugs>	 (03PS1) 10Marostegui: orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141
[13:40:53] <wikibugs>	 (03PS5) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234)
[13:40:55] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[13:41:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1040
[13:41:15] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141 (owner: 10Marostegui)
[13:41:24] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1040
[13:41:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:42:48] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[13:42:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9939721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[13:43:16] <Lucas_WMDE>	 urbanecm: random question – do you see a list of passwords at https://gerrit.wikimedia.org/r/settings/#HTTPCredentials ?
[13:43:18] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: removed unused haproxy28 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[13:43:26] <Lucas_WMDE>	 because I only see the username and the “generate new password” button
[13:43:47] <urbanecm>	 Lucas_WMDE: that's what i see, but there should not be any list of passwords there
[13:43:55] <urbanecm>	 it will give you one password
[13:43:58] <urbanecm>	 and that's what you use
[13:44:17] <Lucas_WMDE>	 makes sense, but then the notification email sounds a bit outdated IMHO :)
[13:44:19] <Lucas_WMDE>	 I’ll file a task
[13:44:43] <urbanecm>	 fair
[13:44:58] <urbanecm>	 it does fall under "manage your HTTP password" imo, but it could definitely be improved
[13:46:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P65583 and previous config saved to /var/cache/conftool/dbconfig/20240701-134626-marostegui.json
[13:46:41] <wikibugs>	 (03CR) 10Elukey: [C:03+2] blubber: no-op change to trigger a rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1051139 (owner: 10Elukey)
[13:48:11] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:48:23] <wikibugs>	 (03CR) 10Btullis: "Thanks Elukey. I agree, it does seem like a lot of privileges just for a unix socket. We're still hoping to find another workable way arou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:48:29] <wikibugs>	 (03CR) 10Elukey: profile::puppetserver::git: add an option to exclude servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:48:38] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:49:01] <wikibugs>	 (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756)
[13:49:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:49:42] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[13:50:07] <wikibugs>	 (03Merged) 10jenkins-bot: blubber: no-op change to trigger a rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1051139 (owner: 10Elukey)
[13:50:23] <Lucas_WMDE>	 filed T368912
[13:50:24] <stashbot>	 T368912: Gerrit email about added or updated HTTP password is a bit misleading - https://phabricator.wikimedia.org/T368912
[13:51:25] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:52:03] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9939770 (10jcrespo) No action will be needed for backup1010 in the end.
[13:52:11] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] profile::puppetserver::git: add an option to exclude servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[13:52:43] <wikibugs>	 (03PS4) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259)
[13:53:11] <wikibugs>	 (03PS6) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023)
[13:56:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:56:59] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[13:57:13] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:01:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P65584 and previous config saved to /var/cache/conftool/dbconfig/20240701-140133-marostegui.json
[14:03:06] <wikibugs>	 (03PS5) 10Btullis: cephcsi: Run the csi-rbdplugin container as gid 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259)
[14:03:31] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[14:03:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9939826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye exec...
[14:04:03] <wikibugs>	 (03CR) 10Btullis: "I have tried a different technique. Let's see if we can configure the socket permissions appropriately like this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:05:20] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:07:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65585 and previous config saved to /var/cache/conftool/dbconfig/20240701-140725-root.json
[14:09:29] <wikibugs>	 (03CR) 10Krinkle: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[14:10:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[14:10:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9939859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[14:11:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:12:02] <wikibugs>	 (03CR) 10Krinkle: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[14:13:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368743#9939862 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[14:16:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:16:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T364069)', diff saved to https://phabricator.wikimedia.org/P65586 and previous config saved to /var/cache/conftool/dbconfig/20240701-141640-marostegui.json
[14:16:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[14:16:43] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:16:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[14:21:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65587 and previous config saved to /var/cache/conftool/dbconfig/20240701-142231-root.json
[14:24:11] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor)
[14:25:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[14:25:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[14:26:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:27:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage
[14:28:52] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9939925 (10elukey) p:05Triage→03Medium
[14:30:51] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Let's see if this works" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[14:31:27] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye
[14:32:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage
[14:34:03] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: staging should listen to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051147
[14:35:20] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:35:40] <fabfur>	 !log upgrading A:cp-codfw to haproxy 2.8.10 (T367756)
[14:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:43] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[14:35:50] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: staging should listen to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051147 (owner: 10DCausse)
[14:35:51] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[14:36:23] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw
[14:36:38] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw
[14:36:40] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[14:36:42] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: staging should listen to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051147 (owner: 10DCausse)
[14:37:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65589 and previous config saved to /var/cache/conftool/dbconfig/20240701-143736-root.json
[14:39:15] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:03] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:40:16] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:43:39] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9940046 (10Eevans) 💥 `/dev/sde` is failed again...  {F56126743}  {F56126744}  {F56126745}
[14:43:57] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply
[14:44:12] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:44:42] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[14:45:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149
[14:45:34] <wikibugs>	 (03PS1) 10Elukey: services: update thumbor-plugin Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051150
[14:45:51] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 (owner: 10Alexandros Kosiaris)
[14:47:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 (owner: 10Alexandros Kosiaris)
[14:48:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[14:48:30] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply
[14:48:41] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:49:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[14:50:04] <wikibugs>	 (03Merged) 10jenkins-bot: Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 (owner: 10Alexandros Kosiaris)
[14:50:07] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM, I don't fully understand the changes to the query in _get_devices() but I assume it works and what is now needed so +1" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[14:50:29] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[14:52:10] <claime>	 jouncebot: nowandnext
[14:52:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 37 minute(s)
[14:52:10] <jouncebot>	 In 0 hour(s) and 37 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1530)
[14:52:34] <claime>	 herron: I'll deploy the last major statsd for mw-on-k8s
[14:52:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65590 and previous config saved to /var/cache/conftool/dbconfig/20240701-145242-root.json
[14:52:45] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:53:04] <herron>	 claime: excellent
[14:53:36] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[14:54:42] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:54:52] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[14:55:14] <claime>	 !log deploying statsd-exporter for mw-web - T365265
[14:55:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:22] <stashbot>	 T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265
[14:56:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:56:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:57:40] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:58:22] <MatmaRex>	 urbanecm: if you have a moment, can you check how that script run is going? (probably not done yet, i'm just curious how far along it is)
[14:59:15] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:16] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[14:59:40] <urbanecm>	 MatmaRex: currently at enwiki
[14:59:44] <urbanecm>	 enwiki:  107801
[15:00:23] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "ActuActual" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[15:01:03] <MatmaRex>	 urbanecm: nice, thanks. enwiki had around 140k rows to fix fwiw (https://phabricator.wikimedia.org/T356196#9908208)
[15:01:13] <urbanecm>	 so almost done there in that case?
[15:01:24] <MatmaRex>	 urbanecm: are there lots of warnings on other wikis too, or was that just on commons? (just curious)
[15:02:23] <MatmaRex>	 urbanecm: yeah, although it gets slower towards the end, because the query scans the fixed rows again
[15:02:34] <urbanecm>	 MatmaRex: similar amount of warnings at enwiki than at commons
[15:02:56] <MatmaRex>	 alright. thanks for checking :)
[15:03:40] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:04:12] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:05:31] <akosiaris>	 !log reboot deploy1003 T364416
[15:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:34] <stashbot>	 T364416: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416
[15:06:22] <icinga-wm>	 PROBLEM - Host deploy1003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:06:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:06:34] <icinga-wm>	 RECOVERY - Host deploy1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[15:07:11] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:07:23] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:07:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65591 and previous config saved to /var/cache/conftool/dbconfig/20240701-150747-root.json
[15:08:09] <wikibugs>	 (03CR) 10Aqu: "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[15:10:33] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2026.codfw.wmnet with OS bullseye
[15:11:11] <wikibugs>	 (03PS3) 10Jdlrobson: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151)
[15:11:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[15:11:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:11:56] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[15:13:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deployment_server: if guard php-readline to buster [puppet] - 10https://gerrit.wikimedia.org/r/1051154 (https://phabricator.wikimedia.org/T364416)
[15:14:55] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:15:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:15:21] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye
[15:15:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9940169 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye comp...
[15:15:33] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: Run the csi-rbdplugin container as gid 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[15:16:25] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:18:51] <wikibugs>	 (03PS1) 10JHathaway: nskaggs: remove references from icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1051155
[15:18:52] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: Run the csi-rbdplugin container as gid 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[15:20:17] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:21:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:21:42] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:22:32] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw
[15:22:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:22:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65592 and previous config saved to /var/cache/conftool/dbconfig/20240701-152253-root.json
[15:25:13] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw
[15:30:05] <jouncebot>	 jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1530).
[15:32:47] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:36:42] <wikibugs>	 (03PS2) 10TChin: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134)
[15:37:04] <wikibugs>	 (03CR) 10TChin: EventStreamConfig: Add hive ingestion defaults (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[15:37:14] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:37:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65593 and previous config saved to /var/cache/conftool/dbconfig/20240701-153758-root.json
[15:39:34] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "Okay to deploy tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE))
[15:44:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367856)', diff saved to https://phabricator.wikimedia.org/P65594 and previous config saved to /var/cache/conftool/dbconfig/20240701-154427-marostegui.json
[15:44:30] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[15:51:01] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] doc: redirect doc.wikimedia.org/analytics-api [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn)
[15:55:37] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/doc/test_doc.yaml --hosts=doc2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn)
[15:56:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9940509 (10leila) approved. thanks!
[15:58:59] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/doc/test_doc.yaml --hosts=doc1003.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn)
[15:59:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P65595 and previous config saved to /var/cache/conftool/dbconfig/20240701-155934-marostegui.json
[16:04:18] <wikibugs>	 (03CR) 10Phuedx: [C:04-1] "A couple of points inline." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[16:07:34] <wikibugs>	 (03CR) 10Phuedx: [C:04-1] Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[16:11:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041
[16:11:51] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041
[16:12:15] <wikibugs>	 (03PS5) 10BCornwall: hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174)
[16:14:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P65596 and previous config saved to /var/cache/conftool/dbconfig/20240701-161441-marostegui.json
[16:16:54] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3132/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[16:17:03] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[16:17:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9940600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[16:18:39] <wikibugs>	 10ops-codfw, 06DC-Ops: Cabling for FR - https://phabricator.wikimedia.org/T368940 (10Jhancock.wm) 03NEW
[16:18:47] <urandom>	 !log restarting Cassandra —restbase2023-{a,b,c}— troubleshooting storage utilization
[16:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:26] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1051155 (owner: 10JHathaway)
[16:20:39] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] nskaggs: remove references from icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1051155 (owner: 10JHathaway)
[16:20:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1039
[16:20:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1039
[16:21:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:22:27] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[16:22:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9940658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[16:27:21] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9940710 (10elukey) After a chat with Riccardo some things came up:  * It seems that the issue comes up when debmonitor-client is upgraded...
[16:27:40] <wikibugs>	 (03PS3) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366524)
[16:27:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:29:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367856)', diff saved to https://phabricator.wikimedia.org/P65597 and previous config saved to /var/cache/conftool/dbconfig/20240701-162948-marostegui.json
[16:29:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[16:29:52] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[16:30:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[16:30:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T367856)', diff saved to https://phabricator.wikimedia.org/P65598 and previous config saved to /var/cache/conftool/dbconfig/20240701-163010-marostegui.json
[16:33:46] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.90.0" for 234 hosts
[16:34:29] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.90.0" for 234 hosts
[16:34:56] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.90.0" for 234 hosts
[16:35:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage
[16:38:23] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage
[16:39:53] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9940781 (10Dzahn) No, I did not get a response.  For one of the owner addresses I got an "550 5.1.1 The email account that you tried to reach does not exist."  So I can confirm there does see...
[16:41:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:42:22] <wikibugs>	 (03PS1) 10Pppery: Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915)
[16:42:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery)
[16:43:34] <wikibugs>	 (03PS2) 10Pppery: Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915)
[16:46:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] deployment_server: if guard php-readline to buster [puppet] - 10https://gerrit.wikimedia.org/r/1051154 (https://phabricator.wikimedia.org/T364416) (owner: 10Alexandros Kosiaris)
[16:48:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9940819 (10akosiaris)
[16:50:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9940823 (10akosiaris) 05Open→03Resolved Host is imaged, rest of the work is ongoing in T364417
[16:50:58] <wikibugs>	 (03PS4) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T367826)
[16:51:10] <wikibugs>	 (03PS5) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T367826)
[16:51:16] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[16:51:31] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[16:51:37] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1700)
[17:00:05] <jouncebot>	 ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1700).
[17:00:53] <icinga-wm>	 PROBLEM - Disk space on restbase2023 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 90521 MB (5% inode=99%): /srv/sdc4 66626 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops
[17:03:24] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "Post-facto +1, already rolled out. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051132 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[17:04:53] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[17:05:09] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[17:08:07] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[17:08:23] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[17:11:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:14:35] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[17:15:38] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[17:16:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[17:16:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T364069)', diff saved to https://phabricator.wikimedia.org/P65599 and previous config saved to /var/cache/conftool/dbconfig/20240701-171609-marostegui.json
[17:16:12] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[17:24:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9941093 (10Papaul)
[17:26:46] <wikibugs>	 (03PS1) 10Herron: thanos: increase query frontend and store cache sizes [puppet] - 10https://gerrit.wikimedia.org/r/1051177 (https://phabricator.wikimedia.org/T368953)
[17:26:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9941103 (10Papaul) All the cabling is done. I am leaving this task open so when we move the console cables from asw-c*/d*-codfw to ssw1-* and lsw1-* I can u...
[17:27:07] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9941105 (10Scott_French)
[17:27:26] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[17:27:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bullseye
[17:27:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9941106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye comp...
[17:30:35] <wikibugs>	 (03PS4) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[17:34:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[17:35:47] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-codfw,ssw1-a[1,8]-codfw.mgmt with reason: reboot ssw1-d8-codfw
[17:36:01] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-codfw,ssw1-a[1,8]-codfw.mgmt with reason: reboot ssw1-d8-codfw
[17:37:17] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[17:38:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9941207 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye exec...
[17:40:48] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm)
[17:40:57] <urbanecm>	 ohoho :)
[17:41:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041
[17:41:16] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041
[17:42:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041
[17:42:05] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041
[17:44:21] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041
[17:44:24] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041
[17:45:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[17:46:57] <wikibugs>	 (03PS1) 10Brennen Bearnes: gitlab-settings: v1.6.0 for squash commit templates [puppet] - 10https://gerrit.wikimedia.org/r/1051178 (https://phabricator.wikimedia.org/T366624)
[17:48:13] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002"
[17:49:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002"
[17:49:19] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:49:23] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041
[17:49:25] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041
[17:52:09] <wikibugs>	 (03PS5) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[17:53:32] <wikibugs>	 (03PS6) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[17:55:39] <wikibugs>	 (03CR) 10David Caro: "Running now on toolsbeta with envvars-api 0.0.49" [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[17:57:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[17:57:29] <wikibugs>	 (03PS1) 10Cwhite: logstash: add normalize_labels script [puppet] - 10https://gerrit.wikimedia.org/r/1051180 (https://phabricator.wikimedia.org/T368867)
[17:57:36] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9941344 (10Scott_French) Thanks, @SGupta-WMF !  The service is up and running...
[17:59:00] <wikibugs>	 (03CR) 10Brennen Bearnes: "I already ran this against all projects, so future runs should only catch a handful of new ones." [puppet] - 10https://gerrit.wikimedia.org/r/1051178 (https://phabricator.wikimedia.org/T366624) (owner: 10Brennen Bearnes)
[17:59:41] <wikibugs>	 (03CR) 10David Caro: envvars-backend: update endpoint to new schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[18:00:23] <wikibugs>	 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9941358 (10Dzahn) 05Open→03Stalled Per IRC chat: curre...
[18:01:03] <wikibugs>	 (03PS7) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[18:04:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova)
[18:04:53] <wikibugs>	 06SRE, 06SRE-OnFire, 10Stewards-Onboarding-Tool, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9941430 (10CDanis) >>! In T343377#9937115, @Urbanecm wrote: >>>! In T343377#9931101, @MoritzMuehlenhoff wrote: >>...
[18:23:06] <wikibugs>	 (03CR) 10Ottomata: EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[18:24:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:24:19] <icinga-wm>	 PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:24:19] <icinga-wm>	 PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:25:36] <wikibugs>	 (03PS1) 10Jdlrobson: Change color of notification icon in dark-mode [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120)
[18:26:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 125, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:26:19] <icinga-wm>	 RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:26:19] <icinga-wm>	 RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:28:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) (owner: 10Jdlrobson)
[18:31:09] <wikibugs>	 06SRE, 06SRE-OnFire, 10Stewards-Onboarding-Tool, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9941628 (10Urbanecm) >>! In T343377#9941430, @CDanis wrote: > This is great, thanks.  One remaining piece here is...
[18:31:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9941625 (10Dzahn) Hello @KFrancis Andy is a special case since he moved from WMF staff to WMDE.  If he was WMF staff we wouldn't have to do a separa...
[18:42:28] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall)
[18:44:34] <wikibugs>	 (03PS1) 10Jdlrobson: Do not invert images that have been tagged with no invert classes [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483)
[18:44:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) (owner: 10Jdlrobson)
[18:46:44] <wikibugs>	 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9941690 (10BCornwall) 05Open→03Resolved
[18:54:52] <wikibugs>	 (03PS2) 10Cwhite: logstash: update ecs patch version to 7 [puppet] - 10https://gerrit.wikimedia.org/r/1032737 (https://phabricator.wikimedia.org/T290020)
[18:56:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041
[18:56:39] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041
[18:57:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[18:57:12] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9941755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[18:59:09] <wikibugs>	 (03CR) 10Gergő Tisza: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[19:02:37] <wikibugs>	 (03PS6) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234)
[19:03:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[19:03:46] <wikibugs>	 (03PS7) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234)
[19:04:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[19:04:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery)
[19:04:51] <wikibugs>	 (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[19:05:38] <wikibugs>	 (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[19:09:42] <wikibugs>	 (03PS1) 10Cwhite: test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451)
[19:09:44] <wikibugs>	 (03PS1) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451)
[19:10:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451) (owner: 10Cwhite)
[19:10:10] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[19:11:31] <wikibugs>	 (03PS2) 10Cwhite: test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451)
[19:11:32] <wikibugs>	 (03PS2) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451)
[19:13:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage
[19:13:44] <wikibugs>	 (03PS3) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451)
[19:14:47] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.91.0" for 234 hosts
[19:15:21] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.91.0" for 234 hosts
[19:16:41] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage
[19:17:38] <wikibugs>	 (03PS2) 10Scott French: admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[19:19:09] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.91.0" for 233 hosts
[19:19:36] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Your plan SGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[19:19:42] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.91.0" completed for 233 hosts
[19:33:56] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[19:45:34] <wikibugs>	 (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051180 (https://phabricator.wikimedia.org/T368867) (owner: 10Cwhite)
[19:55:48] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: add normalize_labels script [puppet] - 10https://gerrit.wikimedia.org/r/1051180 (https://phabricator.wikimedia.org/T368867) (owner: 10Cwhite)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T2000).
[20:00:05] <jouncebot>	 jdlrobson, pppery, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] <Pppery>	 Here
[20:00:12] <cjming>	 o/
[20:00:18] <cjming>	 i can deploy
[20:00:27] <Jdlrobson>	 hey cjming im here :)
[20:00:33] <cjming>	 yay!
[20:00:43] <cjming>	 jdlrobson: can your 2 backports go out together?
[20:01:03] <wikibugs>	 (03PS4) 10Jdlrobson: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151)
[20:02:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[20:02:54] <wikibugs>	 (03Merged) 10jenkins-bot: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[20:03:13] <logmsgbot>	 !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]]
[20:03:18] <stashbot>	 T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin	 - https://phabricator.wikimedia.org/T367151
[20:03:51] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Change color of notification icon in dark-mode [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) (owner: 10Jdlrobson)
[20:04:22] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Do not invert images that have been tagged with no invert classes [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) (owner: 10Jdlrobson)
[20:05:11] <wikibugs>	 (03PS1) 10RLazarus: mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966)
[20:05:43] <cjming>	 Pppery: i will do yours next after Jon's 1st config patch and while i wait for Jon's 2 backports to merge
[20:05:49] <Pppery>	 Ok
[20:07:25] <cjming>	 Jdlrobson: I've +2'd your two backports for Minerva since it looks like it'll be ~20+ mins for each -- ok if i scap backport them together?
[20:07:32] <wikibugs>	 (03PS2) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966)
[20:10:07] <wikibugs>	 (03CR) 10RLazarus: "You're right, I only needed to kube_env as the -deploy user for the `kubectl attach`, and that user already has the privileges it needs. N" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[20:10:10] <wikibugs>	 (03Abandoned) 10RLazarus: admin_ng: RBAC to allow mw-script user to attach to pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[20:13:30] <Jdlrobson>	 cjming: yeh that's fine
[20:14:15] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[20:15:41] <cjming>	 any SRE around? i seem to be stuck syncing to test servers with: 20:04:19 sync-masters:  50% (in-flight: 1; ok: 1; fail: 0; left: 0) /
[20:15:44] <cjming>	 not sure if i should just wait or if there's something to do - usually doesn't take this long
[20:18:52] <rzl>	 cjming: hm, that might be related to deploy1003 which is being set up in https://phabricator.wikimedia.org/T364417
[20:19:45] <cjming>	 huh - what should i do in the meantime?
[20:19:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T364069)', diff saved to https://phabricator.wikimedia.org/P65600 and previous config saved to /var/cache/conftool/dbconfig/20240701-201949-marostegui.json
[20:19:53] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[20:19:56] <rzl>	 I believe it should be fine to not sync to that master, but I don't know offhand how to tell scap that
[20:21:04] <rzl>	 I'm digging around in the scap source a little, but dancy might know the answer offhand if he's around
[20:21:08] <cjming>	 anyone else know how i should intervene with scap?
[20:21:51] * dancy taking a look
[20:22:00] * cjming grateful to dancy
[20:22:04] <dancy>	 I had to do some fighting earlier today to work around partially-deployed deploy1003.
[20:22:39] <cjming>	 oooh - it actually just started up again - maybe it's ok?
[20:22:53] <dancy>	 Yeah, should be ok.
[20:23:04] <cjming>	 i just had wait an unusually long time to sync to test servers
[20:23:16] <dancy>	 For the next  backport, when it hangs there, start another shell, do "ps uaxwwww | grep deploy1003" and kill the associated ssh process.
[20:23:34] <cjming>	 will do - thanks!
[20:23:42] <logmsgbot>	 !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:23:42] <logmsgbot>	 !log cjming@deploy1002 Sync cancelled.
[20:23:42] <dancy>	 I'll hang around.
[20:23:51] <cjming>	 except it cancelled the sync
[20:23:51] <rzl>	 thanks dancy :)
[20:23:51] <stashbot>	 T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin	 - https://phabricator.wikimedia.org/T367151
[20:24:13] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[20:24:15] <dancy>	 cjming: Can you send me the transcript?  
[20:24:16] <cjming>	 and logged me out - bec timeout?
[20:24:53] <cjming>	 yup - 1 sec
[20:25:36] <cjming>	 should i just re-scap backport the same patch?
[20:25:39] <dancy>	 yes
[20:26:08] <logmsgbot>	 !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]]
[20:27:55] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: encode referer field as hex [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718)
[20:28:44] <logmsgbot>	 !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:28:50] <cjming>	 ok finally
[20:28:59] <cjming>	 Jdlrobson: 1st patch on test servers - can i sync?
[20:29:05] <wikibugs>	 (03Merged) 10jenkins-bot: Change color of notification icon in dark-mode [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) (owner: 10Jdlrobson)
[20:29:06] <wikibugs>	 (03Merged) 10jenkins-bot: Do not invert images that have been tagged with no invert classes [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) (owner: 10Jdlrobson)
[20:29:22] <Jdlrobson>	 cjming: looking now
[20:30:05] <Jdlrobson>	 cjming: can we sync all 3 of these together?
[20:30:17] <Jdlrobson>	 it looks good but ideally i'd like the other fixes to go out before or at the same time.
[20:30:32] <cjming>	 sure - let me do that - 1 sec
[20:30:41] <logmsgbot>	 !log cjming@deploy1002 Sync cancelled.
[20:31:45] <logmsgbot>	 !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]], [[gerrit:1051184|Change color of notification icon in dark-mode (T368120)]], [[gerrit:1051186|Do not invert images that have been tagged with no invert classes (T368483)]]
[20:31:50] <stashbot>	 T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin	 - https://phabricator.wikimedia.org/T367151
[20:31:51] <stashbot>	 T368120: [Short term fix] Notification icon not same color as other icons - https://phabricator.wikimedia.org/T368120
[20:31:51] <stashbot>	 T368483: Regression: Global invert broke VisualEditor "Add a link" workflow - https://phabricator.wikimedia.org/T368483
[20:33:07] <wikibugs>	 (03CR) 10Vgutierrez: "two things:" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[20:34:29] <logmsgbot>	 !log cjming@deploy1002 cjming, jdlrobson: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]], [[gerrit:1051184|Change color of notification icon in dark-mode (T368120)]], [[gerrit:1051186|Do not invert images that have been tagged with no invert classes (T368483)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:34:32] <cjming>	 Jdlrobson: ok all 3 are up on test servers - lmk if/when to sync
[20:34:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P65601 and previous config saved to /var/cache/conftool/dbconfig/20240701-203456-marostegui.json
[20:34:57] <Jdlrobson>	 cjming: looing now :)
[20:36:22] <Jdlrobson>	 cjming: please sync!
[20:36:27] <cjming>	 yay!
[20:36:32] <logmsgbot>	 !log cjming@deploy1002 cjming, jdlrobson: Continuing with sync
[20:39:15] <wikibugs>	 (03CR) 10Vgutierrez: "actually any request header logged needs to be encoded" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[20:41:36] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Profiler: Handle X-Wikimedia-Debug cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024932 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[20:42:24] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]], [[gerrit:1051184|Change color of notification icon in dark-mode (T368120)]], [[gerrit:1051186|Do not invert images that have been tagged with no invert classes (T368483)]] (duration: 10m 39s)
[20:42:29] <stashbot>	 T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin	 - https://phabricator.wikimedia.org/T367151
[20:42:30] <stashbot>	 T368120: [Short term fix] Notification icon not same color as other icons - https://phabricator.wikimedia.org/T368120
[20:42:30] <stashbot>	 T368483: Regression: Global invert broke VisualEditor "Add a link" workflow - https://phabricator.wikimedia.org/T368483
[20:42:32] <cjming>	 Jdlrobson: should be live!
[20:42:45] <cjming>	 Pppery: doing yours now - pardon the wait
[20:43:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[20:43:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery)
[20:43:57] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451) (owner: 10Cwhite)
[20:44:25] <wikibugs>	 (03Merged) 10jenkins-bot: Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery)
[20:44:43] <logmsgbot>	 !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1051170|Missing.php: don't redirect to unprefixed nan incubator (T86915)]]
[20:44:45] <stashbot>	 T86915: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915
[20:45:42] <wikibugs>	 (03PS1) 10Cwhite: beta-logs: remove dlq spam mitigation [puppet] - 10https://gerrit.wikimedia.org/r/1051200
[20:47:17] <logmsgbot>	 !log cjming@deploy1002 cjming, pppery: Backport for [[gerrit:1051170|Missing.php: don't redirect to unprefixed nan incubator (T86915)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:47:25] <cjming>	 Pppery: your patch is up on test servers - lmk if i can sync
[20:47:40] <Pppery>	 Looks good
[20:47:53] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:47:57] <logmsgbot>	 !log cjming@deploy1002 cjming, pppery: Continuing with sync
[20:48:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[20:49:36] <wikibugs>	 (03PS2) 10Cwhite: beta-logs: remove dlq spam mitigation [puppet] - 10https://gerrit.wikimedia.org/r/1051200
[20:49:55] <wikibugs>	 (03PS19) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[20:50:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P65602 and previous config saved to /var/cache/conftool/dbconfig/20240701-205003-marostegui.json
[20:51:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:53:47] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1051170|Missing.php: don't redirect to unprefixed nan incubator (T86915)]] (duration: 09m 03s)
[20:53:49] <stashbot>	 T86915: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915
[20:54:00] <cjming>	 Pppery: should be live!
[20:54:12] <Pppery>	 Yep
[20:54:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[20:55:09] <wikibugs>	 (03Merged) 10jenkins-bot: extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming)
[20:55:28] <logmsgbot>	 !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1046710|extension-list: Add Metrics Platform (T366234)]]
[20:55:33] <stashbot>	 T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234
[20:59:20] <wikibugs>	 (03PS1) 10Jforrester: Reference widget: check for undefined config [extensions/WikibaseMediaInfo] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051202 (https://phabricator.wikimedia.org/T368736)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T2100).
[21:00:51] <cjming>	 ^^ i'm almost done - just need to sync last patch
[21:00:53] <icinga-wm>	 RECOVERY - Disk space on restbase2023 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops
[21:05:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T364069)', diff saved to https://phabricator.wikimedia.org/P65603 and previous config saved to /var/cache/conftool/dbconfig/20240701-210512-marostegui.json
[21:05:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance
[21:05:15] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[21:05:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance
[21:05:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T364069)', diff saved to https://phabricator.wikimedia.org/P65604 and previous config saved to /var/cache/conftool/dbconfig/20240701-210534-marostegui.json
[21:12:33] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:13:13] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:14:05] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52340 bytes in 1.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:14:25] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:16:08] <logmsgbot>	 !log cjming@deploy1002 cjming: Backport for [[gerrit:1046710|extension-list: Add Metrics Platform (T366234)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:16:10] <stashbot>	 T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234
[21:16:11] <logmsgbot>	 !log cjming@deploy1002 cjming: Continuing with sync
[21:22:34] <wikibugs>	 (03PS1) 10Herron: prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088)
[21:23:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron)
[21:23:44] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1046710|extension-list: Add Metrics Platform (T366234)]] (duration: 28m 16s)
[21:23:47] <stashbot>	 T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234
[21:24:22] <cjming>	 !log end of UTC late backport window
[21:24:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:12] <wikibugs>	 (03PS2) 10Herron: prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088)
[21:27:30] <wikibugs>	 (03PS1) 10Ahmon Dancy: gitlab::runner: Add buildkitd to no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1051208
[21:27:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab::runner: Add buildkitd to no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy)
[21:28:15] <sbassett>	 cjming: Looking good?  sec.team actually has a couple of patches to go out today.
[21:28:33] <cjming>	 all good and all yours!
[21:29:11] <wikibugs>	 (03PS2) 10Ahmon Dancy: gitlab::runner: Add buildkitd to no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1051208
[21:30:59] <dancy>	 sbassett: Watch out for a hanging "syncing masters" deployment phase.  If this happens to you, start another shell and kill any hanging ssh process for deploy1003.
[21:31:35] <sbassett>	 dancy: Ok.  Are other deploys ok?  I’m on 1002…
[21:31:40] <dancy>	 xref https://phabricator.wikimedia.org/T364417
[21:31:47] <sbassett>	 Cc mstyles ^^
[21:32:00] <zabe>	 !log zabe@mwmaint1002:/tmp/upload$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --sleep=3600 --user=Yann . # T368703
[21:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:03] <stashbot>	 T368703: Server side upload for Yann - https://phabricator.wikimedia.org/T368703
[21:32:53] <dancy>	 sbassett: I'm not sure what you mean by other deploys.  
[21:33:06] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy)
[21:33:26] <sbassett>	 dancy: It looked to be an issue with deploy1003?  Or is it all of the deploy hosts?
[21:33:52] <dancy>	 deployments from deploy1002 will be affected by the fact that deploy1003 is listed in /etc/dsh/group/scap-masters even though it's not ready.
[21:34:03] <sbassett>	 Oh ok
[21:34:48] <wikibugs>	 (03PS2) 10Fabfur: benthos:cache: encode problematic fields as hex [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718)
[21:36:08] <wikibugs>	 (03CR) 10Krinkle: Handle sso.wikimedia.org domain (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[21:36:35] <wikibugs>	 (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1051208/3865/" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy)
[21:46:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:47:31] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] beta-logs: remove dlq spam mitigation [puppet] - 10https://gerrit.wikimedia.org/r/1051200 (owner: 10Cwhite)
[21:47:53] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:55:42] <maryum>	 !log deployed patch for T366991
[21:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1043920328 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:58:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1089-1090,1104].eqiad.wmnet with reason: T348977
[21:58:48] <stashbot>	 T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977
[21:58:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1089-1090,1104].eqiad.wmnet with reason: T348977
[21:59:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1089*,elastic1090*,elastic1104* for T348977 - bking@cumin2002
[21:59:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1089*,elastic1090*,elastic1104* for T348977 - bking@cumin2002
[21:59:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:10:38] <logmsgbot>	 !log sbassett@deploy1002 Synchronized private/PrivateSettings.php: Un-deployed a PS.php mitigation for T341908 (duration: 07m 24s)
[22:15:33] <wikibugs>	 (03Abandoned) 10Clare Ming: Add test streams for Metrics Platform app + web base instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050678 (https://phabricator.wikimedia.org/T366949) (owner: 10Clare Ming)
[22:47:06] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:47:09] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye
[22:47:14] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9942633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye completed: - cloudcep...
[22:48:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9942634 (10Jclark-ctr)
[22:49:37] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:49:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9942635 (10Jclark-ctr) 05Open→03Resolved
[22:50:15] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:52:05] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:52:27] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:54:25] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1038
[22:54:27] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1038
[22:58:02] <wikibugs>	 (03PS1) 10Cwhite: logstash: route thumbor logs in routing filter [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180)
[23:01:45] <wikibugs>	 (03PS6) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T365509)
[23:02:26] <wikibugs>	 (03PS5) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151)
[23:02:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[23:02:46] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "I need to confirm the stage 1 wikis - seems we overlooked an issue when defining those groups." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson)
[23:02:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[23:05:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye
[23:05:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1036.eqiad.wmnet with OS bullseye
[23:12:02] <wikibugs>	 (03PS1) 10Cwhite: logstash: remove ecs gating from kubernetes_docker filter [puppet] - 10https://gerrit.wikimedia.org/r/1051215 (https://phabricator.wikimedia.org/T314381)
[23:12:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Toolforge elasticsearch haproxy: update CORS syntax for modern haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905)
[23:17:51] <wikibugs>	 (03PS2) 10Andrew Bogott: Toolforge elasticsearch haproxy: update CORS syntax for modern haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905)
[23:18:05] <wikibugs>	 (03CR) 10BryanDavis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott)
[23:19:20] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage
[23:22:08] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage
[23:25:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye
[23:25:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[23:25:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye
[23:25:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[23:34:02] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage
[23:36:49] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage
[23:38:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051218
[23:38:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051218 (owner: 10TrainBranchBot)
[23:39:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:40:58] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:41:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[23:41:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye comp...
[23:43:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942746 (10Jclark-ctr)
[23:45:30] <wikibugs>	 (03CR) 10Scott French: "One maybe-typo and one question. Otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[23:47:59] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus)
[23:51:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
[23:51:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage
[23:54:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:54:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
[23:55:15] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[23:55:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bullseye
[23:55:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1036.eqiad.wmnet with OS bullseye comp...
[23:57:50] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage
[23:59:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T364069)', diff saved to https://phabricator.wikimedia.org/P65605 and previous config saved to /var/cache/conftool/dbconfig/20240701-235941-marostegui.json
[23:59:44] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069