[00:00:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050810 (owner: 10TrainBranchBot) [00:04:15] RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:41] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:25:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 127 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:35:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 74 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:46:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:47:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:58:51] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:58:52] ACKNOWLEDGEMENT - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T368866 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:59:00] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T368866 (10ops-monitoring-bot) 03NEW [01:03:37] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:03:51] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:41] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:07:27] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 352.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:10:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:14:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:15:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:19:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:28:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:07:41] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 48.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:09:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:31] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:15:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:15] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:31] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [02:45:33] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:33] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:32] (03PS2) 10David Martin: Add wikilambda_zobject_join to puppet script for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) [02:59:15] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:28] (03CR) 10David Martin: "The table creation patch was successfully deployed and is operating correctly on production. Would be very helpful to get this patch land" [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin) [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:16:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:19:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:54:27] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:58:27] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 27.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:15:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9938078 (10Papaul) @elukey we received last week a temporally license from SuperMicro to test out Redflish, I upload the license to the server, you can test and let me k... [04:19:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:20:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:32:41] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:36:35] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1050341 (owner: 10L10n-bot) [04:49:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:49:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:49:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:49:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T367856)', diff saved to https://phabricator.wikimedia.org/P65556 and previous config saved to /var/cache/conftool/dbconfig/20240701-044945-marostegui.json [04:49:48] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:50:44] !log dbmaint eqiad Rebuild pagelinks table on s8 master T364069 [04:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:51:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2133,2160].codfw.wmnet,db[1195,1217,1228].eqiad.wmnet with reason: m2 switchover T368494 [04:51:55] T368494: Switchover m2 master db1195 -> db1228 - https://phabricator.wikimedia.org/T368494 [04:52:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2133,2160].codfw.wmnet,db[1195,1217,1228].eqiad.wmnet with reason: m2 switchover T368494 [04:54:39] (03PS1) 10Marostegui: mariadb: Promote db1228 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1050814 (https://phabricator.wikimedia.org/T368494) [04:55:44] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1228 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/1050814 (https://phabricator.wikimedia.org/T368494) (owner: 10Marostegui) [04:56:42] !log Failover m2 from db1195 to db1228 - T368494 [04:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:41] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 19.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:01:24] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9938126 (10Marostegui) [05:02:01] (03PS1) 10Marostegui: db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1050815 [05:02:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1195.eqiad.wmnet with reason: Reboot [05:02:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: Reboot [05:02:45] (03CR) 10Marostegui: [C:03+2] db1195: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1050815 (owner: 10Marostegui) [05:24:16] (03CR) 10Ayounsi: [C:03+1] "All seem reasonable and/or needed to me !" [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney) [05:33:38] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:33:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:36:28] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:36:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52339 bytes in 0.209 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:22] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 155440808 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:56:22] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 35696 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:57:15] (03PS5) 10Ayounsi: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) [05:57:24] (03CR) 10Ayounsi: Homer: fix Netbox 4 breaking changes (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:24:36] (03CR) 10Ayounsi: "The cookbook sends a long cli string or commands separated by semi colons. I worry that we will hit some limit at some point sending too m" [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [06:25:26] (03CR) 10Ayounsi: [C:03+1] Add class-of-service scheduler and classifiers plus var to control [homer/public] - 10https://gerrit.wikimedia.org/r/1049917 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [06:33:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:33:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:33:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T364069)', diff saved to https://phabricator.wikimedia.org/P65557 and previous config saved to /var/cache/conftool/dbconfig/20240701-063344-marostegui.json [06:33:48] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:35:30] (03PS1) 10Marostegui: mariadb: Move db1195 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/1050943 (https://phabricator.wikimedia.org/T368871) [06:36:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169 T368871', diff saved to https://phabricator.wikimedia.org/P65558 and previous config saved to /var/cache/conftool/dbconfig/20240701-063601-root.json [06:36:05] T368871: Move db1195 to s1 - https://phabricator.wikimedia.org/T368871 [06:36:41] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn) [06:36:45] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1169.eqiad.wmnet onto db1195.eqiad.wmnet [06:38:59] (03CR) 10Marostegui: [C:03+2] mariadb: Move db1195 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/1050943 (https://phabricator.wikimedia.org/T368871) (owner: 10Marostegui) [06:41:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:14] (03CR) 10Ayounsi: [C:03+1] "a few nits, but overall lgtm. Could be worth another reviewer at least for the python side of things." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [06:59:47] (03PS1) 10Marostegui: instances.yaml: Add db1195 [puppet] - 10https://gerrit.wikimedia.org/r/1050946 (https://phabricator.wikimedia.org/T368871) [07:00:05] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T0700). nyaa~ [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:48] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db1195 [puppet] - 10https://gerrit.wikimedia.org/r/1050946 (https://phabricator.wikimedia.org/T368871) (owner: 10Marostegui) [07:02:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Place db1195 in s1 T368871', diff saved to https://phabricator.wikimedia.org/P65559 and previous config saved to /var/cache/conftool/dbconfig/20240701-070243-marostegui.json [07:02:47] T368871: Move db1195 to s1 - https://phabricator.wikimedia.org/T368871 [07:07:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:08:24] (03CR) 10Urbanecm: [C:03+1] "LGTM now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby) [07:12:56] (03PS1) 10Slyngshede: data.yaml: Offboarding of akhatun [puppet] - 10https://gerrit.wikimedia.org/r/1050949 [07:18:46] (03PS1) 10Slyngshede: data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951 [07:19:28] (03CR) 10CI reject: [V:04-1] data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951 (owner: 10Slyngshede) [07:21:03] (03PS2) 10Slyngshede: data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951 [07:25:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:27:20] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 12401 MB (5% inode=73%): /tmp 12401 MB (5% inode=73%): /var/tmp 12401 MB (5% inode=73%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [07:29:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:36:20] (03CR) 10Brouberol: "l" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:41:48] (03CR) 10Brouberol: [C:03+1] cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [07:44:51] !log `apt-get clean` on buil2001 to free some space in the root partition [07:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:00] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) (owner: 10Scott French) [07:49:27] (03CR) 10Brouberol: dse-k8s-services: Add net-new chart for Airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:58:44] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:06:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1169.eqiad.wmnet onto db1195.eqiad.wmnet [08:10:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:13:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65560 and previous config saved to /var/cache/conftool/dbconfig/20240701-081307-root.json [08:13:44] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:13:57] (03PS1) 10Marostegui: db1195: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1051059 (https://phabricator.wikimedia.org/T368871) [08:14:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:14:21] (03CR) 10Marostegui: [C:03+2] db1195: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1051059 (https://phabricator.wikimedia.org/T368871) (owner: 10Marostegui) [08:15:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65561 and previous config saved to /var/cache/conftool/dbconfig/20240701-081514-root.json [08:18:12] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool es1025 for backups T363812', diff saved to https://phabricator.wikimedia.org/P65562 and previous config saved to /var/cache/conftool/dbconfig/20240701-081811-jynus.json [08:18:15] T363812: Setup backups for es6, es7 and archive old read only backups - https://phabricator.wikimedia.org/T363812 [08:21:18] (03PS1) 10Urbanecm: JsonSchemaValidator: Measure duration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) [08:28:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65563 and previous config saved to /var/cache/conftool/dbconfig/20240701-082813-root.json [08:28:18] jouncebot: nowandnext [08:28:18] No deployments scheduled for the next 1 hour(s) and 31 minute(s) [08:28:18] In 1 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000) [08:28:29] (03CR) 10Urbanecm: [C:03+2] JsonSchemaValidator: Measure duration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) (owner: 10Urbanecm) [08:30:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65564 and previous config saved to /var/cache/conftool/dbconfig/20240701-083020-root.json [08:36:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) (owner: 10Urbanecm) [08:36:17] 10SRE-tools, 10conftool, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9938525 (10ABran-WMF) [08:36:31] 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9938526 (10ABran-WMF) [08:36:34] (03CR) 10Btullis: [C:03+2] cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [08:37:31] (03CR) 10JMeybohm: [C:03+1] admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:38:11] (03CR) 10JMeybohm: [C:03+1] admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:39:52] (03CR) 10JMeybohm: api,rest-gateway: upgrade Envoy version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:39:55] (03Merged) 10jenkins-bot: cephcsi: Use fsGroup 900 to allow /csi/csi.sock to be shared [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050648 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [08:40:09] (03CR) 10JMeybohm: [C:03+1] admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [08:40:47] (03PS27) 10DCausse: wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [08:40:47] (03PS7) 10DCausse: wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [08:43:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65565 and previous config saved to /var/cache/conftool/dbconfig/20240701-084318-root.json [08:44:16] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:45:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65566 and previous config saved to /var/cache/conftool/dbconfig/20240701-084525-root.json [08:45:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:51:20] (03Merged) 10jenkins-bot: JsonSchemaValidator: Measure duration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051060 (https://phabricator.wikimedia.org/T365245) (owner: 10Urbanecm) [08:51:55] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1051060|JsonSchemaValidator: Measure duration (T365245)]] [08:51:58] T365245: Benchmark validation usages - https://phabricator.wikimedia.org/T365245 [08:54:57] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [08:56:34] (03PS1) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) [08:56:35] PROBLEM - MariaDB disk space #page on es1025 is CRITICAL: DISK CRITICAL - /run/credentials/systemd-tmpfiles-clean.service is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:56:55] ^ marostegui [08:56:58] doh [08:57:07] it is depooled, I am backing it up [08:57:12] ack! [08:57:18] but why it alerted? [08:57:35] RECOVERY - MariaDB disk space #page on es1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:57:36] ah ok false alarm [08:57:55] it seems a run special filesystem [08:58:04] no real "disk", right? [08:58:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65567 and previous config saved to /var/cache/conftool/dbconfig/20240701-085824-root.json [08:58:34] I am going to downtime it [08:58:41] ^ slyngs fabfur [08:58:44] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 356.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:58:48] as it is depooled [08:58:57] Thanks [08:59:16] That's weird [08:59:32] although feel free to followup on the disk alert generalities [08:59:43] What would be the reason that filled up? [08:59:48] I don't get it [08:59:55] I think it is a script failure [09:00:16] possible a bug (?) [09:00:21] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:00:22] of the alerting [09:00:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65568 and previous config saved to /var/cache/conftool/dbconfig/20240701-090031-root.json [09:00:33] may need debugging, I'm on a meeting will check later [09:00:59] mybe a race condition [09:01:26] don't worry I will check now [09:04:07] (03CR) 10CI reject: [V:04-1] REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler) [09:06:14] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1051060|JsonSchemaValidator: Measure duration (T365245)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:06:17] T365245: Benchmark validation usages - https://phabricator.wikimedia.org/T365245 [09:06:21] !log urbanecm@deploy1002 urbanecm: Continuing with sync [09:08:22] (03PS7) 10Clément Goubert: envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) [09:08:38] (03PS1) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) [09:09:41] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:12:32] jouncebot: nowandnext [09:12:32] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [09:12:32] In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000) [09:13:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65569 and previous config saved to /var/cache/conftool/dbconfig/20240701-091329-root.json [09:14:10] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1051060|JsonSchemaValidator: Measure duration (T365245)]] (duration: 22m 15s) [09:14:12] T365245: Benchmark validation usages - https://phabricator.wikimedia.org/T365245 [09:15:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65570 and previous config saved to /var/cache/conftool/dbconfig/20240701-091536-root.json [09:15:56] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:16:04] dcausse: fwiw I was deploying, but I'm now done. [09:16:27] (03CR) 10Jelto: [C:03+2] gerrit: enable "new" image diff UI [puppet] - 10https://gerrit.wikimedia.org/r/1050614 (https://phabricator.wikimedia.org/T341291) (owner: 10Hashar) [09:16:34] urbanecm: ok thanks :) [09:16:49] urbanecm, dcausse: I'd like to deploy a core patch... what are you up to? This one: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1051076 [09:17:54] duesen: was about to deploy something not directly related to MW, feel free to go ahead [09:18:30] * urbanecm doesn't have anything else to deploy [09:20:06] dcausse: actually I realized that I have to run an errand now, I'll do it in a couple of hours. [09:20:17] ack [09:20:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:22:25] (03CR) 10JMeybohm: [C:03+1] "Seems reasonable :)" [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:23:41] (03CR) 10Clément Goubert: [C:03+2] envoy: Ensure legacy listeners point to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047447 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [09:25:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:26:24] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:28:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65572 and previous config saved to /var/cache/conftool/dbconfig/20240701-092835-root.json [09:30:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65573 and previous config saved to /var/cache/conftool/dbconfig/20240701-093042-root.json [09:36:51] (03PS1) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) [09:38:18] (03CR) 10Brouberol: [C:03+1] cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:40:12] (03CR) 10CI reject: [V:04-1] cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:40:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9938810 (10Volans) [09:40:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T364069)', diff saved to https://phabricator.wikimedia.org/P65574 and previous config saved to /var/cache/conftool/dbconfig/20240701-094050-marostegui.json [09:40:53] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:41:41] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [09:41:45] (03CR) 10Btullis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:42:19] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9938825 (10Volans) Pending @leila 's approval. [09:42:36] (03PS1) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 [09:42:49] (03CR) 10CI reject: [V:04-1] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [09:42:51] (03PS2) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 [09:42:53] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert) [09:43:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65575 and previous config saved to /var/cache/conftool/dbconfig/20240701-094341-root.json [09:45:28] (03CR) 10Filippo Giunchedi: "See inline re: dashboard links, other than that LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [09:45:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65576 and previous config saved to /var/cache/conftool/dbconfig/20240701-094547-root.json [09:45:55] (03CR) 10Filippo Giunchedi: [C:03+2] frack: Remove old ingenico/globalcollect job checks [puppet] - 10https://gerrit.wikimedia.org/r/1048551 (https://phabricator.wikimedia.org/T368114) (owner: 10Dwisehaupt) [09:46:20] (03PS3) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 [09:46:20] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert) [09:46:53] (03PS2) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) [09:49:38] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1049104 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [09:49:53] (03CR) 10CI reject: [V:04-1] cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:53:01] 06SRE, 06Infrastructure-Foundations, 10netops: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts - https://phabricator.wikimedia.org/T368513#9938867 (10fgiunchedi) Those are SSH probes from local prometheus hosts indeed, in this case the probe consists of a TCP conne... [09:55:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P65577 and previous config saved to /var/cache/conftool/dbconfig/20240701-095557-marostegui.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000) [10:00:24] (03PS1) 10Clément Goubert: fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) [10:02:10] (03CR) 10DCausse: [C:03+1] fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:02:20] (03CR) 10Filippo Giunchedi: [C:03+1] Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney) [10:02:21] (03CR) 10Clément Goubert: [C:03+2] fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:02:58] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9938917 (10fgiunchedi) >>! In T326322#9934200, @cmooney wrote: > @fgiunchedi I was perhaps a little cheeky and merged this, but it was c... [10:03:08] (03Merged) 10jenkins-bot: fixtures: Remove mw-api-int-async-transition listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051087 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:03:59] (03CR) 10Filippo Giunchedi: [C:03+1] pontoon: Remove more puppet 5 leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1047502 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:05:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:27] (03CR) 10Cathal Mooney: [C:03+2] Change gnmi sampling interval and enable timestamps for prom output [puppet] - 10https://gerrit.wikimedia.org/r/1050598 (https://phabricator.wikimedia.org/T326322) (owner: 10Cathal Mooney) [10:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P65578 and previous config saved to /var/cache/conftool/dbconfig/20240701-101104-marostegui.json [10:14:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9938962 (10elukey) >>! In T365167#9938078, @Papaul wrote: > @elukey we received last week a temporally license from SuperMicro to test out Redflish, I upload the license... [10:16:48] (03CR) 10Elukey: api,rest-gateway: upgrade Envoy version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:18:55] (03PS2) 10Elukey: admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) [10:18:56] (03PS2) 10Elukey: admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) [10:18:56] (03PS2) 10Elukey: api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) [10:18:57] (03PS2) 10Elukey: admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) [10:19:43] (03PS3) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) [10:20:12] (03PS4) 10Clément Goubert: service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 [10:20:15] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert) [10:23:38] !log upgrading A:cp-drmrs to haproxy 2.8.10 (T367756) [10:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:41] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [10:25:36] (03CR) 10Elukey: "Folks can you add a bit more info about why you need this configuration in the commit msg? I get the socket creation, but I am wondering i" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [10:26:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T364069)', diff saved to https://phabricator.wikimedia.org/P65579 and previous config saved to /var/cache/conftool/dbconfig/20240701-102611-marostegui.json [10:26:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [10:26:14] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:26:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [10:26:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T364069)', diff saved to https://phabricator.wikimedia.org/P65580 and previous config saved to /var/cache/conftool/dbconfig/20240701-102633-marostegui.json [10:29:57] PROBLEM - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table dewiki.archive: Index for table archive is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1161-bin.002587, end_log_pos 631041457 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:52] ^ i will get that [10:31:21] (03CR) 10Elukey: [C:03+2] admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:34:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:29] (03Merged) 10jenkins-bot: admin_ng: upgrade coredns to 1.8.7-2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050568 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:35:05] ACKNOWLEDGEMENT - MariaDB Replica SQL: s5 on db1154 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table dewiki.archive: Index for table archive is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1161-bin.002587, end_log_pos 631041457 Marostegui working on it https://wikitech.wikimedia.org/wiki/MariaDB/troubleshoo [10:35:05] epooling_a_replica [10:36:20] (03CR) 10Giuseppe Lavagetto: [C:03+1] service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert) [10:36:51] (03CR) 10Clément Goubert: [C:03+2] service::node::config::scap3: Fix scap deploy-local call [puppet] - 10https://gerrit.wikimedia.org/r/1051086 (owner: 10Clément Goubert) [10:36:57] RECOVERY - MariaDB Replica SQL: s5 on db1154 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:37:34] !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:37:43] !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [10:37:51] !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [10:38:02] !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [10:38:27] I'd like to deploy a core patch in about half an hour, any objections? This one: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1051076 [10:38:33] PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 642.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:38:35] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 643.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:39:19] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs [10:39:27] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs [10:39:30] (03CR) 10Elukey: [C:03+2] admin_ng: upgrade cfssl-issuer's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050569 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [10:41:11] marostegui, fabfur, _joe_: is it ok if I deploy a core patch in about half an hour? I'd hit +2 on it now, so it can go through CI. [10:41:45] I can also wait for the backport window, but I'd prefer to get this out of the way early. [10:42:33] RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:43:06] !log running puppet on maps servers [10:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:35] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:44:02] (03PS2) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) [10:44:05] duesen: fine by me :) [10:44:32] (03CR) 10Daniel Kinzler: [C:03+2] "prepare for backport deployment" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler) [10:46:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:46:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:47:02] !log running /usr/local/bin/apply-config-kartotherian on maps-replica [10:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:15] !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:48:30] <_joe_> duesen: I think you're good to go, but next time ask in #serviceops [10:49:27] !log running /usr/local/bin/apply-config-kartotherian on maps-master [10:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:32] _joe_: ok. I keep getting confused about where to ask. tech, sre, serviceops, operations... [10:49:49] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:50:34] _joe_: Should I add this to https://wikitech.wikimedia.org/wiki/How_to_deploy_code ? It currently says "Join the IRC channels #wikimedia-operations connect and #wikimedia-tech connect on libera.chat and be available before and after all changes." [10:51:03] <_joe_> duesen: well this isn't a change in a backport window [10:51:16] <_joe_> jouncebot: nowandnext [10:51:16] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1000) [10:51:16] In 2 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1300) [10:51:28] <_joe_> the mediawiki infra window is managed by serviceops [10:51:32] <_joe_> maybe we should clarify that [10:51:35] <_joe_> :) [10:52:45] _joe_: Maybe each block in https://wikitech.wikimedia.org/wiki/Deployments should mention the associated IRC channel? [10:52:47] (03CR) 10Volans: "Until T354410 is resolved 3.12 can't be officially supported" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [10:54:16] (03PS3) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) [10:54:49] (03CR) 10Daniel Kinzler: [C:03+2] "once again, after fixing a missing constant in tests" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler) [10:57:19] !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:00:20] (03CR) 10CI reject: [V:04-1] REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler) [11:01:26] (03PS4) 10Daniel Kinzler: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) [11:01:56] grrr, I'm having trouble getting this to pass CI, for silly reasons >:( [11:02:38] (03CR) 10Daniel Kinzler: [C:03+2] "grr, once again..." [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler) [11:02:41] (03PS2) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 [11:03:17] (03CR) 10Clément Goubert: team-sre/redis: Alert on replica down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [11:05:30] (03PS3) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 [11:07:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:11:14] (03PS4) 10Clément Goubert: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) [11:11:40] (03CR) 10Ayounsi: "I worked around the Tox issue locally by installing `kafka-python-ng`, noted that this patch is only to have tox working, not for spicerac" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1050452 (owner: 10Ayounsi) [11:15:17] (03PS1) 10Jelto: sre.gitlab.upgrade: lock backups during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1051107 (https://phabricator.wikimedia.org/T367501) [11:18:35] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9939276 (10Sfaci) Hi @Scott_French! Thanks for your suggestion!. Just for cu... [11:19:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye [11:22:09] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9939303 (10SGupta-WMF) @Scott_French I have updated the repo , and tagged the... [11:27:58] !log btullis@cumin1002 START - Cookbook sre.wikireplicas.update-views [11:29:10] !log btullis@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) [11:30:08] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs [11:31:28] (03Merged) 10jenkins-bot: REST: detect mismatching value types in json request [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051076 (https://phabricator.wikimedia.org/T305973) (owner: 10Daniel Kinzler) [11:32:51] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050949 (owner: 10Slyngshede) [11:33:15] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs [11:34:33] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050951 (owner: 10Slyngshede) [11:35:25] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding of akhatun [puppet] - 10https://gerrit.wikimedia.org/r/1050949 (owner: 10Slyngshede) [11:37:26] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding for febinbellamy [puppet] - 10https://gerrit.wikimedia.org/r/1050951 (owner: 10Slyngshede) [11:37:28] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [11:37:35] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:39:12] !log daniel@deploy1002 Started scap: Backport for [[gerrit:1051076|REST: detect mismatching value types in json request (T305973)]] [11:39:15] T305973: JsonBodyValidator does not validate the parameter types - https://phabricator.wikimedia.org/T305973 [11:40:03] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [11:41:11] !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 2188 hosts [11:41:56] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 2188 hosts [11:43:01] !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging FebinBellamy out of all services on: 2188 hosts [11:43:12] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:43:43] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging FebinBellamy out of all services on: 2188 hosts [11:45:07] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:45:28] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:46:09] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [11:49:05] !log klausman@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [11:51:59] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [11:55:44] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) [11:56:36] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [11:58:09] (03PS2) 10DCausse: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) [11:58:14] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:58:40] (03PS1) 10JMeybohm: cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978) [11:59:44] (03PS2) 10JMeybohm: cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978) [12:00:37] !log daniel@deploy1002 daniel: Backport for [[gerrit:1051076|REST: detect mismatching value types in json request (T305973)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:00:47] T305973: JsonBodyValidator does not validate the parameter types - https://phabricator.wikimedia.org/T305973 [12:01:05] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2026.codfw.wmnet with OS bullseye [12:01:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye [12:03:19] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:04:23] !log daniel@deploy1002 daniel: Continuing with sync [12:05:33] (03CR) 10DCausse: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [12:05:50] (03PS4) 10Filippo Giunchedi: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [12:05:58] (03CR) 10JMeybohm: [C:03+2] cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:06:10] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:06:29] (03CR) 10Filippo Giunchedi: [C:03+1] team-sre/redis: Alert on replica down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [12:06:40] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:13] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:08:59] (03Merged) 10jenkins-bot: cfssl-issuer: Add container securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051111 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:09:27] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:12:00] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1051076|REST: detect mismatching value types in json request (T305973)]] (duration: 32m 48s) [12:12:03] T305973: JsonBodyValidator does not validate the parameter types - https://phabricator.wikimedia.org/T305973 [12:13:55] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9939480 (10dcaro) With the current data, you can start observing that `cloudcephosd1034-sdh` (the new drive that has... [12:14:03] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:16:44] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:17:51] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:18:09] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:19:01] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:20:36] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:21:16] (03PS1) 10Daniel Kinzler: Revert "REST: detect mismatching value types in json request" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051119 [12:21:26] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:21:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:22:58] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:23:03] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:24:58] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] hieradata: remove thanos-query settings from thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [12:27:33] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:28:33] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:29:46] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:30:47] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:31:15] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:31:31] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:32:09] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:32:22] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:32:32] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:32:36] (03PS5) 10Clément Goubert: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 [12:32:42] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [12:32:45] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:33:05] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:33:31] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:33:33] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:33:45] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:34:14] !log jayme@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:34:26] (03CR) 10Clément Goubert: team-sre/redis: Alert on replica down (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [12:35:02] !log jayme@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:35:05] !log jayme@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:35:21] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:35:24] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:35:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [12:35:32] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:35:34] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:39:56] !log Running update-netboot-image bullseye for 11.10 release [12:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:41] (03CR) 10Filippo Giunchedi: "See inline, depends on mysqld-exporter version" [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [12:41:30] (03Abandoned) 10Daniel Kinzler: Revert "REST: detect mismatching value types in json request" [core] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051119 (owner: 10Daniel Kinzler) [12:41:39] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051129 [12:43:15] (03CR) 10Filippo Giunchedi: "Nice! LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/1047983 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [12:47:03] (03PS1) 10Fabfur: hiera: removed unused haproxy28 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) [12:47:51] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1051110 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:48:13] (03CR) 10Filippo Giunchedi: mariadb: monitoring memory pressure (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1049159 (https://phabricator.wikimedia.org/T367280) (owner: 10Arnaudb) [12:48:40] (03PS1) 10Jgiannelos: mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131 [12:48:53] (03CR) 10Filippo Giunchedi: mariadb: add monitoring on io pressure for mariadb hosts (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [12:49:09] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru [12:49:10] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru [12:49:11] (03PS2) 10Jgiannelos: mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131 [12:49:18] !log upgrading A:cp-magru to haproxy 2.8.10 (T367756) [12:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:21] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [12:49:32] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:50:53] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131 (owner: 10Jgiannelos) [12:51:12] !log Running update-netboot-image bullseye for 11.10 release on puppetserver1001 [12:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:37] (03CR) 10Filippo Giunchedi: [C:03+1] team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [12:51:49] (03Merged) 10jenkins-bot: mobileapps: Bump latest image on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051131 (owner: 10Jgiannelos) [12:54:10] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [12:54:22] (03PS3) 10Elukey: api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) [12:54:23] (03PS3) 10Elukey: admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) [12:54:23] (03PS1) 10Elukey: admin_ng: remove coredns image tag override for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051132 (https://phabricator.wikimedia.org/T368366) [12:54:36] !log cgoubert@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2026.codfw.wmnet with OS bullseye [12:55:01] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051077 (https://phabricator.wikimedia.org/T331127) (owner: 10DCausse) [12:55:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye [12:55:24] 10SRE-tools, 10conftool, 06DBA, 06Infrastructure-Foundations, 10Spicerack: Spicerack support for dbctl - https://phabricator.wikimedia.org/T362893#9939575 (10ABran-WMF) p:05Low→03Medium [12:55:49] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:56:05] (03CR) 10Filippo Giunchedi: [C:03+2] "Please remove the related probe at hieradata/common/profile/prometheus/ops.yaml too:" [puppet] - 10https://gerrit.wikimedia.org/r/1048551 (https://phabricator.wikimedia.org/T368114) (owner: 10Dwisehaupt) [12:56:12] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:56:16] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:56:25] (03CR) 10Clément Goubert: [C:03+2] team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [12:56:38] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:57:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:57:29] (03PS3) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [12:57:36] (03Merged) 10jenkins-bot: team-sre/redis: Alert on replica down [alerts] - 10https://gerrit.wikimedia.org/r/1046608 (owner: 10Clément Goubert) [12:57:52] (03PS1) 10JMeybohm: admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) [12:58:47] (03CR) 10JMeybohm: [C:03+1] api,rest-gateway: upgrade Envoy version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1300). Please do the needful. [13:00:05] MatmaRex and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] hi [13:00:21] i can deploy today [13:00:32] unless Lucas_WMDE wants to :) [13:01:02] (03CR) 10Urbanecm: [C:03+2] FixTrailingWhitespaceIds: Don't crash on complex conflicts [extensions/DiscussionTools] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050406 (https://phabricator.wikimedia.org/T356196) (owner: 10Bartosz Dziewoński) [13:01:37] MatmaRex: i assume i need to backport the things before running them :D [13:01:49] (03PS2) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned [puppet] - 10https://gerrit.wikimedia.org/r/1051134 [13:01:54] urbanecm: yes please [13:01:58] will do [13:02:34] (03CR) 10JMeybohm: "I was panning on flipping the switch my morning tomorrow (maybe before the backport window). That way the train run tomorrow will run with" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [13:02:38] o/ [13:02:41] sorry for the delay [13:03:03] (03PS1) 10Urbanecm: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051135 (https://phabricator.wikimedia.org/T368862) [13:03:28] Lucas_WMDE: no worries. til about https://wikitech.wikimedia.org/wiki/Update_the_interwiki_cache, we should update it to match reality :) [13:03:44] oh that sounds promising -.- [13:04:53] Lucas_WMDE: this is what _actually_ happens ;) https://www.irccloud.com/pastebin/Et9c7FR1/ [13:05:14] would you mind helping with updating it while i do the backport itself? [13:05:22] (03CR) 10Urbanecm: [C:03+2] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051135 (https://phabricator.wikimedia.org/T368862) (owner: 10Urbanecm) [13:05:28] o_O [13:05:37] and you had to run all those commands manually? [13:05:44] yeah... [13:05:55] is `git push-for-review` a standard command on the deployment servers? I don’t think I’ve seen it before [13:05:56] it _used_ to automagically create the patch for you [13:06:02] (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051135 (https://phabricator.wikimedia.org/T368862) (owner: 10Urbanecm) [13:06:04] but not anymore [13:06:06] (but I’m guessing it’s used by scap backport --revert?) [13:06:32] Lucas_WMDE: oh, sorry. that's my .gitconfig alias. it stands for `git push origin HEAD:refs/for/master` [13:06:42] ok [13:06:54] ah, now I see the username+password below, I skipped over that [13:06:58] ok so no magic there [13:07:04] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1051135|Update interwiki map (T368862)]] [13:07:04] yep [13:07:06] (03PS3) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned [puppet] - 10https://gerrit.wikimedia.org/r/1051134 [13:07:07] T368862: Please run maintenance task "scap update-interwiki-cache" (30 June 2024) - https://phabricator.wikimedia.org/T368862 [13:07:13] just some gitfu [13:08:09] Lucas_WMDE: fwiw, T247107 is why it no longer autocommits anything [13:08:10] T247107: Make 'scap update-interwiki-cache' less scary - https://phabricator.wikimedia.org/T247107 [13:09:21] (03Merged) 10jenkins-bot: FixTrailingWhitespaceIds: Don't crash on complex conflicts [extensions/DiscussionTools] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1050406 (https://phabricator.wikimedia.org/T356196) (owner: 10Bartosz Dziewoński) [13:09:45] “This however is not documented anywhere” [13:09:46] well [13:09:56] except on the wikitech page which is now badly outdated [13:10:00] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1051135|Update interwiki map (T368862)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:15] I’ll see if I can update that [13:10:54] !log urbanecm@deploy1002 urbanecm: Continuing with sync [13:10:56] thanks [13:11:36] do you know if there’s an existing task to further improve update-interwiki-cache? [13:11:48] because at least as I see it now it still seems far from ideal [13:11:52] (03CR) 10Elukey: [C:03+2] admin_ng: remove coredns image tag override for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051132 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:11:59] (03CR) 10Vgutierrez: [C:03+1] hiera: removed unused haproxy28 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [13:11:59] (03CR) 10Elukey: [C:03+2] api,rest-gateway: upgrade Envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050570 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:12:01] it silently updates the production config and leaves you alone to figure out what to do with that [13:12:06] (03CR) 10Elukey: [C:03+2] admin_ng: update helm-state-metrics' Docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050571 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:12:26] it would be nice if you could run it against your local config checkout outside deploy1002 but that’s not gonna happen while it’s still part of scap [13:13:26] Lucas_WMDE: technically you can run `extensions/WikimediaMaintenance/dumpInterwiki.php` locally. not the easiest, as it assumes certain things (like MEDIAWIKI_DEPLOYMENT_DIR) exists, but... [13:13:47] but the scap part doesn't do realy much more than that [13:14:19] no idea if we have a task tho [13:14:22] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [13:15:43] ok, thanks [13:16:05] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1051135|Update interwiki map (T368862)]] (duration: 09m 01s) [13:16:07] T368862: Please run maintenance task "scap update-interwiki-cache" (30 June 2024) - https://phabricator.wikimedia.org/T368862 [13:16:20] okay, interwiki done [13:16:35] now the second part [13:16:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1050406|FixTrailingWhitespaceIds: Don't crash on complex conflicts (T356196)]] [13:16:52] T356196: Auto triming of internal links is breaking anchors if the last character is a space - https://phabricator.wikimedia.org/T356196 [13:17:21] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru [13:17:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [13:19:20] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru [13:21:46] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:21:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:24:05] urbanecm: can I link that irccloud snippet in the documentation? [13:24:11] Lucas_WMDE: sure [13:24:13] (idk how long irccloud keeps snippets) [13:24:13] ok [13:24:29] Lucas_WMDE: but it might be wiser to copy it. just in case irccloud deletes it [13:24:52] * Lucas_WMDE looks for a collapse template [13:25:34] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050406|FixTrailingWhitespaceIds: Don't crash on complex conflicts (T356196)]] (duration: 08m 46s) [13:25:37] T356196: Auto triming of internal links is breaking anchors if the last character is a space - https://phabricator.wikimedia.org/T356196 [13:26:10] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:26:12] MatmaRex: okay, script backported. do you want me to run it for one wiki first for testing? [13:26:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:26:41] (03PS2) 10Alexandros Kosiaris: deploy1003: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/1050628 (https://phabricator.wikimedia.org/T364416) [13:26:41] urbanecm: you could, it won't hurt [13:26:44] !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [13:26:54] !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [13:27:03] !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [13:27:12] !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [13:27:29] (it's a logged update so it will skip that wiki automatically later if you run it with `foreachwiki`) [13:27:39] urbanecm: updated https://wikitech.wikimedia.org/wiki/Update_the_interwiki_cache [13:27:51] MatmaRex: looks like it works. anything to verify before running it all? https://www.irccloud.com/pastebin/UDn1Bbob/ [13:28:24] (03PS3) 10Clare Ming: extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) [13:29:05] urbanecm: don't think so. i ran it on the beta cluster and locally a bunch of times [13:29:14] MatmaRex: okay, proceeding then [13:29:25] (03PS4) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [13:29:45] thank you! [13:29:45] !log mwmaint1002: [urbanecm@mwmaint1002 ~]$ foreachwiki DiscussionTools:FixTrailingWhitespaceIds (T356196) [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:54] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: sync [13:29:56] anything else MatmaRex? [13:30:04] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: sync [13:30:08] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [13:30:11] nope [13:30:32] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [13:30:33] (03CR) 10Alexandros Kosiaris: [C:03+2] deploy1003: Assign role [puppet] - 10https://gerrit.wikimedia.org/r/1050628 (https://phabricator.wikimedia.org/T364416) (owner: 10Alexandros Kosiaris) [13:30:44] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [13:31:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T364069)', diff saved to https://phabricator.wikimedia.org/P65581 and previous config saved to /var/cache/conftool/dbconfig/20240701-133118-marostegui.json [13:31:22] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:32:52] urbanecm: and it looks like T347982 is sort of the general “improve scap update-interwiki-cache” task I was looking for [13:32:53] T347982: scap update-interwiki-cache is broken - https://phabricator.wikimedia.org/T347982 [13:33:08] Lucas_WMDE: documentation page looks good to me [13:33:13] (process doesn't, but that's besides the point) [13:33:15] thanks for the udpate! [13:33:45] yay, thanks ^^ [13:33:48] MatmaRex: it is a reason for concern if it prints out "Failed to update sth sth" a little too frequently? [13:34:52] urbanecm: not sure. can you copy a few examples? [13:34:53] and a bunch of more for commonswiki https://www.irccloud.com/pastebin/8URdlJEt/ [13:34:53] (03PS1) 10Elukey: blubber: no-op change to trigger a rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1051139 [13:34:59] MatmaRex: was just doing that, see above! [13:37:11] (03PS1) 10JMeybohm: flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) [13:37:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2026.codfw.wmnet with OS bullseye [13:37:29] afaics, the db errors are logged at https://logstash.wikimedia.org/goto/650fa401ec1a1e99491516292afb0d65 [13:38:04] (03CR) 10DCausse: [C:03+1] flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:39:15] urbanecm: i think these messages are expected, commons just has a bit more of these cases than i thought it would. but the scenario is exactly the same as we found on the beta cluster [13:39:31] okay, that's good to know. [13:39:34] leaving it running then :) [13:39:57] e.g. https://commons.wikimedia.org/wiki/User_talk:Reneschuler#File:Alden_-_12x12_Mixed_Media_on_Canvas_by_Fine_Artist_Rene_Romero_Schuler.jpg where they post multiple messages with identical topic titles [13:39:58] (03CR) 10JMeybohm: [C:03+2] flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:40:20] (03PS1) 10Marostegui: orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141 [13:40:53] (03PS5) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [13:40:55] (03Merged) 10jenkins-bot: flink-app: Add securityContext to the flink container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051140 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:41:15] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1040 [13:41:15] (03CR) 10Arnaudb: [C:03+1] orchestrator: Add volans to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/1051141 (owner: 10Marostegui) [13:41:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1040 [13:41:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [13:42:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9939721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye [13:43:16] urbanecm: random question – do you see a list of passwords at https://gerrit.wikimedia.org/r/settings/#HTTPCredentials ? [13:43:18] (03CR) 10Fabfur: [C:03+2] hiera: removed unused haproxy28 overrides [puppet] - 10https://gerrit.wikimedia.org/r/1051130 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [13:43:26] because I only see the username and the “generate new password” button [13:43:47] Lucas_WMDE: that's what i see, but there should not be any list of passwords there [13:43:55] it will give you one password [13:43:58] and that's what you use [13:44:17] makes sense, but then the notification email sounds a bit outdated IMHO :) [13:44:19] I’ll file a task [13:44:43] fair [13:44:58] it does fall under "manage your HTTP password" imo, but it could definitely be improved [13:46:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P65583 and previous config saved to /var/cache/conftool/dbconfig/20240701-134626-marostegui.json [13:46:41] (03CR) 10Elukey: [C:03+2] blubber: no-op change to trigger a rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1051139 (owner: 10Elukey) [13:48:11] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:48:23] (03CR) 10Btullis: "Thanks Elukey. I agree, it does seem like a lot of privileges just for a unix socket. We're still hoping to find another workable way arou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [13:48:29] (03CR) 10Elukey: profile::puppetserver::git: add an option to exclude servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:48:38] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:49:01] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) [13:49:28] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:49:42] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [13:50:07] (03Merged) 10jenkins-bot: blubber: no-op change to trigger a rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1051139 (owner: 10Elukey) [13:50:23] filed T368912 [13:50:24] T368912: Gerrit email about added or updated HTTP password is a bit misleading - https://phabricator.wikimedia.org/T368912 [13:51:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:03] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9939770 (10jcrespo) No action will be needed for backup1010 in the end. [13:52:11] (03CR) 10JHathaway: [C:03+1] profile::puppetserver::git: add an option to exclude servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [13:52:43] (03PS4) 10Btullis: cephcsi: Run the nodeplugin-registrar with elevated privileges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) [13:53:11] (03PS6) 10Elukey: profile::puppetserver::git: add an option to exclude servers [puppet] - 10https://gerrit.wikimedia.org/r/1050601 (https://phabricator.wikimedia.org/T368023) [13:56:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:59] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [13:57:13] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P65584 and previous config saved to /var/cache/conftool/dbconfig/20240701-140133-marostegui.json [14:03:06] (03PS5) 10Btullis: cephcsi: Run the csi-rbdplugin container as gid 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) [14:03:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [14:03:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9939826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye exec... [14:04:03] (03CR) 10Btullis: "I have tried a different technique. Let's see if we can configure the socket permissions appropriately like this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:05:20] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2017 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:07:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65585 and previous config saved to /var/cache/conftool/dbconfig/20240701-140725-root.json [14:09:29] (03CR) 10Krinkle: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [14:10:39] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [14:10:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9939859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye [14:11:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:12:02] (03CR) 10Krinkle: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [14:13:13] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T368743#9939862 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:16:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T364069)', diff saved to https://phabricator.wikimedia.org/P65586 and previous config saved to /var/cache/conftool/dbconfig/20240701-141640-marostegui.json [14:16:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:16:43] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:16:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:21:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65587 and previous config saved to /var/cache/conftool/dbconfig/20240701-142231-root.json [14:24:11] (03CR) 10Vgutierrez: [C:03+1] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor) [14:25:07] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [14:25:42] (03CR) 10Cathal Mooney: [C:03+1] "LGTM." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1050379 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:26:25] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:24] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [14:28:52] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9939925 (10elukey) p:05Triage→03Medium [14:30:51] (03CR) 10Brouberol: [C:03+1] "Let's see if this works" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [14:31:27] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bullseye [14:32:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [14:34:03] (03PS1) 10DCausse: cirrus-streaming-updater: staging should listen to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051147 [14:35:20] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2017 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:35:40] !log upgrading A:cp-codfw to haproxy 2.8.10 (T367756) [14:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:43] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [14:35:50] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: staging should listen to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051147 (owner: 10DCausse) [14:35:51] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1051143 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [14:36:23] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw [14:36:38] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw [14:36:40] (03CR) 10Phuedx: [C:03+1] extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [14:36:42] (03Merged) 10jenkins-bot: cirrus-streaming-updater: staging should listen to eqiad topics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051147 (owner: 10DCausse) [14:37:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65589 and previous config saved to /var/cache/conftool/dbconfig/20240701-143736-root.json [14:39:15] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:03] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:40:16] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:43:39] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9940046 (10Eevans) 💥 `/dev/sde` is failed again... {F56126743} {F56126744} {F56126745} [14:43:57] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [14:44:12] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [14:44:42] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [14:45:00] (03PS1) 10Alexandros Kosiaris: Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 [14:45:34] (03PS1) 10Elukey: services: update thumbor-plugin Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051150 [14:45:51] (03CR) 10Kamila Součková: [C:03+1] Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 (owner: 10Alexandros Kosiaris) [14:47:49] (03CR) 10Alexandros Kosiaris: [C:03+2] Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 (owner: 10Alexandros Kosiaris) [14:48:14] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [14:48:30] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [14:48:41] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [14:49:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:50:04] (03Merged) 10jenkins-bot: Repackage for bullseye [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/1051149 (owner: 10Alexandros Kosiaris) [14:50:07] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, I don't fully understand the changes to the query in _get_devices() but I assume it works and what is now needed so +1" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:50:29] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage [14:52:10] jouncebot: nowandnext [14:52:10] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [14:52:10] In 0 hour(s) and 37 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1530) [14:52:34] herron: I'll deploy the last major statsd for mw-on-k8s [14:52:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65590 and previous config saved to /var/cache/conftool/dbconfig/20240701-145242-root.json [14:52:45] (03CR) 10Clément Goubert: [C:03+2] mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:53:04] claime: excellent [14:53:36] (03Merged) 10jenkins-bot: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [14:54:42] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:54:52] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [14:55:14] !log deploying statsd-exporter for mw-web - T365265 [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:22] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [14:56:10] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:56:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:57:40] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:58:22] urbanecm: if you have a moment, can you check how that script run is going? (probably not done yet, i'm just curious how far along it is) [14:59:15] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:16] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [14:59:40] MatmaRex: currently at enwiki [14:59:44] enwiki: 107801 [15:00:23] (03CR) 10Ottomata: [C:03+1] "ActuActual" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [15:01:03] urbanecm: nice, thanks. enwiki had around 140k rows to fix fwiw (https://phabricator.wikimedia.org/T356196#9908208) [15:01:13] so almost done there in that case? [15:01:24] urbanecm: are there lots of warnings on other wikis too, or was that just on commons? (just curious) [15:02:23] urbanecm: yeah, although it gets slower towards the end, because the query scans the fixed rows again [15:02:34] MatmaRex: similar amount of warnings at enwiki than at commons [15:02:56] alright. thanks for checking :) [15:03:40] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:04:12] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:05:31] !log reboot deploy1003 T364416 [15:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:34] T364416: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416 [15:06:22] PROBLEM - Host deploy1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:34] RECOVERY - Host deploy1003 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [15:07:11] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [15:07:23] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:07:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65591 and previous config saved to /var/cache/conftool/dbconfig/20240701-150747-root.json [15:08:09] (03CR) 10Aqu: "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [15:10:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2026.codfw.wmnet with OS bullseye [15:11:11] (03PS3) 10Jdlrobson: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) [15:11:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [15:11:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:56] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [15:13:28] (03PS1) 10Alexandros Kosiaris: deployment_server: if guard php-readline to buster [puppet] - 10https://gerrit.wikimedia.org/r/1051154 (https://phabricator.wikimedia.org/T364416) [15:14:55] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:15:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:15:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [15:15:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9940169 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye comp... [15:15:33] (03CR) 10Btullis: [C:03+2] cephcsi: Run the csi-rbdplugin container as gid 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:16:25] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:18:51] (03PS1) 10JHathaway: nskaggs: remove references from icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1051155 [15:18:52] (03Merged) 10jenkins-bot: cephcsi: Run the csi-rbdplugin container as gid 900 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051083 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:20:17] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:21:02] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:21:42] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:22:32] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw [15:22:35] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:22:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65592 and previous config saved to /var/cache/conftool/dbconfig/20240701-152253-root.json [15:25:13] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw [15:30:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1530). [15:32:47] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:36:42] (03PS2) 10TChin: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) [15:37:04] (03CR) 10TChin: EventStreamConfig: Add hive ingestion defaults (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [15:37:14] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:37:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65593 and previous config saved to /var/cache/conftool/dbconfig/20240701-153758-root.json [15:39:34] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "Okay to deploy tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [15:44:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367856)', diff saved to https://phabricator.wikimedia.org/P65594 and previous config saved to /var/cache/conftool/dbconfig/20240701-154427-marostegui.json [15:44:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [15:51:01] (03CR) 10Dzahn: [V:03+1 C:03+2] doc: redirect doc.wikimedia.org/analytics-api [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn) [15:55:37] (03CR) 10Dzahn: [V:03+1 C:03+2] "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/doc/test_doc.yaml --hosts=doc2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn) [15:56:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9940509 (10leila) approved. thanks! [15:58:59] (03CR) 10Dzahn: [V:03+1 C:03+2] "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/doc/test_doc.yaml --hosts=doc1003.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1050469 (https://phabricator.wikimedia.org/T368482) (owner: 10Dzahn) [15:59:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P65595 and previous config saved to /var/cache/conftool/dbconfig/20240701-155934-marostegui.json [16:04:18] (03CR) 10Phuedx: [C:04-1] "A couple of points inline." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [16:07:34] (03CR) 10Phuedx: [C:04-1] Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [16:11:43] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041 [16:11:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041 [16:12:15] (03PS5) 10BCornwall: hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) [16:14:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P65596 and previous config saved to /var/cache/conftool/dbconfig/20240701-161441-marostegui.json [16:16:54] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3132/console" [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [16:17:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye [16:17:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9940600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye [16:18:39] 10ops-codfw, 06DC-Ops: Cabling for FR - https://phabricator.wikimedia.org/T368940 (10Jhancock.wm) 03NEW [16:18:47] !log restarting Cassandra —restbase2023-{a,b,c}— troubleshooting storage utilization [16:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:26] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1051155 (owner: 10JHathaway) [16:20:39] (03CR) 10JHathaway: [C:03+2] nskaggs: remove references from icinga config [puppet] - 10https://gerrit.wikimedia.org/r/1051155 (owner: 10JHathaway) [16:20:58] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1039 [16:20:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1039 [16:21:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:27] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [16:22:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9940658 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye [16:27:21] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#9940710 (10elukey) After a chat with Riccardo some things came up: * It seems that the issue comes up when debmonitor-client is upgraded... [16:27:40] (03PS3) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T366524) [16:27:48] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:29:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T367856)', diff saved to https://phabricator.wikimedia.org/P65597 and previous config saved to /var/cache/conftool/dbconfig/20240701-162948-marostegui.json [16:29:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:29:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [16:30:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:30:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T367856)', diff saved to https://phabricator.wikimedia.org/P65598 and previous config saved to /var/cache/conftool/dbconfig/20240701-163010-marostegui.json [16:33:46] !log dancy@deploy1002 Installing scap version "4.90.0" for 234 hosts [16:34:29] !log dancy@deploy1002 Installing scap version "4.90.0" for 234 hosts [16:34:56] !log dancy@deploy1002 Installing scap version "4.90.0" for 234 hosts [16:35:57] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage [16:38:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage [16:39:53] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9940781 (10Dzahn) No, I did not get a response. For one of the owner addresses I got an "550 5.1.1 The email account that you tried to reach does not exist." So I can confirm there does see... [16:41:25] FIRING: [4x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:42:22] (03PS1) 10Pppery: Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) [16:42:59] (03CR) 10CI reject: [V:04-1] Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery) [16:43:34] (03PS2) 10Pppery: Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) [16:46:16] (03CR) 10Alexandros Kosiaris: [C:03+2] deployment_server: if guard php-readline to buster [puppet] - 10https://gerrit.wikimedia.org/r/1051154 (https://phabricator.wikimedia.org/T364416) (owner: 10Alexandros Kosiaris) [16:48:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9940819 (10akosiaris) [16:50:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install deploy1003 - https://phabricator.wikimedia.org/T364416#9940823 (10akosiaris) 05Open→03Resolved Host is imaged, rest of the work is ongoing in T364417 [16:50:58] (03PS4) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T367826) [16:51:10] (03PS5) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T367826) [16:51:16] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [16:51:31] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [16:51:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1700) [17:00:05] ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T1700). [17:00:53] PROBLEM - Disk space on restbase2023 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 90521 MB (5% inode=99%): /srv/sdc4 66626 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops [17:03:24] (03CR) 10Klausman: [C:03+1] "Post-facto +1, already rolled out. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051132 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [17:04:53] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [17:05:09] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [17:08:07] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [17:08:23] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [17:11:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:35] (03CR) 10Vgutierrez: [C:03+1] hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [17:15:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [17:16:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [17:16:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T364069)', diff saved to https://phabricator.wikimedia.org/P65599 and previous config saved to /var/cache/conftool/dbconfig/20240701-171609-marostegui.json [17:16:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:24:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9941093 (10Papaul) [17:26:46] (03PS1) 10Herron: thanos: increase query frontend and store cache sizes [puppet] - 10https://gerrit.wikimedia.org/r/1051177 (https://phabricator.wikimedia.org/T368953) [17:26:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9941103 (10Papaul) All the cabling is done. I am leaving this task open so when we move the console cables from asw-c*/d*-codfw to ssw1-* and lsw1-* I can u... [17:27:07] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9941105 (10Scott_French) [17:27:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:27:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [17:27:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9941106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1039.eqiad.wmnet with OS bullseye comp... [17:30:35] (03PS4) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [17:34:11] (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [17:35:47] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on cr2-codfw,ssw1-a[1,8]-codfw.mgmt with reason: reboot ssw1-d8-codfw [17:36:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cr2-codfw,ssw1-a[1,8]-codfw.mgmt with reason: reboot ssw1-d8-codfw [17:37:17] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1041.eqiad.wmnet with OS bullseye [17:38:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9941207 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye exec... [17:40:48] (03CR) 10Dzahn: [C:03+2] stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [17:40:57] ohoho :) [17:41:15] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041 [17:41:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041 [17:42:04] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041 [17:42:05] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041 [17:44:21] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041 [17:44:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041 [17:45:31] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:46:57] (03PS1) 10Brennen Bearnes: gitlab-settings: v1.6.0 for squash commit templates [puppet] - 10https://gerrit.wikimedia.org/r/1051178 (https://phabricator.wikimedia.org/T366624) [17:48:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [17:49:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [17:49:19] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:49:23] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041 [17:49:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041 [17:52:09] (03PS5) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [17:53:32] (03PS6) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [17:55:39] (03CR) 10David Caro: "Running now on toolsbeta with envvars-api 0.0.49" [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [17:57:05] (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [17:57:29] (03PS1) 10Cwhite: logstash: add normalize_labels script [puppet] - 10https://gerrit.wikimedia.org/r/1051180 (https://phabricator.wikimedia.org/T368867) [17:57:36] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9941344 (10Scott_French) Thanks, @SGupta-WMF ! The service is up and running... [17:59:00] (03CR) 10Brennen Bearnes: "I already ran this against all projects, so future runs should only catch a handful of new ones." [puppet] - 10https://gerrit.wikimedia.org/r/1051178 (https://phabricator.wikimedia.org/T366624) (owner: 10Brennen Bearnes) [17:59:41] (03CR) 10David Caro: envvars-backend: update endpoint to new schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [18:00:23] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9941358 (10Dzahn) 05Open→03Stalled Per IRC chat: curre... [18:01:03] (03PS7) 10David Caro: envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [18:04:37] (03CR) 10CI reject: [V:04-1] envvars-backend: update endpoint to new schema [puppet] - 10https://gerrit.wikimedia.org/r/1050567 (https://phabricator.wikimedia.org/T368516) (owner: 10Slavina Stefanova) [18:04:53] 06SRE, 06SRE-OnFire, 10Stewards-Onboarding-Tool, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9941430 (10CDanis) >>! In T343377#9937115, @Urbanecm wrote: >>>! In T343377#9931101, @MoritzMuehlenhoff wrote: >>... [18:23:06] (03CR) 10Ottomata: EventStreamConfig: Add hive ingestion defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin) [18:24:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:19] PROBLEM - BFD status on ssw1-a8-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:24:19] PROBLEM - BFD status on ssw1-a1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:25:36] (03PS1) 10Jdlrobson: Change color of notification icon in dark-mode [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) [18:26:19] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 125, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:26:19] RECOVERY - BFD status on ssw1-a1-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:26:19] RECOVERY - BFD status on ssw1-a8-codfw.mgmt is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:28:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) (owner: 10Jdlrobson) [18:31:09] 06SRE, 06SRE-OnFire, 10Stewards-Onboarding-Tool, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9941628 (10Urbanecm) >>! In T343377#9941430, @CDanis wrote: > This is great, thanks. One remaining piece here is... [18:31:20] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9941625 (10Dzahn) Hello @KFrancis Andy is a special case since he moved from WMF staff to WMDE. If he was WMF staff we wouldn't have to do a separa... [18:42:28] (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Unify all trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1050480 (https://phabricator.wikimedia.org/T344174) (owner: 10BCornwall) [18:44:34] (03PS1) 10Jdlrobson: Do not invert images that have been tagged with no invert classes [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) [18:44:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) (owner: 10Jdlrobson) [18:46:44] 10ops-eqsin, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9941690 (10BCornwall) 05Open→03Resolved [18:54:52] (03PS2) 10Cwhite: logstash: update ecs patch version to 7 [puppet] - 10https://gerrit.wikimedia.org/r/1032737 (https://phabricator.wikimedia.org/T290020) [18:56:38] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1041 [18:56:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1041 [18:57:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye [18:57:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9941755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye [18:59:09] (03CR) 10Gergő Tisza: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [19:02:37] (03PS6) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [19:03:20] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [19:03:46] (03PS7) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [19:04:27] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [19:04:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery) [19:04:51] (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [19:05:38] (03CR) 10Clare Ming: Deploy MetricsPlatform to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [19:09:42] (03PS1) 10Cwhite: test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451) [19:09:44] (03PS1) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451) [19:10:07] (03CR) 10CI reject: [V:04-1] test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451) (owner: 10Cwhite) [19:10:10] (03CR) 10Krinkle: [C:03+1] varnish: Copy value of X-Wikimedia-Debug cookie to header [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [19:11:31] (03PS2) 10Cwhite: test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451) [19:11:32] (03PS2) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451) [19:13:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [19:13:44] (03PS3) 10Cwhite: logstash: enable normalize_labels on production [puppet] - 10https://gerrit.wikimedia.org/r/1051189 (https://phabricator.wikimedia.org/T342451) [19:14:47] !log dancy@deploy1002 Installing scap version "4.91.0" for 234 hosts [19:15:21] !log dancy@deploy1002 Installing scap version "4.91.0" for 234 hosts [19:16:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [19:17:38] (03PS2) 10Scott French: admin_ng: Switch eqiad und codfw wikikube clusters to PSS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [19:19:09] !log dancy@deploy1002 Installing scap version "4.91.0" for 233 hosts [19:19:36] (03CR) 10Scott French: [C:03+1] "Your plan SGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051133 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [19:19:42] !log dancy@deploy1002 Installation of scap version "4.91.0" completed for 233 hosts [19:33:56] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:45:34] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051180 (https://phabricator.wikimedia.org/T368867) (owner: 10Cwhite) [19:55:48] (03CR) 10Cwhite: [C:03+2] logstash: add normalize_labels script [puppet] - 10https://gerrit.wikimedia.org/r/1051180 (https://phabricator.wikimedia.org/T368867) (owner: 10Cwhite) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T2000). [20:00:05] jdlrobson, pppery, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] Here [20:00:12] o/ [20:00:18] i can deploy [20:00:27] hey cjming im here :) [20:00:33] yay! [20:00:43] jdlrobson: can your 2 backports go out together? [20:01:03] (03PS4) 10Jdlrobson: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) [20:02:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [20:02:54] (03Merged) 10jenkins-bot: [July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050084 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [20:03:13] !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]] [20:03:18] T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin - https://phabricator.wikimedia.org/T367151 [20:03:51] (03CR) 10Clare Ming: [C:03+2] Change color of notification icon in dark-mode [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) (owner: 10Jdlrobson) [20:04:22] (03CR) 10Clare Ming: [C:03+2] Do not invert images that have been tagged with no invert classes [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) (owner: 10Jdlrobson) [20:05:11] (03PS1) 10RLazarus: mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) [20:05:43] Pppery: i will do yours next after Jon's 1st config patch and while i wait for Jon's 2 backports to merge [20:05:49] Ok [20:07:25] Jdlrobson: I've +2'd your two backports for Minerva since it looks like it'll be ~20+ mins for each -- ok if i scap backport them together? [20:07:32] (03PS2) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) [20:10:07] (03CR) 10RLazarus: "You're right, I only needed to kube_env as the -deploy user for the `kubectl attach`, and that user already has the privileges it needs. N" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [20:10:10] (03Abandoned) 10RLazarus: admin_ng: RBAC to allow mw-script user to attach to pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [20:13:30] cjming: yeh that's fine [20:14:15] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [20:15:41] any SRE around? i seem to be stuck syncing to test servers with: 20:04:19 sync-masters: 50% (in-flight: 1; ok: 1; fail: 0; left: 0) / [20:15:44] not sure if i should just wait or if there's something to do - usually doesn't take this long [20:18:52] cjming: hm, that might be related to deploy1003 which is being set up in https://phabricator.wikimedia.org/T364417 [20:19:45] huh - what should i do in the meantime? [20:19:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T364069)', diff saved to https://phabricator.wikimedia.org/P65600 and previous config saved to /var/cache/conftool/dbconfig/20240701-201949-marostegui.json [20:19:53] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:19:56] I believe it should be fine to not sync to that master, but I don't know offhand how to tell scap that [20:21:04] I'm digging around in the scap source a little, but dancy might know the answer offhand if he's around [20:21:08] anyone else know how i should intervene with scap? [20:21:51] * dancy taking a look [20:22:00] * cjming grateful to dancy [20:22:04] I had to do some fighting earlier today to work around partially-deployed deploy1003. [20:22:39] oooh - it actually just started up again - maybe it's ok? [20:22:53] Yeah, should be ok. [20:23:04] i just had wait an unusually long time to sync to test servers [20:23:16] For the next backport, when it hangs there, start another shell, do "ps uaxwwww | grep deploy1003" and kill the associated ssh process. [20:23:34] will do - thanks! [20:23:42] !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:42] !log cjming@deploy1002 Sync cancelled. [20:23:42] I'll hang around. [20:23:51] except it cancelled the sync [20:23:51] thanks dancy :) [20:23:51] T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin - https://phabricator.wikimedia.org/T367151 [20:24:13] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [20:24:15] cjming: Can you send me the transcript? [20:24:16] and logged me out - bec timeout? [20:24:53] yup - 1 sec [20:25:36] should i just re-scap backport the same patch? [20:25:39] yes [20:26:08] !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]] [20:27:55] (03PS1) 10Fabfur: benthos:cache: encode referer field as hex [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) [20:28:44] !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:28:50] ok finally [20:28:59] Jdlrobson: 1st patch on test servers - can i sync? [20:29:05] (03Merged) 10jenkins-bot: Change color of notification icon in dark-mode [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051184 (https://phabricator.wikimedia.org/T368120) (owner: 10Jdlrobson) [20:29:06] (03Merged) 10jenkins-bot: Do not invert images that have been tagged with no invert classes [skins/MinervaNeue] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051186 (https://phabricator.wikimedia.org/T368483) (owner: 10Jdlrobson) [20:29:22] cjming: looking now [20:30:05] cjming: can we sync all 3 of these together? [20:30:17] it looks good but ideally i'd like the other fixes to go out before or at the same time. [20:30:32] sure - let me do that - 1 sec [20:30:41] !log cjming@deploy1002 Sync cancelled. [20:31:45] !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]], [[gerrit:1051184|Change color of notification icon in dark-mode (T368120)]], [[gerrit:1051186|Do not invert images that have been tagged with no invert classes (T368483)]] [20:31:50] T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin - https://phabricator.wikimedia.org/T367151 [20:31:51] T368120: [Short term fix] Notification icon not same color as other icons - https://phabricator.wikimedia.org/T368120 [20:31:51] T368483: Regression: Global invert broke VisualEditor "Add a link" workflow - https://phabricator.wikimedia.org/T368483 [20:33:07] (03CR) 10Vgutierrez: "two things:" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [20:34:29] !log cjming@deploy1002 cjming, jdlrobson: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]], [[gerrit:1051184|Change color of notification icon in dark-mode (T368120)]], [[gerrit:1051186|Do not invert images that have been tagged with no invert classes (T368483)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:34:32] Jdlrobson: ok all 3 are up on test servers - lmk if/when to sync [20:34:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P65601 and previous config saved to /var/cache/conftool/dbconfig/20240701-203456-marostegui.json [20:34:57] cjming: looing now :) [20:36:22] cjming: please sync! [20:36:27] yay! [20:36:32] !log cjming@deploy1002 cjming, jdlrobson: Continuing with sync [20:39:15] (03CR) 10Vgutierrez: "actually any request header logged needs to be encoded" [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [20:41:36] (03Abandoned) 10Gergő Tisza: Profiler: Handle X-Wikimedia-Debug cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024932 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [20:42:24] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1050084|[July 1st] Mobile: Enable dark mode for all tier 1 wikis (logged in) (T367151)]], [[gerrit:1051184|Change color of notification icon in dark-mode (T368120)]], [[gerrit:1051186|Do not invert images that have been tagged with no invert classes (T368483)]] (duration: 10m 39s) [20:42:29] T367151: [Config] Deploy dark mode to all users in tier 1 wikis on the Minerva skin - https://phabricator.wikimedia.org/T367151 [20:42:30] T368120: [Short term fix] Notification icon not same color as other icons - https://phabricator.wikimedia.org/T368120 [20:42:30] T368483: Regression: Global invert broke VisualEditor "Add a link" workflow - https://phabricator.wikimedia.org/T368483 [20:42:32] Jdlrobson: should be live! [20:42:45] Pppery: doing yours now - pardon the wait [20:43:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:43:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery) [20:43:57] (03CR) 10Cwhite: [C:03+2] test normalize_labels ruby script on beta [puppet] - 10https://gerrit.wikimedia.org/r/1051188 (https://phabricator.wikimedia.org/T342451) (owner: 10Cwhite) [20:44:25] (03Merged) 10jenkins-bot: Missing.php: don't redirect to unprefixed nan incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051170 (https://phabricator.wikimedia.org/T86915) (owner: 10Pppery) [20:44:43] !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1051170|Missing.php: don't redirect to unprefixed nan incubator (T86915)]] [20:44:45] T86915: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 [20:45:42] (03PS1) 10Cwhite: beta-logs: remove dlq spam mitigation [puppet] - 10https://gerrit.wikimedia.org/r/1051200 [20:47:17] !log cjming@deploy1002 cjming, pppery: Backport for [[gerrit:1051170|Missing.php: don't redirect to unprefixed nan incubator (T86915)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:47:25] Pppery: your patch is up on test servers - lmk if i can sync [20:47:40] Looks good [20:47:53] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:47:57] !log cjming@deploy1002 cjming, pppery: Continuing with sync [20:48:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [20:49:36] (03PS2) 10Cwhite: beta-logs: remove dlq spam mitigation [puppet] - 10https://gerrit.wikimedia.org/r/1051200 [20:49:55] (03PS19) 10Gergő Tisza: Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [20:50:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P65602 and previous config saved to /var/cache/conftool/dbconfig/20240701-205003-marostegui.json [20:51:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:47] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1051170|Missing.php: don't redirect to unprefixed nan incubator (T86915)]] (duration: 09m 03s) [20:53:49] T86915: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 [20:54:00] Pppery: should be live! [20:54:12] Yep [20:54:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [20:55:09] (03Merged) 10jenkins-bot: extension-list: Add Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046710 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [20:55:28] !log cjming@deploy1002 Started scap sync-world: Backport for [[gerrit:1046710|extension-list: Add Metrics Platform (T366234)]] [20:55:33] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [20:59:20] (03PS1) 10Jforrester: Reference widget: check for undefined config [extensions/WikibaseMediaInfo] (wmf/1.43.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1051202 (https://phabricator.wikimedia.org/T368736) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240701T2100). [21:00:51] ^^ i'm almost done - just need to sync last patch [21:00:53] RECOVERY - Disk space on restbase2023 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2023&var-datasource=codfw+prometheus/ops [21:05:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T364069)', diff saved to https://phabricator.wikimedia.org/P65603 and previous config saved to /var/cache/conftool/dbconfig/20240701-210512-marostegui.json [21:05:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [21:05:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:05:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [21:05:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T364069)', diff saved to https://phabricator.wikimedia.org/P65604 and previous config saved to /var/cache/conftool/dbconfig/20240701-210534-marostegui.json [21:12:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:13:13] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52340 bytes in 1.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:14:25] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:16:08] !log cjming@deploy1002 cjming: Backport for [[gerrit:1046710|extension-list: Add Metrics Platform (T366234)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:16:10] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [21:16:11] !log cjming@deploy1002 cjming: Continuing with sync [21:22:34] (03PS1) 10Herron: prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) [21:23:16] (03CR) 10CI reject: [V:04-1] prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) (owner: 10Herron) [21:23:44] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1046710|extension-list: Add Metrics Platform (T366234)]] (duration: 28m 16s) [21:23:47] T366234: Deploy the Metrics Platform extension - https://phabricator.wikimedia.org/T366234 [21:24:22] !log end of UTC late backport window [21:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:12] (03PS2) 10Herron: prom-ipmi-exporter: add sel-events collector [puppet] - 10https://gerrit.wikimedia.org/r/1051207 (https://phabricator.wikimedia.org/T368088) [21:27:30] (03PS1) 10Ahmon Dancy: gitlab::runner: Add buildkitd to no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1051208 [21:27:53] (03CR) 10CI reject: [V:04-1] gitlab::runner: Add buildkitd to no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy) [21:28:15] cjming: Looking good? sec.team actually has a couple of patches to go out today. [21:28:33] all good and all yours! [21:29:11] (03PS2) 10Ahmon Dancy: gitlab::runner: Add buildkitd to no_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1051208 [21:30:59] sbassett: Watch out for a hanging "syncing masters" deployment phase. If this happens to you, start another shell and kill any hanging ssh process for deploy1003. [21:31:35] dancy: Ok. Are other deploys ok? I’m on 1002… [21:31:40] xref https://phabricator.wikimedia.org/T364417 [21:31:47] Cc mstyles ^^ [21:32:00] !log zabe@mwmaint1002:/tmp/upload$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --sleep=3600 --user=Yann . # T368703 [21:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:03] T368703: Server side upload for Yann - https://phabricator.wikimedia.org/T368703 [21:32:53] sbassett: I'm not sure what you mean by other deploys. [21:33:06] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy) [21:33:26] dancy: It looked to be an issue with deploy1003? Or is it all of the deploy hosts? [21:33:52] deployments from deploy1002 will be affected by the fact that deploy1003 is listed in /etc/dsh/group/scap-masters even though it's not ready. [21:34:03] Oh ok [21:34:48] (03PS2) 10Fabfur: benthos:cache: encode problematic fields as hex [puppet] - 10https://gerrit.wikimedia.org/r/1051198 (https://phabricator.wikimedia.org/T365718) [21:36:08] (03CR) 10Krinkle: Handle sso.wikimedia.org domain (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [21:36:35] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/1051208/3865/" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy) [21:46:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:31] (03CR) 10Cwhite: [C:03+2] beta-logs: remove dlq spam mitigation [puppet] - 10https://gerrit.wikimedia.org/r/1051200 (owner: 10Cwhite) [21:47:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:55:42] !log deployed patch for T366991 [21:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1043920328 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:58:43] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1089-1090,1104].eqiad.wmnet with reason: T348977 [21:58:48] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [21:58:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1089-1090,1104].eqiad.wmnet with reason: T348977 [21:59:17] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1089*,elastic1090*,elastic1104* for T348977 - bking@cumin2002 [21:59:20] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1089*,elastic1090*,elastic1104* for T348977 - bking@cumin2002 [21:59:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 6456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:10:38] !log sbassett@deploy1002 Synchronized private/PrivateSettings.php: Un-deployed a PS.php mitigation for T341908 (duration: 07m 24s) [22:15:33] (03Abandoned) 10Clare Ming: Add test streams for Metrics Platform app + web base instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050678 (https://phabricator.wikimedia.org/T366949) (owner: 10Clare Ming) [22:47:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [22:47:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye [22:47:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9942633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1041.eqiad.wmnet with OS bullseye completed: - cloudcep... [22:48:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9942634 (10Jclark-ctr) [22:49:37] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[39-41] - https://phabricator.wikimedia.org/T363341#9942635 (10Jclark-ctr) 05Open→03Resolved [22:50:15] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:05] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:27] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8997 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:54:25] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1038 [22:54:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1038 [22:58:02] (03PS1) 10Cwhite: logstash: route thumbor logs in routing filter [puppet] - 10https://gerrit.wikimedia.org/r/1051214 (https://phabricator.wikimedia.org/T368180) [23:01:45] (03PS6) 10Jdlrobson: [July 8th] Reduce list of exclusions for dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050671 (https://phabricator.wikimedia.org/T365509) [23:02:26] (03PS5) 10Jdlrobson: [July 2nd] Mobile: Enable dark mode for all users for tier 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) [23:02:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [23:02:46] (03CR) 10Jdlrobson: [C:04-1] "I need to confirm the stage 1 wikis - seems we overlooked an issue when defining those groups." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050085 (https://phabricator.wikimedia.org/T367151) (owner: 10Jdlrobson) [23:02:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942655 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye [23:05:26] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye [23:05:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1036.eqiad.wmnet with OS bullseye [23:12:02] (03PS1) 10Cwhite: logstash: remove ecs gating from kubernetes_docker filter [puppet] - 10https://gerrit.wikimedia.org/r/1051215 (https://phabricator.wikimedia.org/T314381) [23:12:03] (03PS1) 10Andrew Bogott: Toolforge elasticsearch haproxy: update CORS syntax for modern haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905) [23:17:51] (03PS2) 10Andrew Bogott: Toolforge elasticsearch haproxy: update CORS syntax for modern haproxy [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905) [23:18:05] (03CR) 10BryanDavis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1051216 (https://phabricator.wikimedia.org/T311905) (owner: 10Andrew Bogott) [23:19:20] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [23:22:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [23:25:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye [23:25:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [23:25:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1038.eqiad.wmnet with OS bullseye [23:25:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942691 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1037.eqiad.wmnet with OS bullseye [23:34:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [23:36:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051218 [23:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1051218 (owner: 10TrainBranchBot) [23:39:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:40:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:41:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [23:41:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1035.eqiad.wmnet with OS bullseye comp... [23:43:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942746 (10Jclark-ctr) [23:45:30] (03CR) 10Scott French: "One maybe-typo and one question. Otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [23:47:59] (03CR) 10Scott French: [C:03+1] mediawiki: Allow setting `tty` and `stdin` for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051196 (https://phabricator.wikimedia.org/T368966) (owner: 10RLazarus) [23:51:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [23:51:38] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [23:54:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:54:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [23:55:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:55:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bullseye [23:55:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#9942761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1036.eqiad.wmnet with OS bullseye comp... [23:57:50] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage [23:59:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T364069)', diff saved to https://phabricator.wikimedia.org/P65605 and previous config saved to /var/cache/conftool/dbconfig/20240701-235941-marostegui.json [23:59:44] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069