[00:07:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1069715 (owner: 10TrainBranchBot) [00:15:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:53] (03PS1) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 [00:29:34] (03CR) 10CI reject: [V:04-1] logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 (owner: 10Bartosz Dziewoński) [00:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:49] (03PS2) 10Bartosz Dziewoński: logging: Fix local variables leaking into global scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069716 [00:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [00:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:45] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:35:40] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:28] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:40] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [04:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:49] 06SRE, 06Commons, 07Wikimedia-production-error: Cannot delete file on Commons: DBQueryError (File:Logo-headlinejabar.jpg) - https://phabricator.wikimedia.org/T373748#10109207 (10Yann) Again with https://commons.wikimedia.org/wiki/File:Ricardo_Monastero_1.png A database query error has occurred. This may ind... [05:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:46] 06SRE, 06Commons, 07Wikimedia-production-error: Cannot delete file on Commons: DBQueryError (File:Logo-headlinejabar.jpg) - https://phabricator.wikimedia.org/T373748#10109215 (10Yann) https://commons.wikimedia.org/w/index.php?title=File:Cinematic_Afternoon.jpg&action=delete `Request from 89.248.174.2 via cp... [05:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:36] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:36] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:40] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:23:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [06:29:24] (03CR) 10Slyngshede: [C:03+1] "From reading the code it seems like this only displays information about the users and their groups. I don't see any harm in adding the ld" [puppet] - 10https://gerrit.wikimedia.org/r/1069229 (owner: 10Thcipriani) [06:34:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db1206 replag', diff saved to https://phabricator.wikimedia.org/P68506 and previous config saved to /var/cache/conftool/dbconfig/20240902-063432-arnaudb.json [06:35:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1206.eqiad.wmnet with reason: replag [06:35:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: replag [06:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68507 and previous config saved to /var/cache/conftool/dbconfig/20240902-064749-arnaudb.json [06:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T0700). [07:00:05] msz2001 and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:41] Hi! I'm around and ready [07:00:49] here [07:01:33] Msz2001: I can deploy your change if you want. [07:01:50] Yes, please [07:02:12] cool. I'll go ahead and ask when it is ready to test on mwdebug servers. [07:02:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069257 (https://phabricator.wikimedia.org/T373079) (owner: 10Msz2001) [07:02:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 2%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68508 and previous config saved to /var/cache/conftool/dbconfig/20240902-070255-arnaudb.json [07:03:36] (03Merged) 10jenkins-bot: Enable EditCheck references on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1069257 (https://phabricator.wikimedia.org/T373079) (owner: 10Msz2001) [07:04:03] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1069257|Enable EditCheck references on plwiki (T373079)]] [07:04:05] T373079: Enable EditCheck on Polish Wikipedia - https://phabricator.wikimedia.org/T373079 [07:15:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:16:57] (03CR) 10Jelto: "I'm not a big fan of maintaining a list of external services/IPs. IMHO this service could either live in WMCS or use a proper user agent a" [puppet] - 10https://gerrit.wikimedia.org/r/1069387 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:18:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 3%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68509 and previous config saved to /var/cache/conftool/dbconfig/20240902-071800-arnaudb.json [07:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:28] !log kartik@deploy1003 msz2001, kartik: Backport for [[gerrit:1069257|Enable EditCheck references on plwiki (T373079)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:33] T373079: Enable EditCheck on Polish Wikipedia - https://phabricator.wikimedia.org/T373079 [07:19:48] Msz2001: you can test it on mwdebug servers [07:20:00] let me know if everything is okay [07:20:42] It's okay [07:20:57] nice. going ahead for deployment. [07:22:07] !log kartik@deploy1003 msz2001, kartik: Continuing with sync [07:25:20] (03CR) 10Ilias Sarantopoulos: [C:03+2] APIGW: Add configuration to expose LW isvc articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [07:30:02] (03CR) 10Jgiannelos: [C:03+2] changeprop: Update references to latest beta restbase node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069145 (https://phabricator.wikimedia.org/T370460) (owner: 10Jgiannelos) [07:31:22] (03Merged) 10jenkins-bot: changeprop: Update references to latest beta restbase node [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069145 (https://phabricator.wikimedia.org/T370460) (owner: 10Jgiannelos) [07:32:44] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069257|Enable EditCheck references on plwiki (T373079)]] (duration: 28m 41s) [07:32:46] T373079: Enable EditCheck on Polish Wikipedia - https://phabricator.wikimedia.org/T373079 [07:32:54] Msz2001: Done! [07:32:58] Thanks! [07:33:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 4%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68510 and previous config saved to /var/cache/conftool/dbconfig/20240902-073306-arnaudb.json [07:33:25] (03PS5) 10KartikMistry: Enable Section Translation in bdr, btm, and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) [07:33:34] 06SRE, 06Commons, 07Wikimedia-production-error: Cannot delete file on Commons: DBQueryError (File:Logo-headlinejabar.jpg) - https://phabricator.wikimedia.org/T373748#10109286 (10Aklapper) >>! In T373748#10109207, @Yann wrote: > Again with https://commons.wikimedia.org/wiki/File:Ricardo_Monastero_1.png > > A... [07:35:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [07:35:57] (03Merged) 10jenkins-bot: Enable Section Translation in bdr, btm, and dtp Wikpedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1067898 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [07:36:06] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1067898|Enable Section Translation in bdr, btm, and dtp Wikpedias (T371420)]] [07:36:09] T371420: Complete enablement Section Translation in new wikis and make the process less manual for the future - https://phabricator.wikimedia.org/T371420 [07:38:49] !log kartik@deploy1003 kartik: Backport for [[gerrit:1067898|Enable Section Translation in bdr, btm, and dtp Wikpedias (T371420)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:38:49] (03Abandoned) 10Vgutierrez: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1069642 (owner: 10Ncmonitor) [07:39:14] !log kartik@deploy1003 kartik: Continuing with sync [07:43:42] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1067898|Enable Section Translation in bdr, btm, and dtp Wikpedias (T371420)]] (duration: 07m 35s) [07:43:45] T371420: Complete enablement Section Translation in new wikis and make the process less manual for the future - https://phabricator.wikimedia.org/T371420 [07:45:19] (03CR) 10JMeybohm: cfssl-issuer: Add external-services support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068768 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [07:45:29] (03PS2) 10JMeybohm: cfssl-issuer: Add external-services support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1068768 (https://phabricator.wikimedia.org/T359423) [07:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:55] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1068019 (owner: 10Ayounsi) [07:48:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68511 and previous config saved to /var/cache/conftool/dbconfig/20240902-074811-arnaudb.json [07:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:17] !log restart swift-proxy on ms-fe2012 T360913 [07:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:19] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [07:54:49] (03PS1) 10Brouberol: dse-k8s-eqiad: Disable mutating parts of the restricted PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069943 (https://phabricator.wikimedia.org/T369492) [07:54:50] (03PS1) 10Brouberol: dse-k8s-eqiad: Enforce the `restricted` PSS for all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069944 (https://phabricator.wikimedia.org/T369492) [07:54:52] (03PS1) 10Brouberol: dse-k8s-eqiad: Disable PSP [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) [07:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:45] (03PS2) 10Hashar: tox: only install flake8 when running flake8 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069226 (https://phabricator.wikimedia.org/T372485) [08:00:14] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [08:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 6%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68512 and previous config saved to /var/cache/conftool/dbconfig/20240902-080317-arnaudb.json [08:03:20] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783 (10MoritzMuehlenhoff) 03NEW [08:09:06] (03CR) 10Filippo Giunchedi: [C:03+1] "Neat!" [puppet] - 10https://gerrit.wikimedia.org/r/1069230 (https://phabricator.wikimedia.org/T269333) (owner: 10Herron) [08:09:11] !log installing Linux 6.1.106 on Bookworm hosts [08:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:55] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10109414 (10ABran-WMF) | es2032| es1 standalone| |es2031 |es2 standalone| |db2207 |s2 candidate master| |db2138| s2| |db2125 |s2| |db2149 |s3| |... [08:17:49] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2008.codfw.wmnet [08:18:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 15%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68513 and previous config saved to /var/cache/conftool/dbconfig/20240902-081822-arnaudb.json [08:18:23] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2008.codfw.wmnet [08:18:27] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes2027.codfw.wmnet [08:21:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes2027.codfw.wmnet [08:25:40] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:05] (03PS2) 10Ilias Sarantopoulos: APIGW: Add configuration to expose LW isvc articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) [08:30:40] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:02] (03PS1) 10JMeybohm: Rename kubernetes2008,2027 to wikikube-worker206[67] [puppet] - 10https://gerrit.wikimedia.org/r/1069951 (https://phabricator.wikimedia.org/T372878) [08:33:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68514 and previous config saved to /var/cache/conftool/dbconfig/20240902-083328-arnaudb.json [08:35:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:29] (03CR) 10JMeybohm: [C:03+2] Rename kubernetes2008,2027 to wikikube-worker206[67] [puppet] - 10https://gerrit.wikimedia.org/r/1069951 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [08:37:11] (03PS1) 10Brouberol: airflow: enable management of remote connections configuration file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) [08:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:16] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2008 to wikikube-worker2066 [08:41:16] (03PS2) 10Brouberol: airflow: enable management of remote connections configuration file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) [08:41:20] !log jayme@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2027 to wikikube-worker2067 [08:41:32] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:43:27] (03PS4) 10Clément Goubert: sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 [08:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:07] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [08:48:04] (03CR) 10Ilias Sarantopoulos: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [08:48:20] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2008 to wikikube-worker2066 - jayme@cumin1002" [08:48:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68515 and previous config saved to /var/cache/conftool/dbconfig/20240902-084833-arnaudb.json [08:48:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2008 to wikikube-worker2066 - jayme@cumin1002" [08:48:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:48:38] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2066 [08:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2066 [08:49:15] (03Merged) 10jenkins-bot: APIGW: Add configuration to expose LW isvc articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [08:49:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:49:23] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2067 [08:49:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2008 to wikikube-worker2066 [08:49:40] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109497 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes200... [08:49:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2067 [08:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:50:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2027 to wikikube-worker2067 [08:50:31] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by jayme@cumin1002 from kubernetes202... [08:50:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:40] (03PS1) 10Jelto: admin_ng: add query service namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) [08:54:54] (03PS1) 10Muehlenhoff: Readd profile::idp::build to idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1069954 [08:54:56] (03CR) 10Elukey: [V:03+1 C:03+2] profile::puppetserver: set java_start_mem to 40g [puppet] - 10https://gerrit.wikimedia.org/r/1069185 (https://phabricator.wikimedia.org/T373527) (owner: 10Elukey) [08:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:05] (03CR) 10CI reject: [V:04-1] sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 (owner: 10Clément Goubert) [08:59:29] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:00:01] (03PS12) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [09:00:40] RESOLVED: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:46] !log restart puppetserver on puppetserver1002 to pick up new JVM settings - T373527 [09:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:48] T373527: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527 [09:02:17] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2066.codfw.wmnet with OS bullseye [09:02:27] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2066 [09:02:30] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109565 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [09:02:38] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:03:03] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2067.codfw.wmnet with OS bullseye [09:03:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109571 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [09:03:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68516 and previous config saved to /var/cache/conftool/dbconfig/20240902-090339-arnaudb.json [09:05:08] (03PS13) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [09:05:40] FIRING: [4x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:45] (03PS16) 10Clément Goubert: sre.k8s.renumber-node: vlan, IP change k8s workers [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 [09:05:57] (03CR) 10Clément Goubert: sre.k8s.renumber-node: vlan, IP change k8s workers (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [09:06:03] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3805/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [09:08:29] (03CR) 10Jaime Nuche: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1069325 (https://phabricator.wikimedia.org/T359795) (owner: 10Dzahn) [09:08:29] (03CR) 10Zabe: [C:03+2] Do not log failed autocreations on closed wikis as diagnostic errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 (https://phabricator.wikimedia.org/T373650) (owner: 10Zabe) [09:08:50] (03PS14) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [09:09:05] (03PS5) 10Clément Goubert: sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 [09:09:15] (03Merged) 10jenkins-bot: Do not log failed autocreations on closed wikis as diagnostic errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068879 (https://phabricator.wikimedia.org/T373650) (owner: 10Zabe) [09:09:34] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1068879|Do not log failed autocreations on closed wikis as diagnostic errors (T373650)]] [09:09:37] T373650: CentralAuth logspam "Account autocreation denied for {name} by ClosedWikiProvider" - https://phabricator.wikimedia.org/T373650 [09:10:43] (03PS6) 10Tiziano Fogli: ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) [09:11:40] !log zabe@deploy1003 zabe: Backport for [[gerrit:1068879|Do not log failed autocreations on closed wikis as diagnostic errors (T373650)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:11:45] !log zabe@deploy1003 zabe: Continuing with sync [09:11:57] (03PS15) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [09:12:44] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2066 - jayme@cumin1002" [09:12:47] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3807/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [09:12:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2066 - jayme@cumin1002" [09:12:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:12:53] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2066.codfw.wmnet 197.0.192.10.in-addr.arpa 7.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:12:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2066.codfw.wmnet 197.0.192.10.in-addr.arpa 7.9.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:12:57] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2066 [09:14:40] !log update netboot images for Bullseye and Bookworm point releases (11.11 and 12.7) following https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_Foundations/Debian-installer [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:40] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:09] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068879|Do not log failed autocreations on closed wikis as diagnostic errors (T373650)]] (duration: 06m 34s) [09:16:12] T373650: CentralAuth logspam "Account autocreation denied for {name} by ClosedWikiProvider" - https://phabricator.wikimedia.org/T373650 [09:16:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2066 [09:16:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2066 [09:16:28] (03CR) 10Btullis: airflow: enable management of remote connections configuration file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [09:17:19] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: Disable mutating parts of the restricted PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069943 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:17:29] !log jayme@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2067 [09:18:03] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: Enforce the `restricted` PSS for all namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069944 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:18:06] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: puppetserver1002 thrashing and requiring a power cycle as a result - https://phabricator.wikimedia.org/T373527#10109603 (10elukey) Next steps: * Wait some hours for puppetserver on puppetserver1002 to get to a steady state and observe if we have any m... [09:18:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: post replag depool', diff saved to https://phabricator.wikimedia.org/P68517 and previous config saved to /var/cache/conftool/dbconfig/20240902-091844-arnaudb.json [09:19:19] (03CR) 10Btullis: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [09:21:27] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:24:23] (03PS3) 10Elukey: doc: add intersphinx_timeout [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) [09:25:32] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1060855 (https://phabricator.wikimedia.org/T367410) (owner: 10Elukey) [09:27:11] (03CR) 10JMeybohm: [C:04-1] "Can we find a more descriptive name for this please? "Query" is super generic..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:28:27] (03PS2) 10Jelto: admin_ng: add query service namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) [09:29:00] (03PS3) 10Jelto: admin_ng: add wikidata-query-gui service namespace to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) [09:29:29] (03CR) 10Jelto: "What about `wikidata-query-gui` :)?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069953 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:29:34] (03PS16) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [09:30:29] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2067 - jayme@cumin1002" [09:30:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2067 - jayme@cumin1002" [09:30:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:30:34] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2067.codfw.wmnet 88.0.192.10.in-addr.arpa 8.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:30:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2067.codfw.wmnet 88.0.192.10.in-addr.arpa 8.8.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [09:30:38] !log jayme@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2067 [09:30:40] RESOLVED: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2067 [09:30:50] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2067 [09:33:48] (03PS17) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [09:34:11] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v8.11.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069958 [09:34:15] (03PS1) 10Filippo Giunchedi: Revert "thanos: temp disable compact" [puppet] - 10https://gerrit.wikimedia.org/r/1069959 [09:34:36] (03PS7) 10Clément Goubert: sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 [09:34:42] (03CR) 10CI reject: [V:04-1] Revert "thanos: temp disable compact" [puppet] - 10https://gerrit.wikimedia.org/r/1069959 (owner: 10Filippo Giunchedi) [09:34:46] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3809/co" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [09:35:56] (03PS2) 10Filippo Giunchedi: Revert "thanos: temp disable compact" [puppet] - 10https://gerrit.wikimedia.org/r/1069959 [09:36:50] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "thanos: temp disable compact" [puppet] - 10https://gerrit.wikimedia.org/r/1069959 (owner: 10Filippo Giunchedi) [09:38:12] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: remove check_etcd_mw_config_lastindex [puppet] - 10https://gerrit.wikimedia.org/r/1060769 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [09:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:41:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:37] (03CR) 10Hnowlan: [C:03+1] "Good stuff, lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [09:42:48] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.renumber-node: vlan, IP change k8s workers [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [09:44:50] (03PS1) 10Filippo Giunchedi: icinga: add systemd::timer::job required parameters [puppet] - 10https://gerrit.wikimedia.org/r/1069961 [09:45:28] (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v8.11.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069958 (owner: 10Elukey) [09:45:31] (03PS2) 10Filippo Giunchedi: icinga: add systemd::timer::job required parameters [puppet] - 10https://gerrit.wikimedia.org/r/1069961 (https://phabricator.wikimedia.org/T322523) [09:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:00] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10109681 (10dcaro) >>! In T348643#10104246, @wiki_willy wrote: > Hi @dcaro - just following up on this to see if you... [09:47:22] (03PS2) 10Elukey: CHANGELOG: add changelogs for release v8.11.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069958 [09:50:20] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: add systemd::timer::job required parameters [puppet] - 10https://gerrit.wikimedia.org/r/1069961 (https://phabricator.wikimedia.org/T322523) (owner: 10Filippo Giunchedi) [09:53:48] (03Merged) 10jenkins-bot: sre.k8s.renumber-node: vlan, IP change k8s workers [cookbooks] - 10https://gerrit.wikimedia.org/r/1067989 (owner: 10Clément Goubert) [09:53:51] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:54:42] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:55:48] !log isaranto@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: sync [09:55:53] (03Abandoned) 10DCausse: cirrus: add cirrussearch-legacy-updater dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse) [09:55:58] !log isaranto@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [09:56:11] (03PS1) 10Hnowlan: k8s: rename mw238[6789] to wikikube-worker hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1069962 (https://phabricator.wikimedia.org/T372878) [09:56:35] (03CR) 10CI reject: [V:04-1] k8s: rename mw238[6789] to wikikube-worker hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1069962 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [09:58:39] (03PS2) 10Hnowlan: k8s: rename mw238[6789] to use wikikube-worker hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1069962 (https://phabricator.wikimedia.org/T372878) [09:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:58] !log isaranto@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: sync [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1000) [10:00:14] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v8.11.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1069958 (owner: 10Elukey) [10:00:20] !log isaranto@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [10:00:59] !log isaranto@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [10:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:11] !log isaranto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [10:02:36] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: Disable mutating parts of the restricted PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069943 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [10:03:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:03:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:04:22] (03PS1) 10Elukey: Upstream release v8.11.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1069983 [10:04:38] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [10:04:43] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet [10:04:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet [10:04:56] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host stat1011.eqiad.wmnet [10:05:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [10:06:15] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2066.codfw.wmnet with OS bullseye [10:06:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [10:06:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109751 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [10:07:00] (03CR) 10Brouberol: airflow: enable management of remote connections configuration file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [10:07:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [10:07:20] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [10:08:16] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on an-worker1127.eqiad.wmnet with reason: Cold booting due to RAID controller battery issue [10:08:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on an-worker1127.eqiad.wmnet with reason: Cold booting due to RAID controller battery issue [10:08:44] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10109755 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9998aca3-dd42-4bee-ad30-407ad1cac83b) set by btullis@cumin1002 f... [10:12:02] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet [10:12:05] (03CR) 10Elukey: [V:03+2 C:03+2] Upstream release v8.11.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1069983 (owner: 10Elukey) [10:12:54] (03CR) 10Clément Goubert: [C:03+1] k8s: rename mw238[6789] to use wikikube-worker hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1069962 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:13:47] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet [10:14:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1011.eqiad.wmnet [10:15:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [10:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:21:07] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2067.codfw.wmnet with OS bullseye [10:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:21:18] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [10:23:22] I uploaded a large batch of patches and then rebased them all to the master branch. That seems to have caused zuul to stop processing jobs [10:25:40] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:26:29] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2386.codfw.wmnet [10:27:24] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:28:11] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:29:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2386.codfw.wmnet [10:29:43] Posted in #wikimedia-releng [10:31:45] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:34:12] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:34:44] !log stevemunene@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:34:47] (03PS1) 10EoghanGaffney: apt-staging: Remove 'staging' flag from distribution name [puppet] - 10https://gerrit.wikimedia.org/r/1069991 [10:42:45] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2387.codfw.wmnet [10:43:20] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2387.codfw.wmnet [10:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:18] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2388.codfw.wmnet [10:44:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2388.codfw.wmnet [10:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:22] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1069954 (owner: 10Muehlenhoff) [10:49:41] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2389.codfw.wmnet [10:50:15] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2389.codfw.wmnet [10:50:30] (03CR) 10JMeybohm: [C:03+2] Make k8s/pool-depool-node work on control-planes as well [cookbooks] - 10https://gerrit.wikimedia.org/r/1069186 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [10:52:12] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack errors out when building without connectivity - https://phabricator.wikimedia.org/T373794 (10elukey) 03NEW [10:53:14] (03PS11) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [10:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:11] (03CR) 10Hnowlan: [C:03+2] k8s: rename mw238[6789] to use wikikube-worker hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1069962 (https://phabricator.wikimedia.org/T372878) (owner: 10Hnowlan) [10:55:57] (03PS12) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [10:57:05] (03PS13) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [10:57:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2386 to wikikube-worker2068 [10:57:47] (03PS4) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 [10:57:52] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:02] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2387 to wikikube-worker2069 [11:01:52] (03CR) 10Tiziano Fogli: "I also restricted the alerts scope to the datahub and production-elk7-.* clusters." [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [11:01:52] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2066.codfw.wmnet with OS bullseye [11:01:56] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2066.codfw.wmnet with OS bullseye [11:02:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [11:02:06] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [11:02:20] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2388 to wikikube-worker2070 [11:02:23] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2389 to wikikube-worker2071 [11:02:27] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2066.codfw.wmnet with OS bullseye [11:02:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109841 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [11:02:55] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:03:22] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2386 to wikikube-worker2068 - hnowlan@cumin1002" [11:03:40] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10109851 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:49] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3812/console" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [11:08:08] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:11:29] !log Manual puppet node deactivate for mw2295 mw2296 mw2377 mw2378 mw2385 - T372878 [11:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:32] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [11:13:45] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:17:04] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): Alert in need of triage: MegaRAID (instance an-worker1127) - https://phabricator.wikimedia.org/T373081#10109861 (10BTullis) 05Open→03Resolved I cold booted the host. The RAID array controller reported a foreign configuration, which I im... [11:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2261-2262,2268-2270].codfw.wmnet [11:19:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2386 to wikikube-worker2068 - hnowlan@cumin1002" [11:19:53] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:54] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2068 [11:20:08] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2068 [11:20:10] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2067.codfw.wmnet with OS bullseye [11:20:18] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:20:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109865 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [11:20:28] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:20:43] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:20:59] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2387 to wikikube-worker2069 - hnowlan@cumin1002" [11:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:13] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2067.codfw.wmnet with OS bullseye [11:21:17] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host mw2386 [11:21:22] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [11:22:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2067.codfw.wmnet with OS bullseye [11:22:28] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wiki... [11:22:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2387 to wikikube-worker2069 - hnowlan@cumin1002" [11:22:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:22:30] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2069 [11:22:44] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2069 [11:23:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2387 to wikikube-worker2069 [11:23:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2387 to w... [11:24:04] (03PS1) 10Clément Goubert: decommission mw226[1-2].codfw.wmnet mw22[68-70] [puppet] - 10https://gerrit.wikimedia.org/r/1069999 (https://phabricator.wikimedia.org/T371262) [11:25:44] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:26:52] (03Merged) 10jenkins-bot: Make k8s/pool-depool-node work on control-planes as well [cookbooks] - 10https://gerrit.wikimedia.org/r/1069186 (https://phabricator.wikimedia.org/T372878) (owner: 10JMeybohm) [11:27:23] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2388 to wikikube-worker2070 - hnowlan@cumin1002" [11:28:03] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw2386 [11:28:24] (03PS18) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [11:29:37] (03PS2) 10Clément Goubert: decommission mw226[1-2].codfw.wmnet mw22[68-77] [puppet] - 10https://gerrit.wikimedia.org/r/1069999 (https://phabricator.wikimedia.org/T371262) [11:29:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2388 to wikikube-worker2070 - hnowlan@cumin1002" [11:29:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:29:59] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2070 [11:30:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2070 [11:30:17] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2389 to wikikube-worker2071 - hnowlan@cumin1002" [11:30:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2389 to wikikube-worker2071 - hnowlan@cumin1002" [11:30:21] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:30:22] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2071 [11:30:23] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:30:34] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2071 [11:30:52] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2388 to wikikube-worker2070 [11:31:02] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109917 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2388 to w... [11:31:07] (03PS1) 10Brouberol: dse-k8s-eqiad: re-enable the Flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070000 (https://phabricator.wikimedia.org/T368787) [11:31:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2389 to wikikube-worker2071 [11:31:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109920 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2389 to w... [11:32:02] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070000 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [11:32:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:32:41] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from mw2386 to wikikube-worker2068 [11:32:53] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:32:56] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2386 to w... [11:33:15] FIRING: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:33:33] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10109925 (10BTullis) There is a drive that is showing clear signs of imminent failure. ` btullis@an-worker1085:~$ sudo smartctl --info --health -d "sat+megaraid,1" /de... [11:35:13] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:35:14] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2261-2262,2268-2270].codfw.wmnet [11:35:41] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: re-enable the Flink operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070000 (https://phabricator.wikimedia.org/T368787) (owner: 10Brouberol) [11:36:27] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:36:37] !log cgoubert@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2271-2277].codfw.wmnet [11:37:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:37:26] (03PS19) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [11:38:10] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2386 to wikikube-worker2068 [11:38:12] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2386 to wikikube-worker2068 [11:38:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10109948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2386 to w... [11:38:45] (03PS8) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [11:39:17] (03CR) 10Andrew Bogott: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [11:40:26] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [11:41:15] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2067.codfw.wmnet with reason: host reimage [11:42:01] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10109957 (10BTullis) From `megacli -PDList -aall` we can see that the slot number is 1, but this still doesn't tell us which device name I should stop. ` Enclosure De... [11:43:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [11:43:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2067.codfw.wmnet with reason: host reimage [11:43:40] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1009.eqiad.wmnet [11:43:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:27] (03CR) 10Brouberol: airflow: enable management of remote connections configuration file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [11:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:53] (03PS1) 10Slyngshede: cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 [11:48:33] (03CR) 10CI reject: [V:04-1] cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 (owner: 10Slyngshede) [11:48:33] (03PS20) 10Slyngshede: R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 [11:49:34] (03PS2) 10Slyngshede: cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 [11:49:46] (03CR) 10Slyngshede: [C:03+2] R:codfw1dev:cloudweb Add CAS IDP installation. [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [11:49:53] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1009.eqiad.wmnet [11:50:55] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1010.eqiad.wmnet [11:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:23] jouncebot: nowandnext [11:51:24] No deployments scheduled for the next 1 hour(s) and 8 minute(s) [11:51:24] In 1 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1300) [11:53:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [11:53:22] RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:16] (03Merged) 10jenkins-bot: Add CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [11:54:29] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1062977|Add CommunityRequests (T372527)]] [11:54:31] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [11:55:16] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1010.eqiad.wmnet [11:56:32] (03PS1) 10Hnowlan: sre.k8s.renumber-node: fix error checking when remote host is not found [cookbooks] - 10https://gerrit.wikimedia.org/r/1070003 [11:56:52] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw2386 to wikikube-worker2068 [11:57:08] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [11:58:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:59:07] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve1011.eqiad.wmnet [11:59:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:59:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:59:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:59:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T371742)', diff saved to https://phabricator.wikimedia.org/P68519 and previous config saved to /var/cache/conftool/dbconfig/20240902-115944-ladsgroup.json [11:59:47] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:36] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2386 to wikikube-worker2068 - hnowlan@cumin1002" [12:03:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2067.codfw.wmnet with OS bullseye [12:03:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2386 to wikikube-worker2068 - hnowlan@cumin1002" [12:03:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:03:37] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2068 [12:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:03:44] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10110036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [12:04:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2068 [12:04:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2386 to wikikube-worker2068 [12:05:03] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10110037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw2386 to w... [12:05:10] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1011.eqiad.wmnet [12:06:26] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:10:47] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve2009.codfw.wmnet [12:13:16] !log samtar@deploy1003 samtar: Backport for [[gerrit:1062977|Add CommunityRequests (T372527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:13:18] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [12:13:38] (03CR) 10Brouberol: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [12:14:10] !log samtar@deploy1003 samtar: Continuing with sync [12:16:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1068786 (owner: 10Slyngshede) [12:17:03] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2009.codfw.wmnet [12:17:43] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10110057 (10BTullis) OK, we only have 11 hadoop data volumes mounted, so I believe that this has already been excluded. We are not mounting the volume with the label `... [12:17:54] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-serve2011.codfw.wmnet [12:18:26] (03PS9) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [12:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:42] (03PS1) 10Filippo Giunchedi: Revert^2 "prometheus: enable oidc auth" [puppet] - 10https://gerrit.wikimedia.org/r/1070011 [12:19:11] hah nice gerrit compressing revert: revert: into revert^2 [12:19:32] (03CR) 10Filippo Giunchedi: [C:03+2] Revert^2 "prometheus: enable oidc auth" [puppet] - 10https://gerrit.wikimedia.org/r/1070011 (owner: 10Filippo Giunchedi) [12:19:35] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2271-2277].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [12:19:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2271-2277].codfw.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1002" [12:19:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2271-2277].codfw.wmnet [12:21:04] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2066.codfw.wmnet with OS bullseye [12:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10110083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host wikikube... [12:22:43] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062977|Add CommunityRequests (T372527)]] (duration: 28m 14s) [12:22:48] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [12:22:49] quite slow(?) deployment of that config patch o_o [12:23:47] (03PS2) 10Samtar: IS: Add CommunityRequests to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) [12:23:49] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2011.codfw.wmnet [12:24:00] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10110107 (10BTullis) [12:24:04] !log enable oidc for prometheus public web interface - T326657 [12:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:06] T326657: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 [12:24:22] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10110110 (10BTullis) [12:25:37] jouncebot: nowandnext [12:25:37] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [12:25:38] In 0 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1300) [12:26:25] (03PS3) 10Samtar: IS: Add CommunityRequests to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) [12:27:15] (03PS2) 10Samtar: CS: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063847 (https://phabricator.wikimedia.org/T372527) [12:27:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 (owner: 10Ebrahim) [12:28:17] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T373800#10110120 (10BTullis) I have executed the following command to enable the drive locator LED for enclosure device ID: 32, slot number: 1 ` btullis@an-worker1085:~$ sudo megacli -PdLoc... [12:28:19] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1009.eqiad.wmnet [12:28:31] (03CR) 10CI reject: [V:04-1] IS: Add CommunityRequests to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [12:28:38] (03CR) 10CI reject: [V:04-1] CS: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063847 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [12:28:42] (03Merged) 10jenkins-bot: Enable dark mode for Creator: namespace in Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068302 (owner: 10Ebrahim) [12:28:50] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1068302|Enable dark mode for Creator: namespace in Commons]] [12:28:58] (03PS4) 10Samtar: IS: Add CommunityRequests to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) [12:29:40] 07sre-alert-triage, 10Data-Platform-SRE (2024.08.17 - 2024.09.06): SmartNotHealthy on an-worker1085 - https://phabricator.wikimedia.org/T371077#10110128 (10BTullis) I have enabled the drive locator LED. Now awaiting a hot-swap from dc-ops, at which point I will follow the procedure outlined here: https://wikit... [12:30:23] (03PS3) 10Samtar: CS: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063847 (https://phabricator.wikimedia.org/T372527) [12:30:53] !log ladsgroup@deploy1003 ladsgroup, ebrahim: Backport for [[gerrit:1068302|Enable dark mode for Creator: namespace in Commons]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:32:01] (03PS1) 10Filippo Giunchedi: prometheus: listen for public vhost on lo v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/1070019 [12:32:41] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1009.eqiad.wmnet [12:33:07] FIRING: ProbeDown: Service prometheus3003:443 has failed probes (http_prometheus_esams_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus3003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:26] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: listen for public vhost on lo v4 and v6 [puppet] - 10https://gerrit.wikimedia.org/r/1070019 (owner: 10Filippo Giunchedi) [12:34:01] !log ladsgroup@deploy1003 ladsgroup, ebrahim: Continuing with sync [12:38:25] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068302|Enable dark mode for Creator: namespace in Commons]] (duration: 09m 34s) [12:39:15] (03PS1) 10Samtar: IS-labs: Add CommunityRequests to InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070021 (https://phabricator.wikimedia.org/T372527) [12:40:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120 (owner: 10Ladsgroup) [12:41:22] (03PS2) 10Ladsgroup: Remove the "powered by mediawiki" override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120 [12:41:28] (03CR) 10TrainBranchBot: "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120 (owner: 10Ladsgroup) [12:42:13] (03PS1) 10Samtar: CS: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070022 (https://phabricator.wikimedia.org/T372527) [12:42:25] (03Merged) 10jenkins-bot: Remove the "powered by mediawiki" override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068120 (owner: 10Ladsgroup) [12:42:37] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1068120|Remove the "powered by mediawiki" override]] [12:42:48] FIRING: [2x] KubernetesCalicoDown: kubernetes2008.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:42:48] (03PS2) 10Samtar: CS-labs: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070022 (https://phabricator.wikimedia.org/T372527) [12:43:07] FIRING: [3x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:28] (03CR) 10Btullis: [C:03+1] airflow: enable management of remote connections configuration file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [12:44:42] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1068120|Remove the "powered by mediawiki" override]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:45:05] jouncebot: nowandnext [12:45:05] No deployments scheduled for the next 0 hour(s) and 14 minute(s) [12:45:05] In 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1300) [12:45:32] (03CR) 10Volans: [C:03+1] "Change looks ok to me, question/nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (owner: 10JMeybohm) [12:47:12] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:47:18] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1070012 (owner: 10L10n-bot) [12:47:36] (03CR) 10Brouberol: [C:03+2] airflow: enable management of remote connections configuration file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1069952 (https://phabricator.wikimedia.org/T373026) (owner: 10Brouberol) [12:49:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [12:49:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1206.eqiad.wmnet with reason: dump replag [12:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [12:49:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070021 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [12:49:45] (03PS1) 10Filippo Giunchedi: prometheus: let envoy listen on ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/1070023 [12:49:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1206.eqiad.wmnet with reason: dump replag [12:50:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, September 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070022 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [12:50:53] FIRING: [8x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:43] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1068120|Remove the "powered by mediawiki" override]] (duration: 09m 05s) [12:52:47] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: let envoy listen on ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/1070023 (owner: 10Filippo Giunchedi) [12:54:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T371742)', diff saved to https://phabricator.wikimedia.org/P68520 and previous config saved to /var/cache/conftool/dbconfig/20240902-125406-ladsgroup.json [12:54:10] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:55:18] (03PS10) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [12:55:53] FIRING: [9x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:01] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [12:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:10] Dreamy_Jazz: might you be self-deploying in the next window? [12:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:17] Yes [12:56:24] ack :) [12:56:51] (03PS1) 10Jelto: gitlab: adjust throttling threshold for GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1070025 (https://phabricator.wikimedia.org/T366882) [12:56:56] I'll start deploying now (might as well) [12:57:09] yeah good idea [12:57:37] (03PS3) 10Dreamy Jazz: Remove wgCheckUserPurgeOldClientHintsData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064458 (https://phabricator.wikimedia.org/T359560) [12:57:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064458 (https://phabricator.wikimedia.org/T359560) (owner: 10Dreamy Jazz) [12:58:23] (03PS11) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [12:58:32] (03Merged) 10jenkins-bot: Remove wgCheckUserPurgeOldClientHintsData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064458 (https://phabricator.wikimedia.org/T359560) (owner: 10Dreamy Jazz) [12:58:45] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1064458|Remove wgCheckUserPurgeOldClientHintsData (T359560)]] [12:58:47] T359560: Create the CheckUserDataPruner service - https://phabricator.wikimedia.org/T359560 [12:58:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [12:58:52] !log apply qos classifers and schedulers to server interfaces on asw-d-codfw T339850 [12:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:55] T339850: Configure QoS marking and policy across network - https://phabricator.wikimedia.org/T339850 [12:59:15] (03PS5) 10Samtar: IS: Add CommunityRequests to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1300). nyaa~ [13:00:05] Dreamy_Jazz, MatmaRex, and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] \o [13:00:13] hi [13:00:35] I can run your script after Dreamy_Jazz is done (unless they want to :)) [13:00:42] thanks [13:01:10] Thanks, I'll let you run the script. [13:01:23] (ack) [13:01:36] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1064458|Remove wgCheckUserPurgeOldClientHintsData (T359560)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:01:49] Might be able to start it now, considering that it shouldn't conflict with the deployment? [13:01:53] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [13:02:43] good point [13:05:42] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10110260 (10akosiaris) [13:05:47] (03PS2) 10Brouberol: global_config: define an external-services entry for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1070024 (https://phabricator.wikimedia.org/T368033) [13:06:27] !log `[samtar@mwmaint1002 ~]$ mwscript maintenance/cleanupTitles.php --wiki=ptwiki --prefix=T195546 2>&1 | tee ~/T195546-ptwiki.log` for T195546 [13:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:30] T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages - https://phabricator.wikimedia.org/T195546 [13:07:32] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064458|Remove wgCheckUserPurgeOldClientHintsData (T359560)]] (duration: 08m 47s) [13:07:37] T359560: Create the CheckUserDataPruner service - https://phabricator.wikimedia.org/T359560 [13:07:41] MatmaRex: do you want the full output? [13:08:21] TheresNoTime: if you could copy it into a phab paste, yeah [13:08:33] i guess you can remove the progress info rows if you want [13:09:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P68522 and previous config saved to /var/cache/conftool/dbconfig/20240902-130914-ladsgroup.json [13:09:21] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack errors out when building without connectivity - https://phabricator.wikimedia.org/T373794#10110286 (10elukey) I tried to build 8.10.0 to make sure that it wasn't related to the 8.11.0 changes, and I get the same result. [13:11:21] (03PS6) 10Volans: sre.switchdc.databases: new cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1059052 (https://phabricator.wikimedia.org/T371351) [13:12:04] MatmaRex: done :) [13:12:27] thanks [13:12:54] Dreamy_Jazz: your patch is done, yes? [13:13:02] Yeah [13:13:04] (03PS12) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [13:13:14] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:13:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [13:14:06] (03Merged) 10jenkins-bot: IS: Add CommunityRequests to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063843 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [13:14:20] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1063843|IS: Add CommunityRequests to InitialiseSettings (T372527)]] [13:14:21] (03PS5) 10JMeybohm: reimage: Don't fail when d-i takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 [13:14:22] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [13:14:24] (03CR) 10JMeybohm: reimage: Don't fail when d-i takes a long time (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (owner: 10JMeybohm) [13:14:30] (03CR) 10JMeybohm: reimage: Don't fail when d-i takes a long time (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (owner: 10JMeybohm) [13:15:34] (03PS1) 10Brouberol: airflow: perform an initial git-sync of the dags in an init container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) [13:16:19] !log samtar@deploy1003 samtar: Backport for [[gerrit:1063843|IS: Add CommunityRequests to InitialiseSettings (T372527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:16:38] !log samtar@deploy1003 samtar: Continuing with sync [13:16:46] (03CR) 10JMeybohm: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:17:44] (03CR) 10Brouberol: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:18:33] (03PS13) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [13:18:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:03] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063843|IS: Add CommunityRequests to InitialiseSettings (T372527)]] (duration: 06m 43s) [13:21:09] T372527: Deploy CommunityRequests to Meta - https://phabricator.wikimedia.org/T372527 [13:21:18] 06SRE, 06Infrastructure-Foundations, 10netops: EX4600 does not support class-of-service 'port scheduling' - https://phabricator.wikimedia.org/T373594#10110339 (10cmooney) Despite this[[ https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/concept/cos-qfx-series-support-by-qf... [13:22:36] (03Merged) 10jenkins-bot: IS-labs: Add CommunityRequests to InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070021 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [13:22:39] (03PS1) 10Cathal Mooney: Modify class-of-service scheduler config for qfx5100 [homer/public] - 10https://gerrit.wikimedia.org/r/1070027 (https://phabricator.wikimedia.org/T373594) [13:22:41] (03Merged) 10jenkins-bot: CS-labs: Load CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070022 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [13:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:19] !log done UTC afternoon backport window [13:24:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P68523 and previous config saved to /var/cache/conftool/dbconfig/20240902-132421-ladsgroup.json [13:24:26] (03PS14) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [13:24:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:24:34] !log done UTC afternoon backport window [13:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:41] (03PS2) 10Cathal Mooney: Modify class-of-service scheduler config for qfx5100 [homer/public] - 10https://gerrit.wikimedia.org/r/1070027 (https://phabricator.wikimedia.org/T373594) [13:27:37] (03PS1) 10Elukey: setup.py: update pynetbox to 7.4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070028 (https://phabricator.wikimedia.org/T373794) [13:29:42] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10110383 (10Ladsgroup) Noting that es2032 is a "perceived master" (since dbctl requires a master) so you can't just depool it. You need to switc... [13:30:13] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack errors out when building without connectivity - https://phabricator.wikimedia.org/T373794#10110390 (10elukey) p:05Triage→03High [13:30:38] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10110392 (10MoritzMuehlenhoff) [13:32:02] (03CR) 10JMeybohm: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:32:03] (03CR) 10Volans: [C:03+1] "LGTM, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070028 (https://phabricator.wikimedia.org/T373794) (owner: 10Elukey) [13:33:05] (03CR) 10Cathal Mooney: [C:03+2] Modify class-of-service scheduler config for qfx5100 [homer/public] - 10https://gerrit.wikimedia.org/r/1070027 (https://phabricator.wikimedia.org/T373594) (owner: 10Cathal Mooney) [13:33:38] (03Merged) 10jenkins-bot: Modify class-of-service scheduler config for qfx5100 [homer/public] - 10https://gerrit.wikimedia.org/r/1070027 (https://phabricator.wikimedia.org/T373594) (owner: 10Cathal Mooney) [13:34:13] (03CR) 10Btullis: airflow: perform an initial git-sync of the dags in an init container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [13:34:13] (03PS3) 10Slyngshede: cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 [13:34:24] (03CR) 10Brouberol: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:35:50] (03PS15) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [13:36:13] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10110407 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:36:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:39:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T371742)', diff saved to https://phabricator.wikimedia.org/P68524 and previous config saved to /var/cache/conftool/dbconfig/20240902-133928-ladsgroup.json [13:39:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:39:31] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:39:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1162.eqiad.wmnet with reason: Maintenance [13:39:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T371742)', diff saved to https://phabricator.wikimedia.org/P68525 and previous config saved to /var/cache/conftool/dbconfig/20240902-133950-ladsgroup.json [13:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:41:20] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, and 2 others: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10110436 (10ABran-WMF) >>! In T373095#10110383, @Ladsgroup wrote: > Noting that es2032 is a "perceived master" (since dbctl requires a master) s... [13:41:41] (03PS3) 10Brouberol: global_config: define an external-services entry for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1070024 (https://phabricator.wikimedia.org/T368033) [13:42:10] (03CR) 10Brouberol: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070024 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [13:43:32] (03PS16) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [13:43:47] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2066.codfw.wmnet with OS bullseye [13:43:54] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10110447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host wikikube-worker2066.codfw.... [13:45:38] (03PS1) 10Slyngshede: P:trafficserver::backend add cloudidp-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070030 [13:45:41] (03PS17) 10Andrew Bogott: keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) [13:45:41] (03PS1) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 [13:46:22] (03PS2) 10Andrew Bogott: Horizon: enable OIDC auth [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) [13:46:36] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [13:47:47] (03PS1) 10Clément Goubert: trafficserver: Fix /w/rest.php and /api/ regex_map [puppet] - 10https://gerrit.wikimedia.org/r/1070032 (https://phabricator.wikimedia.org/T364400) [13:48:17] (03CR) 10Elukey: [C:03+2] setup.py: update pynetbox to 7.4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070028 (https://phabricator.wikimedia.org/T373794) (owner: 10Elukey) [13:49:15] (03CR) 10Brouberol: airflow: perform an initial git-sync of the dags in an init container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [13:50:24] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3815/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [13:50:30] (03PS6) 10JMeybohm: reimage: Don't fail when d-i takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (https://phabricator.wikimedia.org/T372648) [13:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:19] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [13:51:30] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 10s) [13:51:42] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: sre.hosts.reimage failing due to mkfs.ext4 taking to long - https://phabricator.wikimedia.org/T372648#10110491 (10JMeybohm) >>! In T372648#10073513, @SLyngshede-WMF wrote: > It's probably enough to bump the default timeout as a qui... [13:53:45] (03CR) 10Brouberol: airflow: perform an initial git-sync of the dags in an init container (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [13:54:43] (03PS4) 10Slyngshede: cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 [13:55:28] 10SRE-tools, 06Infrastructure-Foundations: sre.hosts.reimage fails when the node is already in puppet db but has no facts (puppet never ran) - https://phabricator.wikimedia.org/T373810 (10JMeybohm) 03NEW [13:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:21] (03PS8) 10Hnowlan: sre.k8s.renumber-node: Handle renamed host [cookbooks] - 10https://gerrit.wikimedia.org/r/1068779 (owner: 10Clément Goubert) [13:56:22] (03PS2) 10Hnowlan: sre.k8s.renumber-node: fix error checking when remote host is not found [cookbooks] - 10https://gerrit.wikimedia.org/r/1070003 [13:57:19] (03PS2) 10Slyngshede: P:trafficserver::backend add cloudidp-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070030 [13:58:05] (03CR) 10JMeybohm: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [13:59:32] (03CR) 10Brouberol: dse-k8s-eqiad: Disable PSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [14:00:05] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [14:00:15] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 10s) [14:00:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2066.codfw.wmnet with reason: host reimage [14:00:31] (03Merged) 10jenkins-bot: setup.py: update pynetbox to 7.4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1070028 (https://phabricator.wikimedia.org/T373794) (owner: 10Elukey) [14:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:32] (03CR) 10Andrew Bogott: [C:03+2] keystone + oidc [puppet] - 10https://gerrit.wikimedia.org/r/1068877 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:01:34] 10SRE-tools, 06Infrastructure-Foundations: sre.hosts.reimage fails when the node is already in puppet db but has no facts (puppet never ran) - https://phabricator.wikimedia.org/T373810#10110513 (10JMeybohm) [14:02:00] (03PS2) 10Clément Goubert: trafficserver: Fix /w/rest.php and /api/ regex_map [puppet] - 10https://gerrit.wikimedia.org/r/1070032 (https://phabricator.wikimedia.org/T364400) [14:02:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T371742)', diff saved to https://phabricator.wikimedia.org/P68527 and previous config saved to /var/cache/conftool/dbconfig/20240902-140208-ladsgroup.json [14:02:11] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:02:43] (03CR) 10Volans: "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (https://phabricator.wikimedia.org/T372648) (owner: 10JMeybohm) [14:03:29] (03CR) 10JMeybohm: [C:03+2] reimage: Don't fail when d-i takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (https://phabricator.wikimedia.org/T372648) (owner: 10JMeybohm) [14:04:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2066.codfw.wmnet with reason: host reimage [14:05:49] (03PS3) 10Hnowlan: sre.k8s.renumber-node: fix remote_host behaviours when renaming [cookbooks] - 10https://gerrit.wikimedia.org/r/1070003 [14:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:11] (03CR) 10Vgutierrez: [C:04-1] "public hostname `cloudidp-dev.wikimedia.org` is missing on the backend server TLS certicate SNI list:" [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [14:12:10] (03CR) 10Btullis: "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1069945 (https://phabricator.wikimedia.org/T369492) (owner: 10Brouberol) [14:12:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070031 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [14:15:18] (03PS1) 10Slyngshede: R:codfw1dev:cloudweb: Add cloudidp-dev TLS. [puppet] - 10https://gerrit.wikimedia.org/r/1070034 [14:15:53] FIRING: [9x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:02] (03Merged) 10jenkins-bot: reimage: Don't fail when d-i takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 (https://phabricator.wikimedia.org/T372648) (owner: 10JMeybohm) [14:16:45] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3816/co" [puppet] - 10https://gerrit.wikimedia.org/r/1070034 (owner: 10Slyngshede) [14:17:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P68528 and previous config saved to /var/cache/conftool/dbconfig/20240902-141715-ladsgroup.json [14:18:07] FIRING: [9x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:19:55] (03CR) 10Andrew Bogott: [C:03+1] cloudidp-dev for Horizon OIDC test. [dns] - 10https://gerrit.wikimedia.org/r/1070001 (owner: 10Slyngshede) [14:20:16] (03CR) 10Andrew Bogott: [C:03+1] R:codfw1dev:cloudweb: Add cloudidp-dev TLS. [puppet] - 10https://gerrit.wikimedia.org/r/1070034 (owner: 10Slyngshede) [14:20:45] (03CR) 10Andrew Bogott: [C:03+1] P:trafficserver::backend add cloudidp-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [14:20:53] RESOLVED: [9x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:01] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-ctrl2001.codfw.wmnet [14:22:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-ctrl2001.codfw.wmnet [14:22:09] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T373811 (10ihurbain) 03NEW [14:22:34] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-ctrl2003.codfw.wmnet [14:22:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-ctrl2003.codfw.wmnet [14:27:31] (03CR) 10Andrew Bogott: P:trafficserver::backend add cloudidp-dev. [puppet] - 10https://gerrit.wikimedia.org/r/1070030 (owner: 10Slyngshede) [14:32:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P68529 and previous config saved to /var/cache/conftool/dbconfig/20240902-143222-ladsgroup.json [14:37:58] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:48] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:45:05] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:46:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T371742)', diff saved to https://phabricator.wikimedia.org/P68530 and previous config saved to /var/cache/conftool/dbconfig/20240902-144729-ladsgroup.json [14:47:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:47:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [14:47:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [14:47:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2066.codfw.wmnet with OS bullseye [14:47:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T371742)', diff saved to https://phabricator.wikimedia.org/P68531 and previous config saved to /var/cache/conftool/dbconfig/20240902-144751-ladsgroup.json [14:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:55] !log Disabling puppet on O:cache::text to merge 1070032 - T364400 [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:57] T364400: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400 [15:01:28] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:54] !log Enabling puppet on cp2027.codfw.wmnet to test 1070032 - T364400 [15:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:04:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:04:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:05:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:48] !log running homer 'lsw1-a6-codfw*' commit 'T372878' [15:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:54] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [15:11:10] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2066.codfw.wmnet [15:11:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2066.codfw.wmnet [15:11:17] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2067.codfw.wmnet [15:11:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2067.codfw.wmnet [15:12:14] !log running homer 'cr*codfw*' commit 'T372878' [15:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:48] RESOLVED: [2x] KubernetesCalicoDown: kubernetes2008.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:18:29] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2070.codfw.wmnet with OS bullseye [15:18:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2069.codfw.wmnet with OS bullseye [15:18:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2068.codfw.wmnet with OS bullseye [15:18:39] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2070 [15:19:12] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:24:17] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2070 - hnowlan@cumin1002" [15:24:22] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2070 - hnowlan@cumin1002" [15:24:23] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:23] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2070.codfw.wmnet 51.0.192.10.in-addr.arpa 1.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:24:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2070.codfw.wmnet 51.0.192.10.in-addr.arpa 1.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:24:27] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2070 [15:24:42] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2070 [15:24:42] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2070 [15:25:13] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2069 [15:25:33] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:26:19] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:26:32] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:29:01] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2069 - hnowlan@cumin1002" [15:29:05] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2069 - hnowlan@cumin1002" [15:29:06] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:06] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2069.codfw.wmnet 50.0.192.10.in-addr.arpa 0.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:29:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2069.codfw.wmnet 50.0.192.10.in-addr.arpa 0.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:29:09] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2069 [15:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1530). [15:30:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2069 [15:30:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2069 [15:31:31] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2068 [15:31:53] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:35:22] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2068 - hnowlan@cumin1002" [15:35:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2068 - hnowlan@cumin1002" [15:35:26] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:35:26] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2068.codfw.wmnet 49.0.192.10.in-addr.arpa 9.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:35:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2068.codfw.wmnet 49.0.192.10.in-addr.arpa 9.4.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:35:31] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2068 [15:36:16] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2068 [15:36:16] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2068 [15:38:18] FIRING: [2x] KubernetesCalicoDown: mw2387.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:39:18] !log Enabling puppet on O:cache::text for 1070032 - T364400 [15:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:20] T364400: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400 [15:40:34] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2070.codfw.wmnet with reason: host reimage [15:44:14] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2070.codfw.wmnet with reason: host reimage [15:46:59] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2069.codfw.wmnet with reason: host reimage [15:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:50:08] !log uploaded spicerack_8.12.0 to apt.wikimedia.org bullseye-wikimedia [15:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:34] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2069.codfw.wmnet with reason: host reimage [15:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:36] !log spicerack 8.12.0 installed on cumin2002 [15:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:56] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2068.codfw.wmnet with reason: host reimage [15:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:55:05] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2068.codfw.wmnet with reason: host reimage [15:58:18] FIRING: [3x] KubernetesCalicoDown: mw2386.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T371742)', diff saved to https://phabricator.wikimedia.org/P68533 and previous config saved to /var/cache/conftool/dbconfig/20240902-155947-ladsgroup.json [15:59:50] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:04:40] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:04] FIRING: PuppetDisabled: Puppet disabled on wikikube-worker2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kubernetes&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [16:08:12] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2071.codfw.wmnet with OS bullseye [16:08:22] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2071 [16:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:11:13] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [16:14:32] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2071 - hnowlan@cumin1002" [16:14:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2071 - hnowlan@cumin1002" [16:14:36] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:36] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2071.codfw.wmnet 52.0.192.10.in-addr.arpa 2.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:14:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2071.codfw.wmnet 52.0.192.10.in-addr.arpa 2.5.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:14:40] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2071 [16:14:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P68534 and previous config saved to /var/cache/conftool/dbconfig/20240902-161454-ladsgroup.json [16:15:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2071 [16:15:00] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2071 [16:21:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:18] FIRING: [4x] KubernetesCalicoDown: mw2386.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:24:09] 10SRE-tools, 06Infrastructure-Foundations: sre.hosts.reimage fails when the node is already in puppet db but has no facts (puppet never ran) - https://phabricator.wikimedia.org/T373810#10110980 (10JMeybohm) Fine by me. We have quite an amount of reimages still to do - will see if this hits us again. I did the... [16:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T367856)', diff saved to https://phabricator.wikimedia.org/P68535 and previous config saved to /var/cache/conftool/dbconfig/20240902-162836-marostegui.json [16:28:39] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [16:30:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P68536 and previous config saved to /var/cache/conftool/dbconfig/20240902-163001-ladsgroup.json [16:31:03] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2071.codfw.wmnet with reason: host reimage [16:33:34] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2071.codfw.wmnet with reason: host reimage [16:37:37] (03PS1) 10Daimona Eaytoy: Enable CampaignEvents Invitation Lists on igwiki and swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070055 (https://phabricator.wikimedia.org/T372582) [16:38:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070055 (https://phabricator.wikimedia.org/T372582) (owner: 10Daimona Eaytoy) [16:43:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P68537 and previous config saved to /var/cache/conftool/dbconfig/20240902-164343-marostegui.json [16:45:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T371742)', diff saved to https://phabricator.wikimedia.org/P68538 and previous config saved to /var/cache/conftool/dbconfig/20240902-164508-ladsgroup.json [16:45:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:45:12] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [16:45:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:45:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T371742)', diff saved to https://phabricator.wikimedia.org/P68539 and previous config saved to /var/cache/conftool/dbconfig/20240902-164530-ladsgroup.json [16:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [16:51:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:21] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10111017 (10Ladsgroup) Hi, this has caused ~12 alerts just since this weekend (https://wm-bot.wmflabs.org/libera_logs/%23w... [16:58:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P68540 and previous config saved to /var/cache/conftool/dbconfig/20240902-165848-marostegui.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1700) [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T1700). [17:02:59] (03PS4) 10Brouberol: airflow: perform an initial git-sync of the dags in an init container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) [17:04:36] (03PS5) 10Brouberol: airflow: perform an initial git-sync of the dags in an init container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) [17:04:40] (03CR) 10Stevemunene: [C:03+1] airflow: perform an initial git-sync of the dags in an init container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [17:05:03] (03PS6) 10Brouberol: airflow: perform an initial git-sync of the dags in an init container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) [17:06:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T371742)', diff saved to https://phabricator.wikimedia.org/P68541 and previous config saved to /var/cache/conftool/dbconfig/20240902-170654-ladsgroup.json [17:06:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:07:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Issues reimaging kubernetes workers due to package user issues in systemd-timesyncd - https://phabricator.wikimedia.org/T373819 (10hnowlan) 03NEW [17:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:13:41] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2069.codfw.wmnet with OS bullseye [17:13:52] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10111061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [17:13:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T367856)', diff saved to https://phabricator.wikimedia.org/P68542 and previous config saved to /var/cache/conftool/dbconfig/20240902-171356-marostegui.json [17:13:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2195.codfw.wmnet with reason: Maintenance [17:14:00] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [17:14:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2195.codfw.wmnet with reason: Maintenance [17:14:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T367856)', diff saved to https://phabricator.wikimedia.org/P68543 and previous config saved to /var/cache/conftool/dbconfig/20240902-171418-marostegui.json [17:14:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2071.codfw.wmnet with OS bullseye [17:15:13] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10111066 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [17:15:17] !log homer 'lsw1-a3-codfw*' commit [17:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:17] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1070024 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [17:16:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2070.codfw.wmnet with OS bullseye [17:16:49] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10111067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [17:19:03] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2068.codfw.wmnet with OS bullseye [17:19:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10111068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikiku... [17:19:43] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2068.codfw.wmnet [17:19:45] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2068.codfw.wmnet [17:19:49] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2069.codfw.wmnet [17:19:51] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2069.codfw.wmnet [17:19:56] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2070.codfw.wmnet [17:19:58] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2070.codfw.wmnet [17:20:03] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2071.codfw.wmnet [17:20:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2071.codfw.wmnet [17:21:18] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T373591#10111073 (10hnowlan) [17:21:59] !log homer 'cr*codfw*' commit [17:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P68544 and previous config saved to /var/cache/conftool/dbconfig/20240902-172202-ladsgroup.json [17:23:18] FIRING: [4x] KubernetesCalicoDown: mw2386.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:25:48] RESOLVED: [4x] KubernetesCalicoDown: mw2386.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P68545 and previous config saved to /var/cache/conftool/dbconfig/20240902-173709-ladsgroup.json [17:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:48:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:50:52] sigh [17:50:56] * Emperor here [17:51:20] nothing to be done, just down time it [17:51:26] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10111087 (10Ladsgroup) It just caused a page [17:51:27] it's T368098 [17:51:28] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [17:51:43] Amir1: you going to downtime, or would you like me to? [17:51:50] on it [17:51:53] ack, thanks [17:52:12] I'll ack the page in the mean time [17:52:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T371742)', diff saved to https://phabricator.wikimedia.org/P68547 and previous config saved to /var/cache/conftool/dbconfig/20240902-175216-ladsgroup.json [17:52:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:52:23] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [17:52:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:52:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T371742)', diff saved to https://phabricator.wikimedia.org/P68548 and previous config saved to /var/cache/conftool/dbconfig/20240902-175238-ladsgroup.json [17:52:42] Am I crazy or these pages don´t end up replicated here? [17:52:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1206.eqiad.wmnet with reason: Dumps causing issues [17:53:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1206.eqiad.wmnet with reason: Dumps causing issues [17:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:53:53] claime: we see them in the data-persistence channel [17:54:18] while I'm here, that generate_vrts_aliases.service has been failed for days now... [17:54:28] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10111093 (10Ladsgroup) ` | 417670674 | wikiadmin2023 | 10.64.0.157:44926 | enwiki | Query | 3 | Creating so... [17:56:04] That's kind of my point, when on batphone they're paging everyone, but are not in -operations. I agree that most team-specific alerts should go to a team's channel but if they're paging they should end up here as well imo [17:56:28] 06SRE, 10Dumps 2.0, 10Dumps-Generation, 13Patch-For-Review: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10111094 (10JJMC89) Even when it doesn't page, the increased lag causes bots that respect a reasonable `maxlag` to not be... [17:57:03] anyhow, back to making dinner [17:57:12] we must stop meeting like this... [17:57:22] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T373820 (10Idxntcx) 03NEW [17:57:38] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T373820#10111105 (10Idxntcx) a:03Idxntcx [17:58:24] 06SRE, 10LDAP-Access-Requests: Grant Access to for  - https://phabricator.wikimedia.org/T373820#10111106 (10Idxntcx) 05Open→03Invalid CHRISTIANITY [18:00:31] dumps need to be fixed - having db lag issues for 2+ months is unacceptable [18:13:45] (03CR) 10Brouberol: [C:03+2] global_config: define an external-services entry for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1070024 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [18:13:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T371742)', diff saved to https://phabricator.wikimedia.org/P68549 and previous config saved to /var/cache/conftool/dbconfig/20240902-181345-ladsgroup.json [18:13:52] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:15:28] (03CR) 10Brouberol: [C:03+2] airflow: perform an initial git-sync of the dags in an init container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070026 (https://phabricator.wikimedia.org/T368033) (owner: 10Brouberol) [18:18:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:18:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:18:18] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:19:40] (03PS1) 10Brouberol: airflow-test-k8s: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070061 [18:23:12] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070061 (owner: 10Brouberol) [18:23:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P68550 and previous config saved to /var/cache/conftool/dbconfig/20240902-182852-ladsgroup.json [18:41:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 18.26% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:43:25] (03CR) 10AOkoth: "Yeah, we discussed this with Eoghan. I think there is no safe way to predict how this will behave when installed. I think on the current h" [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [18:44:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P68551 and previous config saved to /var/cache/conftool/dbconfig/20240902-184359-ladsgroup.json [18:45:54] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [18:46:03] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 09s) [18:51:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.03% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:55:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 19.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:00] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 19.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:59:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T371742)', diff saved to https://phabricator.wikimedia.org/P68552 and previous config saved to /var/cache/conftool/dbconfig/20240902-185906-ladsgroup.json [18:59:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [18:59:10] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:59:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [19:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:03:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:08:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:14:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:15:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [19:20:17] (03PS1) 10Brouberol: airflow: fix configuration checksum [deployment-charts] - 10https://gerrit.wikimedia.org/r/1070063 [19:23:31] (03PS2) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [19:28:29] (03PS3) 10Brouberol: airflow: enable statsd metric reporting when monitoring is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1066756 (https://phabricator.wikimedia.org/T369098) [19:34:59] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [19:35:08] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 09s) [19:42:22] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow [19:42:31] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@5315c8d]: Test Refine through Airflow (duration: 00m 09s) [19:45:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [19:45:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [19:45:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T371742)', diff saved to https://phabricator.wikimedia.org/P68553 and previous config saved to /var/cache/conftool/dbconfig/20240902-194545-ladsgroup.json [19:45:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:54:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.42% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.73% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T2000). [20:00:04] RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:06:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T371742)', diff saved to https://phabricator.wikimedia.org/P68554 and previous config saved to /var/cache/conftool/dbconfig/20240902-200606-ladsgroup.json [20:06:09] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:08:04] FIRING: PuppetDisabled: Puppet disabled on wikikube-worker2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=kubernetes&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:12:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.33% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:16:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:21:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P68555 and previous config saved to /var/cache/conftool/dbconfig/20240902-202113-ladsgroup.json [20:21:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 24.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:23:15] Sorry for the delay, deploying now [20:23:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1069310 (https://phabricator.wikimedia.org/T373676) (owner: 10Catrope) [20:30:34] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10111289 (10phaultfinder) [20:31:30] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 10.01% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:33:16] FIRING: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:36:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P68556 and previous config saved to /var/cache/conftool/dbconfig/20240902-203620-ladsgroup.json [20:36:30] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 9.527% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:49:42] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [20:50:58] (03Merged) 10jenkins-bot: CodexModule: Fix double-flipping in RTL [core] (wmf/1.43.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1069310 (https://phabricator.wikimedia.org/T373676) (owner: 10Catrope) [20:51:12] !log catrope@deploy1003 Started scap sync-world: Backport for [[gerrit:1069310|CodexModule: Fix double-flipping in RTL (T373676)]] [20:51:15] T373676: Search results items have incorrect spacing in Persian Wikipedia - https://phabricator.wikimedia.org/T373676 [20:51:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T371742)', diff saved to https://phabricator.wikimedia.org/P68557 and previous config saved to /var/cache/conftool/dbconfig/20240902-205128-ladsgroup.json [20:51:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [20:51:31] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:51:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [20:51:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T371742)', diff saved to https://phabricator.wikimedia.org/P68558 and previous config saved to /var/cache/conftool/dbconfig/20240902-205149-ladsgroup.json [20:53:16] RESOLVED: [3x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [20:55:34] !log catrope@deploy1003 catrope: Backport for [[gerrit:1069310|CodexModule: Fix double-flipping in RTL (T373676)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:58:01] !log catrope@deploy1003 catrope: Continuing with sync [20:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:05] Reedy, sbassett, Maryum, and manfredi: gettimeofday() says it's time for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240902T2100) [21:00:24] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10111307 (10phaultfinder) [21:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:02:28] (03CR) 10Krinkle: trafficserver: Fix /w/rest.php and /api/ regex_map (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1070032 (https://phabricator.wikimedia.org/T364400) (owner: 10Clément Goubert) [21:02:44] !log catrope@deploy1003 Finished scap sync-world: Backport for [[gerrit:1069310|CodexModule: Fix double-flipping in RTL (T373676)]] (duration: 11m 31s) [21:02:46] T373676: Search results items have incorrect spacing in Persian Wikipedia - https://phabricator.wikimedia.org/T373676 [21:08:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:25:25] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10111320 (10phaultfinder) [21:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:31:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:40:25] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:40:25] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T373755#10111324 (10phaultfinder) [21:48:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T371742)', diff saved to https://phabricator.wikimedia.org/P68559 and previous config saved to /var/cache/conftool/dbconfig/20240902-214802-ladsgroup.json [21:48:07] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:03:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P68560 and previous config saved to /var/cache/conftool/dbconfig/20240902-220310-ladsgroup.json [22:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:18:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P68561 and previous config saved to /var/cache/conftool/dbconfig/20240902-221817-ladsgroup.json [22:26:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:33:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T371742)', diff saved to https://phabricator.wikimedia.org/P68562 and previous config saved to /var/cache/conftool/dbconfig/20240902-223324-ladsgroup.json [22:33:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [22:33:28] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:33:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [23:01:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:03:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:22:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1246.eqiad.wmnet with reason: Maintenance [23:22:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1246.eqiad.wmnet with reason: Maintenance [23:22:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T371742)', diff saved to https://phabricator.wikimedia.org/P68563 and previous config saved to /var/cache/conftool/dbconfig/20240902-232222-ladsgroup.json [23:22:26] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [23:53:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:11] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed