[00:08:25] FIRING: [2x] SystemdUnitFailed: man-db.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:17] PROBLEM - SSH on bast7001 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:19:17] RECOVERY - SSH on bast7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091924 [00:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091924 (owner: 10TrainBranchBot) [01:08:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091925 [01:08:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091925 (owner: 10TrainBranchBot) [01:13:45] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1091924 (owner: 10TrainBranchBot) [01:41:13] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1091925 (owner: 10TrainBranchBot) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:51] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.009e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:08:25] FIRING: [2x] SystemdUnitFailed: man-db.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:13] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:52:04] (03PS1) 10KartikMistry: Enable the Contribute menu in 2nd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) [05:52:53] (03CR) 10CI reject: [V:04-1] Enable the Contribute menu in 2nd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) (owner: 10KartikMistry) [05:55:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) (owner: 10KartikMistry) [06:12:02] (03PS2) 10KartikMistry: Enable the Contribute menu in 2nd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) [06:12:41] (03CR) 10CI reject: [V:04-1] Enable the Contribute menu in 2nd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) (owner: 10KartikMistry) [06:14:21] (03PS3) 10KartikMistry: Enable the Contribute menu in 2nd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) [06:18:20] Doing quick installation of MinT on eqiad.. [06:19:03] err. deployment :) [06:19:19] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:28:50] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:31:17] !log Updated MinT to 2024-10-16-065051-production on eqiad [06:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:49] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:22:29] (03CR) 10JMeybohm: [C:03+1] wikikube-staging: put kubestage2003 and 2004 into production [puppet] - 10https://gerrit.wikimedia.org/r/1091783 (https://phabricator.wikimedia.org/T377011) (owner: 10Jasmine) [07:35:01] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144 (10MoritzMuehlenhoff) 03NEW [07:46:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [07:46:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T373037, host is not pooled [07:46:10] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [07:46:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:46:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:46:17] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [07:47:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [07:48:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10330008 (10ops-monitoring-bot) Draining ganeti1020.eqiad.wmnet of running VMs [07:50:17] (03PS2) 10Stevemunene: airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) [07:50:17] (03PS2) 10Stevemunene: airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) [07:51:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [07:52:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1020.eqiad.wmnet [07:52:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10330017 (10ops-monitoring-bot) Draining ganeti1020.eqiad.wmnet of running VMs [07:54:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [07:56:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [07:56:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10330021 (10ops-monitoring-bot) Draining ganeti1021.eqiad.wmnet of running VMs [07:57:39] (03PS1) 10Stevemunene: airflow-analytics-product: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1092180 (https://phabricator.wikimedia.org/T378440) [07:57:41] (03PS1) 10Stevemunene: airflow-analytics-product: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1092181 (https://phabricator.wikimedia.org/T378440) [07:57:42] (03PS1) 10Stevemunene: airflow-analytics-product: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) [07:59:25] (03CR) 10Joal: [C:03+1] "Thank you for the investigation and findings @btullis" [puppet] - 10https://gerrit.wikimedia.org/r/1090842 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [07:59:27] (03CR) 10Stevemunene: airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [08:00:04] * Hamishcz says hi [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T0800). [08:00:05] Hamishcz and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:02] here [08:01:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [08:01:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [08:02:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10330027 (10ops-monitoring-bot) Draining ganeti1021.eqiad.wmnet of running VMs [08:03:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [08:05:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [08:05:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10330028 (10ops-monitoring-bot) Draining ganeti1021.eqiad.wmnet of running VMs [08:06:23] Hamishcz: Do you need help in deployment? [08:07:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [08:07:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [08:07:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10330029 (10ops-monitoring-bot) Draining ganeti1021.eqiad.wmnet of running VMs [08:08:09] kart_: what kind of help, for example? [08:08:25] FIRING: [2x] SystemdUnitFailed: man-db.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:41] kart_: maybe u mean, you can help me deploy my patch? [08:15:23] (03CR) 10Arnaudb: sre.mysql.sanitize-wiki: sanitize wiki cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [08:17:31] Hamishcz: yes. Do you want me to deploy? [08:17:46] ah yes, appreciate [08:17:50] :) [08:18:32] :) I'm sorry I misunderstood at first [08:18:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091912 (https://phabricator.wikimedia.org/T375054) (owner: 10Hamish) [08:19:31] (03Merged) 10jenkins-bot: bjnwikiquote: Add local logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091912 (https://phabricator.wikimedia.org/T375054) (owner: 10Hamish) [08:20:12] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1091912|bjnwikiquote: Add local logo (T375054)]] [08:20:16] T375054: Requesting logo change for bjn.wikiquote.org - https://phabricator.wikimedia.org/T375054 [08:20:29] (03PS1) 10Slyngshede: Version 0.2.0. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1092184 [08:29:35] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:34] confirmed good on debug server [08:30:48] !log kartik@deploy2002 kartik, hamishz: Backport for [[gerrit:1091912|bjnwikiquote: Add local logo (T375054)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:30:52] T375054: Requesting logo change for bjn.wikiquote.org - https://phabricator.wikimedia.org/T375054 [08:31:09] Hamishcz: nice! [08:31:15] Hamishcz: going ahead.. [08:31:19] !log kartik@deploy2002 kartik, hamishz: Continuing with sync [08:33:42] (03PS2) 10Slyngshede: Version 0.1.0. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1092184 [08:37:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM. At this point we can stop building bitu-ldap for buster, it's still installed on mwmaint*, but no longer used since the functionalit" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1092184 (owner: 10Slyngshede) [08:37:56] (03CR) 10Slyngshede: [C:03+2] Version 0.1.0. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1092184 (owner: 10Slyngshede) [08:38:31] (03PS3) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) [08:39:34] (03Merged) 10jenkins-bot: Version 0.1.0. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1092184 (owner: 10Slyngshede) [08:40:02] (03PS4) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) [08:40:33] (03CR) 10Muehlenhoff: [C:03+2] Add two new Airflow LDAP groups to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1091735 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [08:40:51] (03CR) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [08:43:07] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091912|bjnwikiquote: Add local logo (T375054)]] (duration: 22m 55s) [08:43:11] T375054: Requesting logo change for bjn.wikiquote.org - https://phabricator.wikimedia.org/T375054 [08:44:13] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on registry1004.eqiad.wmnet with reason: testing [08:44:19] Hamishcz: Done! [08:44:25] I'm going with my patch.. [08:44:27] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on registry1004.eqiad.wmnet with reason: testing [08:44:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) (owner: 10KartikMistry) [08:45:04] but I cannot load the logo from my end, why? [08:45:06] https://bjn.wikiquote.org/wiki/Laman_Tatambaian [08:45:36] (03Merged) 10jenkins-bot: Enable the Contribute menu in 2nd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091932 (https://phabricator.wikimedia.org/T375300) (owner: 10KartikMistry) [08:45:52] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1091932|Enable the Contribute menu in 2nd group of Wikis (T375300)]] [08:45:56] T375300: Enable the Contribute menu in 2nd group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375300 [08:49:10] (03PS5) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) [08:49:35] !log kartik@deploy2002 kartik: Backport for [[gerrit:1091932|Enable the Contribute menu in 2nd group of Wikis (T375300)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:50:10] ah good now [08:50:15] maybe cache problem, [08:50:21] kart_: thanks! [08:52:54] (03PS1) 10Muehlenhoff: Add one more Airflow LDAP group to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1092186 (https://phabricator.wikimedia.org/T375729) [08:53:01] !log kartik@deploy2002 kartik: Continuing with sync [08:55:22] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 40850 [08:55:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40850 [08:57:10] (03PS6) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) [08:57:37] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091932|Enable the Contribute menu in 2nd group of Wikis (T375300)]] (duration: 11m 45s) [08:57:41] T375300: Enable the Contribute menu in 2nd group of wikis where translation experience is available on mobile - https://phabricator.wikimedia.org/T375300 [08:59:01] (03CR) 10Elukey: docker_registry_ha: limit /v2/_catalog to internal IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091597 (https://phabricator.wikimedia.org/T378618) (owner: 10Elukey) [09:05:09] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1092186 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [09:12:13] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:45] (03CR) 10Elukey: [C:03+1] Drop Python support for 3.7, 3.8, add 3.11 (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [09:17:04] (03CR) 10Vgutierrez: [C:04-1] trafficserver: remove inbound TLS and related settings (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [09:17:59] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:18:13] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:18:18] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:18:18] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two typos inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/1090852 (owner: 10Slyngshede) [09:24:35] !log installing openssl security updates [09:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:12] (03CR) 10Volans: [V:03+2 C:03+2] "Force merging the next CR in the series fixes mypy" [software/cumin] - 10https://gerrit.wikimedia.org/r/1029209 (owner: 10Volans) [09:26:24] (03CR) 10Volans: [C:03+2] Use importlib.metadata instead of pkg_resources [software/cumin] - 10https://gerrit.wikimedia.org/r/1029210 (owner: 10Volans) [09:34:35] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 114, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:53] (03PS3) 10DCausse: rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T376598) [09:34:53] (03PS1) 10DCausse: rdf-streaming-updater: produce rdf_change v2 events [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) [09:35:06] jouncebot: nowandnext [09:35:06] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [09:35:06] In 1 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1100) [09:35:12] (03CR) 10Nikerabbit: [C:03+1] Add new namespaces to hsb wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1090502 (https://phabricator.wikimedia.org/T373634) (owner: 10Srishakatux) [09:35:59] (03CR) 10DCausse: [C:04-1] "needs Ife016662f5fde835c21457ef457b567d9be61d2a to be fully deployed everywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092191 (https://phabricator.wikimedia.org/T374919) (owner: 10DCausse) [09:42:09] (03Merged) 10jenkins-bot: Use importlib.metadata instead of pkg_resources [software/cumin] - 10https://gerrit.wikimedia.org/r/1029210 (owner: 10Volans) [09:42:28] !log restarting nginx on acmechief hosts to pick up openssl updates [09:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:54] (03CR) 10Volans: [C:03+2] Add support for Python 3.12 [software/cumin] - 10https://gerrit.wikimedia.org/r/1090504 (owner: 10Volans) [09:43:22] (03PS2) 10Slyngshede: Prevalidation of permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1090852 [09:44:59] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T376598) (owner: 10DCausse) [09:45:00] (03PS1) 10Jelto: wikidata-query-gui: add querybuilder releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092192 (https://phabricator.wikimedia.org/T350793) [09:46:16] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump to 0.3.150 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091306 (https://phabricator.wikimedia.org/T376598) (owner: 10DCausse) [09:47:33] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [09:47:59] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [09:48:11] (03PS2) 10Arnaudb: sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [09:48:25] (03CR) 10Brouberol: [C:03+1] airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:49:00] (03CR) 10Brouberol: [C:03+1] airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:49:10] (03CR) 10Brouberol: [C:03+1] airflow-analytics-product: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1092180 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:49:25] (03CR) 10Brouberol: [C:03+1] airflow-analytics-product: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1092181 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:50:02] (03CR) 10Brouberol: [C:04-1] "You're missing the caching config in `hieradata/role/common/cache/text.yaml`" [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:50:09] (03CR) 10Btullis: [C:03+1] airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:50:47] (03CR) 10Btullis: [C:03+1] airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:51:07] (03CR) 10Btullis: [C:03+1] airflow-analytics-product: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1092180 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:51:10] (03PS1) 10Elukey: redfish: add response logging for request() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 [09:51:26] (03CR) 10Btullis: [C:03+1] airflow-analytics-product: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1092181 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [09:51:36] (03PS2) 10Elukey: redfish: add response logging for request() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 [09:53:36] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [09:54:38] (03CR) 10Slyngshede: [C:03+2] Prevalidation of permissions (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/1090852 (owner: 10Slyngshede) [09:55:18] (03PS1) 10Btullis: Add spark version 3.5.3 to production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) [09:57:08] (03Merged) 10jenkins-bot: Prevalidation of permissions [software/bitu] - 10https://gerrit.wikimedia.org/r/1090852 (owner: 10Slyngshede) [09:57:53] (03PS2) 10Stevemunene: airflow-analytics-product: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) [09:58:04] (03Merged) 10jenkins-bot: Add support for Python 3.12 [software/cumin] - 10https://gerrit.wikimedia.org/r/1090504 (owner: 10Volans) [09:58:07] (03CR) 10Volans: [C:04-1] "Makes sense, needs a tweak because of old requests on bullseye." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [09:58:17] (03CR) 10Volans: [C:03+2] Integration tests: use linuxserver/openssh-server [software/cumin] - 10https://gerrit.wikimedia.org/r/1090505 (owner: 10Volans) [09:59:52] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for hiddenparma [puppet] - 10https://gerrit.wikimedia.org/r/1092195 (https://phabricator.wikimedia.org/T135991) [10:02:31] (03CR) 10JMeybohm: [C:03+1] wikidata-query-gui: add querybuilder releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092192 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:03:12] (03PS3) 10Elukey: redfish: add response logging for request() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 [10:03:35] (03CR) 10Elukey: redfish: add response logging for request() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [10:07:52] (03PS1) 10Muehlenhoff: Add Cumin alias for liberica [puppet] - 10https://gerrit.wikimedia.org/r/1092196 [10:10:29] (03PS3) 10Stevemunene: airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) [10:10:29] (03PS3) 10Stevemunene: airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) [10:10:29] (03PS1) 10Stevemunene: airflow-analytics-product: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092197 (https://phabricator.wikimedia.org/T378443) [10:11:05] (03CR) 10Vgutierrez: [C:04-1] "please do not merge this till the applayer endpoint is ready:" [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [10:13:16] (03CR) 10Vgutierrez: [C:03+1] "thx!" [puppet] - 10https://gerrit.wikimedia.org/r/1092196 (owner: 10Muehlenhoff) [10:13:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:13:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:14:06] (03CR) 10CI reject: [V:04-1] redfish: add response logging for request() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [10:14:14] (03Merged) 10jenkins-bot: Integration tests: use linuxserver/openssh-server [software/cumin] - 10https://gerrit.wikimedia.org/r/1090505 (owner: 10Volans) [10:14:46] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo [10:14:54] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:14:56] !log upgrade haproxy on cp-ulsfo (T379891) [10:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:00] T379891: Upgrade haproxy to 2.8.12 on cp hosts - https://phabricator.wikimedia.org/T379891 [10:15:05] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:16:57] (03CR) 10Brouberol: [C:03+1] airflow-analytics-product: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092197 (https://phabricator.wikimedia.org/T378443) (owner: 10Stevemunene) [10:17:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10330345 (10eoghan) a:03eoghan [10:21:27] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I don’t fully understand it, but IMHO it’s fine to try this out and revert if needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092192 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:22:50] (03PS1) 10Volans: doc: don't fail on warning on readthedocs [software/cumin] - 10https://gerrit.wikimedia.org/r/1092199 [10:25:44] (03CR) 10Elukey: [C:03+1] doc: don't fail on warning on readthedocs [software/cumin] - 10https://gerrit.wikimedia.org/r/1092199 (owner: 10Volans) [10:26:17] (03CR) 10Brouberol: Add spark version 3.5.3 to production images (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [10:27:20] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:27:31] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:27:59] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add querybuilder releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092192 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:29:21] (03Merged) 10jenkins-bot: wikidata-query-gui: add querybuilder releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092192 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:33:02] (03CR) 10Elukey: redfish: add response logging for request() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [10:33:16] (03CR) 10Brouberol: Add spark version 3.5.3 to production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [10:35:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:35:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:36:53] (03CR) 10Btullis: Add spark version 3.5.3 to production images (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [10:37:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:38:58] (03CR) 10Btullis: Add spark version 3.5.3 to production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [10:39:42] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:41:16] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [10:41:26] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [10:41:38] (03CR) 10Volans: "LGTM, just run `tox -e py3-format` to fix CI" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [10:41:48] (03CR) 10Volans: [C:03+2] doc: don't fail on warning on readthedocs [software/cumin] - 10https://gerrit.wikimedia.org/r/1092199 (owner: 10Volans) [10:43:12] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:43:23] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:43:31] (03CR) 10FNegri: [C:03+2] "TIL! I didn't know about `keep_firing_for`, it looks like it's mostly designed for flapping alerts, I wonder if setting it to "24h" could " [alerts] - 10https://gerrit.wikimedia.org/r/1088585 (https://phabricator.wikimedia.org/T379378) (owner: 10FNegri) [10:45:10] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:45:20] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:46:31] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [10:46:45] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [10:47:03] (03CR) 10Brouberol: Add spark version 3.5.3 to production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [10:49:50] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:50:00] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:50:18] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:50:28] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:55:50] (03CR) 10Vgutierrez: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [10:56:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10330511 (10elukey) My bad, I misremembered that we got the firmware for config J from Supermicro already (somehow I thought it was for the ganeti nodes,... [10:57:47] (03Merged) 10jenkins-bot: doc: don't fail on warning on readthedocs [software/cumin] - 10https://gerrit.wikimedia.org/r/1092199 (owner: 10Volans) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1100) [11:04:38] (03PS1) 10Aklapper: phabricator weekly changes email: Sort newcomers by claim date [puppet] - 10https://gerrit.wikimedia.org/r/1092205 [11:04:48] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Allow other input and changes to trigger searchsuggestions to update" [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091605 (https://phabricator.wikimedia.org/T379983) (owner: 10Samtar) [11:09:47] (03PS6) 10Fabfur: haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) [11:12:54] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [11:14:21] (03PS2) 10Btullis: Add spark version 3.5.3 to production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) [11:14:54] (03CR) 10Btullis: Add spark version 3.5.3 to production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [11:16:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:18:05] (03CR) 10Fabfur: haproxykafka: working on TLS client authentication to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur) [11:18:42] (03CR) 10Stevemunene: [C:03+2] airflow-analytics-product: create user kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1092180 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [11:20:03] (03CR) 10Stevemunene: [C:03+2] airflow-analytics-product: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092197 (https://phabricator.wikimedia.org/T378443) (owner: 10Stevemunene) [11:21:16] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:21:21] (03CR) 10Btullis: [V:03+1 C:03+2] Enable deletion of unused segments on the druid-analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1090842 (https://phabricator.wikimedia.org/T376118) (owner: 10Btullis) [11:23:15] (03PS2) 10Aqu: EventStreamConfig: Enable Hive Ingestion for most streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [11:23:51] (03Merged) 10jenkins-bot: airflow-analytics-product: define namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092197 (https://phabricator.wikimedia.org/T378443) (owner: 10Stevemunene) [11:24:32] (03CR) 10Aqu: [C:03+1] "I've activated canary events for some streams." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [11:25:19] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:25:30] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:25:51] (03CR) 10Stevemunene: [C:03+2] airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [11:30:00] (03Merged) 10jenkins-bot: airflow-analytics-product: register namespace in ceph-csi and cloudnative-pg operator configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091199 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [11:33:19] !log btullis@cumin1002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:36:23] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1092181 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [11:38:16] (03CR) 10Stevemunene: [C:03+2] airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [11:38:30] (03CR) 10Muehlenhoff: [C:03+2] Add Cumin alias for liberica [puppet] - 10https://gerrit.wikimedia.org/r/1092196 (owner: 10Muehlenhoff) [11:39:02] (03CR) 10Muehlenhoff: [C:03+2] Add one more Airflow LDAP group to be considered for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1092186 (https://phabricator.wikimedia.org/T375729) (owner: 10Muehlenhoff) [11:39:27] (03Merged) 10jenkins-bot: airflow-analytics-product: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091200 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [11:40:59] !log mwmaint2002: Run `extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php` at `testwiki` for a bunch of pages (P71064 is list of commands executed; T378983) [11:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:03] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:41:04] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [11:41:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2216.codfw.wmnet with reason: T380131 - table corruption [11:41:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2216.codfw.wmnet with reason: T380131 - table corruption [11:41:31] T380131: Corrupt index on db2216 - https://phabricator.wikimedia.org/T380131 [11:41:32] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [11:41:52] RECOVERY - MariaDB Replica SQL: s1 #page on db2216 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:43:11] (03PS1) 10Btullis: Add the thirdparty/bigtop15 component to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1092210 (https://phabricator.wikimedia.org/T378954) [11:43:40] OK to deploy ml-service ie recommendation-api? [11:44:00] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4543/console" [puppet] - 10https://gerrit.wikimedia.org/r/1092210 (https://phabricator.wikimedia.org/T378954) (owner: 10Btullis) [11:45:47] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:45:59] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:47:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [11:54:57] I'll wait till current window is over.. [11:56:05] jouncebot: next [11:56:05] In 2 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1400) [11:58:22] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:58:55] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:59:14] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [11:59:40] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:00:38] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:02:14] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:03:06] (03CR) 10Stevemunene: [C:03+2] airflow-analytics-product: create OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1092181 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [12:06:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10330858 (10elukey) Ok I found the issue, I asked Jenn to turn off IPv6 last week for the BMC network to test if that was the issue, but it was before upg... [12:07:09] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10330860 (10elukey) @Jclark-ctr I updated the firmware to the correct one, but I'd need the BMC label password in pvt when you are in the DC (it is needed... [12:08:25] FIRING: [2x] SystemdUnitFailed: man-db.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:36] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1018.eqiad.wmnet with OS bullseye [12:08:45] elukey: I'll be deploying recommendation-api-ng now.. [12:09:10] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:10:02] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [12:10:38] (03CR) 10KartikMistry: [C:03+2] Update recommendation api to 2024-11-13-183159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379592) (owner: 10KartikMistry) [12:11:44] (03Merged) 10jenkins-bot: Update recommendation api to 2024-11-13-183159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089964 (https://phabricator.wikimedia.org/T379592) (owner: 10KartikMistry) [12:11:48] kart_: yes yes go ahead! [12:11:59] Thanks! [12:12:07] I think there is no policy for it, just ping the ml-team on their chan for notification [12:12:22] sure. noted! [12:12:26] ty :) [12:13:06] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:13:23] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-ulsfo [12:14:33] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:15:46] !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [12:17:12] 06SRE, 06collaboration-services: gitlab runners don't have the apt.wikimedia.org key - https://phabricator.wikimedia.org/T380164#10330906 (10MatthewVernon) [12:19:31] !log btullis@cumin1002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:19:50] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092223 [12:21:33] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1018.eqiad.wmnet with OS bullseye [12:22:07] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:22:11] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1018.eqiad.wmnet with OS bullseye [12:24:24] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:29:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10330937 (10BTullis) 05Open→03Resolved I think that this is fixed now. I'm able to reimage an-presto1018 and connect to a SOL session, so... [12:32:47] (03CR) 10Stevemunene: "endpoint is ready" [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [12:36:17] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:36:19] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10330969 (10Clement_Goubert) p:05Triage→03Medium a:03Jclark-ctr [12:36:19] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db2150 slowly with 10 steps - slow repool db2150 T380117 [12:36:23] T380117: Corrupt index on db2150 - https://phabricator.wikimedia.org/T380117 [12:37:19] !log Updated recommendation api to 2024-11-13-183159-production (T379592, T379037) [12:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:23] T379592: Unable to deploy new version of recommendation-api to production due to connectivity issues - https://phabricator.wikimedia.org/T379592 [12:37:23] T379037: Implement batching for collections data - https://phabricator.wikimedia.org/T379037 [12:38:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10330973 (10BTullis) Maybe I spoke too soon. I've had this error twice now, suggesting a failure to pull the boot image with TFTP, or similar.... [12:38:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:45] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-presto1018.eqiad.wmnet with OS bullseye [12:39:16] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1018.eqiad.wmnet with OS bullseye [12:40:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10330996 (10BTullis) Trying the reimage again with the note from https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_docum... [12:42:21] (03PS1) 10Jelto: wikidata-query-gui: update readiness_probe for querybuilder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092232 (https://phabricator.wikimedia.org/T350793) [12:48:02] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: fullstack: use git clone instead of direct fetch [puppet] - 10https://gerrit.wikimedia.org/r/1092233 (https://phabricator.wikimedia.org/T379356) [12:48:23] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092233 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [12:49:41] jouncebot: nowandnext [12:49:41] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [12:49:41] In 1 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1400) [12:50:16] Anyone mind if I use the open window to do a deploy on k8s? [12:53:33] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova: fullstack: use git clone instead of direct fetch [puppet] - 10https://gerrit.wikimedia.org/r/1092233 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [12:54:03] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1018.eqiad.wmnet with reason: host reimage [12:55:00] (03CR) 10Brouberol: [C:03+1] Add spark version 3.5.3 to production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [12:55:56] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: update readiness_probe for querybuilder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092232 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:56:52] (03CR) 10Brouberol: [C:03+1] "Very good news!" [puppet] - 10https://gerrit.wikimedia.org/r/1092210 (https://phabricator.wikimedia.org/T378954) (owner: 10Btullis) [12:56:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1018.eqiad.wmnet with reason: host reimage [12:57:04] (03Merged) 10jenkins-bot: wikidata-query-gui: update readiness_probe for querybuilder [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092232 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:57:13] (03CR) 10Brouberol: [C:03+1] airflow-analytics-product: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [12:58:33] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow: define the webserver.base_url configuration [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) (owner: 10Brouberol) [13:00:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1092210 (https://phabricator.wikimedia.org/T378954) (owner: 10Btullis) [13:01:58] !log removing ganeti1021 from active Ganeti nodes T378921 [13:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:02] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [13:03:07] (03CR) 10Btullis: [V:03+2 C:03+2] Add spark version 3.5.3 to production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [13:03:49] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:04:00] (03CR) 10Btullis: [V:03+1 C:03+2] Add the thirdparty/bigtop15 component to bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1092210 (https://phabricator.wikimedia.org/T378954) (owner: 10Btullis) [13:04:10] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:04:26] (03PS1) 10Muehlenhoff: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1092236 [13:05:13] PROBLEM - ganeti-noded running on ganeti1021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [13:05:13] PROBLEM - ganeti-confd running on ganeti1021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [13:06:27] (03CR) 10Stevemunene: [C:03+2] airflow-analytics-product: create ATS mapping and caching config [puppet] - 10https://gerrit.wikimedia.org/r/1092182 (https://phabricator.wikimedia.org/T378440) (owner: 10Stevemunene) [13:07:04] FIRING: ProbeDown: Service ganeti1021:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092236 (owner: 10Muehlenhoff) [13:13:33] 06SRE-OnFire, 06SRE Observability: Harden corto systemd service - https://phabricator.wikimedia.org/T372437#10331077 (10lmata) [13:16:01] !log mwmaint2002: Run `extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php` at `testwiki` for a bunch of pages (P71064 is list of commands executed; T378983) [13:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:06] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [13:20:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1018.eqiad.wmnet with OS bullseye [13:24:33] (03PS1) 10Effie Mouzeli: memcached: add mc-gp100[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092243 (https://phabricator.wikimedia.org/T377033) [13:25:40] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:25:56] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:25:58] (03CR) 10Stevemunene: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1091654 (https://phabricator.wikimedia.org/T379267) (owner: 10Brouberol) [13:26:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): an-presto1018.eqiad.wmnet: DRAC is down - https://phabricator.wikimedia.org/T378854#10331151 (10BTullis) That worked, so we're all good. [13:26:26] !log stopping netbox service on netbox-next test server to restore new database backup from production [13:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:47] (03CR) 10Effie Mouzeli: [C:03+1] chromium-render: Add cli flag to avoid flooding with crashpad processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088271 (https://phabricator.wikimedia.org/T376438) (owner: 10Jgiannelos) [13:26:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: fullstack: file link depends on git clone [puppet] - 10https://gerrit.wikimedia.org/r/1092244 (https://phabricator.wikimedia.org/T379356) [13:27:27] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [13:27:46] (03PS2) 10Effie Mouzeli: memcached: add mc-gp100[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092243 (https://phabricator.wikimedia.org/T377033) [13:27:50] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092244 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [13:27:57] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092243 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [13:28:25] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089701 (owner: 10PipelineBot) [13:28:48] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:28:57] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:29:26] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1089701 (owner: 10PipelineBot) [13:30:39] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [13:31:04] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [13:31:17] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:31:27] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:31:57] (03CR) 10Muehlenhoff: [C:03+2] Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1092236 (owner: 10Muehlenhoff) [13:33:34] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [13:34:07] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [13:34:22] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova: fullstack: file link depends on git clone [puppet] - 10https://gerrit.wikimedia.org/r/1092244 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [13:34:44] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [13:35:09] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [13:35:26] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:35:34] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:37:23] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:37:31] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:39:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [13:39:46] (03PS1) 10Jelto: wikidata-query-gui: fix namespace typo in gateway and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092247 (https://phabricator.wikimedia.org/T350793) [13:40:39] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova: fullstack: subscribe service to git clone [puppet] - 10https://gerrit.wikimedia.org/r/1092248 (https://phabricator.wikimedia.org/T379356) [13:40:58] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092248 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [13:42:13] (03CR) 10JMeybohm: [C:03+1] wikidata-query-gui: fix namespace typo in gateway and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092247 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:42:20] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: fix namespace typo in gateway and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092247 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:42:42] (03PS1) 10Muehlenhoff: Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1092250 (https://phabricator.wikimedia.org/T378921) [13:43:43] (03Merged) 10jenkins-bot: wikidata-query-gui: fix namespace typo in gateway and service name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092247 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:44:55] (03CR) 10Effie Mouzeli: [C:03+1] debug.json: add support for mwdebug-next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076848 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [13:45:26] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: nova: fullstack: subscribe service to git clone [puppet] - 10https://gerrit.wikimedia.org/r/1092248 (https://phabricator.wikimedia.org/T379356) (owner: 10Arturo Borrero Gonzalez) [13:45:45] (03PS1) 10Ssingh: Revert^2 "cp7001: temporarily set check_min_fe_mem to true" [puppet] - 10https://gerrit.wikimedia.org/r/1092252 [13:46:36] (03CR) 10Clément Goubert: [C:03+1] memcached: add mc-gp100[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092243 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [13:46:48] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add mc-gp100[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092243 (https://phabricator.wikimedia.org/T377033) (owner: 10Effie Mouzeli) [13:46:54] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:46:57] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:47:51] (03CR) 10Ssingh: [C:03+2] Revert^2 "cp7001: temporarily set check_min_fe_mem to true" [puppet] - 10https://gerrit.wikimedia.org/r/1092252 (owner: 10Ssingh) [13:47:57] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Upgrade IDPs to CAS 6.6/Bullseye and enable webauthn - https://phabricator.wikimedia.org/T305518#10331270 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This old task can be closed, the update to CAS 6.6 was resolved with T311235 and th... [13:48:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [13:48:28] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [13:49:09] RESOLVED: ProbeDown: Service ganeti1021:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:49:24] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [13:49:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1082726 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [13:49:52] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [13:50:32] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [13:54:05] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Adapt WMF theming for webauthn - https://phabricator.wikimedia.org/T380172 (10MoritzMuehlenhoff) 03NEW [13:54:18] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Adapt WMF theming for webauthn - https://phabricator.wikimedia.org/T380172#10331304 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:56:33] (03PS2) 10Ssingh: trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 [13:57:11] (03CR) 10Ssingh: trafficserver: remove inbound TLS and related settings (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [13:58:14] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4544/co" [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [13:58:15] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Select data store for webauthn devices - https://phabricator.wikimedia.org/T380173 (10MoritzMuehlenhoff) 03NEW [13:58:20] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Select data store for webauthn devices - https://phabricator.wikimedia.org/T380173#10331325 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:02:05] (03PS5) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 [14:02:05] (03PS5) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [14:02:06] (03PS2) 10Volans: mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278 [14:02:06] (03PS1) 10Volans: mysql_legacy: improve DRY-RUN support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092253 [14:04:36] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [14:09:01] !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. [14:09:17] (03PS3) 10Volans: sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 [14:09:18] (03PS2) 10Volans: Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 [14:10:44] (03CR) 10Xcollazo: "I didn't see Iceberg being put in the `/jars` folder of this Spark distribution?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092194 (https://phabricator.wikimedia.org/T380035) (owner: 10Btullis) [14:11:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet [14:11:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:13:25] FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:10] (03CR) 10CI reject: [V:04-1] Adapt to new Spicerack API renaming mysql_legacy [cookbooks] - 10https://gerrit.wikimedia.org/r/1087861 (owner: 10Volans) [14:15:21] bgp issues are probably me putting a k8s node into failed [14:15:33] FIRING: KubernetesCalicoDown: wikikube-worker1306.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1306.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:15:33] (03CR) 10CI reject: [V:04-1] sre.switchdc.databases: use mysql native methods [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans) [14:15:55] (03CR) 10Arnaudb: [C:03+1] "found a typo, otherwise lgtm!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092253 (owner: 10Volans) [14:16:34] !log running homer 'cr*-eqiad' 'T379454' [14:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:39] T379454: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454 [14:18:25] FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1306:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1306 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:19:17] (03PS4) 10Effie Mouzeli: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:20:31] (03CR) 10Effie Mouzeli: "missing chart bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:27:12] (03PS1) 10KartikMistry: Enable the Contribute menu in 3rd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092257 (https://phabricator.wikimedia.org/T375301) [14:27:40] (03PS1) 10Peter Fischer: CirrusSearch: enable offloading weighted tags via EventBus for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) [14:27:56] jouncebot: now [14:27:56] For the next 0 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1400) [14:28:04] did it not announce the beginning of the backport window? o_O [14:28:25] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:25] (03CR) 10CI reject: [V:04-1] Enable the Contribute menu in 3rd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092257 (https://phabricator.wikimedia.org/T375301) (owner: 10KartikMistry) [14:28:32] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1305-1312].eqiad.wmnet [14:28:45] anyway… if it’s okay with everyone else, I’d quite like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1091605 (cc TheresNoTime, Jdlrobson) [14:29:02] I could reproduce the issue, so I’d be comfortable testing it myself [14:29:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [14:29:06] (03PS2) 10Volans: mysql_legacy: improve DRY-RUN support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092253 [14:29:06] (03PS6) 10Volans: mysql: remove unused module [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087855 [14:29:06] (03PS6) 10Volans: mysql_legacy: rename to mysql [software/spicerack] - 10https://gerrit.wikimedia.org/r/1087856 [14:29:07] (03PS3) 10Volans: mysql: make fetch_one_row return always a dict [software/spicerack] - 10https://gerrit.wikimedia.org/r/1091278 [14:29:07] and it sounds like users are getting antsy about it [14:30:00] (03PS1) 10Sbisson: Unified dashboard: Add UI for page collection recommendations [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092259 (https://phabricator.wikimedia.org/T368718) [14:30:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091605 (https://phabricator.wikimedia.org/T379983) (owner: 10Samtar) [14:30:29] I’‘ll go ahead and start the scap, there’s plenty of time during gate-and-submit if anyone wants to stop me :) [14:31:11] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Select optin method for webauthn - https://phabricator.wikimedia.org/T380178 (10MoritzMuehlenhoff) 03NEW [14:31:21] jouncebot: next [14:31:21] In 1 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1630) [14:31:26] ok, good, there’s a break after this window [14:31:33] because the gate-and-submit might not finish in time otherwise :| [14:32:15] (03PS2) 10KartikMistry: Enable the Contribute menu in 3rd group of Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092257 (https://phabricator.wikimedia.org/T375301) [14:32:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1305-1312].eqiad.wmnet [14:32:51] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Select opt-in method for webauthn - https://phabricator.wikimedia.org/T380178#10331471 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:32:55] * Lucas_WMDE peeks at jouncebot’s logs [14:33:50] well, it says “Deploy timer kicked. Attempting to notify.” [14:33:53] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Evaluate supported for trusted devices - https://phabricator.wikimedia.org/T380179 (10MoritzMuehlenhoff) 03NEW [14:33:54] at 14:00 UTC [14:35:54] I also wants to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1092259 :D [14:36:08] oh dear [14:36:17] i18n changes make for a very slow backport :/ [14:36:32] Yeah, but it is some unbreak change :/ [14:36:41] 'UBN' :D [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:48] > Medium [14:36:49] o_O [14:37:09] * Lucas_WMDE looks at CI time of other CX changes [14:37:37] 18 minutes [14:37:45] I don’t think we can fit that in before the core backport, then [14:38:03] I guess we can still do it out-of-window before the portals update… [14:38:19] I can wait, no issue. Dinner on the desk! [14:38:48] We need to check if portal updates are happening. [14:40:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10331504 (10Jhancock.wm) [14:40:57] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180 (10MoritzMuehlenhoff) 03NEW [14:41:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10331515 (10Jhancock.wm) [14:42:37] filed T380181 for jouncebot’s issue FTR [14:42:38] T380181: jouncebot did not announce 2024-11-18 UTC afternoon backport window for no apparent reason - https://phabricator.wikimedia.org/T380181 [14:43:54] RECOVERY - MariaDB Replica Lag: s1 #page on db2216 is OK: OK slave_sql_lag Replication lag: 5.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:44:09] welcome back db2216 [14:47:56] (03CR) 10Volans: [C:03+2] mysql_legacy: improve DRY-RUN support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092253 (owner: 10Volans) [14:49:37] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182 (10phaultfinder) 03NEW [14:51:03] Lucas_WMDE: we need to wait till core change is merged, right? Can I do +2 to my change after that or wait till deployment is over? [14:51:22] kart_: you can +2 it once the deployment for the core change has properly started, I’d say [14:51:37] (03CR) 10Urbanecm: [C:04-1] CirrusSearch: enable offloading weighted tags via EventBus for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [14:51:38] if you +2 it now there’s a risk it’ll merge before the core change, and then get included in that deployment, which we don’t want [14:51:42] (03PS4) 10Elukey: redfish: add response logging for request() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 [14:52:07] (03CR) 10Elukey: redfish: add response logging for request() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [14:52:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2150 slowly with 10 steps - slow repool db2150 T380117 [14:52:20] T380117: Corrupt index on db2150 - https://phabricator.wikimedia.org/T380117 [14:52:53] Right Lucas_WMDE [14:53:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091197 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [14:53:29] Lucas_WMDE: Please ping me when it starts.. I would `git fetch dinner` meanwhile.. [14:53:45] okay :) [14:54:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10331550 (10phaultfinder) [14:56:14] (03CR) 10Volans: [C:03+1] "LGTM, possible idea inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [14:56:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.pool db2216 slowly with 10 steps - slow motion repool T380131 [14:56:46] T380131: Corrupt index on db2216 - https://phabricator.wikimedia.org/T380131 [14:56:47] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db2216 slowly with 10 steps - slow motion repool T380131 [14:57:52] (03Merged) 10jenkins-bot: mysql_legacy: improve DRY-RUN support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092253 (owner: 10Volans) [14:59:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'manual repool commit', diff saved to https://phabricator.wikimedia.org/P71076 and previous config saved to /var/cache/conftool/dbconfig/20241118-145946-arnaudb.json [15:00:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'manual depool commit', diff saved to https://phabricator.wikimedia.org/P71077 and previous config saved to /var/cache/conftool/dbconfig/20241118-150020-arnaudb.json [15:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:13] (03Merged) 10jenkins-bot: Revert "Allow other input and changes to trigger searchsuggestions to update" [core] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1091605 (https://phabricator.wikimedia.org/T379983) (owner: 10Samtar) [15:03:25] FIRING: [7x] SystemdUnitFailed: confd_prometheus_metrics.service on wikikube-worker1306:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:30] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1091605|Revert "Allow other input and changes to trigger searchsuggestions to update" (T379983)]] [15:03:35] T379983: RangeError: Maximum call stack size exceeded in mediawiki.searchSuggest - https://phabricator.wikimedia.org/T379983 [15:05:32] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Evaluate supported for trusted devices - https://phabricator.wikimedia.org/T380179#10331597 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:05:44] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Registry of multiple webauthn devices - https://phabricator.wikimedia.org/T380180#10331598 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:06:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "+2ing ahead of deployment" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092259 (https://phabricator.wikimedia.org/T368718) (owner: 10Sbisson) [15:06:24] kart_: ^ fyi [15:06:32] (core deployment still ongoing) [15:06:38] !log lucaswerkmeister-wmde@deploy2002 samtar, lucaswerkmeister-wmde: Backport for [[gerrit:1091605|Revert "Allow other input and changes to trigger searchsuggestions to update" (T379983)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:06:42] testing… [15:06:44] Thanks! [15:07:03] !log lucaswerkmeister-wmde@deploy2002 samtar, lucaswerkmeister-wmde: Continuing with sync [15:07:10] yup, fixes the weird search arrow key issue at least [15:09:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10331608 (10phaultfinder) [15:11:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091605|Revert "Allow other input and changes to trigger searchsuggestions to update" (T379983)]] (duration: 08m 14s) [15:11:57] T379983: RangeError: Maximum call stack size exceeded in mediawiki.searchSuggest - https://phabricator.wikimedia.org/T379983 [15:13:27] (03CR) 10TChin: [C:03+2] EventStreamConfig: Enable Hive Ingestion for most streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [15:14:55] (03Merged) 10jenkins-bot: EventStreamConfig: Enable Hive Ingestion for most streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [15:16:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092259 (https://phabricator.wikimedia.org/T368718) (owner: 10Sbisson) [15:17:30] kart_: ^ fyi [15:17:43] scap backport is running now (and waiting for the merge) [15:18:31] though I don’t know what happens to TChin’s config change above… [15:19:04] (I don’t see any other scap locks being held, at least) [15:21:37] Nice! [15:22:25] (03CR) 10Hnowlan: [C:03+1] "Oops, didn't realise I hadn't +1ed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088271 (https://phabricator.wikimedia.org/T376438) (owner: 10Jgiannelos) [15:23:00] tchin: is it okay to deploy your EventStreamConfig change? [15:23:14] because IIUC, it will be included in my ongoing backport (unless it gets reverted in the meantime) [15:24:11] (03PS1) 10Hnowlan: team-sre: add thumbor alert for pods with high error rates [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) [15:26:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1305.eqiad.wmnet with OS bookworm [15:26:55] (03CR) 10Lucas Werkmeister (WMDE): "Note: if I’m not mistaken, I’m about to deploy this as part of the backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [15:27:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1306.eqiad.wmnet with OS bookworm [15:28:39] CI almost finished, apparently [15:28:47] (03CR) 10Clément Goubert: [C:03+1] team-sre: add thumbor alert for pods with high error rates [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:29:04] (03Merged) 10jenkins-bot: Unified dashboard: Add UI for page collection recommendations [extensions/ContentTranslation] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092259 (https://phabricator.wikimedia.org/T368718) (owner: 10Sbisson) [15:29:27] “The following are unexpected commits pulled from origin for /srv/mediawiki-staging” [15:29:28] there it is [15:29:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1307.eqiad.wmnet with OS bookworm [15:29:55] I guess I’ll go ahead with that in a moment if I don’t hear anything else [15:30:15] ah. [15:30:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1308.eqiad.wmnet with OS bookworm [15:30:59] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1092259|Unified dashboard: Add UI for page collection recommendations (T368718)]] [15:31:13] T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718 [15:31:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1309.eqiad.wmnet with OS bookworm [15:31:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1310.eqiad.wmnet with OS bookworm [15:33:18] (03CR) 10Scott French: team-sre: add thumbor alert for pods with high error rates (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:33:19] PROBLEM - BGP status on lsw1-e5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:17] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:17] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:15] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:02] that's my reimages, no worries [15:36:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1311.eqiad.wmnet with OS bookworm [15:36:51] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1312.eqiad.wmnet with OS bookworm [15:39:19] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:45] Hi, I can test gerrit:1092259 when it's on a test server, let me know [15:39:51] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:40:16] Lucas_WMDE: still syncing to mwdebugs? :/ [15:40:36] yup [15:40:50] i18n changes mean a big image diff, IIUC [15:40:52] (03PS1) 10Ssingh: Revert^3 "cp7001: temporarily set check_min_fe_mem to true" [puppet] - 10https://gerrit.wikimedia.org/r/1092267 [15:41:20] (03PS2) 10Hnowlan: team-sre: add thumbor alert for pods with high error rates [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) [15:41:33] (03CR) 10Hnowlan: team-sre: add thumbor alert for pods with high error rates (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:41:45] (03CR) 10Ssingh: [C:03+2] Revert^3 "cp7001: temporarily set check_min_fe_mem to true" [puppet] - 10https://gerrit.wikimedia.org/r/1092267 (owner: 10Ssingh) [15:41:54] 06SRE, 10Observability-Alerting, 06Traffic: PuppetFailure alert is not being fired for host(s) where agent has failed - https://phabricator.wikimedia.org/T379807#10331748 (10ssingh) 05Open→03Resolved a:03ssingh ` 10:38:48 < jinxer-wm> FIRING: PuppetZeroResources: Puppet has failed generate resource... [15:42:50] (03CR) 10Effie Mouzeli: team-sre: add thumbor alert for pods with high error rates (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:42:59] 06SRE, 10Bitu, 06Infrastructure-Foundations: Allow to provide links for Bitu permissions - https://phabricator.wikimedia.org/T379926#10331754 (10SLyngshede-WMF) p:05Triage→03Low [15:45:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage [15:45:37] !log lucaswerkmeister-wmde@deploy2002 sbisson, lucaswerkmeister-wmde: Backport for [[gerrit:1092259|Unified dashboard: Add UI for page collection recommendations (T368718)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:45:43] T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718 [15:45:51] kart_ / stephanebisson: please test :) [15:46:19] Which server do I pick in the browser extension? [15:46:25] (03CR) 10Elukey: [C:03+2] redfish: add response logging for request() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [15:46:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage [15:47:09] (03CR) 10Ahmon Dancy: [C:03+1] debug.json: add support for mwdebug-next [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1076848 (https://phabricator.wikimedia.org/T372605) (owner: 10Scott French) [15:47:14] stephanebisson: You can pick mwdebug1001/1002/2001/2002 either of these [15:47:50] kart_ Lucas_WMDE Working fine AFAICT [15:47:57] you should be testing on k8s actually [15:48:10] yeah, k8s-mwdebug is the one to pick most of the time [15:48:15] mwdebugs are going away in the near-ish future [15:48:31] (though changes still get deployed to them at the moment, so you *can* also test there IIUC) [15:48:32] and 100% of client-facing prod is on k8s [15:48:39] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage [15:48:39] (03CR) 10Scott French: [C:03+1] team-sre: add thumbor alert for pods with high error rates (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:48:51] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10331780 (10cmooney) p:05Triage→03Medium [15:48:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1305.eqiad.wmnet with reason: host reimage [15:48:53] yes, they are still scap targets and get the new code, so it will work [15:49:03] !log lucaswerkmeister-wmde@deploy2002 sbisson, lucaswerkmeister-wmde: Continuing with sync [15:49:08] anyway, I’ll continue [15:49:11] stephanebisson: cool! [15:49:11] it’ll take long enough [15:49:11] (03CR) 10Effie Mouzeli: [C:03+1] team-sre: add thumbor alert for pods with high error rates [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:49:13] jouncebot: next [15:49:13] In 0 hour(s) and 40 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1630) [15:49:18] (03CR) 10TChin: [C:03+2] "That's perfectly fine, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1089967 (https://phabricator.wikimedia.org/T369845) (owner: 10TChin) [15:49:20] ah ok, it’s at half past not at the full hour [15:49:22] should finish in time then [15:49:53] claime: That's new info :) Please let wikitech-l know also! [15:49:53] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage [15:50:36] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [15:51:00] (03CR) 10Hnowlan: team-sre: add thumbor alert for pods with high error rates (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:51:10] (03CR) 10Hnowlan: [C:03+2] team-sre: add thumbor alert for pods with high error rates [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:51:27] kart_: I could have sworn we'd sent out an email about mwdebug targets but apparently not since we went to 1% of global traffic... [15:51:41] https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/2DXHPFD22DUO2EWNL6AVMYF74VPDBYQM/ was the most recent relevant email I found [15:51:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage [15:51:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1307.eqiad.wmnet with reason: host reimage [15:51:52] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Available_backends also sounds more outdated than I realized :/ [15:52:18] About the end of mwdebugs we are not really at the announcement stage yet but it'll come, we'll make an announcement in due time [15:52:21] (03Merged) 10jenkins-bot: team-sre: add thumbor alert for pods with high error rates [alerts] - 10https://gerrit.wikimedia.org/r/1092265 (https://phabricator.wikimedia.org/T379559) (owner: 10Hnowlan) [15:52:24] hi folks, i have some puppet config and MW config changes - all beta-only - that would ideally be deployed around the same time: https://gerrit.wikimedia.org/r/q/bug:T379811 is that possible, or should i just schedule them for their separate windows? [15:52:25] T379811: Update URL structure for SUL3 shared domain - https://phabricator.wikimedia.org/T379811 [15:52:35] claime: Thanks! [15:52:57] MatmaxRex: `scap backport` should handle beta-only changes efficiently [15:53:00] there will be an announcement for mwdebug-next soon, I think we can group all mwdebug target info in there [15:53:48] Lucas_WMDE: I'll update the available backend section after my meeting, thanks for pointing it out [15:54:01] sounds good, thanks! [15:54:31] (I’d try it myself but I have no idea if the other WikimediaDebug features work on k8s by now or not, so happy to leave that to you) [15:54:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1309.eqiad.wmnet with reason: host reimage [15:54:42] kart_: deployment is ongoing fyi (53% rn) [15:55:11] dancy: puppet too? [15:55:13] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage [15:55:42] MatmaRex: Ah, didn't realize you were referenced puppet changes. Disregard. :-) [15:56:02] Lucas_WMDE: noted! [15:56:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage [15:56:25] Lucas_WMDE: xhgui, excimer etc. worj [15:56:29] work* [15:56:37] nice [15:56:38] I have to check the verbose logging one [15:57:57] (03Merged) 10jenkins-bot: redfish: add response logging for request() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092193 (owner: 10Elukey) [15:58:06] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1308.eqiad.wmnet with reason: host reimage [15:58:16] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092259|Unified dashboard: Add UI for page collection recommendations (T368718)]] (duration: 27m 17s) [15:58:19] T368718: Community-defined Translation Collections: Single selection mode UI - https://phabricator.wikimedia.org/T368718 [15:58:23] * Lucas_WMDE done deploying [15:58:35] !log UTC afternoon backport+config window done [15:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:04] Thanks a lot Lucas_WMDE! [16:01:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1306.eqiad.wmnet with reason: host reimage [16:01:41] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading: Unable to obtain exclusive write permission. Someone else is doing something with this file. - https://phabricator.wikimedia.org/T379234#10331888 (10Aklapper) [16:03:49] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10331908 (10RobH) [16:04:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1312.eqiad.wmnet with reason: host reimage [16:06:53] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1005.eqiad.wmnet [16:07:08] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1310.eqiad.wmnet with reason: host reimage [16:07:25] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.16.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092278 [16:08:35] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1305.eqiad.wmnet with OS bookworm [16:10:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092250 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [16:10:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1307.eqiad.wmnet with OS bookworm [16:11:13] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.16.2 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1092278 (owner: 10Volans) [16:11:31] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1311.eqiad.wmnet with reason: host reimage [16:12:32] (03CR) 10Muehlenhoff: [C:03+2] Update site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1092250 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [16:12:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1005.eqiad.wmnet [16:13:25] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:14:02] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1309.eqiad.wmnet with OS bookworm [16:16:01] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:16:25] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:16:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1308.eqiad.wmnet with OS bookworm [16:17:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:07] RECOVERY - BGP status on lsw1-e5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:23] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:41] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 670, down: 7, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:45] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 713, down: 13, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:18:46] (03CR) 10Bking: "Re: link to conversation, it's in #wikimedia-k8s-sig IRC channel. Exact quote: " IIRC gets you two things basically, ProbeDown alerts for " [puppet] - 10https://gerrit.wikimedia.org/r/1090977 (https://phabricator.wikimedia.org/T365659) (owner: 10Bking) [16:18:48] (03PS1) 10Effie Mouzeli: memcached: add mc-gp100[4-6] gutter servers to pool [puppet] - 10https://gerrit.wikimedia.org/r/1092280 (https://phabricator.wikimedia.org/T377033) [16:19:01] PROBLEM - BGP status on lsw1-f7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:19:19] RECOVERY - Disk space on wikikube-worker1306 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=wikikube-worker1306&var-datasource=eqiad+prometheus/ops [16:19:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1306.eqiad.wmnet with OS bookworm [16:19:51] (03PS1) 10Volans: Upstream release v8.16.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1092281 [16:20:01] RECOVERY - BGP status on lsw1-f7-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:20:03] (03CR) 10Volans: [C:03+2] Upstream release v8.16.2 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1092281 (owner: 10Volans) [16:22:23] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:22:25] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:22:44] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1312.eqiad.wmnet with OS bookworm [16:23:18] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10332035 (10Seddon) Approved [16:23:25] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:25:18] (03PS1) 10Effie Mouzeli: memcached: add mc-gp200[4-6] gutter servers [puppet] - 10https://gerrit.wikimedia.org/r/1092282 (https://phabricator.wikimedia.org/T377033) [16:25:57] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1310.eqiad.wmnet with OS bookworm [16:26:27] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:27] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:28] (03PS1) 10Muehlenhoff: Add ferm macro/nftables set for aux pods like for other k8s installations [puppet] - 10https://gerrit.wikimedia.org/r/1092283 [16:28:51] 10SRE-swift-storage, 06Commons, 10MediaWiki-Uploading: Unable to obtain exclusive write permission. Someone else is doing something with this file. - https://phabricator.wikimedia.org/T379234#10332105 (10MatthewVernon) I'm afraid we don't keep swift logs far enough back to 7th November, so I can't provide an... [16:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1630). [16:30:15] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10332108 (10Papaul) maps-test2001 - ganeti2009 maps-test2002 - ganeti2010 maps-test2003 - ganeti2013 maps-test2004 - ganeti2014 maps-test2005 - gsneti2015 maps-test2001 - g... [16:30:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1311.eqiad.wmnet with OS bookworm [16:34:23] !log uploaded spicerack_8.16.2 to apt.wikimedia.org bullseye-wikimedia [16:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:37] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1305-1312].eqiad.wmnet [16:34:40] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1305-1312].eqiad.wmnet [16:35:30] (03CR) 10Alexandros Kosiaris: [C:03+1] Add ferm macro/nftables set for aux pods like for other k8s installations [puppet] - 10https://gerrit.wikimedia.org/r/1092283 (owner: 10Muehlenhoff) [16:37:49] (03CR) 10CDanis: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1092283 (owner: 10Muehlenhoff) [16:38:48] !log installing spicerack v8.16.2 on cumin2002 [16:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:50:54] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10332241 (10RobH) a:05RobH→03None [16:50:58] !log installing spicerack v8.16.2 on cumin1002 [16:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:20] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Decom prod infra side of the ulsfo-office link - https://phabricator.wikimedia.org/T379778#10332236 (10RobH) 05Open→03Resolved a:03RobH @wiki_willy: I just wanted to notify you of this task's resolution and you'll see the N... [16:54:27] PROBLEM - MariaDB Replica Lag: s2 on db1246 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:54:41] PROBLEM - MariaDB read only s2 on db1246 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:55:09] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: set DNS for new maps-test nodes - pt1979@cumin2002" [16:55:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: set DNS for new maps-test nodes - pt1979@cumin2002" [16:55:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:55:50] (03PS1) 10Effie Mouzeli: memcached: add mc-gp200[4-6] gutter servers to pool [puppet] - 10https://gerrit.wikimedia.org/r/1092290 (https://phabricator.wikimedia.org/T377033) [16:57:24] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10332303 (10Jhancock.wm) [16:58:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10332308 (10Jhancock.wm) 2163 is being a pain. gonna take a closer look today. failed during imaging but didn't catch the error. [17:00:30] (03PS1) 10Dreamy Jazz: [Beta] Re-enable IP masking on beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092292 (https://phabricator.wikimedia.org/T379108) [17:01:30] jouncebot: nowandnext [17:01:31] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [17:01:31] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1800) [17:01:31] In 0 hour(s) and 58 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1800) [17:01:41] Going to do a beta only deploy now if that's okay [17:02:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092292 (https://phabricator.wikimedia.org/T379108) (owner: 10Dreamy Jazz) [17:02:57] (03Merged) 10jenkins-bot: [Beta] Re-enable IP masking on beta metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092292 (https://phabricator.wikimedia.org/T379108) (owner: 10Dreamy Jazz) [17:09:01] PROBLEM - SSH on bast7001 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:10:01] RECOVERY - SSH on bast7001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:21:45] (03CR) 10Ebernhardson: [C:03+2] cirrus: Drop labtestwiki exclude [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091589 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [17:22:51] (03Merged) 10jenkins-bot: cirrus: Drop labtestwiki exclude [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091589 (https://phabricator.wikimedia.org/T378260) (owner: 10Majavah) [17:23:29] 06SRE, 06Editing-team, 10MediaWiki-Debug-Logger, 10observability, and 4 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10332629 (10Urbanecm_WMF) [17:24:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:25:49] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@16a5867]: Deploy latest DAGs to analytics Airflow instance. T368755. [17:25:52] T368755: Python job that reads from wmf_dumps.wikitext_inconsistent_row and produced reconciliation events. - https://phabricator.wikimedia.org/T368755 [17:27:59] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@16a5867]: Deploy latest DAGs to analytics Airflow instance. T368755. (duration: 02m 10s) [17:30:30] (03PS1) 10Urbanecm: [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380161) [17:30:34] (03PS2) 10Urbanecm: [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) [17:31:31] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:39] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:31:43] (03CR) 10CI reject: [V:04-1] [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) (owner: 10Urbanecm) [17:31:56] (03PS1) 10Jdlrobson: Promote Vector 2022 as default on 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) [17:33:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:11] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:31] (03PS3) 10Urbanecm: [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) [17:34:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:37:15] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:37:31] (03PS1) 10Urbanecm: Create no-link-recommendation variant [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092300 (https://phabricator.wikimedia.org/T377787) [17:37:42] 10ops-codfw, 06SRE, 06DC-Ops: lsw-d[18]-codfw missing console port info in netbox - https://phabricator.wikimedia.org/T376917#10332761 (10Jhancock.wm) 05Open→03Resolved [17:41:10] (03CR) 10Urbanecm: [C:04-1] CirrusSearch: enable offloading weighted tags via EventBus for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [17:41:19] (03PS2) 10Urbanecm: CirrusSearch: enable offloading weighted tags via EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [17:41:27] (03CR) 10Urbanecm: CirrusSearch: enable offloading weighted tags via EventBus (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092258 (https://phabricator.wikimedia.org/T378983) (owner: 10Peter Fischer) [17:43:05] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 08 Feb 2025 11:19:52 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:43:23] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:43:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:47:37] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:49:37] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:50:28] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211 (10JMeybohm) 03NEW [17:50:41] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#10332849 (10JMeybohm) [17:50:49] 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324#10332850 (10JMeybohm) [17:51:07] (03PS1) 10Bvibber: Use WAN cache for JsonConfig remote fetch cache [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) [17:51:35] 06SRE, 10envoy, 06serviceops, 07Kubernetes, 07Service-Architecture: Upgrade envoy configuration to use the v3 API - https://phabricator.wikimedia.org/T265880#10332798 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I believe this is done https://gerrit.wikimedia.org/r/c/operations/puppet/+/754460 [17:51:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [17:52:21] (03PS2) 10Jdlrobson: Promote Vector 2022 as default on 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) [17:53:28] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [17:53:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10332867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [17:54:37] (03CR) 10Stoyofuku-wmf: [C:03+1] "This looks correct to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [17:57:36] (03PS1) 10Papaul: Add test maps nodes to site.pp and preseed.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) [17:58:14] (03CR) 10CI reject: [V:04-1] Add test maps nodes to site.pp and preseed.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) (owner: 10Papaul) [17:59:07] (03PS3) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1800) [18:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1800). [18:00:41] (03CR) 10CI reject: [V:04-1] dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:01:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:02:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092300 (https://phabricator.wikimedia.org/T377787) (owner: 10Urbanecm) [18:02:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) (owner: 10Urbanecm) [18:03:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:03:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:04:11] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:08:45] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:09:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:09:27] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10332939 (10eoghan) We had a quick chat with ITS today where they disabled the change that caused the routing to change, an... [18:11:29] (03CR) 10CI reject: [V:04-1] Use WAN cache for JsonConfig remote fetch cache [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [18:12:32] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bullseye [18:12:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10332960 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [18:13:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:14:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:15:24] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:15:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:16:22] (03CR) 10Bvibber: "recheck" [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [18:17:11] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [18:24:14] (03PS1) 10Scott French: mw-debug: remove replicas override on -next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092309 (https://phabricator.wikimedia.org/T372604) [18:26:34] (03PS1) 10Bking: dse-k8s: raise quota for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092311 (https://phabricator.wikimedia.org/T371994) [18:27:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [18:35:06] (03PS2) 10Scott French: scap: add mw-debug "next" testservers check [puppet] - 10https://gerrit.wikimedia.org/r/1087984 (https://phabricator.wikimedia.org/T372604) [18:36:49] (03PS2) 10Papaul: Add test maps nodes to site.pp and preseed.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) [18:37:36] jouncebot: nowandnext [18:37:36] For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T1800) [18:37:36] In 2 hour(s) and 22 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T2100) [18:39:29] FYI, I'm going to make a minor helmfile-only change to the mw-debug "next" deployments [18:39:39] (03CR) 10Scott French: [C:03+2] mw-debug: remove replicas override on -next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092309 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:40:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1183.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:40:24] (03CR) 10Ahmon Dancy: [C:03+1] scap: add mw-debug "next" testservers check [puppet] - 10https://gerrit.wikimedia.org/r/1087984 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:40:49] (03Merged) 10jenkins-bot: mw-debug: remove replicas override on -next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092309 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [18:41:58] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:43:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:45:07] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:46:05] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:50:43] (03CR) 10Bking: [C:03+2] "self-merging, as this does not affect production services" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092311 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:54:12] (03Merged) 10jenkins-bot: dse-k8s: raise quota for blunderbuss [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092311 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:55:52] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092311 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [18:56:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:57:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:58:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [19:00:05] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10333210 (10Jclark-ctr) Opened ticket with Dell Advised of i/o errors on sda and uploaded tsr report ` [Sat Nov 9 08:53:19 2024] blk_update_request: I/O error, dev sda, sector 0 op 0x1:(... [19:01:03] 10ops-eqiad, 06DC-Ops, 06serviceops: Degraded RAID on wikikube-worker1256 - https://phabricator.wikimedia.org/T379454#10333216 (10Jclark-ctr) Confirmed: Service Request 201149035 [19:06:22] FYI, unless there are any objections, I'll be making a second mw-debug related change that will require a noop scap deployment. this will happen in 5-10 minutes. [19:07:06] (03CR) 10Scott French: [C:03+2] scap: add mw-debug "next" testservers check [puppet] - 10https://gerrit.wikimedia.org/r/1087984 (https://phabricator.wikimedia.org/T372604) (owner: 10Scott French) [19:08:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [19:15:20] moving ahead with the noop scap deployment to test [19:15:38] !log swfrench@deploy2002 Started scap sync-world: Test deployment after adding mwdebug-next check command - T372604 [19:15:42] T372604: Turn up PHP 8.1-flavored mw-debug k8s deployment - https://phabricator.wikimedia.org/T372604 [19:15:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2163.codfw.wmnet with OS bookworm [19:17:10] !log swfrench@deploy2002 Finished scap sync-world: Test deployment after adding mwdebug-next check command - T372604 (duration: 01m 31s) [19:17:35] all done on my end [19:17:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333264 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm [19:18:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2037.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:18:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:18:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2113.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:21:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2110.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:28:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase2037.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:29:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2113.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:29] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['restbase2037'] [19:33:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2113'] [19:33:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['restbase2037'] [19:33:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2113'] [19:34:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2163.codfw.wmnet with reason: host reimage [19:35:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2037.codfw.wmnet with OS bullseye [19:35:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2113.codfw.wmnet with OS bullseye [19:35:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10333378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2113.co... [19:35:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10333377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host restbase2037.codfw.wmnet with OS bullseye [19:36:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2112.codfw.wmnet with OS bullseye [19:36:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10333390 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host elastic2112.co... [19:37:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2163.codfw.wmnet with reason: host reimage [19:42:04] (03PS1) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [19:42:51] (03PS1) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) [19:44:37] (03PS2) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [19:45:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4546/console" [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [19:46:27] (03CR) 10Ssingh: [V:03+1] "Updating hiera with incorrect data to fail PCC." [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [19:46:55] (03PS3) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [19:48:06] (03CR) 10Ssingh: "Error: Could not call 'find' on 'catalog': Evaluation Error: Error while evaluating a Function Call, HW config check error: cpu_core_count" [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [19:48:58] (03PS4) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) [19:50:00] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4548/console" [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh) [19:51:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2113.codfw.wmnet with reason: host reimage [19:51:17] (03CR) 10Muehlenhoff: Add test maps nodes to site.pp and preseed.yaml file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) (owner: 10Papaul) [19:52:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2112.codfw.wmnet with reason: host reimage [19:54:24] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@594d3b5]: T377153 Release glent 0.3.5 [19:54:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2113.codfw.wmnet with reason: host reimage [19:54:48] T377153: Migrate Glent to Gitlab for publication of artifacts - https://phabricator.wikimedia.org/T377153 [19:54:52] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@594d3b5]: T377153 Release glent 0.3.5 (duration: 00m 27s) [19:55:50] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:56:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [19:56:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2163.codfw.wmnet with OS bookworm [19:56:55] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2163.codfw.wmnet with OS bookworm completed: - wi... [19:57:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2112.codfw.wmnet with reason: host reimage [19:57:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2037.codfw.wmnet with reason: host reimage [19:58:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333508 (10Jhancock.wm) [19:58:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333509 (10Jhancock.wm) @Clement_Goubert last batch done! [20:00:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2037.codfw.wmnet with reason: host reimage [20:03:02] (03PS3) 10Papaul: Add test maps nodes to site.pp and preseed.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) [20:04:03] 10ops-codfw, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T380228 (10phaultfinder) 03NEW [20:07:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) (owner: 10Papaul) [20:11:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:11:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10333568 (10phaultfinder) [20:12:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:12:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2113.codfw.wmnet with OS bullseye [20:12:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10333569 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2113.codfw.... [20:14:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:18:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10333580 (10jhathaway) >>! In T380009#10332939, @eoghan wrote: > We had a quick chat with ITS today where they disabled the... [20:19:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2112.codfw.wmnet with OS bullseye [20:19:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10333581 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host elastic2112.codfw.... [20:19:43] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:19:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10333582 (10Jhancock.wm) [20:20:24] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: corto: failure to create google doc should not be fatal - https://phabricator.wikimedia.org/T379858#10333583 (10Eevans) Done ([[ https://gitlab.wikimedia.org/repos/sre/corto/-/commit/4c0104f0581b6db91b9d379163abcc50b504d20d | 4c0104f ]]). [20:20:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10333584 (10Jhancock.wm) need to double check the mgmt port on 2110 [20:23:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:23:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2037.codfw.wmnet with OS bullseye [20:23:23] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10333590 (10revi) >>! In T380009#10332939, @eoghan wrote: > We had a quick chat with ITS today where they disabled the chan... [20:23:27] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10333591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host restbase2037.codfw.wmnet with OS bullseye completed: - restbase203... [20:23:48] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: corto: failure to create google doc should not be fatal - https://phabricator.wikimedia.org/T379858#10333585 (10Eevans) 05Open→03Resolved a:03Eevans [20:23:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10333592 (10Jhancock.wm) [20:25:35] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install restbase203[6-8] - https://phabricator.wikimedia.org/T377896#10333593 (10Jhancock.wm) 05Open→03Resolved @Eevans this is complete! [20:25:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10333608 (10Jhancock.wm) 05Open→03Resolved [20:26:34] (03CR) 10Papaul: [C:03+2] Add test maps nodes to site.pp and preseed.yaml file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092305 (https://phabricator.wikimedia.org/T380144) (owner: 10Papaul) [20:29:07] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [20:29:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333634 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [20:30:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 30015952 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:31:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:33:07] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:37:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2041 to codfw - jhancock@cumin2002" [20:37:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding es2041 to codfw - jhancock@cumin2002" [20:37:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:37:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:09] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bullseye [20:39:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [20:39:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2041.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:39:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2042.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:39:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:39:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:39:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:39:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:42:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [20:49:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10333742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm [20:49:58] !log disabling auto-reboot on re-imaging for debugging [20:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:06] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye [20:51:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye [20:51:58] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bullseye [20:52:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex... [20:52:41] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bookworm [20:52:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333753 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T2100). [21:00:05] MatmaRex and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] o/ [21:00:20] hi [21:01:15] dear deployer: my patches should all go out together, affect the beta cluster only, and can't be tested (because they depend on a puppet patch to function correctly, which is scheduled for the next window tomorrow). thanks [21:01:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:01:52] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bookworm [21:02:29] my patch is cleanup for multi-dc caching so can't be tested on debug servers :D [21:03:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333768 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm ex... [21:03:38] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bookworm [21:04:27] * TheresNoTime can't deploy this evening ^^ hopefully another deployer appears shortly [21:04:28] hm, actually i might, there's separate servers for each [21:04:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm [21:04:34] <3 [21:10:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2041.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:10:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2043.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:10:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:11:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2044.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:14:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2042.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:15:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2041'] [21:15:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2041'] [21:15:33] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2042'] [21:15:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2042'] [21:16:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS bookworm [21:16:34] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10333798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm [21:16:49] bvibber: if you're deploying, could you do my patches afterwards too? i don't have access [21:17:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm [21:17:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2042.codfw.wmnet with OS bookworm [21:17:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2043.codfw.wmnet with OS bookworm [21:17:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2044.codfw.wmnet with OS bookworm [21:17:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2045.codfw.wmnet with OS bookworm [21:17:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10333799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm [21:17:14] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10333800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2042.codfw.wmnet with OS bookworm [21:17:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2046.codfw.wmnet with OS bookworm [21:17:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10333801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2043.codfw.wmnet with OS bookworm [21:17:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10333802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2044.codfw.wmnet with OS bookworm [21:17:20] (03CR) 10BCornwall: [C:03+1] trafficserver: remove inbound TLS and related settings [puppet] - 10https://gerrit.wikimedia.org/r/1091748 (owner: 10Ssingh) [21:17:26] (or is anyone else deployng today's window?) [21:17:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10333803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2045.codfw.wmnet with OS bookworm [21:17:32] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: VRTS e-mail address unreachable / e-mail routing issue - https://phabricator.wikimedia.org/T380009#10333786 (10eoghan) @jhathaway It was a rule set up to change the envelope-to of a mail from a given source. When we disabl... [21:17:42] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10333804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2046.codfw.wmnet with OS bookworm [21:18:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [21:19:08] (not sure we have a deployer) [21:20:56] No I don't have all the rights to deploy consistently [21:20:59] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [21:21:41] I also need a deployer :) [21:21:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [21:22:23] might be worth pinging RoanKattouw urbanecm cjming and kindrobot again :) [21:22:36] let's deploy then [21:22:39] hi bvibber [21:22:53] :D [21:22:54] (03CR) 10Urbanecm: [C:03+2] Use WAN cache for JsonConfig remote fetch cache [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [21:22:56] \o/ [21:23:02] thx urbanecm [21:23:26] (03CR) 10Urbanecm: [C:03+2] Rename everything referring to "SSO domain" to use "shared domain" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091839 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [21:23:28] (03CR) 10Urbanecm: [C:03+2] Rename shared domain sso.wikimedia.org to auth.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091841 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [21:23:29] (03CR) 10Urbanecm: [C:03+2] Use DB name rather than server name in shared domain path prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091842 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [21:23:46] (03CR) 10Urbanecm: [C:03+2] Create no-link-recommendation variant [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092300 (https://phabricator.wikimedia.org/T377787) (owner: 10Urbanecm) [21:23:48] and since i'm deploying anyway... [21:24:19] (03Merged) 10jenkins-bot: Rename everything referring to "SSO domain" to use "shared domain" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091839 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [21:24:21] (03Merged) 10jenkins-bot: Rename shared domain sso.wikimedia.org to auth.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091841 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [21:24:24] (03Merged) 10jenkins-bot: Use DB name rather than server name in shared domain path prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091842 (https://phabricator.wikimedia.org/T379811) (owner: 10Bartosz Dziewoński) [21:26:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1091839|Rename everything referring to "SSO domain" to use "shared domain" (T379811)]], [[gerrit:1091841|Rename shared domain sso.wikimedia.org to auth.wikimedia.org (T379811)]], [[gerrit:1091842|Use DB name rather than server name in shared domain path prefix (T379811)]] [21:26:10] T379811: Update URL structure for SUL3 shared domain - https://phabricator.wikimedia.org/T379811 [21:26:38] bvibber: btw, you do seem to have all the rights to deploy? [21:26:58] (thanks urbanecm) [21:26:58] urbanecm: last i checked i couldn't +2 into some stuff i needed. that might've been config patches [21:27:03] and i might be wrong hah [21:27:06] might've gotten fixed [21:27:24] bvibber: you shouldn't _need_ to +2 manually. if you run `scap backport XXXXX`, the bot will +2 for you [21:27:25] in which case i just need to read up to make sure i know how to do mediawiki deploys right as well as service deploys :D [21:27:29] ok [21:27:33] nice [21:27:37] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:27:48] ok so that's good news that'll save me time >:-) [21:27:51] (i manually +2 to speed things up, as CI can run while i deploy something else, but that's just to do things in parallel) [21:28:03] but next time i try it i'll want someone hovering over my shoulder in case i fuck it up hehe [21:29:18] !log Add bvibber to wmf-deployment Gerrit group (existing deployer) [21:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:23] i guess trying to escape old permissions never works. the perm bits keep coming back ;) [21:29:33] thx [21:29:45] bvibber: i just made them in sync, if you want them revoked, can be done ig :D [21:29:58] hehe [21:30:33] !log urbanecm@deploy2002 matmarex, urbanecm: Backport for [[gerrit:1091839|Rename everything referring to "SSO domain" to use "shared domain" (T379811)]], [[gerrit:1091841|Rename shared domain sso.wikimedia.org to auth.wikimedia.org (T379811)]], [[gerrit:1091842|Use DB name rather than server name in shared domain path prefix (T379811)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:44] MatmaRex: i assume nothing to test? :) [21:31:03] urbanecm: yep. will test tomorrow when i can get the puppet patch deployed [21:31:09] !log urbanecm@deploy2002 matmarex, urbanecm: Continuing with sync [21:31:11] ack, proceeding [21:31:43] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:33:09] (03PS4) 10Urbanecm: [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) [21:33:21] (03CR) 10Urbanecm: [C:03+2] [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) (owner: 10Urbanecm) [21:34:04] (03Merged) 10jenkins-bot: [GrowthExperiments] testwiki: Enable no-link-recommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092295 (https://phabricator.wikimedia.org/T380204) (owner: 10Urbanecm) [21:36:45] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 229.23 ms [21:36:54] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091839|Rename everything referring to "SSO domain" to use "shared domain" (T379811)]], [[gerrit:1091841|Rename shared domain sso.wikimedia.org to auth.wikimedia.org (T379811)]], [[gerrit:1091842|Use DB name rather than server name in shared domain path prefix (T379811)]] (duration: 10m 54s) [21:37:10] T379811: Update URL structure for SUL3 shared domain - https://phabricator.wikimedia.org/T379811 [21:38:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [21:38:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092300 (https://phabricator.wikimedia.org/T377787) (owner: 10Urbanecm) [21:38:07] \o/ [21:39:21] just the ci, just the ci... [21:40:07] (03PS2) 10Gergő Tisza: Add 'auth' wiki tag when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091922 (https://phabricator.wikimedia.org/T373737) [21:40:49] (03CR) 10CI reject: [V:04-1] Add 'auth' wiki tag when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091922 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [21:42:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:43:46] (03Merged) 10jenkins-bot: Use WAN cache for JsonConfig remote fetch cache [extensions/JsonConfig] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092304 (https://phabricator.wikimedia.org/T374746) (owner: 10Bvibber) [21:43:54] here we go [21:43:58] yay [21:44:35] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:46:20] (03Merged) 10jenkins-bot: Create no-link-recommendation variant [extensions/GrowthExperiments] (wmf/1.44.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1092300 (https://phabricator.wikimedia.org/T377787) (owner: 10Urbanecm) [21:46:41] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1092304|Use WAN cache for JsonConfig remote fetch cache (T374746)]], [[gerrit:1092300|Create no-link-recommendation variant (T377787 T380204)]], [[gerrit:1092295|[GrowthExperiments] testwiki: Enable no-link-recommendation experiment (T380204)]] [21:46:44] okay, now it goes through [21:46:47] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [21:46:48] T377787: Add Link (structured): Introduce the no-link-recommendation variant - https://phabricator.wikimedia.org/T377787 [21:46:48] T380204: Deploy Add Link to a proportion of test.wikipedia.org users - https://phabricator.wikimedia.org/T380204 [21:48:27] !log upload prometheus-mcrouter-exporter_0.4.0+git20241118-1~wmf1 - T380212 [21:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:30] T380212: Package prometheus-mcrouter-exporter v0.4.0 - https://phabricator.wikimedia.org/T380212 [21:52:21] (03PS1) 10Gergő Tisza: Use 'auth' rather than 'sso' as cookie prefix on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092333 (https://phabricator.wikimedia.org/T379811) [21:52:47] !log urbanecm@deploy2002 urbanecm, bvibber: Backport for [[gerrit:1092304|Use WAN cache for JsonConfig remote fetch cache (T374746)]], [[gerrit:1092300|Create no-link-recommendation variant (T377787 T380204)]], [[gerrit:1092295|[GrowthExperiments] testwiki: Enable no-link-recommendation experiment (T380204)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:51] finally [21:52:54] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [21:52:54] T377787: Add Link (structured): Introduce the no-link-recommendation variant - https://phabricator.wikimedia.org/T377787 [21:52:54] T380204: Deploy Add Link to a proportion of test.wikipedia.org users - https://phabricator.wikimedia.org/T380204 [21:52:54] bvibber: can you test? [21:52:56] woot! testing [21:54:04] urbanecm: working :D [21:54:05] thx [21:54:10] yay! [21:54:11] good news [21:54:13] !log urbanecm@deploy2002 urbanecm, bvibber: Continuing with sync [21:54:15] proceeding [21:57:08] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for dbrant - https://phabricator.wikimedia.org/T379678#10333937 (10thcipriani) >>! In T379678#10318095, @herron wrote: > * @thcipriani could you please leave a comment of approval for deployment? Reason for access makes sense to me, approved! [21:58:51] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092304|Use WAN cache for JsonConfig remote fetch cache (T374746)]], [[gerrit:1092300|Create no-link-recommendation variant (T377787 T380204)]], [[gerrit:1092295|[GrowthExperiments] testwiki: Enable no-link-recommendation experiment (T380204)]] (duration: 12m 10s) [21:59:03] T374746: Cache invalidation based on usage tracking of Data: pages - https://phabricator.wikimedia.org/T374746 [21:59:03] T377787: Add Link (structured): Introduce the no-link-recommendation variant - https://phabricator.wikimedia.org/T377787 [21:59:03] T380204: Deploy Add Link to a proportion of test.wikipedia.org users - https://phabricator.wikimedia.org/T380204 [21:59:32] bvibber: okay, should be live [21:59:35] anything else? [22:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241118T2200). [22:00:20] (03CR) 10Bartosz Dziewoński: [C:03+1] Use 'auth' rather than 'sso' as cookie prefix on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092333 (https://phabricator.wikimedia.org/T379811) (owner: 10Gergő Tisza) [22:02:26] (03CR) 10Bking: [C:03+1] ryankemper: add timestamps to bash history [puppet] - 10https://gerrit.wikimedia.org/r/1083925 (owner: 10Ryan Kemper) [22:02:48] (03PS1) 10Gergő Tisza: Disable various extensions when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092334 (https://phabricator.wikimedia.org/T373737) [22:03:07] urbanecm: that's all from me [22:03:14] sounds good! [22:03:15] thanks! [22:03:35] (03PS3) 10Gergő Tisza: Add 'auth' wiki tag when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091922 (https://phabricator.wikimedia.org/T373737) [22:04:16] (03CR) 10CI reject: [V:04-1] Add 'auth' wiki tag when using the shared login domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091922 (https://phabricator.wikimedia.org/T373737) (owner: 10Gergő Tisza) [22:07:15] (03PS1) 10Urbanecm: [GrowthExperiments] testwiki: Only enable Add Link for new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092336 (https://phabricator.wikimedia.org/T380204) [22:08:02] (03CR) 10Urbanecm: [C:03+2] [GrowthExperiments] testwiki: Only enable Add Link for new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092336 (https://phabricator.wikimedia.org/T380204) (owner: 10Urbanecm) [22:08:46] (03Merged) 10jenkins-bot: [GrowthExperiments] testwiki: Only enable Add Link for new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092336 (https://phabricator.wikimedia.org/T380204) (owner: 10Urbanecm) [22:09:19] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1092336|[GrowthExperiments] testwiki: Only enable Add Link for new accounts (T380204)]] [22:09:22] T380204: Deploy Add Link to a proportion of test.wikipedia.org users - https://phabricator.wikimedia.org/T380204 [22:13:22] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1092336|[GrowthExperiments] testwiki: Only enable Add Link for new accounts (T380204)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:13:55] !log urbanecm@deploy2002 urbanecm: Continuing with sync [22:18:34] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1092336|[GrowthExperiments] testwiki: Only enable Add Link for new accounts (T380204)]] (duration: 09m 14s) [22:18:37] T380204: Deploy Add Link to a proportion of test.wikipedia.org users - https://phabricator.wikimedia.org/T380204 [22:22:35] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bookworm [22:22:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10333992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm ex... [22:29:28] (03PS1) 10Effie Mouzeli: prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) [22:37:26] !log bking@deploy2002 Started deploy [wdqs/wdqs@9927a5a]: 0.3.150 [22:37:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2042.codfw.wmnet with OS bookworm [22:37:56] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10334038 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2042.codfw.wmnet with OS bookworm executed... [22:41:07] (03PS1) 10Aleksandar Mastilovic: All the necessary changes and missing files to make helm linter happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092339 [22:41:58] (03CR) 10CI reject: [V:04-1] All the necessary changes and missing files to make helm linter happy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092339 (owner: 10Aleksandar Mastilovic) [22:47:21] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bookworm [22:47:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334071 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm [22:49:01] (03PS1) 10Aleksandar Mastilovic: Fixing an improper merge of values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092340 [22:49:26] !log bking@deploy2002 Finished deploy [wdqs/wdqs@9927a5a]: 0.3.150 (duration: 11m 59s) [22:50:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:50:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2001.codfw.wmnet with OS bookworm [22:50:55] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10334085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm compl... [22:52:45] !log removing 10 files for legal compliance [22:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:16] (03CR) 10Ryan Kemper: [C:03+2] ryankemper: add timestamps to bash history [puppet] - 10https://gerrit.wikimedia.org/r/1083925 (owner: 10Ryan Kemper) [22:53:55] (03PS1) 10C. Scott Ananian: Enable experimental Parsoid fragment support on labs and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092341 (https://phabricator.wikimedia.org/T374661) [22:54:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm [22:54:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10334089 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed... [22:54:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2043.codfw.wmnet with OS bookworm [22:54:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10334090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2043.codfw.wmnet with OS bookworm executed... [22:54:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2046.codfw.wmnet with OS bookworm [22:55:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10334091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2046.codfw.wmnet with OS bookworm executed... [22:55:15] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2044.codfw.wmnet with OS bookworm [22:55:21] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bookworm [22:55:26] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10334092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2044.codfw.wmnet with OS bookworm executed... [22:55:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm ex... [22:55:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2045.codfw.wmnet with OS bookworm [22:55:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10334094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2045.codfw.wmnet with OS bookworm executed... [22:56:01] (03CR) 10Subramanya Sastry: [C:03+1] Enable experimental Parsoid fragment support on labs and test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092341 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [22:57:13] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bookworm [22:57:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334100 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm [22:58:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092296 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson) [22:59:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [22:59:12] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [23:00:26] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [23:00:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [23:00:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [23:00:42] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334109 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [23:00:46] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10334110 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [23:01:34] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-be2005.codfw.wmnet with OS bookworm [23:01:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm ex... [23:02:26] (03PS11) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [23:02:52] (03PS5) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [23:02:52] (03PS4) 10Ryan Kemper: wdqs-internal: add envoy config for graph split [puppet] - 10https://gerrit.wikimedia.org/r/1091340 (https://phabricator.wikimedia.org/T379333) [23:03:38] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [23:03:54] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bookworm [23:03:57] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for Cassandra — restbase2036 - eevans@cumin1002" [23:04:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2003.codfw.wmnet with OS bookworm [23:04:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334117 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bookworm [23:04:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for Cassandra — restbase2036 - eevans@cumin1002" [23:04:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:04:10] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334118 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2003.codfw.wmnet with OS bookworm [23:05:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [23:06:08] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [23:08:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [23:09:10] (03PS1) 10Cwhite: logstash: upgrade phatality version to [puppet] - 10https://gerrit.wikimedia.org/r/1092343 (https://phabricator.wikimedia.org/T342476) [23:09:33] !log eevans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for Cassandra — restbase2036 - eevans@cumin1002" [23:09:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Additional IPs for Cassandra — restbase2036 - eevans@cumin1002" [23:09:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:12:49] !log removing 2 files for legal compliance [23:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:51] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:14:45] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 537 bytes in 3.180 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:15:31] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [23:19:10] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage [23:20:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10334153 (10jhathaway) @elukey, unfortunately I observed the same double d-i installer issue with thanos-be2005. Grub's installer does not throw any errro... [23:25:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2004.codfw.wmnet with OS bookworm [23:25:40] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2004.codfw.wmnet with OS bookworm [23:26:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2003.codfw.wmnet with reason: host reimage [23:26:37] !log removing 1 file for legal compliance [23:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:28:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:28:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2002.codfw.wmnet with OS bookworm [23:28:59] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm completed: - maps-test2... [23:29:57] (03PS4) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) [23:31:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2003.codfw.wmnet with reason: host reimage [23:32:05] !log removing 1 file for legal compliance [23:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:11] (03PS1) 10Eevans: restbase: commission restbase203[6-8] [puppet] - 10https://gerrit.wikimedia.org/r/1092345 (https://phabricator.wikimedia.org/T380236) [23:46:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2005.codfw.wmnet with OS bookworm [23:46:13] 10ops-codfw, 06SRE, 06DC-Ops: Set up six decommissioned nodes as temporary maps-test cluster - https://phabricator.wikimedia.org/T380144#10334202 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2005.codfw.wmnet with OS bookworm [23:48:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2004.codfw.wmnet with reason: host reimage [23:50:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2004.codfw.wmnet with reason: host reimage [23:51:16] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"