[00:00:33] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119233|Lazy Load Images (T366402)]], [[gerrit:1119234|Lazy Load Images (T366402)]] (duration: 31m 40s) [00:00:54] That's all for us today, thank you so much! [00:02:34] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:02:51] (03CR) 10Zabe: [C:03+2] Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119207 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [00:03:49] (03Merged) 10jenkins-bot: Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119207 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [00:04:43] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1119207|Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki (T183490)]] [00:04:46] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [00:07:41] !log zabe@deploy2002 zabe: Backport for [[gerrit:1119207|Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:08:34] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:08:49] !log zabe@deploy2002 zabe: Continuing with sync [00:10:56] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:22] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119207|Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki (T183490)]] (duration: 10m 39s) [00:15:26] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [00:20:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:21:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:48] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:32:40] (03PS4) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) [00:38:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119239 [00:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119239 (owner: 10TrainBranchBot) [00:39:34] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:39:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10547180 (10phaultfinder) [00:39:38] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:39:58] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:45:34] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:46:35] (03CR) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [00:47:50] (03PS2) 10BryanDavis: admin_ng: Switch on enableJobSidecarController for toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119231 (https://phabricator.wikimedia.org/T292861) [00:47:50] (03PS5) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) [00:50:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119239 (owner: 10TrainBranchBot) [01:08:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242 [01:08:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242 (owner: 10TrainBranchBot) [01:23:40] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242 (owner: 10TrainBranchBot) [01:32:45] (03PS1) 10Stang: Revert "zhwiki: Add 2025 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 [01:35:39] !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=OGPawlis --overwrite /tmp/uploads # T382976 [01:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:43] T382976: Server side upload for OGPawlis - https://phabricator.wikimedia.org/T382976 [01:35:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang) [01:42:42] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200799s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [01:45:18] (03PS1) 10TChin: Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) [01:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:12] !log zabe@deploy2002:~$ mwscript-k8s --comment="T386292" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki 'Sofia Baldelli' 'AnonymWikiuser 245' [01:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:15] T386292: Unblock stuck global rename of AnonymWikiuser_245 and Renamed_user_9b7b870ac2b7d3f071232203ec1030d1 - https://phabricator.wikimedia.org/T386292 [01:48:21] !log zabe@deploy2002:~$ mwscript-k8s --comment="T386292" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Nebuls' 'Renamed user 9b7b870ac2b7d3f071232203ec1030d1' [01:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:16] (03PS2) 10Krinkle: tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T379661) (owner: 10Hokwelum) [02:03:59] (03PS3) 10Krinkle: tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum) [02:04:06] (03CR) 10Krinkle: [C:03+1] tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum) [02:17:56] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 16.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:19] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:39:09] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@aaba3ff]: Deploying airflow for T306896 [03:39:12] T306896: Integrate Spark with DataHub with lineage - https://phabricator.wikimedia.org/T306896 [03:40:04] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@aaba3ff]: Deploying airflow for T306896 (duration: 01m 07s) [04:07:55] Deploying cxserver. Minor changes. [04:08:24] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-12-075258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119052 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry) [04:09:35] (03Merged) 10jenkins-bot: Update cxserver to 2025-02-12-075258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119052 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry) [04:28:04] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:28:29] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:37:42] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:38:23] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:40:58] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:41:31] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:45:46] !log Updated cxserver to 2025-02-12-075258-production (T381943) [04:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:49] T381943: Swagger probe test for page fetch API is failing - https://phabricator.wikimedia.org/T381943 [04:54:05] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2025-02-05-115716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115314 (https://phabricator.wikimedia.org/T383750) (owner: 10KartikMistry) [04:55:20] (03Merged) 10jenkins-bot: Update MinT to 2025-02-05-115716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115314 (https://phabricator.wikimedia.org/T383750) (owner: 10KartikMistry) [05:24:27] (03PS3) 10Anzx: Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) [05:26:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx) [05:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0700) [07:00:05] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0700). [07:07:19] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0800). [08:00:05] gmodena, koi, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:14] o/ [08:00:42] o/ [08:00:46] o/ [08:02:07] o/ [08:02:45] I'll be doing my deployment together with dcausse [08:04:12] koi, anzx we'll do your patches first [08:04:40] ok [08:05:13] (03CR) 10DCausse: [C:03+1] Revert "zhwiki: Add 2025 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang) [08:07:58] ok, i'm here [08:08:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang) [08:09:15] (03Merged) 10jenkins-bot: Revert "zhwiki: Add 2025 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang) [08:09:45] (03CR) 10DCausse: [C:04-1] Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx) [08:09:57] anzx: I think there's a small issue in your patch [08:10:20] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1119243|Revert "zhwiki: Add 2025 CNY celebration logos"]] [08:11:50] anzx: my bad it's two different days [08:12:26] dcausse: ok [08:12:56] (03CR) 10DCausse: [C:03+1] Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx) [08:13:48] !log dcausse@deploy2002 stang, dcausse: Backport for [[gerrit:1119243|Revert "zhwiki: Add 2025 CNY celebration logos"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:14:17] koi: could you test your change? [08:14:27] yeah [08:14:56] looking [08:15:57] dcausse, i see the logo back to the normal one, so LGTM [08:17:00] koi: thanks, deploying [08:17:03] !log dcausse@deploy2002 stang, dcausse: Continuing with sync [08:18:11] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow-analytics: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/1119185 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [08:23:45] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119243|Revert "zhwiki: Add 2025 CNY celebration logos"]] (duration: 13m 24s) [08:23:57] koi: should be live [08:24:40] anzx: shipping your patch now, I guess we can't really test it on debug servers? [08:24:56] yeah no test needed [08:25:36] ty [08:25:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx) [08:26:42] (03Merged) 10jenkins-bot: Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx) [08:27:09] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1118984|Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (T386126)]] [08:27:12] T386126: Lift IP for a edit-a-thon in Jujuy, Argentina (2025-02-17 & 2025-03-10) - https://phabricator.wikimedia.org/T386126 [08:27:15] 06SRE, 10Observability-Metrics: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970#10547912 (10Volker_E) [08:36:54] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118984|Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (T386126)]] (duration: 09m 45s) [08:36:57] T386126: Lift IP for a edit-a-thon in Jujuy, Argentina (2025-02-17 & 2025-03-10) - https://phabricator.wikimedia.org/T386126 [08:37:05] dcausse: thanks [08:37:55] anzx: yw! [08:38:17] gmodena: we're next [08:38:25] dcausse ack [08:40:40] (03CR) 10DCausse: [C:03+1] cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:41:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:41:51] (03Merged) 10jenkins-bot: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena) [08:42:22] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] [08:42:25] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [08:44:50] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 2904 MB (3% inode=98%): /tmp 2904 MB (3% inode=98%): /var/tmp 2904 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:45:24] !log dcausse@deploy2002 dcausse, gmodena: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:50:58] (03PS1) 10Brouberol: airflow-analytics: upgarde the airflow deb package to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) [08:51:19] (03CR) 10CI reject: [V:04-1] airflow-analytics: upgarde the airflow deb package to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [08:54:55] !log dcausse@deploy2002 dcausse, gmodena: Continuing with sync [08:59:31] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:00:04] andre and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0900). [09:01:28] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] (duration: 19m 06s) [09:01:31] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [09:02:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:03:35] !log closing UTC morning backport window [09:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:31] (03PS2) 10Brouberol: airflow-analytics: upgrade airflow to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) [09:25:36] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [09:26:01] (03CR) 10Brouberol: [C:03+2] airflow-analytics: upgrade airflow to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol) [09:30:36] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119461 (https://phabricator.wikimedia.org/T382367) [09:30:38] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119461 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [09:31:23] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119461 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot) [09:40:39] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.16 refs T382367 [09:40:42] T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367 [09:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:28] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [09:58:28] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:11:22] (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1003 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119103 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [10:14:47] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm [10:23:23] (03PS1) 10Kosta Harlan: ApiPageTriageList: Check that $user is defined before using it [extensions/PageTriage] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119471 (https://phabricator.wikimedia.org/T386332) [10:35:11] (03CR) 10Aklapper: [C:03+2] ApiPageTriageList: Check that $user is defined before using it [extensions/PageTriage] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119471 (https://phabricator.wikimedia.org/T386332) (owner: 10Kosta Harlan) [10:36:25] (03CR) 10Aklapper: [V:03+2 C:03+2] "+2/V this backport because https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/1119468 got merged" [extensions/PageTriage] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119471 (https://phabricator.wikimedia.org/T386332) (owner: 10Kosta Harlan) [10:38:25] !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1119471|ApiPageTriageList: Check that $user is defined before using it (T386332)]] [10:38:28] T386332: Error: Call to a member function isTemp() on null - https://phabricator.wikimedia.org/T386332 [10:41:04] !log aklapper@deploy2002 kharlan, aklapper: Backport for [[gerrit:1119471|ApiPageTriageList: Check that $user is defined before using it (T386332)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:42:32] !log aklapper@deploy2002 kharlan, aklapper: Continuing with sync [10:44:02] !log joal@deploy2002 Started deploy [analytics/refinery@08b2bd2]: Analytics one-off deploy [analytics/refinery@08b2bd2e] [10:46:10] !log joal@deploy2002 Finished deploy [analytics/refinery@08b2bd2]: Analytics one-off deploy [analytics/refinery@08b2bd2e] (duration: 02m 07s) [10:46:27] !log joal@deploy2002 Started deploy [analytics/refinery@08b2bd2] (thin): Analytics one-off deploy -THIN [analytics/refinery@08b2bd2e] [10:47:13] !log joal@deploy2002 Finished deploy [analytics/refinery@08b2bd2] (thin): Analytics one-off deploy -THIN [analytics/refinery@08b2bd2e] (duration: 00m 46s) [10:47:22] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [10:47:27] !log joal@deploy2002 Started deploy [analytics/refinery@08b2bd2] (hadoop-test): Analytics one-off deploy - TEST [analytics/refinery@08b2bd2e] [10:48:11] !log joal@deploy2002 Finished deploy [analytics/refinery@08b2bd2] (hadoop-test): Analytics one-off deploy - TEST [analytics/refinery@08b2bd2e] (duration: 00m 44s) [10:49:12] !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119471|ApiPageTriageList: Check that $user is defined before using it (T386332)]] (duration: 10m 47s) [10:49:15] T386332: Error: Call to a member function isTemp() on null - https://phabricator.wikimedia.org/T386332 [10:50:48] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1100) [11:07:19] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:25] FIRING: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:54] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm [11:12:27] (03PS1) 10FNegri: toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885) [11:13:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10548759 (10phaultfinder) [11:38:30] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database kncwiki (T385188) [11:38:33] T385188: [wikireplicas] Create views for new wiki kncwiki - https://phabricator.wikimedia.org/T385188 [12:04:31] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database kncwiki (T385188) [12:04:34] T385188: [wikireplicas] Create views for new wiki kncwiki - https://phabricator.wikimedia.org/T385188 [12:06:34] (03PS1) 10KartikMistry: Update cxserver to 2025-02-13-102531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119478 (https://phabricator.wikimedia.org/T381943) [12:13:53] (03PS1) 10Arthur taylor: Add smartcard SSH key for arthurtaylor [puppet] - 10https://gerrit.wikimedia.org/r/1119482 [12:17:21] (03CR) 10Ladsgroup: [C:03+2] "Deploying this" [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [12:18:04] (03Merged) 10jenkins-bot: Stop producing Yahoo! abstract dumps [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [12:18:14] !log draining dse-k8s-worker1004 ready for reimage to bookworm and containerd for T377875 [12:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:17] T377875: Migrate dse-k8s cluster from docker to containerd - https://phabricator.wikimedia.org/T377875 [12:19:27] !log ladsgroup@deploy2002 Started deploy [dumps/dumps@2e0a7a5]: Stop producing Yahoo! abstract dumps (T382069) [12:19:30] T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069 [12:19:35] !log ladsgroup@deploy2002 Finished deploy [dumps/dumps@2e0a7a5]: Stop producing Yahoo! abstract dumps (T382069) (duration: 00m 07s) [12:20:31] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349 (10ArthurTaylor) 03NEW [12:23:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "FWIW, it should be possible to use Ed25519 keys with a YubiKey too (I did it two months ago when I had to replace my YubiKey, see [Matterm" [puppet] - 10https://gerrit.wikimedia.org/r/1119482 (owner: 10Arthur taylor) [12:25:00] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm [12:29:38] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10548938 (10Krinkle) ## Status Update: Dec 2024 - Jan 2025 A few updates on this proposal: In December 2024, @SCherukuwada ment... [12:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10548939 (10phaultfinder) [12:31:50] (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1004 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119104 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [12:37:30] (03PS1) 10Ladsgroup: Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) [12:37:50] (03CR) 10CI reject: [V:04-1] Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [12:39:00] (03PS2) 10Ladsgroup: Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) [12:41:36] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [12:45:10] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage [12:52:01] Another quick cxserver deployment.. [12:52:26] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-13-102531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119478 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry) [12:53:33] (03Merged) 10jenkins-bot: Update cxserver to 2025-02-13-102531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119478 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry) [12:56:00] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:56:23] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1300) [13:01:23] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [13:01:52] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [13:02:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:02:29] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [13:02:41] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm [13:03:06] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [13:04:01] !log Updated Cxserver to 2025-02-13-102531-production (T381943, T386231) [13:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:05] T381943: Swagger probe test for page fetch API is failing - https://phabricator.wikimedia.org/T381943 [13:04:06] T386231: /v2/translate API fails to perform any translation when no provider is provided - https://phabricator.wikimedia.org/T386231 [13:15:46] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10549099 (10karapayneWMDE) as the EM of Wikidata at WMDE, I approve this request! [13:27:07] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bookworm [13:32:19] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:36:51] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10549117 (10Krinkle) >>! In T214998#10548937, @Krinkle wrote: > Historically, Google Search has supported this divide natively. E... [13:37:19] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:42] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:31] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:37] (03CR) 10CDanis: [C:03+2] aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:42:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:45:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 414MiB (2% inode=33%): /tmp 414MiB (2% inode=33%): /var/tmp 414MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [13:45:32] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage [13:46:07] (03Merged) 10jenkins-bot: aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [13:46:52] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:47:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [13:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:02] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:48:03] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage [13:58:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549187 (10YLiou_WMF) 05Resolved→03Open [13:59:44] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:00:00] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:22] I probably can’t deploy today [14:00:34] I might do some backports later but that’s it [14:00:41] (so it’s good there’s nothing in the queue anyway ^^) [14:05:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [14:06:21] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet with OS bookworm [14:16:16] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1005*,elastic1006* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357 [14:16:19] T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357 [14:16:19] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1005*,elastic1006* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357 [14:16:48] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549226 (10YLiou_WMF) Unfortunately, I'm reopening this as I'm experiencing an issue logging into JupyterHub. This app... [14:18:09] (03PS1) 10CDanis: ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) [14:18:59] (03PS2) 10CDanis: ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) [14:19:45] (03PS3) 10CDanis: ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) [14:19:52] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:19:53] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:21:46] (03CR) 10Brouberol: [C:03+1] ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:22:07] (03CR) 10CDanis: [C:03+2] ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:29:23] (03PS1) 10Lucas Werkmeister (WMDE): Add config option to make somevalue hashes use URI [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119508 (https://phabricator.wikimedia.org/T384344) [14:29:31] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:40] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10549272 (10Andrew) p:05Triage→03Medium Update: gitlab is making daily snapshot builds and uploading them to quay.io -- the builds fail now and... [14:29:40] (03PS1) 10Lucas Werkmeister (WMDE): Make somevalue hashes use URI in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119509 (https://phabricator.wikimedia.org/T384344) [14:29:57] (03PS1) 10Lucas Werkmeister (WMDE): Add config option to fix s:, ref:, v: namespace prefix [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119510 (https://phabricator.wikimedia.org/T384344) [14:30:21] (03PS1) 10Lucas Werkmeister (WMDE): Fix s:, ref:, v: namespace prefix in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119511 (https://phabricator.wikimedia.org/T384344) [14:30:43] I’ll probably backport these ^ soon but let’s see CI pass first [14:30:45] jouncebot: nowandnext [14:30:45] For the next 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1400) [14:30:45] In 1 hour(s) and 29 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1600) [14:32:14] (03CR) 10Btullis: [C:03+1] ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis) [14:35:24] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bookworm [14:35:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549344 (10BTullis) It looks like the cause of this is that the `yliou` account is not a member of either the `wmf` or... [14:41:20] (03PS1) 10Gergő Tisza: auth: Use POST trxProfiler expectations during return/reauth [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) [14:41:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) (owner: 10Gergő Tisza) [14:48:22] (03CR) 10Jforrester: "Should we wait for I86a2efdf7d24c1fe7573b047aa3b804a7f831af5 to have a month in prod and not break the world first?" [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [14:53:43] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549415 (10BTullis) I added the record from `mwmaint1002` with the following command: ` btullis@mwmaint2002:~$ sudo mo... [14:53:57] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage [14:54:51] (03CR) 10Ladsgroup: "Yup, that's the plan" [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [14:57:02] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage [14:59:15] jouncebot: nowandnext [14:59:15] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1400) [14:59:15] In 1 hour(s) and 0 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1600) [14:59:47] I’ll go ahead with my backports then (they’re expected to have no effect, just preparing for a config change that we’ll probably do on Monday) [15:00:44] (03PS1) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) [15:01:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [15:02:38] FTR, scap backport warns me that one of the changes depends on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/1118147 but that’s not on the same branch [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:47] which should be fine because it was already merged in time for wmf.16 [15:03:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:03:05] ditto for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/1118120 [15:03:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119508 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:03:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119509 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:03:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119510 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:03:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119511 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:03:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:03:35] (I suppose I could’ve removed the Depends-On from the commit messages 🤔) [15:03:53] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1105*,elastic1106* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357 [15:03:55] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1105*,elastic1106* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357 [15:03:56] T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357 [15:05:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:06:34] (03CR) 10RLazarus: [C:03+1] admin_ng: Switch on enableJobSidecarController for toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119231 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [15:07:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1219', diff saved to https://phabricator.wikimedia.org/P73462 and previous config saved to /var/cache/conftool/dbconfig/20250213-150715-marostegui.json [15:07:32] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1219.eqiad.wmnet [15:10:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:11:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2146', diff saved to https://phabricator.wikimedia.org/P73463 and previous config saved to /var/cache/conftool/dbconfig/20250213-151117-marostegui.json [15:12:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2146.codfw.wmnet [15:12:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:13:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1219.eqiad.wmnet [15:15:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Index rebuild [15:15:49] (03Merged) 10jenkins-bot: Add config option to make somevalue hashes use URI [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119508 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:15:49] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet with OS bookworm [15:17:04] (03Merged) 10jenkins-bot: Make somevalue hashes use URI in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119509 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:17:35] (03Merged) 10jenkins-bot: Add config option to fix s:, ref:, v: namespace prefix [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119510 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:17:37] (03Merged) 10jenkins-bot: Fix s:, ref:, v: namespace prefix in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119511 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [15:17:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1104-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:17:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:18:02] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1119508|Add config option to make somevalue hashes use URI (T384344)]], [[gerrit:1119509|Make somevalue hashes use URI in tests (T384344)]], [[gerrit:1119510|Add config option to fix s:, ref:, v: namespace prefix (T384344)]], [[gerrit:1119511|Fix s:, ref:, v: namespace prefix in tests (T384344)]] [15:18:05] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [15:18:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2146.codfw.wmnet [15:19:32] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1104-1106].eqiad.wmnet with reason: T386357 [15:19:36] T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357 [15:19:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: maintenance [15:20:03] (03CR) 10Andrew Bogott: [C:03+1] toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885) (owner: 10FNegri) [15:20:49] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1119508|Add config option to make somevalue hashes use URI (T384344)]], [[gerrit:1119509|Make somevalue hashes use URI in tests (T384344)]], [[gerrit:1119510|Add config option to fix s:, ref:, v: namespace prefix (T384344)]], [[gerrit:1119511|Fix s:, ref:, v: namespace prefix in tests (T384344)]] synced to the testservers (https://wikitech. [15:20:49] wikimedia.org/wiki/Mwdebug) [15:21:03] testing that nothing changes [15:21:15] (03PS2) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) [15:21:53] yup, lgtm [15:22:01] (03PS3) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) [15:22:41] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [15:22:50] (03CR) 10Gmodena: [C:03+1] Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [15:22:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [15:23:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:23:31] (03CR) 10TChin: [C:03+2] Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [15:24:44] (03Merged) 10jenkins-bot: Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin) [15:27:44] (03PS1) 10Marostegui: s1-pager.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/1119529 [15:28:11] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:28:27] (03CR) 10Marostegui: [C:03+2] s1-pager.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/1119529 (owner: 10Marostegui) [15:29:11] (03Merged) 10jenkins-bot: s1-pager.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/1119529 (owner: 10Marostegui) [15:29:21] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119508|Add config option to make somevalue hashes use URI (T384344)]], [[gerrit:1119509|Make somevalue hashes use URI in tests (T384344)]], [[gerrit:1119510|Add config option to fix s:, ref:, v: namespace prefix (T384344)]], [[gerrit:1119511|Fix s:, ref:, v: namespace prefix in tests (T384344)]] (duration: 11m 19s) [15:29:24] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [15:29:33] * Lucas_WMDE done deploying [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10549646 (10phaultfinder) [15:29:38] (03PS4) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) [15:35:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:36:42] (03CR) 10Marostegui: "It would be interesting if you can add some description to what has been done in the commit message. As it is a very big change, it would " [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 (owner: 10Federico Ceratto) [15:37:48] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885) (owner: 10FNegri) [15:39:51] (03CR) 10Btullis: [C:03+1] "LGTM. Do we need to depool the servers before removing them from conftool-data ?" [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [15:40:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:41:07] (03CR) 10Bking: [V:03+2 C:03+2] "Indeed we do. I have already depooled them." [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [16:00:04] andre and jnuche: OwO what's this, a deployment window?? Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1600). nyaa~ [16:01:53] hnowlan: could you deploy https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/1115587 for me? (I think I do not have the permissions to do it myself) [16:08:09] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:08:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:14:32] (03PS1) 10Gergő Tisza: Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) [16:15:15] zabe: SRE is offsite at the moment so it may take a while for h.nowlan to respond. I can deploy if you want, but I don't have the domain knowledge to test the change afterwards [16:15:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [16:15:57] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1104 to relforge1005 [16:16:01] (03CR) 10CI reject: [V:04-1] Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [16:16:09] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:16:20] I can "test" it (testing would just be to check that https://knc.wikipedia.org/api/rest_v1/ works) [16:16:29] claime: ^ [16:19:36] zabe: do you have +2 rights? I can +1 it but I'm uncomfortable doing +1/+2 and merging alone [16:19:53] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1104 to relforge1005 - bking@cumin2002" [16:20:12] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1104 to relforge1005 - bking@cumin2002" [16:20:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:13] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1005 [16:20:34] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1005 [16:20:37] claime: yes, I am just missing restbase-admins (or what else is needed) to deploy it [16:20:44] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bookworm [16:20:51] hit +2 [16:20:52] zabe: ok cool [16:21:14] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1104 to relforge1005 [16:21:28] (03PS1) 10Gergő Tisza: Do not preserve 'sul3-action' when restarting authentication [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) [16:22:14] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1105 to relforge1006 [16:22:28] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:23:00] !log cgoubert@deploy2002 Started deploy [restbase/deploy@511b3a4]: Add kncwiki (T385186) [16:23:03] T385186: Add kncwiki to RESTBase - https://phabricator.wikimedia.org/T385186 [16:24:02] (03CR) 10Gergő Tisza: "recheck" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [16:27:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1105 to relforge1006 - bking@cumin2002" [16:27:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) (owner: 10Gergő Tisza) [16:27:49] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1105 to relforge1006 - bking@cumin2002" [16:27:49] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:50] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1006 [16:28:04] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1006 [16:28:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1105 to relforge1006 [16:29:13] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1106 to relforge1007 [16:29:26] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:29:26] zabe: it's deploying, slowly, but it is in progress [16:29:42] thanks :) [16:32:59] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1106 to relforge1007 - bking@cumin2002" [16:33:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1106 to relforge1007 - bking@cumin2002" [16:33:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:33:29] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1007 [16:33:40] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1007 [16:34:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1106 to relforge1007 [16:35:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550138 (10YLiou_WMF) @BTullis the yliou account seems to work to login to jupyterhub now! Thank you for all your help! [16:35:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550139 (10YLiou_WMF) 05Open→03Resolved [16:35:59] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1005.eqiad.wmnet with OS bullseye [16:36:34] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1006.eqiad.wmnet with OS bullseye [16:37:01] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1007.eqiad.wmnet with OS bullseye [16:38:54] !log cgoubert@deploy2002 Finished deploy [restbase/deploy@511b3a4]: Add kncwiki (T385186) (duration: 15m 54s) [16:38:58] T385186: Add kncwiki to RESTBase - https://phabricator.wikimedia.org/T385186 [16:39:09] zabe: looks like it's done, and the api doc page loads for me [16:39:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [16:39:22] yep, thanks for your help [16:39:32] np :) [16:42:35] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage [16:43:58] I just noticed that stashbot recently exceeded 400,000 edits on Wikitech. Congratulations little bot. https://meta.wikimedia.org/wiki/Special:CentralAuth/Stashbot [16:44:19] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-03-14 16:43:55. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:44:47] ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-03-14 16:44:36. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:49:29] (03PS1) 10CDanis: add fault-tolerance namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119534 [16:49:29] (03PS1) 10CDanis: WIP: initial chart for fault-tolerance tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119535 [16:50:25] (03CR) 10CI reject: [V:04-1] WIP: initial chart for fault-tolerance tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119535 (owner: 10CDanis) [16:53:40] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad [16:53:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad [16:55:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1153.eqiad.wmnet with reason: maintenance [16:58:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:59:50] (03PS1) 10Sergio Gimeno: beta: A/B test setup for surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) [17:01:27] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1007.eqiad.wmnet with OS bookworm [17:02:19] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:03:06] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [17:03:51] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [17:04:08] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [17:04:50] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [17:13:08] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bookworm [17:16:12] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390 (10RobH) 03NEW [17:16:41] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10550310 (10RobH) [17:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10550329 (10phaultfinder) [17:31:32] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage [17:34:40] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage [17:34:51] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1005.eqiad.wmnet with OS bullseye [17:34:58] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [17:35:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1006.eqiad.wmnet with OS bullseye [17:35:27] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [17:35:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1007.eqiad.wmnet with OS bullseye [17:37:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10550376 (10BTullis) Thanks @RobH - I have been discussing with the team the approach that we would like to take around this and I th... [17:38:27] (03PS2) 10Subramanya Sastry: Turn on Parsoid Read Views for 33 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian) [17:42:24] (03PS1) 10Andrew Bogott: wmcs-policy-tests: make all tests pass in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1119541 [17:42:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1005.eqiad.wmnet with OS bullseye [17:43:53] (03CR) 10Andrew Bogott: [C:03+2] wmcs-policy-tests: make all tests pass in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1119541 (owner: 10Andrew Bogott) [17:47:19] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:53:47] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1008.eqiad.wmnet with OS bookworm [17:56:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1007.eqiad.wmnet with OS bullseye [17:57:24] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1006.eqiad.wmnet with OS bullseye [17:58:34] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1005.eqiad.wmnet with reason: host reimage [18:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1800) [18:03:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1005.eqiad.wmnet with reason: host reimage [18:04:37] (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1009 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119105 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [18:04:53] !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [18:05:16] !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [18:05:33] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [18:07:14] (03PS1) 10NMW03: Allow sysops to add/remove "confirmed" on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) [18:13:05] (03PS1) 10NMW03: Add "suppressredirect" to "editor" on Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) [18:14:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) (owner: 10NMW03) [18:14:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) (owner: 10NMW03) [18:18:39] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage [18:20:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1005.eqiad.wmnet with OS bullseye [18:20:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73465 and previous config saved to /var/cache/conftool/dbconfig/20250213-182026-root.json [18:20:30] RECOVERY - Host ms-be2075 is UP: PING WARNING - Packet loss = 75%, RTA = 30.33 ms [18:21:22] PROBLEM - SSH on ms-be2075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:21:38] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [18:22:11] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage [18:22:20] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [18:26:54] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:30] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1006.eqiad.wmnet with OS bullseye [18:28:31] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1007.eqiad.wmnet with OS bullseye [18:35:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73466 and previous config saved to /var/cache/conftool/dbconfig/20250213-183531-root.json [18:39:07] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1007.eqiad.wmnet with OS bullseye [18:39:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1006.eqiad.wmnet with OS bullseye [18:40:00] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm [18:40:53] (03PS1) 10Andrew Bogott: wmcs-policy-tests: make work in codfw1dev too [puppet] - 10https://gerrit.wikimedia.org/r/1119551 [18:42:40] (03CR) 10Andrew Bogott: [C:03+2] wmcs-policy-tests: make work in codfw1dev too [puppet] - 10https://gerrit.wikimedia.org/r/1119551 (owner: 10Andrew Bogott) [18:50:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73467 and previous config saved to /var/cache/conftool/dbconfig/20250213-185036-root.json [18:54:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73469 and previous config saved to /var/cache/conftool/dbconfig/20250213-185433-root.json [18:54:58] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1007.eqiad.wmnet with reason: host reimage [18:55:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1006.eqiad.wmnet with reason: host reimage [18:58:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1007.eqiad.wmnet with reason: host reimage [18:59:56] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:54] !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [19:00:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1006.eqiad.wmnet with reason: host reimage [19:01:40] !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [19:05:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73470 and previous config saved to /var/cache/conftool/dbconfig/20250213-190542-root.json [19:06:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:08:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73471 and previous config saved to /var/cache/conftool/dbconfig/20250213-190938-root.json [19:15:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1007.eqiad.wmnet with OS bullseye [19:18:05] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1006.eqiad.wmnet with OS bullseye [19:20:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73472 and previous config saved to /var/cache/conftool/dbconfig/20250213-192047-root.json [19:21:26] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:26] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:21:48] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:24:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73473 and previous config saved to /var/cache/conftool/dbconfig/20250213-192444-root.json [19:27:05] (03CR) 10RLazarus: [C:03+2] Reapply "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1117924 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [19:27:27] sneaking out an apache change [19:31:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:31:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:32:00] (03PS1) 10Jdlrobson: Fix name of ABTestEnrollment configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) [19:32:07] ^ expected, transient from version skew [19:32:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:34:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:34:31] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:49] !log rzl@deploy2002 Started scap sync-world: T383952, T384137 [19:35:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:35:55] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [19:35:56] T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137 [19:36:58] !log rzl@deploy2002 rzl: T383952, T384137 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:37:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:38:19] !log rzl@deploy2002 rzl: Continuing with sync [19:39:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73474 and previous config saved to /var/cache/conftool/dbconfig/20250213-193949-root.json [19:44:03] !log rzl@deploy2002 Finished scap sync-world: T383952, T384137 (duration: 11m 16s) [19:44:07] T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952 [19:44:07] T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137 [19:49:49] (done, hanging out in case of problems but everything looks good) [19:54:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73475 and previous config saved to /var/cache/conftool/dbconfig/20250213-195454-root.json [19:55:23] (03CR) 10Bernard Wang: [C:03+1] Fix name of ABTestEnrollment configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson) [19:55:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson) [19:56:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:56:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:56:37] rzl: restarting the httpbb systemd timers to clear up alerts btw [19:57:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:58:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:58:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:59:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:59:17] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:59:29] claime: sure, thanks [19:59:31] RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:03:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:03:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:05:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:49] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:07:37] !log mwscript-k8s --attach extensions/Translate/scripts/moveTranslatableBundle.php -- --wiki=metawiki 'Wiki_Movement_Brazil_User_Group' 'Wikimedia Brasil' 'Martin Urbanec' --reason='per [[special:Permalink/28261149#Wikimedia_Brasil|request]] ([[:phab:T386402]])' # T386402 [20:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:40] T386402: Request to move translatable page: Wiki_Movement_Brazil_User_Group at Meta-Wiki - https://phabricator.wikimedia.org/T386402 [20:09:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:10:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:14:07] (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good 😭 thanks for catching this so fast" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson) [20:23:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:23:36] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:23:40] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@092b9d3]: deploy latest DAGs to analytics Airflow instance. T386114. [20:23:44] T386114: DAG failing due to failure to acquire lock on wmf_data_ops.data_quality_metrics table - https://phabricator.wikimedia.org/T386114 [20:23:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:23:58] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:24:13] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@092b9d3]: deploy latest DAGs to analytics Airflow instance. T386114. (duration: 00m 33s) [20:24:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:24:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10550789 (10phaultfinder) [20:31:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:31:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:31:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:31:36] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:32:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:32:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:33:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:35:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:35:28] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.224 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:36:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:36:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:49:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:50:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:51:02] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550873 (10YLiou_WMF) 05Resolved→03Open [20:51:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:54:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550911 (10YLiou_WMF) Unfortunately I'm now experiencing a separate issue! I'm trying to install R and am receiving th... [20:56:13] !log bking@cephosd1001:~$ sudo radosgw-admin user create --uid=research --display-name="research" [20:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [20:58:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T2100). [21:00:05] tgr, subbu, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:37] o/ [21:01:01] o/ [21:01:18] !log bking@cephosd1001:~$ sudo radosgw-admin quota set --quota-scope=user --uid=research --max-size=4T [21:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:28] I can deploy [21:02:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:02:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian) [21:03:42] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 33 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian) [21:03:44] oh you are deploying mine first. let me set up wikimedia-debug testing. [21:04:00] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]] [21:04:03] T386272: Wiktionary deploy ~2024-02-13 - https://phabricator.wikimedia.org/T386272 [21:06:25] (03PS2) 10C. Scott Ananian: Turn on Parsoid Read Views for mobile wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272) [21:06:47] !log tgr@deploy2002 cscott, tgr: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:36] I just noticed the second patch cscott had uploaded. but, we can wait for it till your other patches are done. [21:07:37] subbu: ^ [21:07:59] ok. will test. [21:08:24] do you mean we should deploy another patch as well? [21:09:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:16] it's no problem, just make sure it's added to the wiki page [21:09:54] that first patch lgtm. [21:10:42] (i am here) [21:11:54] added the second one too. [21:12:01] tgr|away, ^ [21:12:29] ok, I'll deploy them together then to save some time [21:12:36] sounds good. [21:12:41] !log tgr@deploy2002 Sync cancelled. [21:12:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550958 (10BTullis) Strangely, that file isn't displaying for me. It's showing as restricted. It might be an idea to... [21:13:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian) [21:13:50] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for mobile wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian) [21:14:09] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]], [[gerrit:1119216|Turn on Parsoid Read Views for mobile wiktionary (T386272)]] [21:14:13] T386272: Wiktionary deploy ~2024-02-13 - https://phabricator.wikimedia.org/T386272 [21:14:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:48] !log tgr@deploy2002 tgr, cscott: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]], [[gerrit:1119216|Turn on Parsoid Read Views for mobile wiktionary (T386272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:25] tgr|away, tested pages on a few wiktionaries ... lgtm. [21:19:41] !log tgr@deploy2002 tgr, cscott: Continuing with sync [21:22:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:17] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]], [[gerrit:1119216|Turn on Parsoid Read Views for mobile wiktionary (T386272)]] (duration: 12m 08s) [21:26:21] T386272: Wiktionary deploy ~2024-02-13 - https://phabricator.wikimedia.org/T386272 [21:26:25] thanks tgr|away ! [21:27:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson) [21:28:14] (03Merged) 10jenkins-bot: Fix name of ABTestEnrollment configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson) [21:28:31] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119555|Fix name of ABTestEnrollment configuration (T384019)]] [21:28:34] T384019: Deploy Empty Search A/B test - https://phabricator.wikimedia.org/T384019 [21:31:15] !log tgr@deploy2002 jdlrobson, tgr: Backport for [[gerrit:1119555|Fix name of ABTestEnrollment configuration (T384019)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:32:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:35:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551013 (10phaultfinder) [21:37:15] FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:37:24] (03CR) 10Gergő Tisza: [C:03+2] auth: Use POST trxProfiler expectations during return/reauth [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) (owner: 10Gergő Tisza) [21:37:28] (03CR) 10Gergő Tisza: [C:03+2] Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [21:37:32] (03CR) 10Gergő Tisza: [C:03+2] Do not preserve 'sul3-action' when restarting authentication [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) (owner: 10Gergő Tisza) [21:38:18] Jdlrobson: does it look good? [21:39:05] tgr|away: looking almost done [21:41:00] yes please sync tgr|away ! [21:41:16] !log tgr@deploy2002 jdlrobson, tgr: Continuing with sync [21:47:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:26] (03CR) 10Xcollazo: [C:03+1] "Thank you for taking the time (and pain) to clean this up Amir." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup) [21:47:55] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119555|Fix name of ABTestEnrollment configuration (T384019)]] (duration: 19m 24s) [21:47:59] T384019: Deploy Empty Search A/B test - https://phabricator.wikimedia.org/T384019 [21:49:21] (03Merged) 10jenkins-bot: auth: Use POST trxProfiler expectations during return/reauth [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) (owner: 10Gergő Tisza) [21:49:23] (03Merged) 10jenkins-bot: Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza) [21:49:25] (03Merged) 10jenkins-bot: Do not preserve 'sul3-action' when restarting authentication [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) (owner: 10Gergő Tisza) [21:50:52] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119516|auth: Use POST trxProfiler expectations during return/reauth (T385566)]], [[gerrit:1119530|Track the number of started / finished SUL3 flows (T377261)]], [[gerrit:1119531|Do not preserve 'sul3-action' when restarting authentication (T364866)]] [21:50:59] T385566: SUL3: Transaction profiler warnings when logging in - https://phabricator.wikimedia.org/T385566 [21:50:59] T377261: Track the number of interrupted SUL3 logins / signups - https://phabricator.wikimedia.org/T377261 [21:50:59] T364866: Adapt to changes in post-login/signup hooks after switching to a central login wiki - https://phabricator.wikimedia.org/T364866 [21:51:59] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10551073 (10Krinkle) >>! In T214998#10548937, @Krinkle wrote: > […] our mobile user agent regex may not be as good as as Google's... [21:52:39] thanks tgr|away ! [21:52:50] sure [21:53:33] !log tgr@deploy2002 tgr: Backport for [[gerrit:1119516|auth: Use POST trxProfiler expectations during return/reauth (T385566)]], [[gerrit:1119530|Track the number of started / finished SUL3 flows (T377261)]], [[gerrit:1119531|Do not preserve 'sul3-action' when restarting authentication (T364866)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:56:37] 10ops-eqiad, 06DC-Ops, 06Discovery-Search, 06Research, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Relabel Elastic hosts to Relforge hosts - https://phabricator.wikimedia.org/T386358#10551110 (10bking) a:05bking→03None [21:59:02] !log tgr@deploy2002 tgr: Continuing with sync [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T2200) [22:00:17] Amir1: migrateESRefToContentTable is producing a big spike of "PHP Warning: fwrite() expects parameter 1 to be resource, bool given" [22:00:53] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [22:00:56] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [22:00:59] or zabe [22:01:07] yes sorry [22:01:09] that was me [22:01:22] I forgot to "chmod 666" the dump file [22:01:23] np, just wasn't sure you are aware [22:01:58] !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php diqwiki --skip /home/zabe/text_table_cleanup/diqwiki --dump /home/zabe/text_table_dump/diqwiki --sleep 0.5 --start 318769 # T183490 [22:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:01] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:05:56] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119516|auth: Use POST trxProfiler expectations during return/reauth (T385566)]], [[gerrit:1119530|Track the number of started / finished SUL3 flows (T377261)]], [[gerrit:1119531|Do not preserve 'sul3-action' when restarting authentication (T364866)]] (duration: 15m 03s) [22:06:02] T385566: SUL3: Transaction profiler warnings when logging in - https://phabricator.wikimedia.org/T385566 [22:06:03] T377261: Track the number of interrupted SUL3 logins / signups - https://phabricator.wikimedia.org/T377261 [22:06:03] T364866: Adapt to changes in post-login/signup hooks after switching to a central login wiki - https://phabricator.wikimedia.org/T364866 [22:09:49] !log UTC late deploys done [22:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:09] !log rzl@idp2004:~$ sudo systemctl restart tomcat10 [22:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:34] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551219 (10Dzahn) Confirming Arthur is already on the NDA tracking sheet, checking this box off. [22:16:12] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551220 (10Dzahn) [22:17:53] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551228 (10Dzahn) [22:18:53] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551233 (10Dzahn) [22:19:16] (03PS1) 10Bking: relforge: Prepare newly-reimaged relforge hosts to join the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) [22:19:37] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551239 (10Dzahn) @thcipriani Hello, here is a request for deployment access for your consideration. [22:19:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking) [22:21:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551246 (10thcipriani) Thanks for the ping @Dzahn Request looks good to me, approved. Thanks for volunteering. Please reach out if you need anything. [22:22:12] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10551248 (10Dzahn) @EPIC Could you please send an email to Katie Francis (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) and tell her your real name? [22:22:15] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:22:43] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551250 (10Dzahn) [22:23:24] (03PS1) 10Jdlrobson: Footer: Wikimedia icon should collapse at lower resolutions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) [22:24:09] (03CR) 10Jdlrobson: "Amir: this can be deployed as soon as we're sure the 1.44.0-wmf.16 train won't roll back!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [22:32:11] (03CR) 10Ryan Kemper: [C:03+1] "Looks good; my +1 is contingent upon pcc coming back good once we've fixed the puppet facts issue" [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking) [22:32:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking) [22:35:10] (03CR) 10Bking: [C:03+2] relforge: Prepare newly-reimaged relforge hosts to join the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking) [22:38:52] !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php ttwiki --skip /home/zabe/text_table_cleanup/ttwiki --dump /home/zabe/text_table_dump/ttwiki --sleep 0.5 --start 867501 # T183490 [22:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:56] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [22:44:39] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on relforge[1003-1007].eqiad.wmnet with reason: T386357 [22:44:42] T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357 [22:50:56] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 8, active_shards: 15, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight [22:50:56] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.75 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:19:05] (03CR) 10Bartosz Dziewoński: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1117924 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński) [23:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551319 (10phaultfinder) [23:58:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state