[00:00:33] <logmsgbot>	 !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119233|Lazy Load Images (T366402)]], [[gerrit:1119234|Lazy Load Images (T366402)]] (duration: 31m 40s)
[00:00:54] <toyofuku>	 That's all for us today, thank you so much!
[00:02:34] <icinga-wm>	 PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:02:51] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119207 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[00:03:49] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119207 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[00:04:43] <logmsgbot>	 !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1119207|Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki (T183490)]]
[00:04:46] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[00:07:41] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:1119207|Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:08:34] <icinga-wm>	 RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:08:49] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[00:10:56] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:15:22] <logmsgbot>	 !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119207|Reduce revision-slots cache expiry to 60s on diqwiki and ttwiki (T183490)]] (duration: 10m 39s)
[00:15:26] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[00:20:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:21:08] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:32:40] <wikibugs>	 (03PS4) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861)
[00:38:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119239
[00:38:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119239 (owner: 10TrainBranchBot)
[00:39:34] <icinga-wm>	 PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:39:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:39:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10547180 (10phaultfinder)
[00:39:38] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:39:58] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:45:34] <icinga-wm>	 RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:46:35] <wikibugs>	 (03CR) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis)
[00:47:50] <wikibugs>	 (03PS2) 10BryanDavis: admin_ng: Switch on enableJobSidecarController for toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119231 (https://phabricator.wikimedia.org/T292861)
[00:47:50] <wikibugs>	 (03PS5) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861)
[00:50:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119239 (owner: 10TrainBranchBot)
[01:08:43] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242
[01:08:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242 (owner: 10TrainBranchBot)
[01:23:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242 (owner: 10TrainBranchBot)
[01:32:45] <wikibugs>	 (03PS1) 10Stang: Revert "zhwiki: Add 2025 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243
[01:35:39] <zabe>	 !log zabe@mwmaint2002:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=OGPawlis --overwrite /tmp/uploads # T382976
[01:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:35:43] <stashbot>	 T382976: Server side upload for OGPawlis - https://phabricator.wikimedia.org/T382976
[01:35:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang)
[01:42:42] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200799s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:45:18] <wikibugs>	 (03PS1) 10TChin: Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769)
[01:47:19] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:48:12] <zabe>	 !log zabe@deploy2002:~$ mwscript-k8s --comment="T386292" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki 'Sofia Baldelli' 'AnonymWikiuser 245'
[01:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:15] <stashbot>	 T386292: Unblock stuck global rename of AnonymWikiuser_245 and Renamed_user_9b7b870ac2b7d3f071232203ec1030d1 - https://phabricator.wikimedia.org/T386292
[01:48:21] <zabe>	 !log zabe@deploy2002:~$ mwscript-k8s --comment="T386292" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Nebuls' 'Renamed user 9b7b870ac2b7d3f071232203ec1030d1'
[01:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:59:16] <wikibugs>	 (03PS2) 10Krinkle: tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T379661) (owner: 10Hokwelum)
[02:03:59] <wikibugs>	 (03PS3) 10Krinkle: tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum)
[02:04:06] <wikibugs>	 (03CR) 10Krinkle: [C:03+1] tables-catalog: remove module_deps table [puppet] - 10https://gerrit.wikimedia.org/r/1118872 (https://phabricator.wikimedia.org/T385997) (owner: 10Hokwelum)
[02:17:56] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 16.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:07:19] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:19] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[03:39:09] <logmsgbot>	 !log tchin@deploy2002 Started deploy [airflow-dags/analytics@aaba3ff]: Deploying airflow for T306896
[03:39:12] <stashbot>	 T306896: Integrate Spark with DataHub with lineage - https://phabricator.wikimedia.org/T306896
[03:40:04] <logmsgbot>	 !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@aaba3ff]: Deploying airflow for T306896 (duration: 01m 07s)
[04:07:55] <kart_>	 Deploying cxserver. Minor changes.
[04:08:24] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-12-075258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119052 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry)
[04:09:35] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-02-12-075258-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119052 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry)
[04:28:04] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[04:28:29] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[04:37:42] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[04:38:23] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[04:40:58] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[04:41:31] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[04:45:46] <kart_>	 !log Updated cxserver to 2025-02-12-075258-production (T381943)
[04:45:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:45:49] <stashbot>	 T381943: Swagger probe test for page fetch API is failing - https://phabricator.wikimedia.org/T381943
[04:54:05] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update MinT to 2025-02-05-115716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115314 (https://phabricator.wikimedia.org/T383750) (owner: 10KartikMistry)
[04:55:20] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2025-02-05-115716-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115314 (https://phabricator.wikimedia.org/T383750) (owner: 10KartikMistry)
[05:24:27] <wikibugs>	 (03PS3) 10Anzx: Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126)
[05:26:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx)
[05:47:19] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0700).
[07:07:19] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:32:19] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0800).
[08:00:05] <jouncebot>	 gmodena, koi, and anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:14] <dcausse>	 o/
[08:00:42] <koi>	 o/
[08:00:46] <anzx>	 o/
[08:02:07] <gmodena>	 o/
[08:02:45] <gmodena>	 I'll be doing my deployment together with dcausse 
[08:04:12] <dcausse>	 koi, anzx we'll do your patches first
[08:04:40] <anzx>	 ok
[08:05:13] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Revert "zhwiki: Add 2025 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang)
[08:07:58] <koi>	 ok, i'm here
[08:08:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang)
[08:09:15] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "zhwiki: Add 2025 CNY celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119243 (owner: 10Stang)
[08:09:45] <wikibugs>	 (03CR) 10DCausse: [C:04-1] Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx)
[08:09:57] <dcausse>	 anzx: I think there's a small issue in your patch
[08:10:20] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1119243|Revert "zhwiki: Add 2025 CNY celebration logos"]]
[08:11:50] <dcausse>	 anzx: my bad it's two different days
[08:12:26] <anzx>	 dcausse: ok
[08:12:56] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx)
[08:13:48] <logmsgbot>	 !log dcausse@deploy2002 stang, dcausse: Backport for [[gerrit:1119243|Revert "zhwiki: Add 2025 CNY celebration logos"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:14:17] <dcausse>	 koi: could you test your change?
[08:14:27] <koi>	 yeah
[08:14:56] <koi>	 looking
[08:15:57] <koi>	 dcausse, i see the logo back to the normal one, so LGTM
[08:17:00] <dcausse>	 koi: thanks, deploying
[08:17:03] <logmsgbot>	 !log dcausse@deploy2002 stang, dcausse: Continuing with sync
[08:18:11] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] airflow-analytics: fix typo in config [puppet] - 10https://gerrit.wikimedia.org/r/1119185 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol)
[08:23:45] <logmsgbot>	 !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119243|Revert "zhwiki: Add 2025 CNY celebration logos"]] (duration: 13m 24s)
[08:23:57] <dcausse>	 koi: should be live
[08:24:40] <dcausse>	 anzx: shipping your patch now, I guess we can't really test it on debug servers?
[08:24:56] <anzx>	 yeah no test needed
[08:25:36] <koi>	 ty
[08:25:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx)
[08:26:42] <wikibugs>	 (03Merged) 10jenkins-bot: Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118984 (https://phabricator.wikimedia.org/T386126) (owner: 10Anzx)
[08:27:09] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1118984|Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (T386126)]]
[08:27:12] <stashbot>	 T386126: Lift IP for a edit-a-thon in Jujuy, Argentina (2025-02-17 & 2025-03-10) - https://phabricator.wikimedia.org/T386126
[08:27:15] <wikibugs>	 06SRE, 10Observability-Metrics: Use white version of Wikimedia logo for grafana - https://phabricator.wikimedia.org/T226970#10547912 (10Volker_E)
[08:36:54] <logmsgbot>	 !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118984|Lift IP cap for edit-a-thon on 2025-02-17 & 2025-03-10 (T386126)]] (duration: 09m 45s)
[08:36:57] <stashbot>	 T386126: Lift IP for a edit-a-thon in Jujuy, Argentina (2025-02-17 & 2025-03-10) - https://phabricator.wikimedia.org/T386126
[08:37:05] <anzx>	 dcausse: thanks 
[08:37:55] <dcausse>	 anzx: yw!
[08:38:17] <dcausse>	 gmodena: we're next
[08:38:25] <gmodena>	 dcausse ack
[08:40:40] <wikibugs>	 (03CR) 10DCausse: [C:03+1] cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena)
[08:41:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena)
[08:41:51] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: enable mlr-2025 for select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118785 (https://phabricator.wikimedia.org/T385972) (owner: 10Gmodena)
[08:42:22] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]]
[08:42:25] <stashbot>	 T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972
[08:44:50] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 2904 MB (3% inode=98%): /tmp 2904 MB (3% inode=98%): /var/tmp 2904 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[08:45:24] <logmsgbot>	 !log dcausse@deploy2002 dcausse, gmodena: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:50:58] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics: upgarde the airflow deb package to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092)
[08:51:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow-analytics: upgarde the airflow deb package to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol)
[08:54:55] <logmsgbot>	 !log dcausse@deploy2002 dcausse, gmodena: Continuing with sync
[08:59:31] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:00:04] <jouncebot>	 andre and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T0900).
[09:01:28] <logmsgbot>	 !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118785|cirrus: enable mlr-2025 for select wikis (T385972)]] (duration: 19m 06s)
[09:01:31] <stashbot>	 T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972
[09:02:19] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[09:03:35] <dcausse>	 !log closing UTC morning backport window
[09:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:31] <wikibugs>	 (03PS2) 10Brouberol: airflow-analytics: upgrade airflow to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092)
[09:25:36] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol)
[09:26:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-analytics: upgrade airflow to get a new confluent-kafka version [puppet] - 10https://gerrit.wikimedia.org/r/1119456 (https://phabricator.wikimedia.org/T386092) (owner: 10Brouberol)
[09:30:36] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119461 (https://phabricator.wikimedia.org/T382367)
[09:30:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119461 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot)
[09:31:23] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119461 (https://phabricator.wikimedia.org/T382367) (owner: 10TrainBranchBot)
[09:40:39] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.16  refs T382367
[09:40:42] <stashbot>	 T382367: 1.44.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T382367
[09:47:19] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:57:28] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[09:58:28] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[10:11:22] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1003 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119103 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene)
[10:14:47] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm
[10:23:23] <wikibugs>	 (03PS1) 10Kosta Harlan: ApiPageTriageList: Check that $user is defined before using it [extensions/PageTriage] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119471 (https://phabricator.wikimedia.org/T386332)
[10:35:11] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] ApiPageTriageList: Check that $user is defined before using it [extensions/PageTriage] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119471 (https://phabricator.wikimedia.org/T386332) (owner: 10Kosta Harlan)
[10:36:25] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "+2/V this backport because https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/1119468 got merged" [extensions/PageTriage] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119471 (https://phabricator.wikimedia.org/T386332) (owner: 10Kosta Harlan)
[10:38:25] <logmsgbot>	 !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1119471|ApiPageTriageList: Check that $user is defined before using it (T386332)]]
[10:38:28] <stashbot>	 T386332: Error: Call to a member function isTemp() on null - https://phabricator.wikimedia.org/T386332
[10:41:04] <logmsgbot>	 !log aklapper@deploy2002 kharlan, aklapper: Backport for [[gerrit:1119471|ApiPageTriageList: Check that $user is defined before using it (T386332)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:42:32] <logmsgbot>	 !log aklapper@deploy2002 kharlan, aklapper: Continuing with sync
[10:44:02] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@08b2bd2]: Analytics one-off deploy [analytics/refinery@08b2bd2e]
[10:46:10] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@08b2bd2]: Analytics one-off deploy [analytics/refinery@08b2bd2e] (duration: 02m 07s)
[10:46:27] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@08b2bd2] (thin): Analytics one-off deploy -THIN [analytics/refinery@08b2bd2e]
[10:47:13] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@08b2bd2] (thin): Analytics one-off deploy -THIN [analytics/refinery@08b2bd2e] (duration: 00m 46s)
[10:47:22] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[10:47:27] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@08b2bd2] (hadoop-test): Analytics one-off deploy - TEST [analytics/refinery@08b2bd2e]
[10:48:11] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@08b2bd2] (hadoop-test): Analytics one-off deploy - TEST [analytics/refinery@08b2bd2e] (duration: 00m 44s)
[10:49:12] <logmsgbot>	 !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119471|ApiPageTriageList: Check that $user is defined before using it (T386332)]] (duration: 10m 47s)
[10:49:15] <stashbot>	 T386332: Error: Call to a member function isTemp() on null - https://phabricator.wikimedia.org/T386332
[10:50:48] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1003.eqiad.wmnet with reason: host reimage
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1100)
[11:07:19] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:08:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:08:54] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet with OS bookworm
[11:12:27] <wikibugs>	 (03PS1) 10FNegri: toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885)
[11:13:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: docker.service on ml-lab1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10548759 (10phaultfinder)
[11:38:30] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database kncwiki (T385188)
[11:38:33] <stashbot>	 T385188: [wikireplicas] Create views for new wiki kncwiki - https://phabricator.wikimedia.org/T385188
[12:04:31] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database kncwiki (T385188)
[12:04:34] <stashbot>	 T385188: [wikireplicas] Create views for new wiki kncwiki - https://phabricator.wikimedia.org/T385188
[12:06:34] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-02-13-102531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119478 (https://phabricator.wikimedia.org/T381943)
[12:13:53] <wikibugs>	 (03PS1) 10Arthur taylor: Add smartcard SSH key for arthurtaylor [puppet] - 10https://gerrit.wikimedia.org/r/1119482
[12:17:21] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "Deploying this" [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[12:18:04] <wikibugs>	 (03Merged) 10jenkins-bot: Stop producing Yahoo! abstract dumps [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[12:18:14] <stevemunene>	 !log draining dse-k8s-worker1004 ready for reimage to bookworm and containerd for T377875
[12:18:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:17] <stashbot>	 T377875: Migrate dse-k8s cluster from docker to containerd - https://phabricator.wikimedia.org/T377875
[12:19:27] <logmsgbot>	 !log ladsgroup@deploy2002 Started deploy [dumps/dumps@2e0a7a5]: Stop producing Yahoo! abstract dumps (T382069)
[12:19:30] <stashbot>	 T382069: Undeploy and archive ActiveAbstract - https://phabricator.wikimedia.org/T382069
[12:19:35] <logmsgbot>	 !log ladsgroup@deploy2002 Finished deploy [dumps/dumps@2e0a7a5]: Stop producing Yahoo! abstract dumps (T382069) (duration: 00m 07s)
[12:20:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349 (10ArthurTaylor) 03NEW
[12:23:29] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "FWIW, it should be possible to use Ed25519 keys with a YubiKey too (I did it two months ago when I had to replace my YubiKey, see [Matterm" [puppet] - 10https://gerrit.wikimedia.org/r/1119482 (owner: 10Arthur taylor)
[12:25:00] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm
[12:29:38] <wikibugs>	 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10548938 (10Krinkle) ## Status Update: Dec 2024 - Jan 2025  A few updates on this proposal:  In December 2024, @SCherukuwada ment...
[12:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10548939 (10phaultfinder)
[12:31:50] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1004 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119104 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene)
[12:37:30] <wikibugs>	 (03PS1) 10Ladsgroup: Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069)
[12:37:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[12:39:00] <wikibugs>	 (03PS2) 10Ladsgroup: Remove abstract dumps infrastructure [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069)
[12:41:36] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[12:45:10] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1004.eqiad.wmnet with reason: host reimage
[12:52:01] <kart_>	 Another quick cxserver deployment..
[12:52:26] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-13-102531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119478 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry)
[12:53:33] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-02-13-102531-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119478 (https://phabricator.wikimedia.org/T381943) (owner: 10KartikMistry)
[12:56:00] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:56:23] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1300)
[13:01:23] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[13:01:52] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[13:02:19] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:02:29] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[13:02:41] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet with OS bookworm
[13:03:06] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[13:04:01] <kart_>	 !log Updated Cxserver to 2025-02-13-102531-production (T381943, T386231)
[13:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:05] <stashbot>	 T381943: Swagger probe test for page fetch API is failing - https://phabricator.wikimedia.org/T381943
[13:04:06] <stashbot>	 T386231: /v2/translate API fails to perform any translation when no provider is provided - https://phabricator.wikimedia.org/T386231
[13:15:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10549099 (10karapayneWMDE) as the EM of Wikidata at WMDE, I approve this request!
[13:27:07] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1005.eqiad.wmnet with OS bookworm
[13:32:19] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:35:34] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:36:51] <wikibugs>	 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10549117 (10Krinkle) >>! In T214998#10548937, @Krinkle wrote: > Historically, Google Search has supported this divide natively. E...
[13:37:19] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:38:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:39:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:41:37] <wikibugs>	 (03CR) 10CDanis: [C:03+2] aux-k8s-eqiad: add RBD-backed persistence (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[13:42:06] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[13:45:20] <icinga-wm>	 PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 414MiB (2% inode=33%): /tmp 414MiB (2% inode=33%): /var/tmp 414MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[13:45:32] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage
[13:46:07] <wikibugs>	 (03Merged) 10jenkins-bot: aux-k8s-eqiad: add RBD-backed persistence [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118796 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[13:46:52] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:47:06] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_nologinattempt) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[13:47:19] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:48:02] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:48:03] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1005.eqiad.wmnet with reason: host reimage
[13:58:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549187 (10YLiou_WMF) 05Resolved→03Open
[13:59:44] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:00:00] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:22] <Lucas_WMDE>	 I probably can’t deploy today
[14:00:34] <Lucas_WMDE>	 I might do some backports later but that’s it
[14:00:41] <Lucas_WMDE>	 (so it’s good there’s nothing in the queue anyway ^^)
[14:05:20] <icinga-wm>	 RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops
[14:06:21] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1005.eqiad.wmnet with OS bookworm
[14:16:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1005*,elastic1006* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357
[14:16:19] <stashbot>	 T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357
[14:16:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1005*,elastic1006* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357
[14:16:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549226 (10YLiou_WMF) Unfortunately, I'm reopening this as I'm experiencing an issue logging into JupyterHub. This app...
[14:18:09] <wikibugs>	 (03PS1) 10CDanis: ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541)
[14:18:59] <wikibugs>	 (03PS2) 10CDanis: ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541)
[14:19:45] <wikibugs>	 (03PS3) 10CDanis: ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541)
[14:19:52] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[14:19:53] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[14:21:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[14:22:07] <wikibugs>	 (03CR) 10CDanis: [C:03+2] ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[14:29:23] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Add config option to make somevalue hashes use URI [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119508 (https://phabricator.wikimedia.org/T384344)
[14:29:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:40] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10549272 (10Andrew) p:05Triage→03Medium Update: gitlab is making daily snapshot builds and uploading them to quay.io -- the builds fail now and...
[14:29:40] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Make somevalue hashes use URI in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119509 (https://phabricator.wikimedia.org/T384344)
[14:29:57] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Add config option to fix s:, ref:, v: namespace prefix [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119510 (https://phabricator.wikimedia.org/T384344)
[14:30:21] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Fix s:, ref:, v: namespace prefix in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119511 (https://phabricator.wikimedia.org/T384344)
[14:30:43] <Lucas_WMDE>	 I’ll probably backport these ^ soon but let’s see CI pass first
[14:30:45] <Lucas_WMDE>	 jouncebot: nowandnext
[14:30:45] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1400)
[14:30:45] <jouncebot>	 In 1 hour(s) and 29 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1600)
[14:32:14] <wikibugs>	 (03CR) 10Btullis: [C:03+1] ceph: add firewall rules for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1119504 (https://phabricator.wikimedia.org/T380541) (owner: 10CDanis)
[14:35:24] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1006.eqiad.wmnet with OS bookworm
[14:35:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549344 (10BTullis) It looks like the cause of this is that the `yliou` account is not a member of either the `wmf` or...
[14:41:20] <wikibugs>	 (03PS1) 10Gergő Tisza: auth: Use POST trxProfiler expectations during return/reauth [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566)
[14:41:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) (owner: 10Gergő Tisza)
[14:48:22] <wikibugs>	 (03CR) 10Jforrester: "Should we wait for I86a2efdf7d24c1fe7573b047aa3b804a7f831af5 to have a month in prod and not break the world first?" [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[14:53:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10549415 (10BTullis) I added the record from `mwmaint1002` with the following command: ` btullis@mwmaint2002:~$ sudo mo...
[14:53:57] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage
[14:54:51] <wikibugs>	 (03CR) 10Ladsgroup: "Yup, that's the plan" [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[14:57:02] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1006.eqiad.wmnet with reason: host reimage
[14:59:15] <Lucas_WMDE>	 jouncebot: nowandnext
[14:59:15] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1400)
[14:59:15] <jouncebot>	 In 1 hour(s) and 0 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1600)
[14:59:47] <Lucas_WMDE>	 I’ll go ahead with my backports then (they’re expected to have no effect, just preparing for a config change that we’ll probably do on Monday)
[15:00:44] <wikibugs>	 (03PS1) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752)
[15:01:50] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[15:02:38] <Lucas_WMDE>	 FTR, scap backport warns me that one of the changes depends on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/1118147 but that’s not on the same branch
[15:02:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:47] <Lucas_WMDE>	 which should be fine because it was already merged in time for wmf.16
[15:03:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:03:05] <Lucas_WMDE>	 ditto for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/1118120
[15:03:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119508 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:03:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119509 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:03:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119510 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:03:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119511 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:03:23] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:03:35] <Lucas_WMDE>	 (I suppose I could’ve removed the Depends-On from the commit messages 🤔)
[15:03:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1105*,elastic1106* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357
[15:03:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1105*,elastic1106* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357
[15:03:56] <stashbot>	 T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357
[15:05:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:06:34] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] admin_ng: Switch on enableJobSidecarController for toolhub [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119231 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis)
[15:07:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1219', diff saved to https://phabricator.wikimedia.org/P73462 and previous config saved to /var/cache/conftool/dbconfig/20250213-150715-marostegui.json
[15:07:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1219.eqiad.wmnet
[15:10:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:11:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2146', diff saved to https://phabricator.wikimedia.org/P73463 and previous config saved to /var/cache/conftool/dbconfig/20250213-151117-marostegui.json
[15:12:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2146.codfw.wmnet
[15:12:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:13:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1219.eqiad.wmnet
[15:15:19] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Index rebuild
[15:15:49] <wikibugs>	 (03Merged) 10jenkins-bot: Add config option to make somevalue hashes use URI [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119508 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:15:49] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1006.eqiad.wmnet with OS bookworm
[15:17:04] <wikibugs>	 (03Merged) 10jenkins-bot: Make somevalue hashes use URI in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119509 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:17:35] <wikibugs>	 (03Merged) 10jenkins-bot: Add config option to fix s:, ref:, v: namespace prefix [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119510 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:17:37] <wikibugs>	 (03Merged) 10jenkins-bot: Fix s:, ref:, v: namespace prefix in tests [extensions/Wikibase] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119511 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[15:17:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1104-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:17:57] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:18:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1119508|Add config option to make somevalue hashes use URI (T384344)]], [[gerrit:1119509|Make somevalue hashes use URI in tests (T384344)]], [[gerrit:1119510|Add config option to fix s:, ref:, v: namespace prefix (T384344)]], [[gerrit:1119511|Fix s:, ref:, v: namespace prefix in tests (T384344)]]
[15:18:05] <stashbot>	 T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344
[15:18:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2146.codfw.wmnet
[15:19:32] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1104-1106].eqiad.wmnet with reason: T386357
[15:19:36] <stashbot>	 T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357
[15:19:38] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: maintenance
[15:20:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885) (owner: 10FNegri)
[15:20:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1119508|Add config option to make somevalue hashes use URI (T384344)]], [[gerrit:1119509|Make somevalue hashes use URI in tests (T384344)]], [[gerrit:1119510|Add config option to fix s:, ref:, v: namespace prefix (T384344)]], [[gerrit:1119511|Fix s:, ref:, v: namespace prefix in tests (T384344)]] synced to the testservers (https://wikitech.
[15:20:49] <logmsgbot>	 wikimedia.org/wiki/Mwdebug)
[15:21:03] <Lucas_WMDE>	 testing that nothing changes
[15:21:15] <wikibugs>	 (03PS2) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752)
[15:21:53] <Lucas_WMDE>	 yup, lgtm
[15:22:01] <wikibugs>	 (03PS3) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752)
[15:22:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[15:22:50] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin)
[15:22:55] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[15:23:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:23:31] <wikibugs>	 (03CR) 10TChin: [C:03+2] Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin)
[15:24:44] <wikibugs>	 (03Merged) 10jenkins-bot: Eventstreams: Bump image to v0.14.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119245 (https://phabricator.wikimedia.org/T361769) (owner: 10TChin)
[15:27:44] <wikibugs>	 (03PS1) 10Marostegui: s1-pager.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/1119529
[15:28:11] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:28:27] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] s1-pager.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/1119529 (owner: 10Marostegui)
[15:29:11] <wikibugs>	 (03Merged) 10jenkins-bot: s1-pager.sql: Remove file [software] - 10https://gerrit.wikimedia.org/r/1119529 (owner: 10Marostegui)
[15:29:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119508|Add config option to make somevalue hashes use URI (T384344)]], [[gerrit:1119509|Make somevalue hashes use URI in tests (T384344)]], [[gerrit:1119510|Add config option to fix s:, ref:, v: namespace prefix (T384344)]], [[gerrit:1119511|Fix s:, ref:, v: namespace prefix in tests (T384344)]] (duration: 11m 19s)
[15:29:24] <stashbot>	 T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344
[15:29:33] * Lucas_WMDE done deploying
[15:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10549646 (10phaultfinder)
[15:29:38] <wikibugs>	 (03PS4) 10Bking: relforge/elastic: repurpose elastic hosts for relforge [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752)
[15:35:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:36:42] <wikibugs>	 (03CR) 10Marostegui: "It would be interesting if you can add some description to what has been done in the commit message. As it is a very big change, it would " [cookbooks] - 10https://gerrit.wikimedia.org/r/1118099 (owner: 10Federico Ceratto)
[15:37:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] toolsdb_apt_pinning: enable manual 10.6 upgrades [puppet] - 10https://gerrit.wikimedia.org/r/1119473 (https://phabricator.wikimedia.org/T385885) (owner: 10FNegri)
[15:39:51] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "LGTM. Do we need to depool the servers before removing them from conftool-data ?" [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[15:40:53] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[15:41:07] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] "Indeed we do. I have already depooled them." [puppet] - 10https://gerrit.wikimedia.org/r/1119520 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking)
[16:00:04] <jouncebot>	 andre and jnuche: OwO what's this, a deployment window?? Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1600). nyaa~
[16:01:53] <zabe>	 hnowlan: could you deploy https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/1115587 for me? (I think I do not have the permissions to do it myself)
[16:08:09] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:08:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:14:32] <wikibugs>	 (03PS1) 10Gergő Tisza: Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261)
[16:15:15] <claime>	 zabe: SRE is offsite at the moment so it may take a while for h.nowlan to respond. I can deploy if you want, but I don't have the domain knowledge to test the change afterwards
[16:15:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza)
[16:15:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1104 to relforge1005
[16:16:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza)
[16:16:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:16:20] <zabe>	 I can "test" it (testing would just be to check that https://knc.wikipedia.org/api/rest_v1/ works)
[16:16:29] <zabe>	 claime: ^
[16:19:36] <claime>	 zabe: do you have +2 rights? I can +1 it but I'm uncomfortable doing +1/+2 and merging alone
[16:19:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1104 to relforge1005 - bking@cumin2002"
[16:20:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1104 to relforge1005 - bking@cumin2002"
[16:20:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:20:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1005
[16:20:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1005
[16:20:37] <zabe>	 claime: yes, I am just missing restbase-admins (or what else is needed) to deploy it
[16:20:44] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1007.eqiad.wmnet with OS bookworm
[16:20:51] <zabe>	 hit +2
[16:20:52] <claime>	 zabe: ok cool
[16:21:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1104 to relforge1005
[16:21:28] <wikibugs>	 (03PS1) 10Gergő Tisza: Do not preserve 'sul3-action' when restarting authentication [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866)
[16:22:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1105 to relforge1006
[16:22:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:23:00] <logmsgbot>	 !log cgoubert@deploy2002 Started deploy [restbase/deploy@511b3a4]: Add kncwiki (T385186)
[16:23:03] <stashbot>	 T385186: Add kncwiki to RESTBase - https://phabricator.wikimedia.org/T385186
[16:24:02] <wikibugs>	 (03CR) 10Gergő Tisza: "recheck" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza)
[16:27:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1105 to relforge1006 - bking@cumin2002"
[16:27:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) (owner: 10Gergő Tisza)
[16:27:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1105 to relforge1006 - bking@cumin2002"
[16:27:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:27:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1006
[16:28:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1006
[16:28:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1105 to relforge1006
[16:29:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1106 to relforge1007
[16:29:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[16:29:26] <claime>	 zabe: it's deploying, slowly, but it is in progress
[16:29:42] <zabe>	 thanks :)
[16:32:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1106 to relforge1007 - bking@cumin2002"
[16:33:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1106 to relforge1007 - bking@cumin2002"
[16:33:28] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:33:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host relforge1007
[16:33:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host relforge1007
[16:34:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1106 to relforge1007
[16:35:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550138 (10YLiou_WMF) @BTullis the yliou account seems to work to login to jupyterhub now! Thank you for all your help!
[16:35:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550139 (10YLiou_WMF) 05Open→03Resolved
[16:35:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1005.eqiad.wmnet with OS bullseye
[16:36:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1006.eqiad.wmnet with OS bullseye
[16:37:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1007.eqiad.wmnet with OS bullseye
[16:38:54] <logmsgbot>	 !log cgoubert@deploy2002 Finished deploy [restbase/deploy@511b3a4]: Add kncwiki (T385186) (duration: 15m 54s)
[16:38:58] <stashbot>	 T385186: Add kncwiki to RESTBase - https://phabricator.wikimedia.org/T385186
[16:39:09] <claime>	 zabe: looks like it's done, and the api doc page loads for me
[16:39:12] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[16:39:22] <zabe>	 yep, thanks for your help
[16:39:32] <claime>	 np :)
[16:42:35] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1007.eqiad.wmnet with reason: host reimage
[16:43:58] <bd808>	 I just noticed that stashbot recently exceeded 400,000 edits on Wikitech. Congratulations little bot. https://meta.wikimedia.org/wiki/Special:CentralAuth/Stashbot
[16:44:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-03-14 16:43:55. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:44:47] <icinga-wm>	 ACKNOWLEDGEMENT - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly Clément Goubert T383032 - The acknowledgement expires at: 2025-03-14 16:44:36. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:49:29] <wikibugs>	 (03PS1) 10CDanis: add fault-tolerance namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119534
[16:49:29] <wikibugs>	 (03PS1) 10CDanis: WIP: initial chart for fault-tolerance tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119535
[16:50:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: initial chart for fault-tolerance tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119535 (owner: 10CDanis)
[16:53:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad
[16:53:58] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad
[16:55:16] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1153.eqiad.wmnet with reason: maintenance
[16:58:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2243.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:59:50] <wikibugs>	 (03PS1) 10Sergio Gimeno: beta: A/B test setup for surfacing structured tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903)
[17:01:27] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1007.eqiad.wmnet with OS bookworm
[17:02:19] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:03:06] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[17:03:51] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[17:04:08] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[17:04:50] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[17:13:08] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1008.eqiad.wmnet with OS bookworm
[17:16:12] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390 (10RobH) 03NEW
[17:16:41] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10550310 (10RobH)
[17:25:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10550329 (10phaultfinder)
[17:31:32] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage
[17:34:40] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1008.eqiad.wmnet with reason: host reimage
[17:34:51] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1005.eqiad.wmnet with OS bullseye
[17:34:58] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[17:35:13] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1006.eqiad.wmnet with OS bullseye
[17:35:27] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[17:35:40] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host relforge1007.eqiad.wmnet with OS bullseye
[17:37:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10550376 (10BTullis) Thanks @RobH - I have been discussing with the team the approach that we would like to take around this and I th...
[17:38:27] <wikibugs>	 (03PS2) 10Subramanya Sastry: Turn on Parsoid Read Views for 33 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian)
[17:42:24] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-policy-tests: make all tests pass in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1119541
[17:42:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1005.eqiad.wmnet with OS bullseye
[17:43:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs-policy-tests: make all tests pass in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1119541 (owner: 10Andrew Bogott)
[17:47:19] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:53:47] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1008.eqiad.wmnet with OS bookworm
[17:56:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1007.eqiad.wmnet with OS bullseye
[17:57:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1006.eqiad.wmnet with OS bullseye
[17:58:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1005.eqiad.wmnet with reason: host reimage
[18:00:05] <jouncebot>	 bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T1800)
[18:03:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1005.eqiad.wmnet with reason: host reimage
[18:04:37] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Change dse-k8s-worker1009 to use containerd [puppet] - 10https://gerrit.wikimedia.org/r/1119105 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene)
[18:04:53] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[18:05:16] <logmsgbot>	 !log tchin@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[18:05:33] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm
[18:07:14] <wikibugs>	 (03PS1) 10NMW03: Allow sysops to add/remove "confirmed" on English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313)
[18:13:05] <wikibugs>	 (03PS1) 10NMW03: Add "suppressredirect" to "editor" on Russian Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367)
[18:14:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119546 (https://phabricator.wikimedia.org/T386313) (owner: 10NMW03)
[18:14:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119548 (https://phabricator.wikimedia.org/T386367) (owner: 10NMW03)
[18:18:39] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage
[18:20:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1005.eqiad.wmnet with OS bullseye
[18:20:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73465 and previous config saved to /var/cache/conftool/dbconfig/20250213-182026-root.json
[18:20:30] <icinga-wm>	 RECOVERY - Host ms-be2075 is UP: PING WARNING - Packet loss = 75%, RTA = 30.33 ms
[18:21:22] <icinga-wm>	 PROBLEM - SSH on ms-be2075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:21:38] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[18:22:11] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1009.eqiad.wmnet with reason: host reimage
[18:22:20] <logmsgbot>	 !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[18:26:54] <icinga-wm>	 PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100%
[18:28:30] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1006.eqiad.wmnet with OS bullseye
[18:28:31] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host relforge1007.eqiad.wmnet with OS bullseye
[18:35:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73466 and previous config saved to /var/cache/conftool/dbconfig/20250213-183531-root.json
[18:39:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1007.eqiad.wmnet with OS bullseye
[18:39:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host relforge1006.eqiad.wmnet with OS bullseye
[18:40:00] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1009.eqiad.wmnet with OS bookworm
[18:40:53] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-policy-tests: make work in codfw1dev too [puppet] - 10https://gerrit.wikimedia.org/r/1119551
[18:42:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmcs-policy-tests: make work in codfw1dev too [puppet] - 10https://gerrit.wikimedia.org/r/1119551 (owner: 10Andrew Bogott)
[18:50:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73467 and previous config saved to /var/cache/conftool/dbconfig/20250213-185036-root.json
[18:54:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73469 and previous config saved to /var/cache/conftool/dbconfig/20250213-185433-root.json
[18:54:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1007.eqiad.wmnet with reason: host reimage
[18:55:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on relforge1006.eqiad.wmnet with reason: host reimage
[18:58:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1007.eqiad.wmnet with reason: host reimage
[18:59:56] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:00:54] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[19:00:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on relforge1006.eqiad.wmnet with reason: host reimage
[19:01:40] <logmsgbot>	 !log tchin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[19:05:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73470 and previous config saved to /var/cache/conftool/dbconfig/20250213-190542-root.json
[19:06:34] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:08:36] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:09:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73471 and previous config saved to /var/cache/conftool/dbconfig/20250213-190938-root.json
[19:15:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1007.eqiad.wmnet with OS bullseye
[19:18:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host relforge1006.eqiad.wmnet with OS bullseye
[19:20:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73472 and previous config saved to /var/cache/conftool/dbconfig/20250213-192047-root.json
[19:21:26] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:21:26] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:21:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:24:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73473 and previous config saved to /var/cache/conftool/dbconfig/20250213-192444-root.json
[19:27:05] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Reapply "Use new 'auth' docroot for the auth domain" [puppet] - 10https://gerrit.wikimedia.org/r/1117924 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[19:27:27] <rzl>	 sneaking out an apache change
[19:31:06] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:31:26] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:32:00] <wikibugs>	 (03PS1) 10Jdlrobson: Fix name of ABTestEnrollment configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019)
[19:32:07] <rzl>	 ^ expected, transient from version skew
[19:32:10] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:34:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:34:31] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:35:49] <logmsgbot>	 !log rzl@deploy2002 Started scap sync-world: T383952, T384137
[19:35:52] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:35:55] <stashbot>	 T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952
[19:35:56] <stashbot>	 T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137
[19:36:58] <logmsgbot>	 !log rzl@deploy2002 rzl: T383952, T384137 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:37:44] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:38:19] <logmsgbot>	 !log rzl@deploy2002 rzl: Continuing with sync
[19:39:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73474 and previous config saved to /var/cache/conftool/dbconfig/20250213-193949-root.json
[19:44:03] <logmsgbot>	 !log rzl@deploy2002 Finished scap sync-world: T383952, T384137 (duration: 11m 16s)
[19:44:07] <stashbot>	 T383952: Auth.wikimedia.org circular errors - https://phabricator.wikimedia.org/T383952
[19:44:07] <stashbot>	 T384137: Set up robots.txt in auth.wikimedia.org - https://phabricator.wikimedia.org/T384137
[19:49:49] <rzl>	 (done, hanging out in case of problems but everything looks good)
[19:54:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73475 and previous config saved to /var/cache/conftool/dbconfig/20250213-195454-root.json
[19:55:23] <wikibugs>	 (03CR) 10Bernard Wang: [C:03+1] Fix name of ABTestEnrollment configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson)
[19:55:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson)
[19:56:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:56:28] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:56:37] <claime>	 rzl: restarting the httpbb systemd timers to clear up alerts btw
[19:57:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:58:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:58:08] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:59:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:59:17] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[19:59:29] <rzl>	 claime: sure, thanks
[19:59:31] <jinxer-wm>	 RESOLVED: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:03:08] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:03:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:05:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:05:49] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:07:37] <urbanecm>	 !log mwscript-k8s --attach extensions/Translate/scripts/moveTranslatableBundle.php -- --wiki=metawiki 'Wiki_Movement_Brazil_User_Group' 'Wikimedia Brasil' 'Martin Urbanec' --reason='per [[special:Permalink/28261149#Wikimedia_Brasil|request]] ([[:phab:T386402]])' # T386402
[20:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:40] <stashbot>	 T386402: Request to move translatable page: Wiki_Movement_Brazil_User_Group at Meta-Wiki - https://phabricator.wikimedia.org/T386402
[20:09:03] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:10:48] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[20:14:07] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good 😭 thanks for catching this so fast" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson)
[20:23:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:23:36] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:23:40] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@092b9d3]: deploy latest DAGs to analytics Airflow instance. T386114.
[20:23:44] <stashbot>	 T386114: DAG failing due to failure to acquire lock on wmf_data_ops.data_quality_metrics table - https://phabricator.wikimedia.org/T386114
[20:23:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:23:58] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:24:13] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@092b9d3]: deploy latest DAGs to analytics Airflow instance. T386114. (duration: 00m 33s)
[20:24:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:24:16] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10550789 (10phaultfinder)
[20:31:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:31:32] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:31:36] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:31:36] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:32:22] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:32:30] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:33:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:33:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:35:28] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:35:28] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.224 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:36:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:36:51] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:49:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:49:42] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:50:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:51:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550873 (10YLiou_WMF) 05Resolved→03Open
[20:51:18] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:54:40] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550911 (10YLiou_WMF) Unfortunately I'm now experiencing a separate issue! I'm trying to install R and am receiving th...
[20:56:13] <inflatador>	 !log bking@cephosd1001:~$ sudo radosgw-admin user create --uid=research --display-name="research"
[20:56:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[20:58:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T2100).
[21:00:05] <jouncebot>	 tgr, subbu, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:37] <subbu>	 o/
[21:01:01] <tgr|away>	 o/
[21:01:18] <inflatador>	 !log bking@cephosd1001:~$ sudo radosgw-admin quota set --quota-scope=user --uid=research --max-size=4T
[21:01:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:28] <tgr|away>	 I can deploy
[21:02:20] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:02:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian)
[21:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 33 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119215 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian)
[21:03:44] <subbu>	 oh you are deploying mine first. let me set up wikimedia-debug testing.
[21:04:00] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]]
[21:04:03] <stashbot>	 T386272: Wiktionary deploy ~2024-02-13 - https://phabricator.wikimedia.org/T386272
[21:06:25] <wikibugs>	 (03PS2) 10C. Scott Ananian: Turn on Parsoid Read Views for mobile wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272)
[21:06:47] <logmsgbot>	 !log tgr@deploy2002 cscott, tgr: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:07:36] <subbu>	 I just noticed the second patch cscott had uploaded. but, we can wait for it till your other patches are done.
[21:07:37] <tgr|away>	 subbu: ^
[21:07:59] <subbu>	 ok. will test.
[21:08:24] <tgr|away>	 do you mean we should deploy another patch as well?
[21:09:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:09:16] <tgr|away>	 it's no problem, just make sure it's added to the wiki page
[21:09:54] <subbu>	 that first patch lgtm.
[21:10:42] <Jdlrobson>	 (i am here)
[21:11:54] <subbu>	 added the second one too.
[21:12:01] <subbu>	 tgr|away, ^
[21:12:29] <tgr|away>	 ok, I'll deploy them together then to save some time
[21:12:36] <subbu>	 sounds good.
[21:12:41] <logmsgbot>	 !log tgr@deploy2002 Sync cancelled.
[21:12:45] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10550958 (10BTullis) Strangely, that file isn't displaying for me. It's showing as restricted.  It might be an idea to...
[21:13:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian)
[21:13:50] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for mobile wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119216 (https://phabricator.wikimedia.org/T386272) (owner: 10C. Scott Ananian)
[21:14:09] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]], [[gerrit:1119216|Turn on Parsoid Read Views for mobile wiktionary (T386272)]]
[21:14:13] <stashbot>	 T386272: Wiktionary deploy ~2024-02-13 - https://phabricator.wikimedia.org/T386272
[21:14:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:16:48] <logmsgbot>	 !log tgr@deploy2002 tgr, cscott: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]], [[gerrit:1119216|Turn on Parsoid Read Views for mobile wiktionary (T386272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:17:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:19:25] <subbu>	 tgr|away, tested pages on a few wiktionaries ... lgtm.
[21:19:41] <logmsgbot>	 !log tgr@deploy2002 tgr, cscott: Continuing with sync
[21:22:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:26:17] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119215|Turn on Parsoid Read Views for 33 wiktionaries (T386272)]], [[gerrit:1119216|Turn on Parsoid Read Views for mobile wiktionary (T386272)]] (duration: 12m 08s)
[21:26:21] <stashbot>	 T386272: Wiktionary deploy ~2024-02-13 - https://phabricator.wikimedia.org/T386272
[21:26:25] <subbu>	 thanks tgr|away !
[21:27:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson)
[21:28:14] <wikibugs>	 (03Merged) 10jenkins-bot: Fix name of ABTestEnrollment configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119555 (https://phabricator.wikimedia.org/T384019) (owner: 10Jdlrobson)
[21:28:31] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119555|Fix name of ABTestEnrollment configuration (T384019)]]
[21:28:34] <stashbot>	 T384019: Deploy Empty Search A/B test - https://phabricator.wikimedia.org/T384019
[21:31:15] <logmsgbot>	 !log tgr@deploy2002 jdlrobson, tgr: Backport for [[gerrit:1119555|Fix name of ABTestEnrollment configuration (T384019)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:32:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:35:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551013 (10phaultfinder)
[21:37:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:37:24] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] auth: Use POST trxProfiler expectations during return/reauth [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) (owner: 10Gergő Tisza)
[21:37:28] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza)
[21:37:32] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Do not preserve 'sul3-action' when restarting authentication [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) (owner: 10Gergő Tisza)
[21:38:18] <tgr|away>	 Jdlrobson: does it look good?
[21:39:05] <Jdlrobson>	 tgr|away: looking almost done
[21:41:00] <Jdlrobson>	 yes please sync tgr|away !
[21:41:16] <logmsgbot>	 !log tgr@deploy2002 jdlrobson, tgr: Continuing with sync
[21:47:20] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:47:26] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "Thank you for taking the time (and pain) to clean this up Amir." [dumps] - 10https://gerrit.wikimedia.org/r/1119486 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[21:47:55] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119555|Fix name of ABTestEnrollment configuration (T384019)]] (duration: 19m 24s)
[21:47:59] <stashbot>	 T384019: Deploy Empty Search A/B test - https://phabricator.wikimedia.org/T384019
[21:49:21] <wikibugs>	 (03Merged) 10jenkins-bot: auth: Use POST trxProfiler expectations during return/reauth [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119516 (https://phabricator.wikimedia.org/T385566) (owner: 10Gergő Tisza)
[21:49:23] <wikibugs>	 (03Merged) 10jenkins-bot: Track the number of started / finished SUL3 flows [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119530 (https://phabricator.wikimedia.org/T377261) (owner: 10Gergő Tisza)
[21:49:25] <wikibugs>	 (03Merged) 10jenkins-bot: Do not preserve 'sul3-action' when restarting authentication [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1119531 (https://phabricator.wikimedia.org/T364866) (owner: 10Gergő Tisza)
[21:50:52] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1119516|auth: Use POST trxProfiler expectations during return/reauth (T385566)]], [[gerrit:1119530|Track the number of started / finished SUL3 flows (T377261)]], [[gerrit:1119531|Do not preserve 'sul3-action' when restarting authentication (T364866)]]
[21:50:59] <stashbot>	 T385566: SUL3: Transaction profiler warnings when logging in - https://phabricator.wikimedia.org/T385566
[21:50:59] <stashbot>	 T377261: Track the number of interrupted SUL3 logins / signups - https://phabricator.wikimedia.org/T377261
[21:50:59] <stashbot>	 T364866: Adapt to changes in post-login/signup hooks after switching to a central login wiki - https://phabricator.wikimedia.org/T364866
[21:51:59] <wikibugs>	 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10551073 (10Krinkle) >>! In T214998#10548937, @Krinkle wrote: > […] our mobile user agent regex may not be as good as as Google's...
[21:52:39] <Jdlrobson>	 thanks tgr|away !
[21:52:50] <tgr|away>	 sure
[21:53:33] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1119516|auth: Use POST trxProfiler expectations during return/reauth (T385566)]], [[gerrit:1119530|Track the number of started / finished SUL3 flows (T377261)]], [[gerrit:1119531|Do not preserve 'sul3-action' when restarting authentication (T364866)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:56:37] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Discovery-Search, 06Research, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Relabel Elastic hosts to Relforge hosts - https://phabricator.wikimedia.org/T386358#10551110 (10bking) a:05bking→03None
[21:59:02] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[22:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250213T2200)
[22:00:17] <tgr|away>	 Amir1: migrateESRefToContentTable is producing a big spike of "PHP Warning: fwrite() expects parameter 1 to be resource, bool given"
[22:00:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad
[22:00:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad
[22:00:59] <tgr|away>	 or zabe 
[22:01:07] <zabe>	 yes sorry
[22:01:09] <zabe>	 that was me
[22:01:22] <zabe>	 I forgot to "chmod 666" the dump file
[22:01:23] <tgr|away>	 np, just wasn't sure you are aware
[22:01:58] <zabe>	 !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php diqwiki --skip /home/zabe/text_table_cleanup/diqwiki --dump /home/zabe/text_table_dump/diqwiki --sleep 0.5 --start 318769 # T183490
[22:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:01] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[22:05:56] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119516|auth: Use POST trxProfiler expectations during return/reauth (T385566)]], [[gerrit:1119530|Track the number of started / finished SUL3 flows (T377261)]], [[gerrit:1119531|Do not preserve 'sul3-action' when restarting authentication (T364866)]] (duration: 15m 03s)
[22:06:02] <stashbot>	 T385566: SUL3: Transaction profiler warnings when logging in - https://phabricator.wikimedia.org/T385566
[22:06:03] <stashbot>	 T377261: Track the number of interrupted SUL3 logins / signups - https://phabricator.wikimedia.org/T377261
[22:06:03] <stashbot>	 T364866: Adapt to changes in post-login/signup hooks after switching to a central login wiki - https://phabricator.wikimedia.org/T364866
[22:09:49] <tgr|away>	 !log UTC late deploys done
[22:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:09] <rzl>	 !log rzl@idp2004:~$ sudo systemctl restart tomcat10
[22:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:34] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551219 (10Dzahn) Confirming Arthur is already on the NDA tracking sheet, checking this box off.
[22:16:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551220 (10Dzahn)
[22:17:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551228 (10Dzahn)
[22:18:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551233 (10Dzahn)
[22:19:16] <wikibugs>	 (03PS1) 10Bking: relforge: Prepare newly-reimaged relforge hosts to join the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357)
[22:19:37] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551239 (10Dzahn) @thcipriani Hello, here is a request for deployment access for your consideration.
[22:19:46] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking)
[22:21:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551246 (10thcipriani) Thanks for the ping @Dzahn   Request looks good to me, approved.  Thanks for volunteering. Please reach out if you need anything.
[22:22:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10551248 (10Dzahn) @EPIC Could you please send an email to Katie Francis (https://meta.wikimedia.org/wiki/User:KFrancis_(WMF)) and tell her your real name?
[22:22:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:22:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10551250 (10Dzahn)
[22:23:24] <wikibugs>	 (03PS1) 10Jdlrobson: Footer: Wikimedia icon should collapse at lower resolutions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619)
[22:24:09] <wikibugs>	 (03CR) 10Jdlrobson: "Amir: this can be deployed as soon as we're sure the 1.44.0-wmf.16 train won't roll back!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[22:32:11] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] "Looks good; my +1 is contingent upon pcc coming back good once we've fixed the puppet facts issue" [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking)
[22:32:34] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking)
[22:35:10] <wikibugs>	 (03CR) 10Bking: [C:03+2] relforge: Prepare newly-reimaged relforge hosts to join the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1119576 (https://phabricator.wikimedia.org/T386357) (owner: 10Bking)
[22:38:52] <zabe>	 !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTable.php ttwiki --skip /home/zabe/text_table_cleanup/ttwiki --dump /home/zabe/text_table_dump/ttwiki --sleep 0.5 --start 867501 # T183490
[22:38:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:56] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[22:44:39] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on relforge[1003-1007].eqiad.wmnet with reason: T386357
[22:44:42] <stashbot>	 T386357: Replace current Relforge servers with repurposed Elastic hosts - https://phabricator.wikimedia.org/T386357
[22:50:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 8, active_shards: 15, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight
[22:50:56] <icinga-wm>	 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.75 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:19:05] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1117924 (https://phabricator.wikimedia.org/T383952) (owner: 10Bartosz Dziewoński)
[23:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551319 (10phaultfinder)
[23:58:28] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state