[00:08:51] 06serviceops, 13Patch-For-Review: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038#10506944 (10Scott_French) **shellbox-constraints**: Thanks to @jijiki, this has been running on 8.1 since ~ 11:30 UTC today, with no issues detected thus far. In the event that an issu... [00:11:08] 06serviceops, 13Patch-For-Review: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038#10506945 (10Scott_French) [00:20:27] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10506980 (10Scott_French) Alright, we've reached 5% of client sessions as of ~ 16:20 UTC Wednesday, which is where we intend to pause until next week. No major issues have been encount... [00:21:11] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10506985 (10Scott_French) [00:45:03] 06serviceops, 13Patch-For-Review: Build php-uuid package, and add to WMF production and CI - https://phabricator.wikimedia.org/T373752#10507023 (10Scott_French) Ah, right! Just to confirm, how //recent// does the version of `php-uuid` need to be in the 7.4 case? As I mentioned in T373752#10136087, the tricky... [01:04:23] 06serviceops, 13Patch-For-Review: Build php-uuid package, and add to WMF production and CI - https://phabricator.wikimedia.org/T373752#10507037 (10Reedy) Naively, in practice... I think it probably doesn't matter, too much. As long as works. Famous last words. https://github.com/php/pecl-networking-uuid doesn... [08:45:47] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10507417 (10ops-monitoring-bot) pool host wikikube-worker[2095,2175,2186].codfw.wmnet by jayme@cumin1002 wit... [08:45:49] 06serviceops, 06collaboration-services, 06Data-Persistence, 06DC-Ops, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10507421 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by jayme@cumin1002 pool... [08:51:18] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian), 13Patch-For-Review: Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10507440 (10Jgiannelos) Latest difftesting after fixing localization | quantile | ssim | | 0.25 | 0.99175 | | 0.5 | 0.9985... [09:15:50] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian), 13Patch-For-Review: Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10507497 (10Jgiannelos) @elukey I double checked the results and the issue was fixed. I am taking a look at the diffs that show some... [09:20:02] hey folks! [09:20:18] there are again some pending cert requests for hosts that may have been renamed: http://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DPuppetPendingCertificateRequest [09:24:45] 06serviceops, 06Release-Engineering-Team, 10Scap, 13Patch-For-Review: Retire use of scap proxies - https://phabricator.wikimedia.org/T384196#10507506 (10hashar) I got the issue when running the train which was very confusing: ` 09:16:19 deploy-promote failed: Command '['/usr/bin/scap',... [10:19:26] elukey: there is an effort of renaming tons of hosts [10:20:54] yep I know, but those certs are probably leftovers [10:21:01] IIRC Kamila cleaned them up the last time [10:24:29] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian): Staging error: Snapshots with overlay map failed to render - https://phabricator.wikimedia.org/T384023#10507691 (10Jgiannelos) 05Open→03Resolved This is fixed in staging [10:25:05] 06serviceops, 06Content-Transform-Team, 10WMDE-TechWish-Maintenance, 10Maps (Kartotherian), 13Patch-For-Review: Staging error: SVG attribute is not supported - https://phabricator.wikimedia.org/T384823#10507695 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos This looks like a warning that is no... [10:41:58] elukey: alright [10:48:59] 06serviceops, 07Datacenter-Switchover: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 (10hnowlan) 03NEW [10:55:43] 06serviceops, 07Datacenter-Switchover: MoveComms support for March 2025 Datacentre switchover - https://phabricator.wikimedia.org/T385157 (10hnowlan) 03NEW [10:58:27] 06serviceops, 07Datacenter-Switchover: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155#10507861 (10hnowlan) [10:59:05] 06serviceops, 07Datacenter-Switchover: SRE comms for March 2025 Datacentre switchover - https://phabricator.wikimedia.org/T385157#10507863 (10hnowlan) [11:04:05] 06serviceops, 07Datacenter-Switchover: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155#10507874 (10hnowlan) [11:16:09] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow members of restricted to run maintenance scripts - https://phabricator.wikimedia.org/T378429#10507919 (10JMeybohm) kubeconfig files (and certificates) have been created ` -rw-r----- 1 mwdeploy restricted 490 Jan 30 10:23 /etc/kubernetes/mw-cron-restrict... [11:16:33] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Allow members of restricted to run maintenance scripts - https://phabricator.wikimedia.org/T378429#10507921 (10JMeybohm) a:05JMeybohm→03RLazarus [11:17:03] the other question that I have is if it is ok to set "4 cpus" (I know it is not exactly like that but you got it) for Kartotherian: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1115340 [11:17:28] for prod I am thinking about 6 pods for each DC, to reflect the current maps nodes [11:30:04] elukey: checking capacity [11:30:36] elukey: yeah we're good [11:31:06] I'll go ahead and clean up the puppet certs [11:33:59] I can do that [11:34:08] too late [11:34:10] :p [11:34:12] (sorry, missed the message) [11:34:23] elukey: puppet certs cleaned [11:34:26] oh well :-D [11:34:27] alert should go away [11:34:32] claime: thanksssssss [11:34:42] np [13:04:36] 06serviceops, 06Release-Engineering-Team, 10Scap, 13Patch-For-Review: Retire use of scap proxies - https://phabricator.wikimedia.org/T384196#10508205 (10hashar) That was confusing at first, but I am happy it was "expected" :) Thank you for having tested scap with an empty list of proxies! [13:49:21] 06serviceops, 10PoolCounter, 10MediaWiki-Platform-Team (Radar): poolcounter-exporter upgrade - https://phabricator.wikimedia.org/T333947#10508291 (10fgiunchedi) As part of {T321808} I have imported 0.1.2 in gerrit and revamped the debian bits, next week I'll be uploading the debian package and upgrade \o/ [14:08:05] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian), 13Patch-For-Review: Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10508357 (10Jgiannelos) Using prod deployment: | quantile | ssim | | 0.1 | 0.990876 | | 0.2 | 0.994956 | | 0.25 | 0.995939... [14:08:45] 06serviceops, 10PoolCounter, 10MediaWiki-Platform-Team (Radar): poolcounter-exporter upgrade - https://phabricator.wikimedia.org/T333947#10508360 (10fgiunchedi) a:05akosiaris→03fgiunchedi [14:26:00] 06serviceops, 10Observability-Alerting, 07Kubernetes, 10SRE Observability (FY2024/2025-Q3): Alert on unscrapable pods - https://phabricator.wikimedia.org/T372242#10508472 (10fgiunchedi) [14:35:14] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian), 13Patch-For-Review: Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10508564 (10Jgiannelos) | quantile | Percentage diff of latency between A and B % | | 0.1 | -29.5699 | | 0.2 | -20... [15:05:07] 06serviceops, 06Release-Engineering-Team, 10Scap, 13Patch-For-Review: Retire use of scap proxies - https://phabricator.wikimedia.org/T384196#10508833 (10hnowlan) 05Open→03Resolved a:03hnowlan We've seen a few scaps run successfully, I think everything is working as expected. [15:30:52] akosiaris: sorry for not replying before.. IPv6 connections between liberica and confd nodes are already happening, so apparently it's working, I just want to be sure that's not an unsupported use case [15:31:51] vgutierrez: definitely supported. [15:39:49] 06serviceops, 06Content-Transform-Team-WIP, 10Maps (Kartotherian), 13Patch-For-Review: Difftesting between staging and production - https://phabricator.wikimedia.org/T384530#10509105 (10Jgiannelos) After some back and forth with @elukey and increasing the cpu resources in kartotherian deployment charts her... [15:40:29] hello serviceops friends I'm seeing a cr on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1115436 [15:54:59] cdanis: where do you see this breaking on bare-metal? [15:55:08] as in, we have almost 0 bare metal now [15:55:17] akosiaris: specifically mwdebug* where it is causing lots of log spam [15:55:19] mwdebug, mwmaint and that's it [15:55:26] T385037 [15:55:31] ah, ok that's the reason then. [15:55:32] sorry, it's one of the 3 linked bugs, I was too liberal [15:55:36] with adding them [15:58:45] cdanis: commit message amended and +1ed. Want to deploy [15:59:03] ? [15:59:20] yeah I'll deploy now I think, thanks! [16:00:07] yw. [16:08:09] 06serviceops, 10Arc-Lamp: Gather PHP8.1 profiling data - https://phabricator.wikimedia.org/T385199 (10jijiki) 03NEW [16:09:45] 06serviceops, 10Arc-Lamp: Gather PHP8.1 profiling data - https://phabricator.wikimedia.org/T385199#10509238 (10jijiki) [16:09:50] 06serviceops, 13Patch-For-Review: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845#10509239 (10jijiki) [17:14:36] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 07Video: Log filename in shellbox-video httpd - https://phabricator.wikimedia.org/T368619#10509531 (10kamila) The easiest way to do this would be to add a header with the filename to the shellbox client. However, in the videoscalers case, the shellbox client... [17:19:39] 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 10Data-Platform-SRE (2025.01.11 - 2025.01.31), 07Kubernetes: Update wikikube-staging-codfw to kubernetes 1.31 - https://phabricator.wikimedia.org/T384450#10509553 (10JMeybohm) 05Open→03In progress a:03JMeybohm [17:29:34] 06serviceops, 10MW-on-K8s, 10Observability-Metrics: Update Benthos chart for k8s deployments - https://phabricator.wikimedia.org/T385210 (10kamila) 03NEW [17:29:50] 06serviceops, 10MW-on-K8s, 10Observability-Metrics: Update Benthos chart for k8s deployments - https://phabricator.wikimedia.org/T385210#10509613 (10kamila) p:05Triage→03Medium [17:32:01] 06serviceops, 10MW-on-K8s, 10TimedMediaHandler, 07Video: Log filename in shellbox-video httpd - https://phabricator.wikimedia.org/T368619#10509644 (10hnowlan) 05Open→03Declined Given the amount of work required to do this and the relative stability of mercurius and mw-videoscaler I think we can jus... [17:55:45] 06serviceops, 10MW-on-K8s, 10Observability-Metrics: Update Benthos chart and image for k8s deployments - https://phabricator.wikimedia.org/T385210#10509806 (10kamila) [18:15:27] 06serviceops, 06DC-Ops, 10ops-codfw, 10Prod-Kubernetes, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T385078#10509886 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [19:26:53] 06serviceops: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225 (10Scott_French) 03NEW [20:02:10] 06serviceops, 10MediaWiki-Platform-Team (Radar): Please review MediaWiki Apache config changes adding new docroot for the auth domain - https://phabricator.wikimedia.org/T385228 (10matmarex) 03NEW [20:02:48] 06serviceops, 10MediaWiki-Platform-Team (Radar): Please review MediaWiki Apache config changes adding new docroot for the auth domain - https://phabricator.wikimedia.org/T385228#10510250 (10matmarex) [20:20:12] 06serviceops: Mercurius does not retry failed transcodes beyond 15m - https://phabricator.wikimedia.org/T385225#10510297 (10Scott_French) p:05Triage→03Medium a:03Scott_French Given that this has been the case from day one without significant issue, I suspect this is fairly rare to trigger in practice, whic... [23:26:17] 06serviceops, 10MW-on-K8s, 06SRE: mwscript-k8s does not support short maintenance script names - https://phabricator.wikimedia.org/T385238 (10Urbanecm_WMF) 03NEW [23:27:21] 06serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10510697 (10Urbanecm_WMF) Question: How would I run a maintenance script within a debug pod (within k8s-mwdebug)? Occasionally, I need to do that, for example, when deploying T374348 (to run the migrat... [23:51:17] 06serviceops, 10MW-on-K8s: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553#10510726 (10Urbanecm_WMF) >>! In T341553#10510697, @Urbanecm_WMF wrote: > Would running `mwscript-k8s` while scap is waiting on me to confirm the deployment have the desired effect (of running with the...