[01:11:14] 06serviceops, 10MediaWiki-Platform-Team (Radar): Please review MediaWiki Apache config changes adding new docroot for the auth domain - https://phabricator.wikimedia.org/T385228#10562051 (10RLazarus) >>! In T385228#10551304, @matmarex wrote: > I saw that it was deployed today. Thank you. Thanks for your p... [02:27:49] 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10562112 (10Scott_French) Thanks for opening this, @RLazarus. So, now that I've had a chance to look more closely, I'm pretty sure this is actually getting caught... [02:36:55] 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10562119 (10RLazarus) As a `--k8s-confirm-diff` user that'd meet my needs just fine -- I just don't know who consumes those scap logs and whether they'd miss havin... [04:48:57] 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian): Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562190 (10Jgiannelos) For debugging purposes, this is a URL that requests a snapshot with an overlay map from en.wikipedia.org: ` https://maps.wikimedia.or... [05:02:38] 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian): Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562193 (10Jgiannelos) From my local env when I try this request this shows up in the trace logs so indeed it should eventually request en.wikipedia.org: `... [08:07:37] hey folks, filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120893 for wikikube [08:07:49] of course I forgot to add the lvs loopback to wikikube workers [08:07:54] I focused only on bare metal sigh [08:08:04] if anybody could check/review I'd be grateful [08:08:17] (credits to Scott for the finding) [08:32:20] 06serviceops, 10decommission-hardware: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10562424 (10MoritzMuehlenhoff) [08:41:56] elukey: +1'd [08:52:30] <3 [10:02:16] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10562723 (10cmooney) Agreed. If the POD (both ends of the veth linking it to the main netns) has a lower MTU, but the K8s host physic... [10:06:19] 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10562736 (10Vgutierrez) Thanks @cmooney, do you have any idea of when could you proceed with this @JMeybohm? [10:40:02] 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian), 13Patch-For-Review: Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562809 (10elukey) >>! In T386648#10560543, @Scott_French wrote: > I took a quick look this morning after seeing the discussion in IRC... [11:01:34] 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian), 13Patch-For-Review: Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562885 (10elukey) I just depooled two bare metal nodes from each DC (like I did on Monday) and everything seems stable, so I think we... [11:02:28] 06serviceops, 10ChangeProp, 10WMF-JobQueue: Scope migrating jobs from changeprop-jobqueue to Mercurius - https://phabricator.wikimedia.org/T386799 (10hnowlan) 03NEW [11:02:44] 06serviceops, 10ChangeProp, 10WMF-JobQueue: Scope migrating jobs from changeprop-jobqueue to Mercurius - https://phabricator.wikimedia.org/T386799#10562904 (10hnowlan) p:05Triage→03Low [11:49:27] FYI, I'll temporarily switch kubestagemaster1002 to DRBD for draining a Ganeti node (and then back to plain disk storage) [11:50:15] ack [12:12:25] and it's back to plain storage [12:43:59] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10563189 (10Clement_Goubert) >>! In T385782#10559681, @Urbanecm_WMF wrote: >> jobs that are low criticality and could be mig... [12:49:26] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10563201 (10Clement_Goubert) >>! In T385782#10559397, @Urbanecm_WMF wrote: > Question: How would this work impact beta? Some... [12:58:43] 06serviceops, 06SRE Observability: chartmuseum prometheus metrics cardinality spam - https://phabricator.wikimedia.org/T386808 (10fgiunchedi) 03NEW [12:59:21] 06serviceops, 06SRE Observability: chartmuseum prometheus metrics cardinality spam - https://phabricator.wikimedia.org/T386808#10563217 (10fgiunchedi) [13:12:43] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10563236 (10Urbanecm_WMF) >>! In T385782#10563189, @Clement_Goubert wrote: > Thanks for the cleanup! I have another question... [14:05:01] 06serviceops, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Dont propagate server error details to end users - https://phabricator.wikimedia.org/T385821#10563462 (10Jgiannelos) Verified on staging [14:26:43] 06serviceops, 06Content-Transform-Team, 06MediaWiki-Engineering, 07OKR-Work, 03Web Team Essential Work 2025: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246#10563552 (10MSantos) [15:33:51] 06serviceops, 10Page Content Service, 10Content-Transform-Team (Work In Progress): Dont propagate server error details to end users - https://phabricator.wikimedia.org/T385821#10563982 (10Jgiannelos) 05Open→03Resolved [16:30:09] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10564283 (10jijiki) p:05Triage→03Medium [16:30:25] 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10564286 (10jijiki) p:05Medium→03Triage [16:35:14] 06serviceops, 06SRE, 10Wikimedia-Mailing-lists: Set up memcached for mailman3 - https://phabricator.wikimedia.org/T282931#10564303 (10jijiki) 05Open→03Stalled [16:37:45] 06serviceops, 06SRE, 10Wikimedia-Mailing-lists: Set up memcached for mailman3 - https://phabricator.wikimedia.org/T282931#10564314 (10jijiki) p:05Medium→03Low [16:52:21] 06serviceops, 06Content-Transform-Team, 06MediaWiki-Engineering, 07OKR-Work, 03Web Team Essential Work 2025: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246#10564381 (10jijiki) p:05Triage→03Medium [17:19:42] rzl (or anyone really): is there a way for me see if the sidecar controller is active in my namespace (toolhub) and what it might be doing if so? Context is that I deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1119198 but I am still seeing the toolhub-main-crawler-* pod spawned by the cronjob linger. [17:20:45] bd808: hm, I can take a look in a few minutes [17:25:18] thanks. I do know that the "toolhub-main-crawler" container in the pod exits with a non-zero status currently, but it looks like the controller logic works the same with reason: Complete and reason: Error exits. [17:30:32] hm, the sidecar controller logs show it went to kill toolhub-main-crawler-28999740-j7f6p at 2025-02-19T17:01:31Z [17:30:51] and then yeah it shows the toolhub-main-tls-proxy container going from phase Running to Failed [17:33:02] er, the pod going to Failed rather, that log line is unclear [17:33:57] afaict this is working? the pod is stopped, you're only seeing it in status Error because toolhub-main-crawler exited with nonzero status [17:35:24] if you're seeing something else you didn't expect, let me know what [17:35:45] rzl: ah, that actually makes some sense. I need to make a follow up change to stop the error exit of the toolhub-main-crawler container in the pod. I will move forward to do that as things won't be any worse even if that turns out not to be the needed magic. [17:38:12] so, for clarity, if you're surprised to still see the pods in `kubectl get pod` after the job is finished, you still will :) but after making that fix, you'll see them in phase Succeeded rather than Failed [17:38:42] if you want to also delete the pods, so that you can't see them in `kubectl get pod` or view their logs or anything, you can do that too but it's a different change [17:41:54] (see https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#jobs-history-limits -- the implication you care about is that deleting the job will also garbage-collect the associated pods) [17:42:10] Thanks for that reminder. It has been a minute since I had the details of how Kubernetes treats the cronjob.batch -> job.batch -> pod objects in my head. [17:43:34] for sure, there's a lot going on there [17:58:21] 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10564838 (10MatthewVernon) Which PCRE version(s) do you want? Our CI is currently building unstable-for-{bookworm,bullseye}, so currently 10.45-1 (which become... [18:09:11] 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10564895 (10Scott_French) Thanks, @MatthewVernon. So, while 10.39 or newer will work, I'd propose we adopt the version with the most "miles" on it in terms of... [18:43:28] 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10565138 (10Scott_French) Great, thanks @RLazarus! Alright, I have a patch that I'll post later on that makes a couple of hardening changes, but most importantly... [19:54:32] 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10565435 (10dancy) >>! In T386759#10565138, @Scott_French wrote: > Of course, this is on the assumption that it would be possible(?) for someone to enable the `--k... [19:55:52] 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 3 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10565441 (10dancy) p:05Triage→03Low [21:03:09] 06serviceops, 06Release-Engineering-Team, 10Scap, 13Patch-For-Review: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10565610 (10Scott_French) Thank you very much, @dancy. I've gone the `output_line` route in https://gitlab.wikimedia.org/repos/releng/scap/-/... [23:05:31] 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 07Kubernetes: Replace k8s-controller-sidecars with built in Sidecar containers on k8s 1.31 - https://phabricator.wikimedia.org/T386694#10565984 (10RLazarus) Yep 100% agree. The controller we're using is just a stopgap until https://kubernetes.io/do...