[01:11:14] <wikibugs>	 06serviceops, 10MediaWiki-Platform-Team (Radar): Please review MediaWiki Apache config changes adding new docroot for the auth domain - https://phabricator.wikimedia.org/T385228#10562051 (10RLazarus) >>! In T385228#10551304, @matmarex wrote: > I saw that it was deployed today. Thank you.  Thanks for your p...
[02:27:49] <wikibugs>	 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10562112 (10Scott_French) Thanks for opening this, @RLazarus.  So, now that I've had a chance to look more closely, I'm pretty sure this is actually getting caught...
[02:36:55] <wikibugs>	 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10562119 (10RLazarus) As a `--k8s-confirm-diff` user that'd meet my needs just fine -- I just don't know who consumes those scap logs and whether they'd miss havin...
[04:48:57] <wikibugs>	 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian): Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562190 (10Jgiannelos) For debugging purposes, this is a URL that requests a snapshot with an overlay map from en.wikipedia.org: ` https://maps.wikimedia.or...
[05:02:38] <wikibugs>	 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian): Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562193 (10Jgiannelos) From my local env when I try this request this shows up in the trace logs so indeed it should eventually request en.wikipedia.org: `...
[08:07:37] <elukey>	 hey folks, filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1120893 for wikikube
[08:07:49] <elukey>	 of course I forgot to add the lvs loopback to wikikube workers
[08:07:54] <elukey>	 I focused only on bare metal sigh
[08:08:04] <elukey>	 if anybody could check/review I'd be grateful
[08:08:17] <elukey>	 (credits to Scott for the finding)
[08:32:20] <wikibugs>	 06serviceops, 10decommission-hardware: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10562424 (10MoritzMuehlenhoff)
[08:41:56] <kamila_>	 elukey: +1'd
[08:52:30] <elukey>	 <3
[10:02:16] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10562723 (10cmooney) Agreed.  If the POD (both ends of the veth linking it to the main netns) has a lower MTU, but the K8s host physic...
[10:06:19] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 06Traffic, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10562736 (10Vgutierrez) Thanks @cmooney, do you have any idea of when could you proceed with this @JMeybohm?
[10:40:02] <wikibugs>	 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian), 13Patch-For-Review: Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562809 (10elukey) >>! In T386648#10560543, @Scott_French wrote: > I took a quick look this morning after seeing the discussion in IRC...
[11:01:34] <wikibugs>	 06serviceops, 06Content-Transform-Team, 10Maps (Kartotherian), 13Patch-For-Review: Review maps outage happened on Feb 17th 2025 - https://phabricator.wikimedia.org/T386648#10562885 (10elukey) I just depooled two bare metal nodes from each DC (like I did on Monday) and everything seems stable, so I think we...
[11:02:28] <wikibugs>	 06serviceops, 10ChangeProp, 10WMF-JobQueue: Scope migrating jobs from changeprop-jobqueue to Mercurius - https://phabricator.wikimedia.org/T386799 (10hnowlan) 03NEW
[11:02:44] <wikibugs>	 06serviceops, 10ChangeProp, 10WMF-JobQueue: Scope migrating jobs from changeprop-jobqueue to Mercurius - https://phabricator.wikimedia.org/T386799#10562904 (10hnowlan) p:05Triage→03Low
[11:49:27] <moritzm>	 FYI, I'll temporarily switch kubestagemaster1002 to DRBD for draining a Ganeti node (and then back to plain disk storage)
[11:50:15] <hnowlan>	 ack
[12:12:25] <moritzm>	 and it's back to plain storage
[12:43:59] <wikibugs>	 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10563189 (10Clement_Goubert) >>! In T385782#10559681, @Urbanecm_WMF wrote: >> jobs that are low criticality and could be mig...
[12:49:26] <wikibugs>	 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10563201 (10Clement_Goubert) >>! In T385782#10559397, @Urbanecm_WMF wrote: > Question: How would this work impact beta? Some...
[12:58:43] <wikibugs>	 06serviceops, 06SRE Observability: chartmuseum prometheus metrics cardinality spam - https://phabricator.wikimedia.org/T386808 (10fgiunchedi) 03NEW
[12:59:21] <wikibugs>	 06serviceops, 06SRE Observability: chartmuseum prometheus metrics cardinality spam - https://phabricator.wikimedia.org/T386808#10563217 (10fgiunchedi)
[13:12:43] <wikibugs>	 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10563236 (10Urbanecm_WMF) >>! In T385782#10563189, @Clement_Goubert wrote: > Thanks for the cleanup! I have another question...
[14:05:01] <wikibugs>	 06serviceops, 10Page Content Service, 10Content-Transform-Team (Work In Progress), 13Patch-For-Review: Dont propagate server error details to end users - https://phabricator.wikimedia.org/T385821#10563462 (10Jgiannelos) Verified on staging
[14:26:43] <wikibugs>	 06serviceops, 06Content-Transform-Team, 06MediaWiki-Engineering, 07OKR-Work, 03Web Team Essential Work 2025: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246#10563552 (10MSantos)
[15:33:51] <wikibugs>	 06serviceops, 10Page Content Service, 10Content-Transform-Team (Work In Progress): Dont propagate server error details to end users - https://phabricator.wikimedia.org/T385821#10563982 (10Jgiannelos) 05Open→03Resolved
[16:30:09] <wikibugs>	 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10564283 (10jijiki) p:05Triage→03Medium
[16:30:25] <wikibugs>	 06serviceops, 06Growth-Team, 10GrowthExperiments, 10MW-on-K8s, 13Patch-For-Review: Migrate GrowthExperiments maintenance jobs to mw-cron - https://phabricator.wikimedia.org/T385782#10564286 (10jijiki) p:05Medium→03Triage
[16:35:14] <wikibugs>	 06serviceops, 06SRE, 10Wikimedia-Mailing-lists: Set up memcached for mailman3 - https://phabricator.wikimedia.org/T282931#10564303 (10jijiki) 05Open→03Stalled
[16:37:45] <wikibugs>	 06serviceops, 06SRE, 10Wikimedia-Mailing-lists: Set up memcached for mailman3 - https://phabricator.wikimedia.org/T282931#10564314 (10jijiki) p:05Medium→03Low
[16:52:21] <wikibugs>	 06serviceops, 06Content-Transform-Team, 06MediaWiki-Engineering, 07OKR-Work, 03Web Team Essential Work 2025: Migrate parsoidtest functionality to kubernetes - https://phabricator.wikimedia.org/T386246#10564381 (10jijiki) p:05Triage→03Medium
[17:19:42] <bd808>	 rzl (or anyone really): is there a way for me see if the sidecar controller is active in my namespace (toolhub) and what it might be doing if so? Context is that I deployed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1119198 but I am still seeing the toolhub-main-crawler-* pod spawned by the cronjob linger.
[17:20:45] <rzl>	 bd808: hm, I can take a look in a few minutes
[17:25:18] <bd808>	 thanks. I do know that the "toolhub-main-crawler" container in the pod exits with a non-zero status currently, but it looks like the controller logic works the same with reason: Complete and reason: Error exits.
[17:30:32] <rzl>	 hm, the sidecar controller logs show it went to kill toolhub-main-crawler-28999740-j7f6p at 2025-02-19T17:01:31Z
[17:30:51] <rzl>	 and then yeah it shows the toolhub-main-tls-proxy container going from phase Running to Failed
[17:33:02] <rzl>	 er, the pod going to Failed rather, that log line is unclear
[17:33:57] <rzl>	 afaict this is working? the pod is stopped, you're only seeing it in status Error because toolhub-main-crawler exited with nonzero status
[17:35:24] <rzl>	 if you're seeing something else you didn't expect, let me know what
[17:35:45] <bd808>	 rzl: ah, that actually makes some sense. I need to make a follow up change to stop the error exit of the toolhub-main-crawler container in the pod. I will move forward to do that as things won't be any worse even if that turns out not to be the needed magic.
[17:38:12] <rzl>	 so, for clarity, if you're surprised to still see the pods in `kubectl get pod` after the job is finished, you still will :) but after making that fix, you'll see them in phase Succeeded rather than Failed
[17:38:42] <rzl>	 if you want to also delete the pods, so that you can't see them in `kubectl get pod` or view their logs or anything, you can do that too but it's a different change
[17:41:54] <rzl>	 (see https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#jobs-history-limits -- the implication you care about is that deleting the job will also garbage-collect the associated pods)
[17:42:10] <bd808>	 Thanks for that reminder. It has been a minute since I had the details of how Kubernetes treats the cronjob.batch -> job.batch -> pod objects in my head.
[17:43:34] <rzl>	 for sure, there's a lot going on there
[17:58:21] <wikibugs>	 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10564838 (10MatthewVernon) Which PCRE version(s) do you want? Our CI is currently building unstable-for-{bookworm,bullseye}, so currently 10.45-1 (which become...
[18:09:11] <wikibugs>	 06serviceops, 13Patch-For-Review, 07PHP 8.1 support: Update PCRE in PHP 8.1 images to PCRE 10.39 or newer - https://phabricator.wikimedia.org/T386006#10564895 (10Scott_French) Thanks, @MatthewVernon. So, while 10.39 or newer will work, I'd propose we adopt the version with the most "miles" on it in terms of...
[18:43:28] <wikibugs>	 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10565138 (10Scott_French) Great, thanks @RLazarus!  Alright, I have a patch that I'll post later on that makes a couple of hardening changes, but most importantly...
[19:54:32] <wikibugs>	 06serviceops, 06Release-Engineering-Team, 10Scap: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10565435 (10dancy) >>! In T386759#10565138, @Scott_French wrote: > Of course, this is on the assumption that it would be possible(?) for someone to enable the `--k...
[19:55:52] <wikibugs>	 06serviceops, 10Deployments, 10Shellbox, 10Wikibase-Quality-Constraints, and 3 others: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker - https://phabricator.wikimedia.org/T371633#10565441 (10dancy) p:05Triage→03Low
[21:03:09] <wikibugs>	 06serviceops, 06Release-Engineering-Team, 10Scap, 13Patch-For-Review: OSError "Message too long" from scap helmfile diffs - https://phabricator.wikimedia.org/T386759#10565610 (10Scott_French) Thank you very much, @dancy. I've gone the `output_line` route in https://gitlab.wikimedia.org/repos/releng/scap/-/...
[23:05:31] <wikibugs>	 06serviceops, 06collaboration-services, 10Prod-Kubernetes, 07Kubernetes: Replace k8s-controller-sidecars with built in Sidecar containers on k8s 1.31 - https://phabricator.wikimedia.org/T386694#10565984 (10RLazarus) Yep 100% agree. The controller we're using is just a stopgap until https://kubernetes.io/do...