[07:58:53] <wikibugs>	 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907 (10hashar) 03NEW
[08:57:02] <wikibugs>	 06serviceops, 06SRE, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10151941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum1001.eqiad.wmnet with OS bookworm
[09:05:03] <elukey>	 hello folks
[09:05:26] <elukey>	 during the mw infra window I'd like to deploy the mw-config change to introduce poolcounter2005
[09:05:42] <elukey>	 I checked on a pod in mw-web that the IP:7531 combination can be reached
[09:06:10] <elukey>	 if you are ok with me going forward, what is the best command to use?
[09:21:23] <wikibugs>	 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10152027 (10JMeybohm) >>! In T332016#10148380, @elukey wrote: > @JMeybohm I can take care of the new VMs, but I have a doubt - if I create `registry2005`, will it be read/write? Just to understand your comment abo...
[09:25:43] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Race condition in iptables rules during puppet runs on k8s nodes - https://phabricator.wikimedia.org/T374366#10152054 (10JMeybohm) With [[ https://gerrit.wikimedia.org/r/1073233 | 1073233 ]] merged, ferm is c...
[09:26:42] <wikibugs>	 06serviceops, 06SRE, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10152056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum1001.eqiad.wmnet with OS bookworm completed: - chartmuseum1001 (**PASS**)...
[09:29:20] <jayme>	 elukey: AIUI an mw-config change requires image rebuild, so a regular scap run should be the way to go
[10:05:42] <elukey>	 jayme: ack, so basically merge + scap sync-world --k8s-only --k8s-confirm-diff -D full_image_build:true
[10:12:22] <claime>	 elukey: scap backport
[10:12:53] <elukey>	 claime: yeah too late, I've already started that sigh
[10:12:56] <claime>	 heh
[10:13:07] <elukey>	 TIL for the next time
[10:13:12] <claime>	 If you do a scap k8s-only, videoscalers won´t have the config
[10:13:45] <claime>	 And you don't need a full build, that's for changes to the base images
[10:14:29] <elukey>	 I mistakenly assumed that mw-config would have been added to base images, but it doesn't make much sense yes
[10:14:55] <elukey>	 I asked around and decided to do it anyway, and failed
[10:15:19] <claime>	 I think you'll need another scap run to deploy to the videoscalers and maintenance hosts etc.
[10:15:19] <elukey>	 at this point I'll wait for it to finish, then I'll do a scap backprot
[10:15:24] <claime>	 yeah
[10:15:47] <elukey>	 sorry for the extra deployment :(
[10:16:10] <claime>	 No worries
[10:16:28] <claime>	 Do you think the doc should be clarified?
[10:17:40] <elukey>	 if you mean to be Luca-proof, I'd add a reference to scap backport in there (like "you are probably looking for this one instead of a full rebuild, don't use it")
[10:17:53] <claime>	 Fair
[10:18:24] <elukey>	 but probably only for the far left / right of the Bell curve, where I am sitting at the moment
[10:23:41] <claime>	 https://wikitech.wikimedia.org/w/index.php?title=MediaWiki_On_Kubernetes&diff=2227697&oldid=2203916
[10:24:19] <elukey>	 <3
[10:24:33] <elukey>	 yes I think at this point I may be able to avoid stupid deployments in the future
[10:24:36] <elukey>	 "may"
[10:24:52] <claime>	 Hehe
[11:06:47] <wikibugs>	 06serviceops, 06SRE, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10152511 (10elukey) Last step https://gerrit.wikimedia.org/r/c/integration/config/+/1073426
[11:10:42] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: codfw: 1 new VM for docker-registry - https://phabricator.wikimedia.org/T374928 (10elukey) 03NEW
[11:14:36] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 13Patch-For-Review: codfw: 1 new VM for docker-registry - https://phabricator.wikimedia.org/T374928#10152552 (10elukey) ` sudo cookbook sre.ganeti.makevm --os bookworm --network private -p 7 --cluster codfw --group B --memory 6 --vcpus 2 --disk 20 registry2005 `
[11:34:17] <wikibugs>	 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Communication for Wikitech/Wikimedia Developer Account migration - https://phabricator.wikimedia.org/T373615#10152624 (10jijiki)
[12:17:06] <wikibugs>	 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907#10152792 (10akosiaris) Looking [back in time a bit](https://logstash.wikimedia.org/goto/65381bdae8cad296c81a7ccd9b453e46) this isn...
[13:11:53] <wikibugs>	 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907#10153052 (10hashar) I filed the task after the first deploy of the day, so maybe something was invalidated over night such as the...
[13:28:22] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10153112 (10JMeybohm) 05Open→03Resolved a:03JMeybo...
[13:52:01] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10153208 (10dcausse) 05Resolved→03Open Thanks for looki...
[15:05:47] <wikibugs>	 06serviceops, 06Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, and 2 others: hewiki: Use backing node service instead of RESTBase on pregeneration changeprop rules - https://phabricator.wikimedia.org/T372749#10153657 (10Jgiannelos)
[15:34:52] <wikibugs>	 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907#10153807 (10akosiaris) >>! In T374907#10153052, @hashar wrote:  > For the OCI image being pulled, scap has a step described as //d...
[15:38:01] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10153826 (10JMeybohm) It does not feel right to bake the ex...
[15:54:14] <topranks>	 hey folks, sorry for the late ping on this but was anyone able to depool the service-ops hosts in codfw racks D3 and D4?
[15:54:15] <topranks>	 https://phabricator.wikimedia.org/T373103
[15:59:14] <akosiaris>	 I am around but I have a meeting right now
[15:59:52] <akosiaris>	 let me see if I can find someone
[16:01:11] <topranks>	 akosiaris: thanks would be great if you can 
[16:01:20] <topranks>	 but if not all good we won't start anything just yet anyway 
[16:02:46] <swfrench-wmf>	 topranks: here, let me take a look
[16:03:06] <topranks>	 swfrench-wmf: thanks
[16:06:48] <swfrench-wmf>	 checking the backscroll, looks like we've previously concluded that the kafka-main hosts require no action, so not touching that
[16:09:08] <topranks>	 yep, I think the main ones are the k8s hosts 
[16:09:42] <topranks>	 hnowlan: did you manage to depool maps2008?
[16:10:40] <hnowlan>	 yep
[16:10:56] <topranks>	 great, thanks : )
[16:11:35] <swfrench-wmf>	 hnowlan or others, this is like 10% of nodes in the cluster ... can we do that safely, and if so, how long should we space that out?
[16:13:47] <hnowlan>	 that's quite a lot, but I'd say we can 
[16:13:48] <hnowlan>	 one sec
[16:14:24] <topranks>	 we could stagger it perhaps guys
[16:15:55] <hnowlan>	 I think we should be fine 
[16:16:06] <hnowlan>	 but if we could easily stagger it it would be great 
[16:16:26] <hnowlan>	 but we should be okay to do it in one go, there's spare capacity 
[16:17:32] <swfrench-wmf>	 yeah, given the spare capacity in the cluster, it looks like we could lose 10% and not hit the ceiling during deployments
[16:18:03] <swfrench-wmf>	 it would be kinda close though
[16:18:19] <swfrench-wmf>	 topranks: how challenging would it be to split into, say, 2 groups?
[16:19:22] <swfrench-wmf>	 say, first group in the sheet up through mw2279 (inclusive)
[16:19:45] <topranks>	 swfrench-wmf: my thoughts had been to split them based on what rack they are in, but they seem concentrated in rack d3 anyway 
[16:20:22] <swfrench-wmf>	 alas, yeah that only shaves off only 1 node :)
[16:20:41] <topranks>	 but yes we could do mw2271 - mw2279 first, then you can repool those, depool the others, and we proceed?
[16:22:06] <swfrench-wmf>	 cool, so group 1 would be:
[16:22:06] <swfrench-wmf>	 kubernetes2022 (only D4 host) +
[16:22:06] <swfrench-wmf>	 kubernetes2046,kubernetes2047,kubernetes2056,mw2271,mw2272,mw2273,mw2274,mw2275,mw2276,mw2277,mw2278,mw2279
[16:23:12] <swfrench-wmf>	 topranks: if that sounds good to you, then I can get that started momentarily
[16:23:44] <topranks>	 swfrench-wmf: fire ahead yeah, when they are depooled let me know we'll move them over 
[16:25:47] <swfrench-wmf>	 great, starting
[16:26:14] <topranks>	 swfrench-wmf: Jenn says mw2271-2277 were decommissioned already?
[16:26:29] <topranks>	 (this may be my bad for not being aware and updating the list)
[16:26:41] <swfrench-wmf>	 ha, alright - let me check site.pp
[16:28:27] <swfrench-wmf>	 and yeah, a bunch of these no longer exist there
[16:28:38] <topranks>	 ok
[16:28:50] <swfrench-wmf>	 let me sort out a way to cross-check these real quick
[16:30:14] <swfrench-wmf>	 also some of these are not k8s nodes
[16:34:00] <topranks>	 swfrench-wmf: btw mw2282 is not in rack D4 at all - so we won't touch it today 
[16:34:04] <topranks>	 there was an error in netbox 
[16:34:09] <swfrench-wmf>	 ack, thanks!
[16:38:25] <swfrench-wmf>	 okay, I think I have this sorted
[16:38:40] <swfrench-wmf>	 we now have 15 active k8s worker nodes, plus 2 jobrunners
[16:38:43] <topranks>	 cool 
[16:38:52] <swfrench-wmf>	 I'm happy to do all those at once
[16:39:08] <topranks>	 cool that makes it simpler <3
[16:39:37] <swfrench-wmf>	 I'll get started on that now for real real this time
[16:50:15] <swfrench-wmf>	 still running ... this take a little while
[16:50:43] <swfrench-wmf>	 *takes
[16:58:45] <swfrench-wmf>	 topranks: we're good to go - let me know when it's all-clear :)
[16:59:27] <topranks>	 oh shit crossed wires we already started :( 
[16:59:34] <topranks>	 we were only slightly ahead of you 
[17:00:06] <topranks>	 sry I must have somehow misread your last msg 
[17:01:55] <topranks>	 swfrench-wmf: ok for better or worse all done now anyway 
[17:02:26] <swfrench-wmf>	 ah, heh, it happens - yeah, sorry this is kind of a slow process to wait for pods to get rescheduled
[17:02:51] <swfrench-wmf>	 depending on which way you were working across the switch ports, it may very well have been totally covered :)
[17:03:09] <swfrench-wmf>	 topranks: just to confirm, you're all done and we're good to repool these hosts?
[17:03:36] <topranks>	 swfrench-wmf: yep you're good to repool thanks 
[17:03:48] <swfrench-wmf>	 awesome, thanks!
[17:03:55] <topranks>	 thanks for all the help !
[17:04:24] <topranks>	 hnowlan: if you're still around maps2008 can be repooled 
[17:04:29] <hnowlan>	 thanks! 
[17:04:36] <swfrench-wmf>	 no problem! thanks for your patience with it taking a little bit to get the bookkeeping down
[17:15:48] <cdanis>	 swfrench-wmf: it'd be cool if there was a cookbook to depool a whole group of k8s nodes at once, or tell you that your batch is too big
[17:18:01] <swfrench-wmf>	 cdanis: yeah, while it's fun to come up with an ad-hoc estimate and all, it would be nice to make it clearer whether "this is fine" or not (and ideally an automated determination)
[17:30:19] <swfrench-wmf>	 for visibility, what I was doing was a very rough "this is x% of nodes in the cluster, and during deployments we have transient excursions to y% (CPU) allocated" kind of thing, but is probably a fine as a zeroth order kinda thing
[17:38:04] <cdanis>	 yeah I did the same napkin math :)
[17:38:27] <cdanis>	 I thiiiiink kubectl drain --dry-run=server might even take into account pod disruption budgets?
[17:39:58] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 13Patch-For-Review: codfw: 1 new VM for docker-registry - https://phabricator.wikimedia.org/T374928#10154433 (10elukey) 05Open→03Resolved a:03elukey VM is up and running :)
[17:40:39] <wikibugs>	 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10154449 (10elukey) registry2005 is now running Bookworm, up and running:  ` elukey@puppetmaster1001:~$ sudo -i confctl select name=registry2005.codfw.wmnet get {"registry2005.codfw.wmnet": {"weight": 0, "pooled":...
[18:07:30] <wikibugs>	 06serviceops, 10CirrusSearch, 03Discovery-Search (Current work), 10MediaWiki-Platform-Team (Radar): PHP web requests running for multiple hours - https://phabricator.wikimedia.org/T374662#10154543 (10Krinkle) When the excimer timeout exception is thrown, MediaWiki replaces the response with an error page....
[18:39:47] <wikibugs>	 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10154673 (10dancy) The updated Firefox add-on is available at https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/. The up...
[19:45:01] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10154909 (10dcausse) In the [[https://datatracker.ietf.org/...