[07:58:53] 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907 (10hashar) 03NEW [08:57:02] 06serviceops, 06SRE, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10151941 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host chartmuseum1001.eqiad.wmnet with OS bookworm [09:05:03] hello folks [09:05:26] during the mw infra window I'd like to deploy the mw-config change to introduce poolcounter2005 [09:05:42] I checked on a pod in mw-web that the IP:7531 combination can be reached [09:06:10] if you are ok with me going forward, what is the best command to use? [09:21:23] 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10152027 (10JMeybohm) >>! In T332016#10148380, @elukey wrote: > @JMeybohm I can take care of the new VMs, but I have a doubt - if I create `registry2005`, will it be read/write? Just to understand your comment abo... [09:25:43] 06serviceops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Race condition in iptables rules during puppet runs on k8s nodes - https://phabricator.wikimedia.org/T374366#10152054 (10JMeybohm) With [[ https://gerrit.wikimedia.org/r/1073233 | 1073233 ]] merged, ferm is c... [09:26:42] 06serviceops, 06SRE, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10152056 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host chartmuseum1001.eqiad.wmnet with OS bookworm completed: - chartmuseum1001 (**PASS**)... [09:29:20] elukey: AIUI an mw-config change requires image rebuild, so a regular scap run should be the way to go [10:05:42] jayme: ack, so basically merge + scap sync-world --k8s-only --k8s-confirm-diff -D full_image_build:true [10:12:22] elukey: scap backport [10:12:53] claime: yeah too late, I've already started that sigh [10:12:56] heh [10:13:07] TIL for the next time [10:13:12] If you do a scap k8s-only, videoscalers won´t have the config [10:13:45] And you don't need a full build, that's for changes to the base images [10:14:29] I mistakenly assumed that mw-config would have been added to base images, but it doesn't make much sense yes [10:14:55] I asked around and decided to do it anyway, and failed [10:15:19] I think you'll need another scap run to deploy to the videoscalers and maintenance hosts etc. [10:15:19] at this point I'll wait for it to finish, then I'll do a scap backprot [10:15:24] yeah [10:15:47] sorry for the extra deployment :( [10:16:10] No worries [10:16:28] Do you think the doc should be clarified? [10:17:40] if you mean to be Luca-proof, I'd add a reference to scap backport in there (like "you are probably looking for this one instead of a full rebuild, don't use it") [10:17:53] Fair [10:18:24] but probably only for the far left / right of the Bell curve, where I am sitting at the moment [10:23:41] https://wikitech.wikimedia.org/w/index.php?title=MediaWiki_On_Kubernetes&diff=2227697&oldid=2203916 [10:24:19] <3 [10:24:33] yes I think at this point I may be able to avoid stupid deployments in the future [10:24:36] "may" [10:24:52] Hehe [11:06:47] 06serviceops, 06SRE, 13Patch-For-Review: Migrate chartmuseum to Bookworm - https://phabricator.wikimedia.org/T331969#10152511 (10elukey) Last step https://gerrit.wikimedia.org/r/c/integration/config/+/1073426 [11:10:42] 06serviceops, 06Infrastructure-Foundations: codfw: 1 new VM for docker-registry - https://phabricator.wikimedia.org/T374928 (10elukey) 03NEW [11:14:36] 06serviceops, 06Infrastructure-Foundations, 13Patch-For-Review: codfw: 1 new VM for docker-registry - https://phabricator.wikimedia.org/T374928#10152552 (10elukey) ` sudo cookbook sre.ganeti.makevm --os bookworm --network private -p 7 --cluster codfw --group B --memory 6 --vcpus 2 --disk 20 registry2005 ` [11:34:17] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Communication for Wikitech/Wikimedia Developer Account migration - https://phabricator.wikimedia.org/T373615#10152624 (10jijiki) [12:17:06] 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907#10152792 (10akosiaris) Looking [back in time a bit](https://logstash.wikimedia.org/goto/65381bdae8cad296c81a7ccd9b453e46) this isn... [13:11:53] 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907#10153052 (10hashar) I filed the task after the first deploy of the day, so maybe something was invalidated over night such as the... [13:28:22] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10153112 (10JMeybohm) 05Open→03Resolved a:03JMeybo... [13:52:01] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10153208 (10dcausse) 05Resolved→03Open Thanks for looki... [15:05:47] 06serviceops, 06Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, and 2 others: hewiki: Use backing node service instead of RESTBase on pregeneration changeprop rules - https://phabricator.wikimedia.org/T372749#10153657 (10Jgiannelos) [15:34:52] 06serviceops, 10Deployments, 06Release-Engineering-Team: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907#10153807 (10akosiaris) >>! In T374907#10153052, @hashar wrote: > For the OCI image being pulled, scap has a step described as //d... [15:38:01] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10153826 (10JMeybohm) It does not feel right to bake the ex... [15:54:14] hey folks, sorry for the late ping on this but was anyone able to depool the service-ops hosts in codfw racks D3 and D4? [15:54:15] https://phabricator.wikimedia.org/T373103 [15:59:14] I am around but I have a meeting right now [15:59:52] let me see if I can find someone [16:01:11] akosiaris: thanks would be great if you can [16:01:20] but if not all good we won't start anything just yet anyway [16:02:46] topranks: here, let me take a look [16:03:06] swfrench-wmf: thanks [16:06:48] checking the backscroll, looks like we've previously concluded that the kafka-main hosts require no action, so not touching that [16:09:08] yep, I think the main ones are the k8s hosts [16:09:42] hnowlan: did you manage to depool maps2008? [16:10:40] yep [16:10:56] great, thanks : ) [16:11:35] hnowlan or others, this is like 10% of nodes in the cluster ... can we do that safely, and if so, how long should we space that out? [16:13:47] that's quite a lot, but I'd say we can [16:13:48] one sec [16:14:24] we could stagger it perhaps guys [16:15:55] I think we should be fine [16:16:06] but if we could easily stagger it it would be great [16:16:26] but we should be okay to do it in one go, there's spare capacity [16:17:32] yeah, given the spare capacity in the cluster, it looks like we could lose 10% and not hit the ceiling during deployments [16:18:03] it would be kinda close though [16:18:19] topranks: how challenging would it be to split into, say, 2 groups? [16:19:22] say, first group in the sheet up through mw2279 (inclusive) [16:19:45] swfrench-wmf: my thoughts had been to split them based on what rack they are in, but they seem concentrated in rack d3 anyway [16:20:22] alas, yeah that only shaves off only 1 node :) [16:20:41] but yes we could do mw2271 - mw2279 first, then you can repool those, depool the others, and we proceed? [16:22:06] cool, so group 1 would be: [16:22:06] kubernetes2022 (only D4 host) + [16:22:06] kubernetes2046,kubernetes2047,kubernetes2056,mw2271,mw2272,mw2273,mw2274,mw2275,mw2276,mw2277,mw2278,mw2279 [16:23:12] topranks: if that sounds good to you, then I can get that started momentarily [16:23:44] swfrench-wmf: fire ahead yeah, when they are depooled let me know we'll move them over [16:25:47] great, starting [16:26:14] swfrench-wmf: Jenn says mw2271-2277 were decommissioned already? [16:26:29] (this may be my bad for not being aware and updating the list) [16:26:41] ha, alright - let me check site.pp [16:28:27] and yeah, a bunch of these no longer exist there [16:28:38] ok [16:28:50] let me sort out a way to cross-check these real quick [16:30:14] also some of these are not k8s nodes [16:34:00] swfrench-wmf: btw mw2282 is not in rack D4 at all - so we won't touch it today [16:34:04] there was an error in netbox [16:34:09] ack, thanks! [16:38:25] okay, I think I have this sorted [16:38:40] we now have 15 active k8s worker nodes, plus 2 jobrunners [16:38:43] cool [16:38:52] I'm happy to do all those at once [16:39:08] cool that makes it simpler <3 [16:39:37] I'll get started on that now for real real this time [16:50:15] still running ... this take a little while [16:50:43] *takes [16:58:45] topranks: we're good to go - let me know when it's all-clear :) [16:59:27] oh shit crossed wires we already started :( [16:59:34] we were only slightly ahead of you [17:00:06] sry I must have somehow misread your last msg [17:01:55] swfrench-wmf: ok for better or worse all done now anyway [17:02:26] ah, heh, it happens - yeah, sorry this is kind of a slow process to wait for pods to get rescheduled [17:02:51] depending on which way you were working across the switch ports, it may very well have been totally covered :) [17:03:09] topranks: just to confirm, you're all done and we're good to repool these hosts? [17:03:36] swfrench-wmf: yep you're good to repool thanks [17:03:48] awesome, thanks! [17:03:55] thanks for all the help ! [17:04:24] hnowlan: if you're still around maps2008 can be repooled [17:04:29] thanks! [17:04:36] no problem! thanks for your patience with it taking a little bit to get the bookkeeping down [17:15:48] swfrench-wmf: it'd be cool if there was a cookbook to depool a whole group of k8s nodes at once, or tell you that your batch is too big [17:18:01] cdanis: yeah, while it's fun to come up with an ad-hoc estimate and all, it would be nice to make it clearer whether "this is fine" or not (and ideally an automated determination) [17:30:19] for visibility, what I was doing was a very rough "this is x% of nodes in the cluster, and during deployments we have transient excursions to y% (CPU) allocated" kind of thing, but is probably a fine as a zeroth order kinda thing [17:38:04] yeah I did the same napkin math :) [17:38:27] I thiiiiink kubectl drain --dry-run=server might even take into account pod disruption budgets? [17:39:58] 06serviceops, 06Infrastructure-Foundations, 13Patch-For-Review: codfw: 1 new VM for docker-registry - https://phabricator.wikimedia.org/T374928#10154433 (10elukey) 05Open→03Resolved a:03elukey VM is up and running :) [17:40:39] 06serviceops: Migrate docker registry hosts to bookworm - https://phabricator.wikimedia.org/T332016#10154449 (10elukey) registry2005 is now running Bookworm, up and running: ` elukey@puppetmaster1001:~$ sudo -i confctl select name=registry2005.codfw.wmnet get {"registry2005.codfw.wmnet": {"weight": 0, "pooled":... [18:07:30] 06serviceops, 10CirrusSearch, 03Discovery-Search (Current work), 10MediaWiki-Platform-Team (Radar): PHP web requests running for multiple hours - https://phabricator.wikimedia.org/T374662#10154543 (10Krinkle) When the excimer timeout exception is thrown, MediaWiki replaces the response with an error page.... [18:39:47] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org, 13Patch-For-Review: MVP: Privately serve wikitech via mwdebug1001 - https://phabricator.wikimedia.org/T371537#10154673 (10dancy) The updated Firefox add-on is available at https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/. The up... [19:45:01] 06serviceops, 10Prod-Kubernetes, 03Discovery-Search (Current work), 07Kubernetes: Use kafka-main-[eqiad|codfw].external-services.svc.cluster.local to discover kafka brokers in kafka client running in k8s - https://phabricator.wikimedia.org/T374729#10154909 (10dcausse) In the [[https://datatracker.ietf.org/...