[01:28:11] https://www.facebook.com/160982553987911/posts/4786081928144594/?mibextid=rS40aB7S9Ucbxw6v [01:33:51] https://www.facebook.com/WWW.TECHYBLAZE.C0M?mibextid=rS40aB7S9Ucbxw6v [01:54:32] Elon Musk reacts to 1952 German manuscript connecting his name to an uncanny prediction about Mars [01:54:34] https://www.uniladtech.com/science/space/elon-musk-reacts-to-1952-german-manuscript-connecting-him-to-mars-517196-20250127 [13:16:46] !log admin manual failover of cloudgw1004 to cloudgw1003 T382356 [13:16:56] !log admin test log [13:26:40] !log admin manual failover of cloudgw1004 to cloudgw1003 T382356 [13:26:48] !log admin manual failover of cloudgw1004 to cloudgw1003 T382356 [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:26:55] T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev - https://phabricator.wikimedia.org/T382356 [15:54:59] !log catalyst-dev Promoted jnuche from reader to member [15:55:01] bd808: Unknown project "catalyst-dev" [15:56:00] ugh. I bet that is the UUID bug that I haven't merged the patch for yet... [15:56:29] yup [15:56:50] !log 7209100e0e744a4fbdf447534d4eb825 Promoted jnuche from reader to member [15:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:7209100e0e744a4fbdf447534d4eb825/SAL [15:57:09] !log 7209100e0e744a4fbdf447534d4eb825 Promoted thcipriani from reader to member [15:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:7209100e0e744a4fbdf447534d4eb825/SAL [15:57:19] :-S [17:18:05] Hello cloud folks! I have a fuzzy question. I see that there's a DNS name `puppet` accessible from any project which maps to 172.16.7.124. What service is listening on that address and what does it do? Does it have magic that redirects to the project-local puppetmaster? If so, what happens if there are two VMs with the role::puppetserver::cloud_vps_project role (as is the case in `devtools` at the moment). [17:20:06] dancy: we run a cloud-wide puppet server for most projects that don't need a bespoke server. That dns entry will always refer to that server. [17:20:48] So, nothing magic or complicated. [17:27:40] Ah ok.. so for a project w/ a custom puppetmaster, using "puppet" as the name of the puppet server is incorrect. [17:28:34] something like {project}-puppetserver01 would be rcommended [17:29:11] If you put the FQDN of your puppetserver in the `puppetmaster` hiera setting that should work too though [17:29:33] https://wikitech.wikimedia.org/wiki/Help:Project_puppetserver#Step_2:_Setup_a_puppet_client [17:29:38] the "puppet" thing used to work (in prod) but then things changed with puppetmaster-puppetserver, afaict [17:29:45] this might have affected cloud too [17:29:59] dancy: hm if you call it 'puppet' I'm not sure what happens -- I suspect that it'll pick your local one by preference but I wouldn't recommend counting on it. [17:30:16] dancy: i think this is "used to be a thing but not anymore" roughly [17:31:05] even if you did have a local server named 'puppet' the first run of the VM will still use the central puppetserver since resolv.conf is likely not set up until after that first run [17:31:20] Gotcha. Thanks everyone. I was debugging failed puppet run emails from devtools and I found that puppetmaster-1003 had `puppetmaster: puppet` in its puppet config [17:31:30] its probably there because production had that [17:31:47] puppetmaster-1004 also has the bad setting. [17:32:03] dancy: that would be typical of a project puppetserver that itself gets config from the central puppetservers [17:32:14] dancy: that's on purpose though. You don't want a puppetserver to manage itself in most cases. [17:32:14] I fixed puppetmaster-1003 and refreshed the CA and re-signed all client certs. [17:32:35] So typically there's a cascade of control. Central puppetserver manages project-local puppetserver, which manages other VMs in the project. [17:32:51] ah, interesting. I may need to make some tweaks then. [17:33:01] If you have a project-wide override of puppetmaster: then you need to override /that/ for the puppetserver itself in VM-specific config [17:33:33] dancy: oh, you are trying to fix the puppet runs in devtools.. yea.. that is a a deep rabbit hole [17:33:39] I have tried that before [17:33:50] and it also had to do with rebuilding the puppetmaster [17:33:54] and expired certs [17:34:12] once I gave up on that I created a new local puppetmaster and ran into another issue [17:34:41] dancy: if you REALLy want to.. first let's see which instance is even using the local puppetmaster and which is not [17:35:11] then lets question if there is a reason to use the local master at all for this specific one [17:35:26] https://www.irccloud.com/pastebin/18yfXCbR/ [17:35:26] since the easiest fix is to not use a local one [17:35:40] I have no idea why a local puppetmaster is used in this project. [17:35:46] yea, every single instance is a bit of a different story [17:36:03] rm -r /devtools?! [17:36:12] presumably to have local secrets [17:36:20] ah, that makes sense. [17:36:26] are you interested in a specific instance more than another ? [17:36:39] gitlab-runners ? [17:37:01] how about you just try to fix gitlab-runner test instance and ignore the rest and leave those for me, heh [17:37:05] My goal was to get rid of the emails that I was getting every day about failed puppet runs. I think I've now eliminated all of them except for puppetmaster-1004. [17:37:26] you actively fixed some puppet runs? [17:37:32] or just deleted the mails, heh [17:37:40] I also get those and yea, thanks for looking ! [17:37:54] I have an open ticket for that and thought nobody else cares [17:38:11] I fixed the CA state and resigned the client certs. [17:38:15] and that puppetmaster setup was a bit frustrating [17:38:22] oooh, hell yea:) [17:38:36] is it still working now? because at some point I also had it fixed [17:38:39] and later it was broken again [17:38:44] checks:) yay [17:38:52] I just got it cleaned up today so we'll see tomorrow I think. [17:38:59] thanks a lot! [17:39:08] run-puppet-agent works twice in a row on the previously offending nodes [17:39:17] ok, I will keep an eye on those mails. [17:39:20] I appreciate this [17:39:43] Glad someone other than me will feel benefit from it! [17:40:07] to clarify, the fix was all about the CA on the local master/server, right [17:40:12] thx dancy, sorry for the confusing setup [17:40:22] no "change which master an instance or the project uses" [17:40:37] feel free to update the docs if you feel adequately enlightened :) [17:40:51] because then I will still discuss with others if we can stop using the local master [17:41:06] we already reduced the number of instances using it at some point [17:41:33] mutante: Correct. All on the CA/puppetserver side [17:41:33] one thing is secrets for gitlab-runner but we could do that in a different way. already chatted with J.elto [17:41:49] Awesome [17:41:52] so if we all agree we dont really need the local master .. we will just remove iut [17:42:16] ok, sounds all like progress. tbc :) [17:42:38] Thanks again everyone for the info. That was helpful [17:43:04] dancy: you basically resolved https://phabricator.wikimedia.org/T382960 [17:44:30] (not just rm -rf devtools, btw. having test instances still good and some use cases, even if not that many) [17:44:49] Nod. I just was being silly. [17:45:00] ;) thanks again, and cu later [19:57:36] is there an easy way to track down the horizon project for https://apt-browser.toolforge.org/ ? It's been down a couple of days. I pinged legoktm but he hasn't responded yet. Thought I might peek and and see if there was something obvious to fix [19:59:31] if it’s toolforge (rather than cloud vps) then https://toolsadmin.wikimedia.org/tools/id/apt-browser should be correct [20:02:20] I don’t know what’s wrong with it, the webservice seems to be running [20:02:55] the kubectl logs show some /healthz responses, but no new ones when I attach with --follow, so who knows how old those /healthz entries in the logs are [20:03:36] aha, `kubectl logs --timestamps` is what I want. the /healthz responses are from two days ago [20:03:46] ah, the pod is in “killing” status [20:03:55] but I guess something in the k8s side is stuck [20:04:05] inflatador: if it's a toolforge tool, it should appear in https://hay.toolforge.org/directory/ | if it's a cloud vps project, it should appear in https://openstack-browser.toolforge.org/project/ [20:04:21] the toolforge.org URL strongly implies it's the first [20:04:36] /me peeks at tools-k8s-worker-nfs-37 [20:04:37] though I guess technically you could click a proxy for that in cloud vps [20:05:21] also: the regular wikitech wiki search. if it's a project then there are always some auto-generated pages about it [20:05:45] I see a handful of processes in D state, might be the usual NFS stuff? [20:05:49] though not *that* many processes tbh [20:06:19] but one of the processes is from apt-browser (`tee /data/project/apt-browser/logs/rocket.log`) [20:07:25] mutante ah, thanks for that directory link! And lucaswerkmeister thanks for checking k8s [20:07:41] I tried sudo kill -9’ing it but it doesn’t seem to have done anything (which is not surprising if it is indeed stuck in NFS) [20:07:53] let’s see if I can find the docs for what to do with a worker in this state [20:08:00] though IIRC I don’t have enough permissions to do the needful [20:09:01] yeah, I can’t run cookbooks https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses [20:09:25] (though an `ls` still worked, so the server isn’t totally dead… not sure what the right thing to do then is) [20:09:45] How'd you figure out which worker the service 'lives' on? Or even that it was hosted in toolforge k8s? [20:09:56] (sorry, new to this whole thing) [20:10:07] I ran `sudo become apt-browser` with my toolforge root powers ;) [20:10:15] and then kubectl get pods, kubectl get deployments, etc. [20:10:28] the worker is in `kubectl get pod apt-browser-85d46f5946-c6ptd -o yaml | less` buried in the output ksomewhere [20:10:57] though you can also see that particular detail without privileges at https://k8s-status.toolforge.org/namespaces/tool-apt-browser/pods/apt-browser-85d46f5946-c6ptd/ [20:11:21] and I think any toolforge member can SSH into the toolforge nodes in theory (they just can’t do very much in them, except stare at the “no user-serviceable parts inside” banner that gets printed on login ^^) [20:11:54] `tools-k8s-worker-nfs-37.tools.eqiad1.wikimedia.cloud` is the full hostname for SSH purposes [20:11:58] can I help? [20:12:08] I'm briefly online [20:12:28] you probably can [20:12:34] *some* processes in tools-k8s-worker-nfs-37 are stuck in D state [20:12:41] though others still seem to be running happily [20:13:08] `apt-browser-85d46f5946-c6ptd` pod in the apt-browser tool if you want to look at it from the k8s side [20:13:28] or PID 3029 on that worker [20:13:47] ok, thanks for the pointer, let me see [20:14:19] thanks [20:14:47] it wont let me ssh [20:15:20] did y'all poke anything? It appears to be back up [20:15:37] huh, I’m inside… load average 7/9/10 which is a bit high but shouldn’t prevent ssh [20:15:47] but `ps aux | awk '$8=="D"'` output suddenly became empty [20:16:25] I didn’t do anything that should have an effect since that kill -9 I mentioned earlier [20:16:51] I guess it went through after all, the process seems to be gone now [20:17:00] at which point k8s restarted the pod [20:17:04] so… yay? [20:17:32] {◕ ◡ ◕} [20:17:43] You get a "yay" from me anyway ;) [20:17:46] wondering if I should !log any of this ^^ [20:18:38] I did nothing [20:18:45] I think the VM had high memory usage [20:18:55] and the OOMkiller triggered [20:19:03] ah, okay [20:19:05] at the same time, the VM network is flapping [20:19:28] https://www.irccloud.com/pastebin/XaXcn253/ [20:19:52] Flapping network + NFS? Sounds "fun" ;P [20:20:24] yeah, I'll reboot the VM anyway, just to clear to ugly kernel errors [20:20:38] load avg is back to 1/4/7 now btw [20:20:45] but yeah ok ^^ [20:21:00] (and 11Gi available according to free -h) [20:21:04] @lucaswerkmeister load avg is expected to go high if IO is somehow blocked [20:27:44] * arturo offline again [20:27:50] thanks! [21:26:29] inflatador, mutante: Hay's Directory is a partial listing of some tools that have very explicitly opted-in to being listed there. https://toolhub.wikimedia.org/ has everything that Hay's has plus much more. [21:27:05] https://toolsadmin.wikimedia.org/tools/ is the canonical Toolforge resource because it includes all Toolforge tools even if they have never had a toolinfo.json record created. [21:34:35] * inflatador adds to notes