[07:34:18] greetings [08:13:14] I'm looking at outstanding alerts and 'wikilabels' project seems to be gone, there are two puppet host certs expiring, I'll nuke all wikilabels puppet certs [08:23:29] lol project has been gone since 2024 https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [09:03:35] thanks! yep, sometimes old certs stick around [09:04:03] I think there was some cleanup cron or something? (not sure, just ringing bells) [09:06:46] there is a script that can be ran manually to clean up those [09:06:55] (they should be getting deleted automatically as instances are deleted) [09:13:50] hmmm... maybe very old instances did not autodelete certs? [09:13:58] (I found a couple very old certs also last week) [09:17:31] morning [09:18:26] could be yeah, no autodeletion for some reason [09:25:50] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/81 [09:26:16] let's keep an eye, if we see any 'newer' certs not being cleaned up we should double check the deletion process [09:26:41] agreed, MR LGTM [09:33:24] hmm... I think I might have borked the toolforge bastion (ran a kubectl | ... command and got unresponsive) [09:33:45] feels like it should not have though [09:36:11] both bastions still work fine for me [09:36:42] there's a per-user CPU limiter that can be pretty aggressive, so probably it only affects your user [09:39:04] I think that might have been it yep [10:14:35] did anything change on tools-static-15 at 2026-02-23 22:00? since then infra-tracing-nfs is failing to send data to loki getting 400s [10:14:50] all the other nfs tracing is working fine [10:17:52] I think I might have found the issue in loki's logs [10:18:20] "couldn't parse push request: couldn't parse labels: 1:36: parse error: invalid UTF-8 rune" [10:18:23] weird [10:19:00] yaeh, seems like the script is just not giving up on trying to store that line [10:19:23] it should probably print out the errors it gets to the local lgos as well [10:19:31] no, on the client side ti does buffer and so it's truying to send the same buffer [10:19:35] I get the 400 from requests [10:20:07] it must be a directory name or a username that cause this [10:20:20] weird that is not already proper utf8 when read by python, I'll diff [10:20:23] *digg [10:26:49] pulling images from toolsbeta-harbor into lima-kilo seems slower than usual... can somebody double check from their machine? [10:26:56] e.g. "docker pull toolsbeta-harbor.wmcloud.org/toolforge/maintain-kubeusers:image-0.0.190-20260202141700-a8551eb7" [10:40:47] I get > 10Mb/s [10:41:14] dhinus: it's still hapenning? [10:43:54] I started the pull 10 minutes ago and it hasn't finished yet :) [10:45:17] hmm [10:45:55] it seems more around 1kb/s on my end, I also tried outside of lima-kilo [10:57:55] hmmm [10:58:06] do you have any other place/internet connection? [11:00:56] I'm trying now from a mobile hotspot [11:06:55] yes mobile hotspot is fast... it's just my home connection :/ [11:09:11] :/ [11:09:14] not cool [11:09:42] * dcaro lunch [11:59:31] I managed to dump the streams variable and then restarted the unit, it's working fine now. I'll try to add some check in the code to skip this issue but the most obvious culprit (the path) is already decoded in a safe way [12:00:10] I'll add a better logging on the post error and check more the code if there is any obvious place to check [12:02:59] I'm curious about that, utf8 was very painful some years ago, but should be nicer now xd [12:03:30] yeah I'm already doing path = Path(path_bytes.decode("utf-8", "replace")) [12:03:44] and that's inside a try/except Exception anyway that logs/discard [12:03:58] so I really don't see how the path itself can be the culprit [12:04:49] the next option would be the username, but we resolve that in python via pwd.getpwuid(uid).pw_name and on the static host it's just one, www-data :D [12:06:02] also the loki error is about labels, and that can be just the first or second level in the path, so a very limited number of possibilities [12:43:36] hmm, it seems tools-puppetdb-2.tools.eqiad1.wikimedia.cloud is down? [12:43:52] (lots of alert of empty puppet resources, with the message that failed to contact tools-puppetdb-2.tools.eqiad1.wikimedia.cloud in the logs) [12:44:45] it is up but puppetdb is logging database connection errors [12:44:48] the puppetdb process has some kind or effor [12:44:49] yep [12:44:55] trying to restart just in case [12:45:37] postgres service is running (postrgres@15-main) [12:46:18] it seems puppetdb was able to restart [12:46:28] (no errors so far) [12:46:49] (no successful runs either, might time out) [12:47:37] toolsbeta also has the issue :/ [12:48:05] yep, puppet run timed out [12:48:42] toolsbeta might be different (no error in puppetdb logs, but the log rotated, so no logs at all) [12:49:49] Feb 24 11:47:47 toolsbeta-puppetdb-03 puppet-agent[4084872]: (/Stage[main]/Ferm/File[/etc/ferm/conf.d/10_puppetdb]/ensure) removed [12:49:56] probably related [12:50:09] Feb 24 11:47:42 toolsbeta-puppetdb-03 puppet-agent[4084872]: Applying configuration version '(c9b4bfc8b8) Muehlenhoff - puppetdb: Drop firewall rule for access to Puppet 5 servers' [12:50:11] xd [12:51:43] this reverts the commit https://gerrit.wikimedia.org/r/c/operations/puppet/+/1243106 [12:52:00] but there might be a nicer solution, moritzm ^ maybe you have a better idea? [12:52:49] (besides upgrading to next puppet, though we probably should) [12:54:48] there shouldn't be any Puppet 5 servers left in cloud? [12:55:17] let's revert to unbreak this, but it sounds like some incorrect left over cruft in the config [12:55:27] the problem is `wmflib::role::hosts('puppetserver')`, the role name is slightly different in cloud [12:55:49] yep it's puppet 7 [12:56:33] good catch [12:56:41] what is the role name for cloud instancs? instead we can also fix up the access for this role [12:57:43] role::puppetserver::cloud_vps_project [12:58:02] +1 for the nicer fix [12:58:18] taavi: got a meeting in 2 min, are you sending the fix? [12:58:26] (I cherry-picked the revert in tools for now) [12:58:45] hmm, that might not be enough, we need pupptedb up to restore puppetdb xd [12:58:58] I'm making a patch [12:59:03] ack, thanks [13:14:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1243113 should fix it, currently running PCC over it [13:29:21] PCC failed for cloud, but kinda expected I guess. I'm merging this now [13:34:46] patch is merged, sorry for the disruption, it had escaped me that cloud VPS used a different role name than prod [14:04:47] thanks! [14:36:03] to unbreak you could "systemctl stop ferm" on the tools puppetdb host, then run Puppet and then re-start ferm [14:40:09] oh awesome, I'll take notes for the next time, I was not sure how to allow the port (did not want to manually mess with nftables xd) [14:47:04] I manually fixed them [15:07:52] thanks! [15:08:06] did you do the ferm stop/ferm start? or you used a different process? [15:08:10] * dcaro curious [15:15:09] just manually edited one of the other ferm files for that port to also include the puppetserver's address [15:19:11] ack, thanks! [15:20:22] fyi. they are going to cut the electricity in my building for ~30min max in a few minutes, I'll disconnect suddenly at some point xd, and will be back a while after [15:49:23] I've sent https://gerrit.wikimedia.org/r/c/operations/puppet/+/1243151 for the infra-tracing client errors for better visibility and to avoid it getting stuck in the future [15:49:51] I can't backfill the data because it's past the current allowed timeframe for backfilling [15:50:38] out of curiosity I will try to replay the same data that I dumped on toolsbeta with fake timestamps when I have time to see if I can pinpoint something from the dumped data (but I'm afraid that the encoding issue might not be present in the dumped data) [15:57:28] I'm back :) [16:52:44] Raymond_Ndibe: hmm, I found that in tools-harbor, the logs for individual projects (ex. pull logs) are not being cleaned up it seems (ex. https://tools-harbor.wmcloud.org/harbor/projects/2918/logs) [16:52:47] fyi [17:00:06] does anyone have opinions how to install the Gateway API CRDs in Toolforge? upstream distributes them just as a yaml file, should I throw these in a Helm chart like we do for Calico? or just install them directly in the clusters and document some processes around that? [17:04:07] for most things we do it by hand I think, I'm open to doing something nicer if you have better ideas, otherwise by hand + documentation seems good enough [17:04:45] I have not checked if helm has "fixed" being able to install crds in a reliable way, I hope it gets done at some point [18:19:38] * dcaro off [18:19:41] cya tomorrow!