[07:56:50] hello! arturo is it expected that 208.80.153.41 (cloudweb) have port 8080 open to the world? (seems like a java process) [07:57:09] from a diffscan email [08:01:42] XioNoX: I don't know. I don't usually work with that server and I have not changed anything lately. I can double check though [08:02:02] morning! [08:02:51] I think that port is striker [08:03:12] https://www.irccloud.com/pastebin/KQNXTgsk/ [08:08:31] going through my email backlog, that was sent 13 days ago, about cloudweb2002-dev [08:10:01] oh, in cloudweb2002-dev it's a tomcat running there it seems [08:10:08] it's CAS, the idp [08:10:27] probably slyngs and andrewbogot.t doing tests? [08:11:13] they would have needed external connectivity to test the idp flows for horizon, not sure what's the status (if it's still needed, or can be stopped), maybe slyngs knows better ^ [08:11:47] that rings a bell. I think they deployed a separated idp for testing the horizon integration? [08:13:22] cool thx for your quick answer, let's see what slyngs says, nothing to do if it's expected, just keeping an eye on new ports opened to the world :) [08:13:40] 👍 [08:45:43] Ah, probably shouldn't be open to the world, just the load balancer [08:52:37] brb [09:16:32] There were a lot of emails about fourohfour being down on tools since last week, anyone looked into it? [09:16:36] (flapping essentially) [09:18:01] I didnt [09:20:20] XioNoX: when you have a moment, could you please ACK the plans on T187929 so I can move forward? [09:20:21] T187929: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 [09:27:21] arturo: yeah that looks good to me! left a comment, I don't think it was waiting for my approval as it was "my' suggestion, but thanks for making progress on v6 ! [09:28:20] XioNoX: thanks for double checking. I'll then move on to work on the actual allocations on netbox T374712 I may ping you or topranks later [09:28:21] T374712: netbox: create IPv6 entries for Cloud VPS - https://phabricator.wikimedia.org/T374712 [09:28:35] sure, anytime [10:10:02] dhinus: I created this a few days ago but forgot to send it your way https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/45 wdyt? I haven't even tested, wanted to collect your opinion first [10:10:18] arturo: I will have a look! [10:10:33] thanks [10:11:41] hmmm... I have the suspicion that logrotate and uwsgi on tool fourohfour don't mix well, it rotates the log, but the new log has the same size, just filled up with 0s, and uwsgi just puts more logs after [10:11:51] anyhow, will continue after lunch [10:12:13] maybe logrotate should be restarting uwsgi? [11:27:12] uwsgi is running inside webservice, so that would mean logrotate needs webservice access (for which we have no API yet xd) [11:27:48] ah, I see [12:06:52] hmm... maintain-kubeusers does not refresh the user certs inside lima-kilo?/ [12:07:21] aaahh, the ldap population thingie [13:03:55] arturo: the MR looks good, I tested it locally and it installs the new provider version [13:04:23] I never remember why we don't just pull it from the internet? [13:06:28] I'd say the main reason is in case internet stops working (ex. they remove the old version) [13:06:53] wouldn't it be true for any python dependency etc.? [13:06:55] then local caching + security [13:07:04] it would, and it kinda is [13:07:24] (ex. we usually use debian packages when deploying stuff in hardware) [13:07:28] I have a vague memory there was an _additional_ reason in this case, but I'm not sure [13:09:53] maybe [13:14:02] I'm not a big fan of having a 7MB file in the Git repo, especially because it's a file that needs regular updates [13:15:15] yep, the repo might not be the best place for it [13:34:22] I found the previous discussion about this: https://wm-bot.wmcloud.org/logs/%23wikimedia-cloud-admin/20240723.txt [13:36:31] not sure it's much clearer? [13:37:32] not really no :) [13:39:16] "maybe taavi wanted to make sure we did not use any non-free provider" < I think that's what I was vaguely remembering, not just caching but controlling that we don't install non-free deps without noticing [13:46:15] sounds reasonable yep [13:56:00] anyone playing with cloudlb? haproxy seems to have restarted [13:57:37] that's me [13:57:47] pooling/depooling clouddbs [13:58:04] causes haproxy to restart... that alert is a bit too sensitive maybe? [13:58:18] it seems to trigger every time you change the config [14:07:25] yep, it's ok, just wondering [14:07:38] maybe the cookook should add a silence with a note [14:08:12] I'm not using a cookbook atm, it's just confctl [14:08:24] ah, okok, scrap that then [14:08:41] but yes, this should all be automated in a cookbook, and that cookbook could add the silence [14:08:55] it's still annoying that a normal usage of confctl causes an alert [16:36:27] puppet was failing on cloudcontrol2006-dev:9100 [16:36:48] the error was in the /srv/tofu-infra checkout [16:37:37] "git checkout main" fixed it [16:38:11] the error was: "fatal: ambiguous object name: 'remotes/origin/HEAD'" [16:39:19] weird, maybe another race collision between the cookbook and the cron? [16:44:58] not sure [16:45:27] the alert is not clearing, but I think it's just the prom exporter which has a lag [16:45:36] maybe it updates only on the systemd timer? [16:47:49] yep running "systemctl start puppet-agent-timer.service" updated the prom stat [16:49:05] or maybe it was just a coincidence because the systemd unit does not contain anything special [16:51:14] ha, it's "prometheus-puppet-agent-stats.service", which contains "After=puppet-agent-timer.service" [16:51:44] so yes, the prometheus stats does not update if you run puppet manually, but only if you run the systemd unit [16:51:52] or wait for the next scheduled run [16:52:41] the cause of the alert remains unclear, I will not investigate now [16:52:45] * dhinus offline [23:54:25] cteam: does anyone now why there are 13 tools-k8s-worker-nfs-* nodes currently marked as SchedulingDisabled? [23:54:40] Stashbot cannot find a node to run from [23:59:01] Raymond_Ndibe: are you still running cookbooks against the toolforge k8s cluster? It looks like too many nodes are currently depooled for things to work correctly.