[07:25:29] greetings [09:25:39] morning [09:36:33] hello [09:38:54] hmm, the stats about the running pods for paws don't seem correct https://grafana-rw.wmcloud.org/d/eV0M3UyVk/paws-usage-statistics [09:40:47] dcaro: I've updated the secret in the toolsbeta puppetserver, can I freely re-deploy infra-tracing there to see if that fixes the auth issue? [09:43:06] yep, that's ok [09:43:34] got an email alert about manitain_dbusers, looking (don't see it in karma though) [09:44:15] it's having errors [09:45:08] https://usercontent.irccloud-cdn.com/file/po7c4I9D/image.png [09:45:35] Nov 20 08:39:32 cloudcontrol1007 maintain-dbusers[269845]: DEBUG [urllib3.connectionpool._make_request:544] https://nfs.svc.toolforge.org:443 "POST /v1/write-replica-cnf HTTP/1.1" 500 265 [09:45:44] from before, not happening anymore though [09:52:05] hmm.... I think there's a bug somewhere [09:52:06] Nov 18 16:12:26 tools-nfs-3 uwsgi-toolsdb-replica-cnf-web[1014516]: RuntimeError: Unable to get kubeconfig for user 'tools.commonsconferenceuploader': [09:52:24] from replica-cnf, it should not have the `tools.` prefix [10:13:04] what alerts dashboard(s) are you looking at for toolforge alerts? I'd guess https://prometheus-alerts.wmcloud.org/?q=project%3D~%5Etools.%2A ? for tools + toolsbeta that is [10:13:20] plus https://alerts.wikimedia.org/?q=team%3Dwmcs of course for cloud vps alerts [10:13:28] Y use that last one yep [10:13:59] ack thank you, me too [10:14:14] *I use [10:14:44] I think everything from the first link should also show on the second link, which is also what I use [10:15:11] but alerts for projects other than tools/toolsbeta will only be visible on prometheus-alerts.wmcloud.org [10:15:15] yep, silences included [10:16:23] true ok alerts.wikimedia.org is indeed hooked up with prometheus-alerts.wmcloud.org via @cluster=wmcloud.org [10:17:31] however toolforge alerts don't have team=wmcs so sth like https://alerts.wikimedia.org/?q=%40cluster%3Dwmcloud.org&q=project%3D~%5Etools.%2A [10:19:17] I think most (all?) of them should have team=wmcs, but none is firing atm [10:20:25] ohhhh... we have to paws clusters running side by side [10:20:51] two months old, that's why the stats don't match [10:21:02] dhinus: you are right, my bad toolforge alerts do have team=wmcs, nevermind [10:21:26] anyone deployed a paws cluster not long ago? [10:21:31] (well ~2months xd) [10:21:51] https://github.com/toolforge/paws/pull/498 [10:21:54] godog: np, it's actually a good question if it's "most" or "all", I don't remember if the team tag is manually added to each alert, or if there's some auto-tagging based on project [10:22:15] it's added per project iirc [10:22:51] I think so yes, in metricsinfra iirc [10:29:02] taavi: dcaro: good catch about the double PAWS, I think a.ndrew followed https://wikitech.wikimedia.org/wiki/PAWS/Admin#Blue_Green_Deployment,_creating_a_new_cluster but forgot the last item "Removing the old cluster" [10:29:27] admittedly the docs say "perhaps a few days later", which makes it very easy to forget about [10:29:39] is the double cluster causing any issues? all traffic should be going to the new one [10:31:57] prometheus was pointing to the old cluster, so I was looking at some stats and getting very confused [10:32:10] changed the ip, let me see if that's in the doc (changing the prometheus endpoint) [10:33:03] changed it where? [10:34:42] webproxy in horizon [10:34:55] I'll update the wiki (does not mention it, and it's not in the screenshot there) [10:35:10] ah that makes sense! [10:35:20] hmm, is there a reason not to manage where those proxies point to via tofu? [10:35:27] time? [10:35:34] not sure, I don't think so [10:35:42] probably they were added before tofu, and never moved there [10:45:46] hmm... can't upload a nev version of the screenshot [10:45:55] https://usercontent.irccloud-cdn.com/file/nFlUZbwo/image.png [10:46:27] dhinus: ^ can you update https://commons.wikimedia.org/wiki/File:Paws-web-proxies.png ? it only lets update it to the same user that uploaded it [10:48:55] I think I created that one in commons by mistake, you should probably create a new image in wikitech [10:49:25] iirc the wikitech interface is slightly confusing and lets you upload files directly, but also has a button to upload in commons [10:54:54] let me try [10:55:11] this should put it in wikitech without publishing to commons https://wikitech.wikimedia.org/wiki/Special:Upload [10:56:50] there's a button when editting too, used that [10:56:55] 👍 [11:05:18] I suspect that there's some race condition between maintain-dbusers and maintain-kubeusers, as the errors I saw were about replica_cnf_api not being able to read the kubeconfig cert to create the secrets in envvars, but from my testing, it's working [11:31:06] ok, now that toolsbeta deployment of infra-tracing-loki from !1040 works I think I'm ready to deploy everything to tools. Is there any concern? s [11:39:59] this instead is from a chat with david yesterday, to speed up the deploy of the logging component: [11:40:02] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1086 [11:40:22] if I understood correctly the logic (I might as well misunderstood everything :D ) [13:25:16] no comments == no concerns ? :-P [13:50:14] volans: LGTM, in a meeting [13:50:37] ack thanks! [14:03:52] +1d [14:04:37] <3 thanks, I'll test it deploying to toolsbeta and then merge it, as it doesn't affect prod in any way, right? [14:05:07] yep correct [14:39:18] * volans starting to deploy registry-admission first to tools [14:41:31] 👍 break a leg! xd [14:42:02] :D [14:48:15] dcaro: the leg did break, one test failed [14:48:15] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1040#note_176106 [14:51:48] volans: looks like a flake :/ [14:52:12] can you retry? if it happens again I'll investigate, but given that other tests passed the registry is probably doing it's work [14:52:28] not nice though [14:52:51] ok, the fact that it says [14:52:51] Test failed, creating skip file /tmp/tools.automated-toolforge-tests.bats.skip [14:52:52] (not nice if it's flaky I mean, you did all good) [14:52:59] means that it will skip it next time? [14:53:05] I see: 36 tests, 1 failure, 17 skipped in 335 seconds [14:53:11] no, it flags the run as failed for the next tests to skip running [14:53:16] yep [15:03:05] timedout again [15:03:06] ERROR: timed out 120 seconds waiting for job 'check-test-1157' to complete: [15:03:21] hm... that's not good, let's check [15:04:28] kubectl get all shows the pod there [15:06:09] mmmh sure? [15:06:59] I logged in as the tool that's used for the tests, `sudo become automated-toolforge-tests` [15:07:15] (from my user account in login.toolforge.org), and there just `kubectl get all` [15:07:16] ah we run the tests as a tool, got it [15:07:35] it sshs as my account, then for the "/tool" suites, it becomes that tool yep [15:07:48] got it [15:07:48] (it's a different tool name in toolsbeta :/) [15:08:12] ok so it's pod/test-1157-75d46f8849-6h6qn [15:08:17] and it's still there because of the failed test [15:08:20] anyhow, the check is trying to make sure that when you run a job with `--port ....` it does actually expose it by running another job and trying to curl the one exposing the port [15:08:23] yep [15:08:44] most tests don't cleanup right away, so if there's a failure you can debug a bit [15:08:55] (they cleanup on start) [15:10:17] $ curl localhost:1234/status [15:10:17] OK [15:10:24] so it's listening [15:10:49] from another job it seems it fails [15:10:51] tools.automated-toolforge-tests@tools-bastion-15:~$ toolforge jobs run --image python3.13 --command 'curl -v http://test-1157.tool-automated-toolforge-tests.svc.tools.local:1234/status' checkport [15:11:08] https://www.irccloud.com/pastebin/dbAmZk5n/ [15:11:12] 48s and counting [15:13:26] network policy? [15:13:36] * volans is going around with kubectl describe [15:15:31] did you deploy any network policies? [15:15:39] nope [15:15:42] (that wolud be in the infra-tracing component no?) [15:15:49] exactly [15:16:00] I can re-run with main [15:16:01] just in case [15:16:15] sure, weird though [15:16:35] right now or you want to debug anything more? [15:16:50] feel free to deploy, I'll look around a bit [15:17:07] https://sample-complex-app.toolforge.org/ still has access to the backend (so things that were running are not broken, let me restart the pods there, see if they break) [15:17:42] * volans deploying [15:18:05] 👍 [15:19:04] recreated both the webservice and the backend and they still work [15:19:35] so probably not widespread issue, might be some nodes? [15:20:02] let's see if main passes or not [15:22:48] culprit test running now [15:23:40] seeing it, seems stuck [15:23:41] :/ [15:23:48] same failure [15:23:53] so at least it's not me :D [15:24:25] both pods were running in tools-k8s-worker-nfs-34, looking [15:26:14] can you try again? I've cordoned that node, just to make sure it's not the node [15:26:33] dcaro: is there a shortcut to run only those tests? [15:26:51] ah --filter-tags [15:27:16] the tag wuld be 'jobs-api' in this case? [15:28:02] yep [15:28:22] * volans doing [15:28:35] I forced it to recreate the pods in a different node and it seems to be workeng [15:28:45] https://www.irccloud.com/pastebin/hnX8oPPF/ [15:28:47] :/ [15:29:17] so bad node? [15:29:28] probably, it's still there cordoned for debugging [15:29:33] k [15:30:05] no idea what's the issue though, networking errors make me nervous xd [15:30:39] always blame netops ;) [15:31:23] damn cloudlare :old_man_yells_at_cloud: [15:31:29] lol [15:31:53] ✓ run a continuous job with a port exposes port [30207] [15:31:55] passed [15:31:58] \o/ [15:32:19] now we have a mystery in our hands :), anyone wants to debug network stuff? [15:32:25] I'll re-deploy my branch next once finished [15:32:34] 👍 [15:32:39] dcaro: should I get a diff for registry-admission with the cookbook? [15:34:17] maybe, it changed the deployment object right? [15:37:02] it did say [15:37:02] Release "registry-admission" has been upgraded. Happy Helming! [15:37:13] if that's what you meant :) [15:37:44] ok finished main with test passe,d deploying again loki-tracing [15:38:07] ack [15:39:54] ahh, no the registry had no diff in prod, only in lima-kilo/local [15:40:02] k [15:40:27] https://www.irccloud.com/pastebin/XTrJosNb/ [15:48:52] k [15:49:07] ok, moving to logging then [15:52:22] dcaro: got a diff, weird: https://phabricator.wikimedia.org/P85420 [15:52:39] totally unrelated with my changes, was something deployed and not merged? [15:53:12] yep, that looks like the limit change I did time ago [15:53:16] give me one sec [15:53:52] yep, I might have made the change by hand and forgot to deploy [15:53:59] the value in prod is actually that same one [15:54:17] was... [15:54:18] :D [15:54:23] https://www.irccloud.com/pastebin/iu6s6yAe/ [15:54:32] xd [15:54:45] ack, it might want to restart all the alloy pods though, so might take a bit [15:55:13] but which limit is the good one? [15:55:41] yep [15:55:45] 400/4000 [15:56:08] helm keeps a copy of what it deployed, so if you manually change any resource, it does not know [15:56:34] when deploying, helm does the diff against the copy it has, not the real resource [15:56:34] Error: UPGRADE FAILED: context deadline exceeded [15:56:38] yep :/ [15:56:42] that was expected [15:56:44] because of the timeout? [15:56:50] too many alloys to deploy [15:56:51] got it [15:57:00] yep [15:57:06] root@tools-k8s-control-9:~# kubectl get pods -n alloy --sort-by=.metadata.creationTimestamp [15:57:20] that will show you more or less the list of the ones it's restarting [15:57:29] (~11/40 or so) [15:57:32] shift+A in k9s :D [15:57:50] once they are done, if I re-deploy logging will it not restart them again? [15:57:51] yep :), way easier to remember! [15:58:03] it should not, it uses the sha of the config to restart them [15:58:05] because of same checksum I presume [15:58:06] (it was in the diff) [15:58:08] yep [15:58:21] ok, will wait and re-deploy so the tests runs [15:58:26] 👍 [16:42:18] FYI deploying logging again on tools now that alloy was restarted, running the tests now [16:45:47] all good! [16:55:18] * volans starting the deploy of the infra-tracing [16:56:45] taavi: here's an example: [16:56:48] root@cloudcontrol2005-dev:~# openstack server migrate --live-migration 8a547552-3fee-4f23-9e6e-51de607da5c3 [16:57:26] 2025-11-20 16:56:09.175 605928 ERROR nova.virt.libvirt.driver [None req-4aee42c8-d83a-41b2-9255-08372900bd6e novaadmin admin - - default default] [instance: 8a547552-3fee-4f23-9e6e-51de607da5c3] Live Migration failure: unsupported configuration: Target network card MTU 1500 does not match source 1450: libvirt.libvirtError: unsupported configuration: Target network card MTU 1500 does not match source 1450 [16:57:27] 2025-11-20 16:56:09.612 605928 WARNING nova.compute.manager [req-a834c8fc-3c7e-4674-a3ff-814ec44bc25a req-0f3c58d1-d247-4222-a544-b7850ad9877d novaadmin admin - - default default] [instance: 8a547552-3fee-4f23-9e6e-51de607da5c3] Received unexpected event network-vif-plugged-692d0305-1fd7-4766-b8c5-caa2c43b5d05 for instance with vm_state active and task_state migrating. [16:57:27] 2025-11-20 16:56:09.617 605928 ERROR nova.virt.libvirt.driver [None req-4aee42c8-d83a-41b2-9255-08372900bd6e novaadmin admin - - default default] [instance: 8a547552-3fee-4f23-9e6e-51de607da5c3] Migration operation has aborted [16:57:40] Do you want more info than that, or want me to paste that in the task? [16:59:28] that's enough [16:59:50] I think that every VM on cloudvirt2005-dev will exhibit the same behavior. [17:00:11] (not because 2005-dev is special, but because I drained everything that didn't have the issue) [17:01:13] deploy seems good, can push and get logs from loki, waiting for the tests to finish [17:01:36] dcaro: after that the last step is to merge !1040? [17:04:56] volans: yep :), if tests pass and such :shipit: [17:05:07] yay!!! [17:05:26] thanks a lot for all the time spent [17:05:28] andrewbogott: argh, that is on codf1dev where horizon is still broken :/ [17:05:42] I can find you one on eqiad1 if you'd rather! [17:08:55] 97059d6f-4922-4747-8dea-bb5166a397b4 on cloudvirt1040 [17:09:27] andrewbogott: I think I bricked this codfw1dev VM (schedtest.testlabs.codfw1dev.wikimedia.cloud), I was trying to see if detaching and then re-attaching the neutron port would help but apparently you can't detach that interface without deleting the port entirely [17:09:42] do you have something important there that needs rescuing or can I just delete it entirely? [17:09:52] totally fine to delete it. [17:11:10] That other one (97059d6f-4922-4747-8dea-bb5166a397b4) is 'networktests-vxlan-ipv4only' so can also probably be recreated if it doesn't survive the surgery [17:12:10] Hey, someone other than me just asked the same question about Debian Ceph packages on slack and got an answer! "We are sorting out an issue with the debian packages. Hoping for it to be available soon (maybe a day or so? Would be my guess" [17:17:21] with my lucky the pipeline on main got a 503 while downloading shellcheck [17:17:46] I've already tried to re-run it, I'll wait a bit that they fix the issue upstream I guess [17:23:20] andrewbogott: none of my immediate theories seem to work :( [17:39:10] :( [18:02:09] taavi: I haven't figured out what the pattern is... it's almost like the VMs get one free migration and then get stuck after that. Some of those I've definitely moved in the last few days but now they won't move again. [18:22:53] hmmm... restarting that k8s worker was not enough [18:22:59] now it can't reach the api server :/ [18:23:02] Nov 20 18:22:31 tools-k8s-worker-nfs-34 kubelet[869]: E1120 18:22:31.182599 869 kubelet_node_status.go:96] "Unable to register node with API server" err="Post \"https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/nodes\": dial tcp 172.16.18.169:6443: i/o timeout" node="tools-k8s-worker-nfs-34" [18:23:23] I ca nping it [18:23:27] https://www.irccloud.com/pastebin/R6yfqzTo/ [18:27:08] it ended up getting rebooted again, and now it's up and running -\(o.o)/- [18:27:58] okok, running stuff, now I'm clocking off [18:28:03] cya! [18:28:03] * dcaro off [18:30:50] * dhinus off