[06:28:25] aputhin, cciufo: FYI it seems there is some interest in improving the documentation of file management on toolforge ( https://phabricator.wikimedia.org/T347753#12016072 ), it could be a good occasion to steer the documentation towards a saner use of NFS as we discussed ;) [06:30:46] [clinic duty] a new project just to isolate a vm seems a bit too drastic ( https://phabricator.wikimedia.org/T429032 ) should I just suggest to use the existing project and isolate the VM either in openstack or firewall? do we have a suggested way? [07:37:22] I think historically a few projects created a separate staging project on cloudvps (e.g. catalyst-dev, gitlab-runners-staging), but I'm not sure if it was suggested by us or not [07:56:51] I don't remember suggesting it, though it would make sense specially if they use terraform or similar to manage the infra [07:57:17] wb dcaro [07:58:10] thanks :) [09:46:36] I'm about to nuke the checker instance, did I get it right I should merge/deploy this first then nuke the instance? [09:46:39] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/105 [09:46:48] sgtm [09:47:37] ack thank you [09:58:32] reminder that I will be starting the k8s upgrade in about an hour [10:02:23] good luck [10:16:25] taavi: do you need any support/help? (besides just being around) [10:25:28] I think I'll manage copy-pasting the cookbook commands :P [11:01:35] starting prepare_upgrade which starts with the slooow test suite [11:19:48] tests finally done, the cookbook is moving on to the next steps [11:22:39] upgrading first control node [11:27:19] ack \o/ [11:33:11] T429157 [11:33:11] T429157: toolforge worker upgrade cookbook sometimes fails when uncordoning - https://phabricator.wikimedia.org/T429157 [11:33:58] did it output anything useful in the console? [11:34:59] nothing more useful than 'Cumit execution failed' [11:35:04] s/Cumit/Cumin/ [11:36:51] ah print_output=False :/ [11:37:27] also, [11:37:28] raise KubernetesNodeNotFound("Unable to find node {node_hostname} in the cluster.") [11:37:31] that's not an f-string [11:37:59] is the get nodes that fails [11:38:08] "kubectl", "get", "nodes", "--output=json", selector_cli [11:47:35] error when evicting pods/"infra-tracing-loki-backend-2" -n "infra-tracing-loki": global timeout reached: 1m0s [11:47:37] hmh [11:48:48] :'( something I did? [11:49:06] not sure, may be just bad luck and the cookbook not being able to handle that [11:49:27] did you pass --good-luck to it [11:49:28] ? [11:49:48] without the risk of bad lucky is exponentially higher :D [11:49:50] ah, does it default to bad luck? [11:49:57] ofc :D [12:00:19] lol [12:01:38] running good luck by default costs too many tokens, hence the random luck default :-P [12:18:14] worker upgrades are all done [12:20:35] 🎉 [12:24:04] taavi: is that the end of your upgrade window? I have ceph upgrades to do but do not want to overlap [12:25:06] andrewbogott: still a bit of cleanup, give me 15' [12:25:32] Ok! No rush [12:42:24] hm, webservice logs tests fail with: [12:42:25] requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/tool-automated-toolforge-tests/pods/automated-toolforge-tests-6dc7c5545b-kfzhq/log?container=webservice&follow=False&pretty=True×tamps=True [12:42:56] which I cannot repro, running 'toolforge webservice logs' manually works fine [12:43:48] ah, seems to already be filed as T413874 [12:43:49] T413874: toolforge jobs logs: requests.exceptions.HTTPError: 400 Client Error: Bad Request for url - https://phabricator.wikimedia.org/T413874 [12:43:59] andrewbogott: I'm all done [12:44:06] ok! ty [12:46:57] upgrading mons... [12:51:09] taavi: I got that error locally but a rerun passed, I can check [12:51:51] might be a delay of the logs reaching loki? (and loki returning with bad request instead of empty 200) [12:56:55] it's doing the request directly to k8s, not logs api x [13:45:01] godog: it can't be just me really annoyed at your alerts.git patches having inconsistent 'wmcs'/'team-wmcs' commit prefixes :/ [13:45:20] taavi: hahaha! fair enough, fixing [13:45:53] {{done}} [13:46:34] lol [13:47:14] I didn't mean to trigger anyone, FWIW [13:48:19] hey I probably wouldn't notice if gerrit wasn't showing both of them on top of each other in the stack, so it's your own making really [13:50:45] 100% is [13:51:14] I'm just tempted to remove the 'key' label, not sure why it is there tbh [13:51:27] remove it from the metric altogether [13:52:07] +1 [17:32:50] * dcaro off, cya tomorrow!