[06:28:25] <volans>	 aputhin, cciufo: FYI it seems there is some interest in improving the documentation of file management on toolforge ( https://phabricator.wikimedia.org/T347753#12016072 ), it could be a good occasion to steer the documentation towards a saner use of NFS as we discussed ;)
[06:30:46] <volans>	 [clinic duty] a new project just to isolate a vm seems a bit too drastic ( https://phabricator.wikimedia.org/T429032 ) should I just suggest to use the existing project and isolate the VM either in openstack or firewall? do we have a suggested way?
[07:37:22] <dhinus>	 I think historically a few projects created a separate staging project on cloudvps (e.g. catalyst-dev, gitlab-runners-staging), but I'm not sure if it was suggested by us or not
[07:56:51] <dcaro>	 I don't remember suggesting it, though it would make sense specially if they use terraform or similar to manage the infra
[07:57:17] <godog>	 wb dcaro 
[07:58:10] <dcaro>	 thanks :)
[09:46:36] <godog>	 I'm about to nuke the checker instance, did I get it right I should merge/deploy this first then nuke the instance? 
[09:46:39] <godog>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/105
[09:46:48] <taavi>	 sgtm
[09:47:37] <godog>	 ack thank you
[09:58:32] <taavi>	 reminder that I will be starting the k8s upgrade in about an hour
[10:02:23] <godog>	 good luck
[10:16:25] <dcaro>	 taavi: do you need any support/help? (besides just being around)
[10:25:28] <taavi>	 I think I'll manage copy-pasting the cookbook commands :P
[11:01:35] <taavi>	 starting prepare_upgrade which starts with the slooow test suite
[11:19:48] <taavi>	 tests finally done, the cookbook is moving on to the next steps
[11:22:39] <taavi>	 upgrading first control node
[11:27:19] <volans>	 ack \o/
[11:33:11] <taavi>	 T429157
[11:33:11] <stashbot>	 T429157: toolforge worker upgrade cookbook sometimes fails when uncordoning - https://phabricator.wikimedia.org/T429157
[11:33:58] <volans>	 did it output anything useful in the console?
[11:34:59] <taavi>	 nothing more useful than 'Cumit execution failed'
[11:35:04] <taavi>	 s/Cumit/Cumin/
[11:36:51] <volans>	 ah print_output=False :/
[11:37:27] <taavi>	 also,
[11:37:28] <taavi>	 raise KubernetesNodeNotFound("Unable to find node {node_hostname} in the cluster.")
[11:37:31] <taavi>	 that's not an f-string
[11:37:59] <volans>	 is the get nodes that fails
[11:38:08] <volans>	 "kubectl", "get", "nodes", "--output=json", selector_cli
[11:47:35] <taavi>	 error when evicting pods/"infra-tracing-loki-backend-2" -n "infra-tracing-loki": global timeout reached: 1m0s
[11:47:37] <taavi>	 hmh
[11:48:48] <volans>	 :'( something I did?
[11:49:06] <taavi>	 not sure, may be just bad luck and the cookbook not being able to handle that
[11:49:27] <volans>	 did you pass --good-luck to it
[11:49:28] <volans>	 ?
[11:49:48] <volans>	 without the risk of bad lucky is exponentially higher :D
[11:49:50] <taavi>	 ah, does it default to bad luck?
[11:49:57] <volans>	 ofc :D
[12:00:19] <dcaro>	 lol
[12:01:38] <volans>	 running good luck by default costs too many tokens, hence the random luck default :-P
[12:18:14] <taavi>	 worker upgrades are all done
[12:20:35] <dcaro>	 🎉
[12:24:04] <andrewbogott>	 taavi: is that the end of your upgrade window? I have ceph upgrades to do but do not want to overlap
[12:25:06] <taavi>	 andrewbogott: still a bit of cleanup, give me 15'
[12:25:32] <andrewbogott>	 Ok! No rush
[12:42:24] <taavi>	 hm, webservice logs tests fail with:
[12:42:25] <taavi>	    requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/tool-automated-toolforge-tests/pods/automated-toolforge-tests-6dc7c5545b-kfzhq/log?container=webservice&follow=False&pretty=True&timestamps=True
[12:42:56] <taavi>	 which I cannot repro, running 'toolforge webservice logs' manually works fine
[12:43:48] <taavi>	 ah, seems to already be filed as T413874
[12:43:49] <stashbot>	 T413874: toolforge jobs logs: requests.exceptions.HTTPError: 400 Client Error: Bad Request for url - https://phabricator.wikimedia.org/T413874
[12:43:59] <taavi>	 andrewbogott: I'm all done
[12:44:06] <andrewbogott>	 ok! ty
[12:46:57] <andrewbogott>	 upgrading mons...
[12:51:09] <dcaro>	 taavi: I got that error locally but a rerun passed, I can check
[12:51:51] <dcaro>	 might be a delay of the logs reaching loki? (and loki returning with bad request instead of empty 200)
[12:56:55] <dcaro>	 it's doing the request directly to k8s, not logs api x
[13:45:01] <taavi>	 godog: it can't be just me really annoyed at your alerts.git patches having inconsistent 'wmcs'/'team-wmcs' commit prefixes :/
[13:45:20] <godog>	 taavi: hahaha! fair enough, fixing
[13:45:53] <godog>	 {{done}} 
[13:46:34] <volans>	 lol
[13:47:14] <godog>	 I didn't mean to trigger anyone, FWIW
[13:48:19] <taavi>	 hey I probably wouldn't notice if gerrit wasn't showing both of them on top of each other in the stack, so it's your own making really
[13:50:45] <godog>	 100% is
[13:51:14] <godog>	 I'm just tempted to remove the 'key' label, not sure why it is there tbh
[13:51:27] <godog>	 remove it from the metric altogether
[13:52:07] <taavi>	 +1
[17:32:50] * dcaro off, cya tomorrow!