[10:25:05] <dhinus>	 toolsbeta is still returning 503 for all tools, I opened T426304 and I'm investigating
[10:25:05] <stashbot>	 T426304: toolsbeta tools are not reachable - https://phabricator.wikimedia.org/T426304
[13:36:21] <andrewbogott>	 ty dhinus! I got as far as 'istio is clearly broken' but assumed that was some kind of istio test in progress. Given that it was just a scheduling thing, would it have worked just as well for me to kill the istio pods and let them reschedule?
[13:36:35] * andrewbogott also wonders why the test tool is still in crashloopbackoff and that isn't producing alerts
[13:39:49] <dhinus>	 andrewbogott: yes that's my understanding
[13:40:04] <dhinus>	 (re: restarting the pods)
[13:40:12] <dhinus>	 test tool, will have a look
[13:40:40] <dhinus>	 I'm also opening a follow-up task to prevent this from repeating in the future
[13:40:43] <andrewbogott>	 ok. That's something I should be bolder about (just killing pods and seeing if things heal properly)
[13:40:59] <dhinus>	 I also did not think about it, tbh
[13:41:40] <dhinus>	 the alert was there, pointing to the runbook, but there were 5 or 6 toolsbeta alerts firing at the same time
[13:42:24] <taavi>	 there was an alert firing, with a runbook saying 'please kill the pods', I'm not sure how I can make it more clear that it's fine to do that :P
[13:43:11] <andrewbogott>	 taavi, my mistake was fixating on the lower-level 'this tool is failing' alert rather than actually reading the istio runbook :)
[13:43:37] <andrewbogott>	 obviously in retrospect all the other alerts were just secondary effects of the istio thing
[13:44:25] <taavi>	 hmm, do we need some way to prioritize "cause" alerts to be more visible over "symptom" alerts?
[13:44:48] <dhinus>	 my mistake was not clicking the runbook link for all the firing alerts, and yes I thought of prioritizing that alert instead of the symptom ones...
[13:44:57] <dhinus>	 but I think prioritizing symptoms is usually the best practice
[13:45:08] <andrewbogott>	 maybe, that that way might be "hey, when troubleshooting don't forget to prioritize cause alerts over symptom alerts!"
[13:45:58] <dhinus>	 we could link from one runbook to another...
[13:46:44] <taavi>	 trying to list all the possible causes of each symptom seems like it's doomed to get incomplete or outdated
[13:46:51] <dhinus>	 I'm also thinking if this is something we expect to happen every time we deploy istio-gateway, we should fix the deployment process :)
[13:47:19] <dhinus>	 either adding some logic to the cookbook... or maybe increasing the size of the worker?
[13:48:11] <taavi>	 well, the strategy config I added (and Raymond_Ndibe fixed) was supposed to do that
[13:48:34] <taavi>	 but I don't know the exact sequence on what got deployed and so how we got in this state
[13:48:35] <dhinus>	 ah I see! I didn't connect the two things, but it makes sense now!
[13:49:21] <taavi>	 anyway, the current pod sizing is based on what ingress-nginx had, but istio is significantly more efficient, we could probably tune it much smaller so that a single worker could house two pods during replacements
[13:51:00] <dhinus>	 I'm gonna open a task anyway (I've already half written it) and hopefully we can close it soon if we see that we can now deploy without causing misplaced pods
[13:51:36] <andrewbogott>	 if you're patient, i have some very basic questions about this. There needs to be an istio pod on every worker, right? to allow network access for pods on that worker?
[13:51:46] <taavi>	 what?
[13:51:56] <andrewbogott>	 see, this is why it's a basic question
[13:52:15] <andrewbogott>	 can you tell me how the issue of istio pods being scheduled in the wrong places caused everything to fail?
[13:52:28] <andrewbogott>	 or is even ^ a wrong assumption?
[13:54:07] <taavi>	 istio listens on a k8s nodeport service, and haproxy has the tools-k8s-gateway workers as its targets
[13:55:08] <taavi>	 if the pods run on some other workers than on the tools-k8s-gateway workers, that traffic won't make it to pods because for various reasons the istio service is in 'local' traffic mode, so the nodeport service will either forward traffic to a pod on the local worker or fail entirely
[13:55:35] <taavi>	 so if the istio pods get scheduled to any other node than those tools-k8s-gateway ones, things will fail
[13:57:13] <dhinus>	 I created T426321
[13:57:13] <stashbot>	 T426321: [istio-gateway] Deploying the component can cause an outage - https://phabricator.wikimedia.org/T426321
[13:57:49] <taavi>	 so running in that 'local' mode is new to istio, we could run ingress-gateway in the 'external' mode which meant that kube-proxy would've forwarded the traffic to some other node if the service did not run on that node
[13:57:50] <andrewbogott>	 ok, makes sense! So now an even more naive question: isn't the idea of daemonsets that you can say "always make sure there's 1 pod X on each  worker Y"?
[13:58:06] <taavi>	 istio does not run as a daemonset
[13:58:14] <taavi>	 i'm not sure if we could do that
[13:58:20] <andrewbogott>	 huh
[13:58:48] <taavi>	 because 'usually' you'd just run it on any worker and the cloud-provided load balancer plane would take care of routing the traffic to the right worker
[13:59:02] <taavi>	 our special workers are a hack since we need to specify the target workers statically in the haproxy config
[13:59:10] <andrewbogott>	 right, the thing about having gateway nodes isn't... yes, as you say
[14:00:05] <taavi>	 but the other thing worth doing pretty much instantly is to change the 'run on these workers only' from a 'do this if possible' to 'refuse scheduling otherwise' condition, since there is no point in running the pods on any others in local mode (unlike ingress-nginx where it was a nice fallback)
[14:00:46] <taavi>	 anyway, /me brb to eat something before the meeting
[14:01:10] <andrewbogott>	 OK, I think you have explained my main question, thank you!
[16:11:29] <taavi>	 could someone double check that https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1258 matches with the latest on the task?
[16:12:47] <volans>	 checking
[16:15:26] <volans>	 all good
[16:19:53] <taavi>	 godog: one thing I'm not a huge fan of in the new quota increase tool is how easy it would be to replace the quota increase command it spits out with something harmful
[16:24:02] <dhinus>	 the web-scale way would of course be: /bin/bash -c "$(curl -fsSL https://cloudvps-quota.toolforge.org/bump_my_quota.sh)"
[16:25:57] <volans>	 missing sudo tghere
[16:26:23] <taavi>	 why bother when the script can sudo itself when needed
[16:27:03] <godog>	 taavi: I'm not sure I understand, you mean the user edits the command before submitting the task ?
[16:32:13] <godog>	 anyways I got to run, will read backlog tomorrow