[10:25:05] toolsbeta is still returning 503 for all tools, I opened T426304 and I'm investigating [10:25:05] T426304: toolsbeta tools are not reachable - https://phabricator.wikimedia.org/T426304 [13:36:21] ty dhinus! I got as far as 'istio is clearly broken' but assumed that was some kind of istio test in progress. Given that it was just a scheduling thing, would it have worked just as well for me to kill the istio pods and let them reschedule? [13:36:35] * andrewbogott also wonders why the test tool is still in crashloopbackoff and that isn't producing alerts [13:39:49] andrewbogott: yes that's my understanding [13:40:04] (re: restarting the pods) [13:40:12] test tool, will have a look [13:40:40] I'm also opening a follow-up task to prevent this from repeating in the future [13:40:43] ok. That's something I should be bolder about (just killing pods and seeing if things heal properly) [13:40:59] I also did not think about it, tbh [13:41:40] the alert was there, pointing to the runbook, but there were 5 or 6 toolsbeta alerts firing at the same time [13:42:24] there was an alert firing, with a runbook saying 'please kill the pods', I'm not sure how I can make it more clear that it's fine to do that :P [13:43:11] taavi, my mistake was fixating on the lower-level 'this tool is failing' alert rather than actually reading the istio runbook :) [13:43:37] obviously in retrospect all the other alerts were just secondary effects of the istio thing [13:44:25] hmm, do we need some way to prioritize "cause" alerts to be more visible over "symptom" alerts? [13:44:48] my mistake was not clicking the runbook link for all the firing alerts, and yes I thought of prioritizing that alert instead of the symptom ones... [13:44:57] but I think prioritizing symptoms is usually the best practice [13:45:08] maybe, that that way might be "hey, when troubleshooting don't forget to prioritize cause alerts over symptom alerts!" [13:45:58] we could link from one runbook to another... [13:46:44] trying to list all the possible causes of each symptom seems like it's doomed to get incomplete or outdated [13:46:51] I'm also thinking if this is something we expect to happen every time we deploy istio-gateway, we should fix the deployment process :) [13:47:19] either adding some logic to the cookbook... or maybe increasing the size of the worker? [13:48:11] well, the strategy config I added (and Raymond_Ndibe fixed) was supposed to do that [13:48:34] but I don't know the exact sequence on what got deployed and so how we got in this state [13:48:35] ah I see! I didn't connect the two things, but it makes sense now! [13:49:21] anyway, the current pod sizing is based on what ingress-nginx had, but istio is significantly more efficient, we could probably tune it much smaller so that a single worker could house two pods during replacements [13:51:00] I'm gonna open a task anyway (I've already half written it) and hopefully we can close it soon if we see that we can now deploy without causing misplaced pods [13:51:36] if you're patient, i have some very basic questions about this. There needs to be an istio pod on every worker, right? to allow network access for pods on that worker? [13:51:46] what? [13:51:56] see, this is why it's a basic question [13:52:15] can you tell me how the issue of istio pods being scheduled in the wrong places caused everything to fail? [13:52:28] or is even ^ a wrong assumption? [13:54:07] istio listens on a k8s nodeport service, and haproxy has the tools-k8s-gateway workers as its targets [13:55:08] if the pods run on some other workers than on the tools-k8s-gateway workers, that traffic won't make it to pods because for various reasons the istio service is in 'local' traffic mode, so the nodeport service will either forward traffic to a pod on the local worker or fail entirely [13:55:35] so if the istio pods get scheduled to any other node than those tools-k8s-gateway ones, things will fail [13:57:13] I created T426321 [13:57:13] T426321: [istio-gateway] Deploying the component can cause an outage - https://phabricator.wikimedia.org/T426321 [13:57:49] so running in that 'local' mode is new to istio, we could run ingress-gateway in the 'external' mode which meant that kube-proxy would've forwarded the traffic to some other node if the service did not run on that node [13:57:50] ok, makes sense! So now an even more naive question: isn't the idea of daemonsets that you can say "always make sure there's 1 pod X on each worker Y"? [13:58:06] istio does not run as a daemonset [13:58:14] i'm not sure if we could do that [13:58:20] huh [13:58:48] because 'usually' you'd just run it on any worker and the cloud-provided load balancer plane would take care of routing the traffic to the right worker [13:59:02] our special workers are a hack since we need to specify the target workers statically in the haproxy config [13:59:10] right, the thing about having gateway nodes isn't... yes, as you say [14:00:05] but the other thing worth doing pretty much instantly is to change the 'run on these workers only' from a 'do this if possible' to 'refuse scheduling otherwise' condition, since there is no point in running the pods on any others in local mode (unlike ingress-nginx where it was a nice fallback) [14:00:46] anyway, /me brb to eat something before the meeting [14:01:10] OK, I think you have explained my main question, thank you! [16:11:29] could someone double check that https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1258 matches with the latest on the task? [16:12:47] checking [16:15:26] all good [16:19:53] godog: one thing I'm not a huge fan of in the new quota increase tool is how easy it would be to replace the quota increase command it spits out with something harmful [16:24:02] the web-scale way would of course be: /bin/bash -c "$(curl -fsSL https://cloudvps-quota.toolforge.org/bump_my_quota.sh)" [16:25:57] missing sudo tghere [16:26:23] why bother when the script can sudo itself when needed [16:27:03] taavi: I'm not sure I understand, you mean the user edits the command before submitting the task ? [16:32:13] anyways I got to run, will read backlog tomorrow