[00:19:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:19:04] Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ... [00:19:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:19:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:19:04] Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ... [04:19:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [07:54:47] good morning [07:58:32] morning! :) [08:01:29] (03PS1) 10Bartosz Wójtowicz: revise-tone-task-generator: Add cache to the model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538) [08:19:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:19:04] Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ... [08:19:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:58:37] regarding the storage access part, the egress seems to be enabled towards thanos [09:01:54] o/ [09:02:56] for the storage-initializer there is a caveat: since it is an init-container, when it starts the istio gateway settings are not ready and without special settings, it blackholes all the traffic. [09:03:31] we use special annotations for the storage initializer in Istio to allow TCP connections to the thanos-swift IPs [09:04:16] IIUC with new versions of Kserve/Istio this is not needed since the storage-initilizer runs with a user that is automatically allowed to perform TCP conns bypassing the istio proxy [09:04:52] in theory it should work on trixie too, but there may be some extra caveats [09:09:07] the annotation is [09:09:08] traffic.sidecar.istio.io/excludeOutboundIPRanges: 10.2.2.54/32,10.2.1.54/32 [09:10:57] it is likely calico not working in my opinion, so no connectivity for the pod [09:12:23] I tried to kill/restart the calico pod, and the same for the aya pod [09:12:34] I am wondering if the settings applied yesterday need a clean start [09:15:04] 0/14 nodes are available: 1 Insufficient amd.com/gpu, 11 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate. [09:18:37] I am wondering if the new gpu node labeller removed the amd.com/gpu labe for the host, since it adds more specific ones [09:19:24] so probably it needs something like amd.com/gpu.vram=64G in limits [09:21:38] https://github.com/ROCm/gpu-operator/issues/151 [09:21:40] uff [09:22:46] so amd.com/gpu was added by the gpu device plugin, IIUC the node labeller wipes it and adds only the custom ones [09:24:26] amd.com/gpu.vram is probably safer to avoid pods requiring a little bit of vram to be scheduled on big GPUS [10:01:33] good morning! :) [10:02:10] hi [10:04:43] hey hey [10:08:51] (03PS1) 10Kevin Bazira: test: add unit tests for revertrisk-wikidata [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) [10:12:45] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 2 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11362304 (10kevinbazira) I have added unit tests for critical components of the model-server to make sure future changes do n... [10:13:36] (03CR) 10Kevin Bazira: "This patch has been tested locally as shown in: https://phabricator.wikimedia.org/T406179#11362304" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:30:57] (03CR) 10Gkyziridis: [C:03+1] "Thnx for adding unit testing Kevin!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:31:08] kevinbazira: Thnx for adding unit testing mate! [10:31:47] thanks for the review :) [10:32:05] (03CR) 10Kevin Bazira: [C:03+2] test: add unit tests for revertrisk-wikidata [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:33:27] (03Merged) 10jenkins-bot: test: add unit tests for revertrisk-wikidata [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:51:24] elukey: isn't the label amd.com/gpu.vram: 24G wrong tho? there's way more than that on the node even after partitioning [10:52:38] actually no, i'm wrong, it's 24g per partition [10:52:58] exactly yes [10:53:08] atm I think the GPUs are partitioned in two [10:53:27] 8 [10:53:38] https://www.irccloud.com/pastebin/0lyxc9E3/ [10:53:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [10:53:49] Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ... [10:53:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:55:18] altough it's represented in a weird way where the first entry it's the first partition but also the physical unit as a whole it seems [10:55:27] right sorry I checked 1011 before [10:57:45] Warning FailedScheduling 5m59s (x1 over 7m9s) default-scheduler 0/14 nodes are available: 1 Insufficient amd.com/gpu.vram, 11 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate. [11:13:48] dpogorzelski: looks like you tagged the wrong Phab ticket for gpu testing patches lol [11:14:02] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1203787 [11:14:17] sorry :D [11:18:01] aiko: o/ how much memory is aya need? Do you recall? [11:18:13] s/is/does [11:23:16] https://www.irccloud.com/pastebin/bEWRnPMd/ [11:43:51] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11362665 (10Samwalton9-WMF) >>! In T409438#11360643, @Kgraessle wrote: > @Samwalton9-WMF > > There's a few t... [11:45:39] elukey: I don't recall it, but in theory it needs 8*2*1.2 = ~19.2 GB of memory for aya-8B [11:45:56] dpogorzelski: --^ [11:46:05] thanks :) [11:46:44] maybe on the GPU it will be different, we won't need as much memory on the container [11:50:45] ah the number I calculated is GPU VRAM needed for loading the model weights and inference [11:51:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:51:49] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [11:51:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:52:26] I don't recall how much system ram it needs [12:21:46] probably not much [12:40:24] it's still stuck on `botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=llm%2Faya-expanse-8B%2F&encoding-type=url"` [12:40:50] elukey: not sure if I missed some changes to be applied there on my side? [12:41:33] where is `traffic.sidecar.istio.io/excludeOutboundIPRanges: 10.2.2.54/32,10.2.1.54/32` applied? [12:41:40] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11362789 (10achou) >>! In T409414#11359529, @Eevans wrote: >>>! In T409414#11354310, @DPogorzelski-WMF wrote: >> @Eevans i guess we can just start with a set of shared credentials and split la... [12:58:07] aha nvm i see it in the pod [13:26:52] when the aya pod starts you can see calico pod logs like : [13:26:53] `2025-11-11 11:20:53.403 [INFO][103] felix/int_dataplane.go 1585: Received *proto.ActivePolicyUpdate update from calculation graph msg=id: policy: dst_ports: dst_ports: dst_ports: dst_ports: [13:26:53] rule_id:"RPSf-RXLGcYBu4O7" > outbound_rules: dst_net:"10.2.1.54/32" dst_ports: rule_id:"n5eND-JlkWR0v4_P" > outbound_rules: dst_net:"10.2.2.54/32" dst_ports: rule_id:"1Yf5YgzNl_lDoNoc" > >` [13:27:19] which is correct, it should allow outbound traffic [13:38:31] i don't see anything weird there honestly, will keep looking around [13:39:26] but having to spend this much time to get a simple pod running is starting to make me question how much k8s is helping here and how much of a problem source it actually is :) [13:40:10] oh yes I have been wondering this for a long time [13:40:25] I think everybody wonders the same every now and then [13:50:31] so once a pod is working how does it become reachable across the load balancer chains? [13:51:00] i guess an entry is provided somewhere and it's not discovered fully automatically [13:56:19] answering that in a second, but first I think I found the problem [13:56:37] elukey@ml-serve1012:~$ sudo calicoctl node status [13:56:37] Calico process is running. [13:56:37] IPv4 BGP status [13:56:37] No IPv4 peers found. [13:57:18] is calico and istio really needed though? [13:57:42] calico yes, it is the overlay network for pods [13:57:47] we use it everywhere [13:58:25] istio it was required by kserve when we started, it may not be needed now but I am not sure if it would be better or not [13:59:33] so k8s hosts need to BGP peer either with the core DC routers (old rows) or to L3 switches, and I think ml-serve1012 may be connected to one that is not allowed to BGP peer yet [13:59:36] lemme check [13:59:44] i'm more thinking that it's just a few nodes with a few services, we could execute them just via systemd with no orchestrator outside of a CI pipeline :) [13:59:46] kk [14:00:49] i'll be back in 1 hr [14:00:56] about you earlier question - when an InferenceService resource is up istio and knative set up all the configs to reach the new pods. Usually we just need to use the right URI after https://inference.discovery.wmnet [14:07:30] dpogorzelski: this should be the fix https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1203820 [14:08:12] ml-serve1012 is connected to a L3 switch that is not configured to BGP peer, so calico cannot really do much [15:41:57] hmmm but why bgp? do we ever have situations where pods talk to each other? unless i'm mistaken the flow is: clients-->ingress-->pod, or pod-->some other external service, pod-to-pod doesn't seem to be a thing, or? [15:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [15:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:54:51] dpogorzelski: there are various use cases, the first one that comes to mind is Prometheus scraping metrics from pods directly. Those IPs need to be reachable from the prod network.. When a pod needs to reach to a discovery VIP/IP etc.. [15:55:12] we don't have an egress gateway or similar [15:55:46] so each k8s host announce the IP subnets that are allocated by the control plane to it via BGP [15:55:57] and those pods become reachable [15:56:19] but there could be a prom daemonset on each k8s node which would be much simpler :) [15:56:31] or even a sidecar to each deployment [15:57:16] but then you'd have a special case for everything, rather than just plain connectivity [15:57:31] I am not sure that your solution would be much simpler to be honest [16:02:41] in this specific case we are a little bit more unlucky since ml-serve1012 is connected to a brand new Nokia switch, and I don't believe it is ready to act as L3 switch yet [16:04:29] my general experience is that relying on network setup to deliver high level features yields complexity that becomes immovable once established, simple foundations always win :) [16:11:10] my point is that I am not sure what simple foundations always win, since we are talking about a basic overlay network between pods. I'd agree with you that things like Istio are a huge complexity that we don't need, especially as sidecars [16:11:31] but even relying on sidecars for simple thing like implementing a mesh is not trivial [16:11:44] for example, if a pod needs to contact an external service [16:12:05] In the ML use case, we do have extra complexity since we have calico and istio sidecars [16:12:16] and that is not great I know, but it was mandatory [16:16:13] getting back to the BGP issue - there is another bit of info that I didn't share, since I always forget [16:16:45] in order to allow a host to BGP peer we need to turn on a flag in https://netbox.wikimedia.org/dcim/devices/6344/ and run a tool called "homer", that propagates the network config where needed [16:17:09] In this case I am not sure if we can do it because of the nokia switch, so I'll wait for Cathal to come back to me [16:23:33] one simplification here is to perhaps use simple deployments and deprecate knative+kserve, after all, these services are just rest endpoints [16:23:46] i'm not sure were really need the serverless experience [16:27:25] 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850 (10Eevans) 03NEW [16:29:13] we do use autoscaling in some use cases, that is nice to have [16:29:58] 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11363705 (10Eevans) p:05Triage→03Medium [16:30:12] kserve is deeply integrated with the ML standards that the community is using, so I may become a real burden to bypass it to create a custom service [16:31:05] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11363715 (10Eevans) >>! In T409414#11362789, @achou wrote: >>>! In T409414#11359529, @Eevans wrote: >>>>! In T409414#11354310, @DPogorzelski-WMF wrote: >>> @Eevans i guess we can just start wi... [16:31:07] autoscaling: agree but generally that is a luxury problem so we can always slove it later and take the easy path now without it [16:31:07] kserve: are we actually using specific things out of that box? [16:32:44] I agree for autoscaling we can probably simply tune the number of replicas and be done with it, getting back once removed knative may be harder. I see that kserve offers a "raw deployment" option now, so that is something worth to test [16:33:43] kserve is used in two places: 1) as skeleton for all the inference services that we have (the python code I mean) 2) To create a standard URI path where to query models when they get deployed [16:34:10] plus other things, like integrating with various other open source projects [16:34:28] (hugging face, vllm, etc.. writing custom images only for kserve for example) [16:34:52] and it is a CNCF incubating project atm IIUC, so a ton of community and support [16:35:01] maybe 1) is just to fit in the inferenceservice crd? 2) is probably not specific to kserve [16:35:35] at the end of the day the ML services load a model from "a place" and consume a gpu device [16:36:05] no 1) is a little more, you have a framework where you implement preprocess/process/postprocess in python and the whole http service is created for you, metrics included [16:36:38] 2) is getting more standard, but again why do you want to replicate things that the ML community is converging on ? [16:36:55] 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11363741 (10Eevans) [16:37:08] we'd need to redo and adapt a ton of work that we've done so far, not sure with what gains [16:37:29] I get the knative simplification, probably a really nice exploratory task to do [16:37:58] 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11363751 (10Eevans) With respect to `GRANT`s, is it safe to assume that `MODIFY` is sufficient? There is no requirement to do reads here, is there? [16:46:14] yeah kserve now has a standard deployment (without knative) https://kserve.github.io/website/docs/concepts/architecture#deployment-modes that wasn't an option before [16:46:17] we should explore this for LLM [16:46:42] and it says its highly recommended for LLM Serving.. [16:54:50] it could be something to explore before the upgrade to k8s 1.31 [16:55:07] it would surely simplify the process if knative was removed from the picture [16:56:26] 🤸 [18:05:21] (03PS1) 10Reedy: build: Update MediaWiki requirement to 1.46 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1203988 (https://phabricator.wikimedia.org/T409239) [18:33:05] 06Machine-Learning-Team: Q2 FY2025-26 Goal: - https://phabricator.wikimedia.org/T409863 (10Sucheta-Salgaonkar-WMF) 03NEW [18:37:39] 06Machine-Learning-Team: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11364345 (10Sucheta-Salgaonkar-WMF) [18:49:59] 06Machine-Learning-Team: Iterate on Annotool functionality to support more use cases - https://phabricator.wikimedia.org/T409866 (10Sucheta-Salgaonkar-WMF) 03NEW [18:58:52] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11364421 (10Sucheta-Salgaonkar-WMF) [18:59:19] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11364425 (10Sucheta-Salgaonkar-WMF) [18:59:31] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11364430 (10Sucheta-Salgaonkar-WMF) [18:59:35] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11364431 (10Sucheta-Salgaonkar-WMF) [19:00:51] 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11364438 (10Sucheta-Salgaonkar-WMF) [19:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [19:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:49:33] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11364771 (10Eevans) [23:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [23:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas