[00:19:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[00:19:04] <jinxer-wm>	 Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ...
[00:19:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[04:19:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[04:19:04] <jinxer-wm>	 Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ...
[04:19:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[07:54:47] <ozge_>	 good morning
[07:58:32] <bartosz>	 morning! :) 
[08:01:29] <wikibugs>	 (03PS1) 10Bartosz Wójtowicz: revise-tone-task-generator: Add cache to the model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203757 (https://phabricator.wikimedia.org/T408538)
[08:19:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[08:19:04] <jinxer-wm>	 Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ...
[08:19:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:58:37] <dpogorzelski>	 regarding the storage access part, the egress seems to be enabled towards thanos
[09:01:54] <elukey>	 o/
[09:02:56] <elukey>	 for the storage-initializer there is a caveat: since it is an init-container, when it starts the istio gateway settings are not ready and without special settings, it blackholes all the traffic.
[09:03:31] <elukey>	 we use special annotations for the storage initializer in Istio to allow TCP connections to the thanos-swift IPs
[09:04:16] <elukey>	 IIUC with new versions of Kserve/Istio this is not needed since the storage-initilizer runs with a user that is automatically allowed to perform TCP conns bypassing the istio proxy
[09:04:52] <elukey>	 in theory it should work on trixie too, but there may be some extra caveats
[09:09:07] <elukey>	 the annotation is
[09:09:08] <elukey>	 traffic.sidecar.istio.io/excludeOutboundIPRanges: 10.2.2.54/32,10.2.1.54/32
[09:10:57] <elukey>	 it is likely calico not working in my opinion, so no connectivity for the pod
[09:12:23] <elukey>	 I tried to kill/restart the calico pod, and the same for the aya pod
[09:12:34] <elukey>	 I am wondering if the settings applied yesterday need a clean start
[09:15:04] <elukey>	 0/14 nodes are available: 1 Insufficient amd.com/gpu, 11 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate.
[09:18:37] <elukey>	 I am wondering if the new gpu node labeller removed the amd.com/gpu labe for the host, since it adds more specific ones
[09:19:24] <elukey>	 so probably it needs something like amd.com/gpu.vram=64G in limits
[09:21:38] <elukey>	 https://github.com/ROCm/gpu-operator/issues/151
[09:21:40] <elukey>	 uff
[09:22:46] <elukey>	 so amd.com/gpu was added by the gpu device plugin, IIUC the node labeller wipes it and adds only the custom ones
[09:24:26] <elukey>	  amd.com/gpu.vram is probably safer to avoid pods requiring a little bit of vram to be scheduled on big GPUS
[10:01:33] <aiko>	 good morning! :)
[10:02:10] <Gry>	 hi
[10:04:43] <georgekyz>	 hey hey
[10:08:51] <wikibugs>	 (03PS1) 10Kevin Bazira: test: add unit tests for revertrisk-wikidata [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179)
[10:12:45] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 2 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11362304 (10kevinbazira) I have added unit tests for critical components of the model-server to make sure future changes do n...
[10:13:36] <wikibugs>	 (03CR) 10Kevin Bazira: "This patch has been tested locally as shown in: https://phabricator.wikimedia.org/T406179#11362304" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[10:30:57] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thnx for adding unit testing Kevin!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[10:31:08] <georgekyz>	 kevinbazira: Thnx for adding unit testing mate!
[10:31:47] <kevinbazira>	 thanks for the review :)
[10:32:05] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] test: add unit tests for revertrisk-wikidata [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[10:33:27] <wikibugs>	 (03Merged) 10jenkins-bot: test: add unit tests for revertrisk-wikidata [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203781 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[10:51:24] <dpogorzelski>	 elukey: isn't the label amd.com/gpu.vram: 24G wrong tho? there's way more than that on the node even after partitioning
[10:52:38] <dpogorzelski>	 actually no, i'm wrong, it's 24g per partition
[10:52:58] <elukey>	 exactly yes
[10:53:08] <elukey>	 atm I think the GPUs are partitioned in two
[10:53:27] <dpogorzelski>	 8
[10:53:38] <dpogorzelski>	 https://www.irccloud.com/pastebin/0lyxc9E3/
[10:53:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[10:53:49] <jinxer-wm>	 Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ...
[10:53:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[10:55:18] <dpogorzelski>	 altough it's represented in a weird way where the first entry it's the first partition but also the physical unit as a whole it seems
[10:55:27] <elukey>	 right sorry I checked 1011 before
[10:57:45] <elukey>	   Warning  FailedScheduling  5m59s (x1 over 7m9s)  default-scheduler  0/14 nodes are available: 1 Insufficient amd.com/gpu.vram, 11 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate.
[11:13:48] <aiko>	 dpogorzelski: looks like you tagged the wrong Phab ticket for gpu testing patches lol  
[11:14:02] <aiko>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1203787
[11:14:17] <dpogorzelski>	 sorry :D 
[11:18:01] <elukey>	 aiko: o/ how much memory is aya need? Do you recall?
[11:18:13] <elukey>	 s/is/does
[11:23:16] <dpogorzelski>	 https://www.irccloud.com/pastebin/bEWRnPMd/
[11:43:51] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11362665 (10Samwalton9-WMF) >>! In T409438#11360643, @Kgraessle wrote: > @Samwalton9-WMF  >  > There's a few t...
[11:45:39] <aiko>	 elukey: I don't recall it, but in theory it needs 8*2*1.2 = ~19.2 GB of memory for aya-8B
[11:45:56] <elukey>	 dpogorzelski: --^
[11:46:05] <elukey>	 thanks :)
[11:46:44] <elukey>	 maybe on the GPU it will be different, we won't need as much memory on the container
[11:50:45] <aiko>	 ah the number I calculated is GPU VRAM needed for loading the model weights and inference
[11:51:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[11:51:49] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[11:51:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[11:52:26] <aiko>	 I don't recall how much system ram it needs 
[12:21:46] <dpogorzelski>	 probably not much
[12:40:24] <dpogorzelski>	 it's still stuck on `botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=llm%2Faya-expanse-8B%2F&encoding-type=url"`
[12:40:50] <dpogorzelski>	 elukey: not sure if I missed some changes to be applied there on my side?
[12:41:33] <dpogorzelski>	 where is `traffic.sidecar.istio.io/excludeOutboundIPRanges: 10.2.2.54/32,10.2.1.54/32` applied?
[12:41:40] <wikibugs>	 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11362789 (10achou) >>! In T409414#11359529, @Eevans wrote: >>>! In T409414#11354310, @DPogorzelski-WMF wrote: >> @Eevans i guess we can just start with a set of shared credentials and split la...
[12:58:07] <dpogorzelski>	 aha nvm i see it in the pod
[13:26:52] <dpogorzelski>	 when the aya pod starts you can see calico pod logs like :
[13:26:53] <dpogorzelski>	 `2025-11-11 11:20:53.403 [INFO][103] felix/int_dataplane.go 1585: Received *proto.ActivePolicyUpdate update from calculation graph msg=id:<tier:"default" name:"llm/knp.default.kserve-inference-main-predictor" > policy:<namespace:"llm" inbound_rules:<action:"allow" protocol:<name:"tcp" > dst_ports:<first:8012 last:8012 > dst_ports:<first:8080 last:8080 > dst_ports:<first:9090 last:9091 > dst_ports:<first:15020 last:15020 > 
[13:26:53] <dpogorzelski>	 rule_id:"RPSf-RXLGcYBu4O7" > outbound_rules:<action:"allow" protocol:<name:"tcp" > dst_net:"10.2.1.54/32" dst_ports:<first:443 last:443 > rule_id:"n5eND-JlkWR0v4_P" > outbound_rules:<action:"allow" protocol:<name:"tcp" > dst_net:"10.2.2.54/32" dst_ports:<first:443 last:443 > rule_id:"1Yf5YgzNl_lDoNoc" > >`
[13:27:19] <dpogorzelski>	 which is correct, it should allow outbound traffic
[13:38:31] <dpogorzelski>	 i don't see anything weird there honestly, will keep looking around
[13:39:26] <dpogorzelski>	 but having to spend this much time to get a simple pod running is starting to make me question how much k8s is helping here and how much of a problem source it actually is :) 
[13:40:10] <elukey>	 oh yes I have been wondering this for a long time
[13:40:25] <elukey>	 I think everybody wonders the same every now and then
[13:50:31] <dpogorzelski>	 so once a pod is working how does it become reachable across the load balancer chains?
[13:51:00] <dpogorzelski>	 i guess an entry is provided somewhere and it's not discovered fully automatically
[13:56:19] <elukey>	 answering that in a second, but first I think I found the problem
[13:56:37] <elukey>	 elukey@ml-serve1012:~$ sudo calicoctl node status 
[13:56:37] <elukey>	 Calico process is running.
[13:56:37] <elukey>	 IPv4 BGP status
[13:56:37] <elukey>	 No IPv4 peers found.
[13:57:18] <dpogorzelski>	 is calico and istio really needed though?
[13:57:42] <elukey>	 calico yes, it is the overlay network for pods
[13:57:47] <elukey>	 we use it everywhere
[13:58:25] <elukey>	 istio it was required by kserve when we started, it may not be needed now but I am not sure if it would be better or not
[13:59:33] <elukey>	 so k8s hosts need to BGP peer either with the core DC routers (old rows) or to L3 switches, and I think ml-serve1012 may be connected to one that is not allowed to BGP peer yet
[13:59:36] <elukey>	 lemme check
[13:59:44] <dpogorzelski>	 i'm more thinking that it's just a few nodes with a few services, we could execute them just via systemd with no orchestrator outside of a CI pipeline  :) 
[13:59:46] <dpogorzelski>	 kk
[14:00:49] <dpogorzelski>	 i'll be back in 1 hr
[14:00:56] <elukey>	 about you earlier question - when an InferenceService resource is up istio and knative set up all the configs to reach the new pods. Usually we just need to use the right URI after https://inference.discovery.wmnet
[14:07:30] <elukey>	 dpogorzelski: this should be the fix https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1203820
[14:08:12] <elukey>	 ml-serve1012 is connected to a L3 switch that is not configured to BGP peer, so calico cannot really do much
[15:41:57] <dpogorzelski>	 hmmm but why bgp? do we ever have situations where pods talk to each other? unless i'm mistaken the flow is: clients-->ingress-->pod, or pod-->some other external service, pod-to-pod doesn't seem to be a thing, or?
[15:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[15:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[15:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[15:54:51] <elukey>	 dpogorzelski: there are various use cases, the first one that comes to mind is Prometheus scraping metrics from pods directly. Those IPs need to be reachable from the prod network.. When a pod needs to reach to a discovery VIP/IP etc..
[15:55:12] <elukey>	 we don't have an egress gateway or similar
[15:55:46] <elukey>	 so each k8s host announce the IP subnets that are allocated by the control plane to it via BGP
[15:55:57] <elukey>	 and those pods become reachable 
[15:56:19] <dpogorzelski>	 but there could be a prom daemonset on each k8s node which would be much simpler :) 
[15:56:31] <dpogorzelski>	 or even a sidecar to each deployment
[15:57:16] <elukey>	 but then you'd have a special case for everything, rather than just plain connectivity
[15:57:31] <elukey>	 I am not sure that your solution would be much simpler to be honest
[16:02:41] <elukey>	 in this specific case we are a little bit more unlucky since ml-serve1012 is connected to a brand new Nokia switch, and I don't believe it is ready to act as L3 switch yet
[16:04:29] <dpogorzelski>	 my general experience is that relying on network setup to deliver high level features yields complexity that becomes immovable once established, simple foundations  always win :) 
[16:11:10] <elukey>	 my point is that I am not sure what simple foundations always win, since we are talking about a basic overlay network between pods. I'd agree with you that things like Istio are a huge complexity that we don't need, especially as sidecars
[16:11:31] <elukey>	 but even relying on sidecars for simple thing like implementing a mesh is not trivial
[16:11:44] <elukey>	 for example, if a pod needs to contact an external service
[16:12:05] <elukey>	 In the ML use case, we do have extra complexity since we have calico and istio sidecars
[16:12:16] <elukey>	 and that is not great I know, but it was mandatory
[16:16:13] <elukey>	 getting back to the BGP issue - there is another bit of info that I didn't share, since I always forget
[16:16:45] <elukey>	 in order to allow a host to BGP peer we need to turn on a flag in https://netbox.wikimedia.org/dcim/devices/6344/ and run a tool called "homer", that propagates the network config where needed
[16:17:09] <elukey>	 In this case I am not sure if we can do it because of the nokia switch, so I'll wait for Cathal to come back to me
[16:23:33] <dpogorzelski>	 one simplification here is to perhaps use simple deployments and deprecate knative+kserve, after all, these services are just rest endpoints
[16:23:46] <dpogorzelski>	 i'm not sure were really need the serverless experience 
[16:27:25] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850 (10Eevans) 03NEW
[16:29:13] <elukey>	 we do use autoscaling in some use cases, that is nice to have
[16:29:58] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11363705 (10Eevans) p:05Triage→03Medium
[16:30:12] <elukey>	 kserve is deeply integrated with the ML standards that the community is using, so I may become a real burden to bypass it to create a custom service
[16:31:05] <wikibugs>	 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11363715 (10Eevans) >>! In T409414#11362789, @achou wrote: >>>! In T409414#11359529, @Eevans wrote: >>>>! In T409414#11354310, @DPogorzelski-WMF wrote: >>> @Eevans i guess we can just start wi...
[16:31:07] <dpogorzelski>	 autoscaling: agree but generally that is a luxury problem so we can always slove it later and take the easy path now without it
[16:31:07] <dpogorzelski>	 kserve: are we actually using specific things out of that box?
[16:32:44] <elukey>	 I agree for autoscaling we can probably simply tune the number of replicas and be done with it, getting back once removed knative may be harder. I see that kserve offers a "raw deployment" option now, so that is something worth to test
[16:33:43] <elukey>	 kserve is used in two places: 1) as skeleton for all the inference services that we have (the python code I mean) 2) To create a standard URI path where to query models when they get deployed
[16:34:10] <elukey>	 plus other things, like integrating with various other open source projects 
[16:34:28] <elukey>	 (hugging face, vllm, etc.. writing custom images only for kserve for example)
[16:34:52] <elukey>	 and it is a CNCF incubating project atm IIUC, so a ton of community and support
[16:35:01] <dpogorzelski>	 maybe 1) is  just to fit in the inferenceservice crd?  2) is probably not specific to kserve 
[16:35:35] <dpogorzelski>	 at the end of the day the ML services load a model from "a place" and consume a gpu device
[16:36:05] <elukey>	 no 1) is a little more, you have a framework where you implement preprocess/process/postprocess in python and the whole http service is created for you, metrics included
[16:36:38] <elukey>	 2) is getting more standard, but again why do you want to replicate things that the ML community is converging on ?
[16:36:55] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11363741 (10Eevans)
[16:37:08] <elukey>	 we'd need to redo and adapt a ton of work that we've done so far, not sure with what gains
[16:37:29] <elukey>	 I get the knative simplification, probably a really nice exploratory task to do
[16:37:58] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11363751 (10Eevans) With respect to `GRANT`s, is it safe to assume that `MODIFY` is sufficient?  There is no requirement to do reads here, is there?
[16:46:14] <aiko>	 yeah kserve now has a standard deployment (without knative) https://kserve.github.io/website/docs/concepts/architecture#deployment-modes that wasn't an option before 
[16:46:17] <aiko>	 we should explore this for LLM
[16:46:42] <aiko>	 and it says its highly recommended for LLM Serving..
[16:54:50] <elukey>	 it could be something to explore before the upgrade to k8s 1.31
[16:55:07] <elukey>	 it would surely simplify the process if knative was removed from the picture
[16:56:26] <dpogorzelski>	 🤸
[18:05:21] <wikibugs>	 (03PS1) 10Reedy: build: Update MediaWiki requirement to 1.46 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1203988 (https://phabricator.wikimedia.org/T409239)
[18:33:05] <wikibugs>	 06Machine-Learning-Team: Q2 FY2025-26 Goal: - https://phabricator.wikimedia.org/T409863 (10Sucheta-Salgaonkar-WMF) 03NEW
[18:37:39] <wikibugs>	 06Machine-Learning-Team: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11364345 (10Sucheta-Salgaonkar-WMF)
[18:49:59] <wikibugs>	 06Machine-Learning-Team: Iterate on Annotool functionality to support more use cases - https://phabricator.wikimedia.org/T409866 (10Sucheta-Salgaonkar-WMF) 03NEW
[18:58:52] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11364421 (10Sucheta-Salgaonkar-WMF)
[18:59:19] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q2 FY2025-26 Goal: Generate a list of edit suggestions using machine learning - https://phabricator.wikimedia.org/T409863#11364425 (10Sucheta-Salgaonkar-WMF)
[18:59:31] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Task generation engine for Revise Tone task - https://phabricator.wikimedia.org/T408341#11364430 (10Sucheta-Salgaonkar-WMF)
[18:59:35] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages - https://phabricator.wikimedia.org/T400423#11364431 (10Sucheta-Salgaonkar-WMF)
[19:00:51] <wikibugs>	 06Machine-Learning-Team, 05Goal, 07OKR-Work: Q1 FY2025-26 Goal: Make article topic data available at scale and within SLOs for Year in Review - https://phabricator.wikimedia.org/T392833#11364438 (10Sucheta-Salgaonkar-WMF)
[19:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[19:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[19:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[20:49:33] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11364771 (10Eevans)
[23:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[23:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[23:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas