[02:59:43] (03PS3) 10Tim Starling: Remove unused WatchedItemQueryService hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) [08:20:27] morning, i have updated knative, will deploy the test llm model now [08:21:48] deployed, let me see if it works [08:27:14] (03PS1) 10Kevin Bazira: docker-compose: add revertrisk-wikidata config [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203376 (https://phabricator.wikimedia.org/T406179) [08:52:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [08:52:49] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [08:52:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:00:18] hmmm seems like the inference service was taken in but there's no pod to be seen anywhere [09:22:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:23:12] dpogorzelski: o/ [09:24:32] so afaics in kubectl get events -n llm I can see [09:24:33] Error creating: pods "aya-llm-predictor-00002-deployment-5f7bd55b65-fzffl" is forbidden: [maximum memory usage per Container is 8Gi, but limit is 35Gi, maximum cpu usage per Pod is 10, but limit is 11, maximum memory usage per Pod is 10Gi, but limit is 38811992064] [09:26:15] so I suspect that where aya is deployed in staging it has higher limitranges [09:34:33] ah ok we don't deploy it in staging atm [09:34:57] anywyay, on ml-serve-eqiad you can check the limits via `kubectl get limitranges -n llm -oyaml` [09:36:18] if you go in deployment-charts' hemfille.d/admin_ng/ml-staging-codfw/values.yaml you'll see the limitranges settings that we apply for each namespace (if there is the need to override, we have a default as well) [09:37:01] (03PS2) 10Nik Gkountas: collection recs: fix lead section size filtering [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203199 (https://phabricator.wikimedia.org/T403730) [09:37:01] (03PS2) 10Nik Gkountas: search recs: do not add lead_section_size when no lead_section URL param [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203200 (https://phabricator.wikimedia.org/T403730) [09:37:01] (03PS2) 10Nik Gkountas: popular recs: do not add lead_section_size when no lead_section URL param [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203201 (https://phabricator.wikimedia.org/T403730) [09:38:07] (03CR) 10Nik Gkountas: collection recs: fix lead section size filtering (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203199 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [09:38:11] for ml-serve-eqiad, you can probably just override helmfile.d/admin_ng/ml-serve.yaml, that will be applied to both eqiad and codfw (you can also override ml-serve-eqiad.yaml if you prefer, but it seems easier/more-consistent otherwise) [09:39:27] in case of emergency, or if you need to be quick, you can also simply do `kubectl edit limitranges -n llm` for ml-serve-eqiad (from the admin/root account), modify/save and see if the pod comes up [09:39:38] (and then of course follow up with a patch etc..) [09:39:56] last but not the least, please apply the knative settings to staging and codfw when you are done :) [09:59:49] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203376 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [10:07:04] is there anybody on-call this week? [10:07:50] yesterday (I was oncall for SRE) I noticed https://phabricator.wikimedia.org/T409657 [10:08:01] the access is restricted to subscribers, I didn't add all the team members [10:08:05] cc: aiko --^ [10:12:17] elukey: This week I am on MLOps rotation [10:12:49] georgekyz: o/ thanksss - lemme know if you have time to review the above task (you should be able to see it) [10:13:29] elukey: Yeap I can see the ticket, I will report it in the MLOps incidents doc [10:13:46] thnx for reporting it @elukey [10:13:48] maybe I was over-cautious with restricting access, but better safe than sorry :D [10:14:07] feel free to move it to public when you feel ok [10:14:11] ✊ [10:14:38] 👍 [10:24:52] elukey: Is it ok if I add the "essential work" to it since it seems that this task falls into the MLOps rotation work ?? [10:27:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:29:03] georgekyz: I think so but I have no idea how ML triages these tasks :( [11:34:00] (03CR) 10Kevin Bazira: [C:03+2] docker-compose: add revertrisk-wikidata config [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203376 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [11:34:35] (03Merged) 10jenkins-bot: docker-compose: add revertrisk-wikidata config [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1203376 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [12:22:29] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11358283 (10achou) > regarding 2. would flipping egress to true here be sufficient? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/kserv... [13:02:41] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11358384 (10achou) > Option A would require some talk with SRE but given the size of the topic and the current /srv usage in main-eqiad / codfw I don't see any big opposition in havi... [13:07:13] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API, 10LPL Projects (Other): Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11358411 (10Nikerabbit) [13:08:46] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11358428 (10gkyziridis) [13:37:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:47:44] RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:47:50] I've merged patch but diff is not showing up on the deployment server. What can be wrong? [13:47:50] `kartik@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw diff` [13:47:50] `skipping missing values file matching "values-main.yaml"` [13:47:50] `Comparing release=main, chart=wmf-stable/python-webapp, namespace=recommendation-api-ng` [13:54:29] Seems helmfile.d/ml-services/llm/values-ml-serve-eqiad.yaml file is modified directly? [13:57:33] Can anyone take a look at, `helmfile.d/ml-services/llm/values-ml-serve-eqiad.yaml` git shows it is changed, and proabably blocking sync of the repository. [13:58:45] elukey: ^^ [14:05:08] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11358699 (10DPogorzelski-WMF) @achou which services should be able to connect to cassandra? to know where to enable egress @Eevans I would need to know the cassandra endpoint and possible a se... [14:05:40] let me see i might have left it over [14:05:46] while testing [14:05:59] but if you need to prceed you can discard [14:06:04] it's not imporant [14:06:13] kart_: [14:06:27] I'm not sure if I've access to pull git. But let try. [14:06:34] i can fix [14:06:35] sec [14:07:01] now [14:07:25] cool. Working. Thanks a lot dpogorzelski [14:07:31] sorry about it [14:09:32] No problem [14:29:56] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11358786 (10gkyziridis) ==== Tested Locally ==== I downloaded the same model binary (pickle) that we are using currently on staging and build the model server locally using `m... [14:33:29] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11358844 (10elukey) @gkyziridis I am not 100% sure if the rev-id in the task's description is the problematic one, I thought it was when checking the logs but you may need to... [14:49:35] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11358954 (10gkyziridis) >>! In T409657#11358844, @elukey wrote: > @gkyziridis I am not 100% sure if the rev-id in the task's description is the problematic one, I thought it w... [15:44:25] (03CR) 10Sbisson: [C:03+2] collection recs: fix lead section size filtering [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203199 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [15:44:30] (03CR) 10Sbisson: [C:03+2] search recs: do not add lead_section_size when no lead_section URL param [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203200 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [15:44:33] (03CR) 10Sbisson: [C:03+2] popular recs: do not add lead_section_size when no lead_section URL param [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203201 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [15:46:07] (03Merged) 10jenkins-bot: collection recs: fix lead section size filtering [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203199 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [15:46:08] (03Merged) 10jenkins-bot: search recs: do not add lead_section_size when no lead_section URL param [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203200 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [15:46:15] (03Merged) 10jenkins-bot: popular recs: do not add lead_section_size when no lead_section URL param [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1203201 (https://phabricator.wikimedia.org/T403730) (owner: 10Nik Gkountas) [15:54:04] 06Machine-Learning-Team, 13Patch-For-Review: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11359315 (10achou) @DPogorzelski-WMF The service to connect to Cassandra is the revise-tone-task-generator that @BWojtowicz-WMF is working on in T408538. Currently, it is... [15:56:27] elukey: how do you setup calico? is there a puppet automation? seems to be missing on 1012 [15:56:39] 5m31s Warning FailedCreatePodSandBox pod/aya-llm-predictor-00002-deployment-66cd8fd9f-6kgdb (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "87f62b30083d0ab6e2f734f85181e2fce0313cf751c0b6ebf6df3aec7688627c": plugin type="calico" failed (add): failed to find plugin "calico" in path [/usr/lib/cni] [15:59:39] we do yes, lemme check the packages [15:59:48] 06Machine-Learning-Team, 10Research-engineering, 06Research (FY2025-26-Research-October-December): Share code between Research & ML teams - https://phabricator.wikimedia.org/T398974#11359332 (10fkaelin) Weekly updates - started implementation of commons-utils as a project in ml-pipeline. Initial focus is on... [16:01:45] istio-cni and calico-cni are installed on ml-serve1012 [16:01:56] is that from the kubelet? [16:02:56] I see it also in describe pod ol [16:02:57] *ok [16:03:18] ml-serve1012 is the only one running trixie [16:03:22] so I guess something is missing [16:04:06] yeah it smells like a config issue, /usr/lib/cni is not there on ml-serve1011 [16:06:12] the /etc/default/kubelet seems good on 1012 [16:15:04] and /etc/cni/net.d/10-calico.conflist looks legit [16:15:14] "cni_bin_dir": "/opt/cni/bin", [16:17:21] hmmm [16:18:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [16:18:49] Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ... [16:18:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:21:07] the only diff that I can see in /etc/kubernetes/kubelet-config.yaml is the custom registerWithTaints that I added, in theory with this version is not picked up unless the kubelet is started the first time with that option (and I added it afterwards) [16:21:23] lemme try to remove it manually to see if the kubelet is misbehaving for that [16:21:27] unlikely but.. [16:21:48] as FYI `sudo disable-puppet "elukey - testing"` [16:21:59] kk [16:22:30] ok didn't really work as expected [16:22:30] could the kubelet have been started after the change to calico.conflist ? [16:22:37] so maybe just a restart [16:22:49] tried but nothing [16:23:22] it is weird that it uses /usr/lib/cni [16:23:26] where does it come from? [16:24:14] no idea been looking around but didn't find it [16:25:20] even the process has correct args [16:35:41] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11359529 (10Eevans) >>! In T409414#11354310, @DPogorzelski-WMF wrote: > @Eevans i guess we can just start with a set of shared credentials and split later if needed These clusters are managed... [16:40:21] hmmm seems like those errors were already present on oct24 [16:40:27] based on journalctl [16:42:30] same kubelet version [16:44:29] I killed the calico pod running on ml-serve1012, got restored and I restarted the kubelet. Same issue. [16:44:41] the funny thing is that when you start it it even says `Nov 10 16:43:17 ml-serve1012 kubelet[440544]: Flag --cni-bin-dir has been deprecated, will be removed along with dockershim.` [16:44:41] which means it sucked in the correct config [16:46:29] ok I think I may have found what changed, containerd. We use the debian one: on bookworm we have 1.6.20~ds1-1+deb12u1 and on trixie 1.7.24 [16:46:50] right [16:54:47] elukey@ml-serve1012:~$ containerd config default | grep /usr/lib [16:54:47] bin_dir = "/usr/lib/cni" [16:54:56] there you go [16:56:25] | yep on 1011 we have [16:56:28] [plugins."io.containerd.grpc.v1.cri".cni] [16:56:28] bin_dir = "/opt/cni/bin" [16:58:43] ok so the containerd toml template needs to have an extra setting for the bin_dir in trixie [17:05:17] dpogorzelski: I made a change to the containerd toml, I think it may have fixed it [17:06:05] Normal Created 2m25s kubelet Created container storage-initializer [17:09:33] ok I see that now the storage initializer fails [17:09:34] botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=llm%2Faya-expanse-8B%2F&encoding-type=url" [17:10:38] nice, thx, :) i'll followup tomorrow [18:11:07] dpogorzelski: opened https://gerrit.wikimedia.org/r/c/operations/puppet/+/1203500 [20:19:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [20:19:04] Deployment aya-llm-predictor-00002-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00002-deployment - ... [20:19:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:24:45] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11360643 (10Kgraessle) 05Open→03Stalled @Samwalton9-WMF There's a few translations outstanding before we... [22:43:47] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11361284 (10Eevans) [22:45:17] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11361288 (10Eevans)