[07:22:43] Good morning. [07:33:24] morning! [08:06:52] 06Machine-Learning-Team, 05Goal: Q2 FY2025-26 Goal: Deploy Add-a-link v2 models to production - https://phabricator.wikimedia.org/T408790#11343521 (10OKarakaya-WMF) [08:18:22] morning [08:27:46] do we have miro or other similar products ? [08:37:36] we do have miro, but we need to ask for a license via OIT IIRC [08:38:28] dpogorzelski: are the root credentials working fine etc..? [08:38:51] yep, just poking around the new gpu node :) [08:39:02] kubctl etc also works from deployment [08:39:17] so i should be good [08:39:49] perfect :) [08:45:00] what's the k8s env where ml-serve1012 was added? [08:47:45] basically what is the kube_env namespace/env combo to poke the right cluster :) [08:50:42] ml-serve-eqiad [08:51:48] you can use `kube-env admin ml-serve-eqiad` from root on deploy2002 to have a broader access (careful, it grants you access to all namespaces etc..) [08:55:53] so kube-env can only load admin via sudo but sudo doesn't have kube-env.sh loaded via /etc/profile.d [08:56:04] what am i missing ? [09:01:54] you need to sudo -i before [09:02:07] in a root session, you can use kube-env admin [09:02:29] this is what I usually d [09:02:34] *do [09:07:39] I'm seeing that some of our staging deployments are setting `monitoring.enabled: false` in their values file and some are keeping it true. If set to true, it adds `prometheus.io/scrape: true` annotation in pods/inferenceservices [09:07:54] However, I also see that we use `prometheus.kserve.io/scrape` annotation in our services. Some staging deployments have `monitoring.enabled: false`, but do inherit the `prometheus.kserve.io/scrape: true` from the production values file and I can see their metrics in Grafana :D [09:08:06] So I’m wondering - do we actually utilize the `prometheus.io/...` annotations? [09:17:55] aiko: does the llm image have a chart I could use? [09:19:38] aha i see one [09:20:35] somethig weird gets into the root session after sudo -i since my backspace becomes space :) [09:20:50] (ghostty/zsh) [09:23:31] 06Machine-Learning-Team, 06Discovery-Search (2025.10.20 - 2025.11.07): Initial task generation and ingestion to Cassandra and Search weight tags - https://phabricator.wikimedia.org/T408533#11343742 (10achou) **Update** I've collected articles in English (en), French (fr), Arabic (ar), and Japanese (ja), then... [09:42:47] bartosz: In theory they are used, have you checked if when we use `monitoring.enabled: false` we do see metrics from kserve? [09:42:58] terminfo gets lost after sudo -i, that's the reason. in general i'm not a super fan of shared admin jump hosts. having the capability to act independently on the "owned" (sub)domain of resources would remove some of the unnecessary friction imo. a PKI via Vault that could dispatch on demand, short lived credentials, like a signed cert for k8s, based on the user's identity would be perfect [09:44:14] dpogorzelski: definitely, but we haven't done it so far. The root admin kube-env is used only by SREs when needed, otherwise you can use the per-user kube-env that is not shared [09:44:39] we have plans to add Vault in the future but at the moment we need to balance our needs vs the complexity that it will bring [09:45:15] 👍 [09:45:20] to summarize - I didn't mean that you need to always use kube-env admin, it was just a suggestion for when you need to debug cluster-level things [09:46:26] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11343856 (10achou) @dcausse Thanks a lot! I found it was also missing $schema. (... [09:48:04] yep :) [09:54:32] elukey: Hmm it seems to me that we see (or don't see :D) the same number of kserve metrics on staging, regardless of `monitoring.enabled: true/false`. I've checked the Kserve dashboard (https://grafana.wikimedia.org/goto/rP-WTczDR?orgId=1) and Kserve Inference Services dashboard (https://grafana.wikimedia.org/goto/dIinTckvR?orgId=1) [09:59:42] what is the stat1004 host? [10:05:46] bartosz: then it is probably not being used by the inference-services chart, so we can probably remove it. I can try to check later on to double check, but you noticed a diff right? [10:06:05] dpogorzelski: old host that has been decommed, it was part of the stat10xx series [10:06:17] kk [10:12:45] aiko: when `❯ pip install -r src/models/llm/requirements.txt` i get `ERROR: bitsandbytes-1.0.0-py3-none-manylinux_2_24_x86_64.whl is not a supported wheel on this platform.` how do you workaround this on mac? [10:13:16] elukey: thank you, I'm happy to double check later as well, it's not too urgent. I don't think I noticed any diff, I stumbled onto it as I'm setting up a new inference service on staging and was wondering what should I set in the `monitoring.enabled` value. So I started investigating it.. [10:22:59] dpogorzelski: I think we didn't test it on mac. bitsandbytes is for llm quantization, we were testing it on ml-lab [10:28:38] (03CR) 10Gkyziridis: [C:03+1] "LGTM! THNX!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1201558 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)