[07:40:03] good morning o/ [08:10:10] going to deploy the seccompProfile change to ml-staging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1117939 [08:17:23] Good morning [08:22:22] \o [08:41:53] o/ [08:42:32] georgekyz: I saw Daniel has responded on the blubber MR. do you have all you need or do you need any help? [08:43:35] isaranto: Yeap he did his first review on Friday and now the last part. I think I am ok if I need anything I will ping you [08:43:43] ack [08:44:05] if I can help that is :P [08:44:20] but if you need a brain to pick or a debugger duck I'm here [08:54:19] isaranto: thnx much appreciated! [09:20:52] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10556220 (10isarantopoulos) I have deployed the above change to all the services in ml-staging-codfw. The following was successfully added t... [09:50:06] (03CR) 10Thiemo Kreuz (WMDE): Replace isset() with null checks on global (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 (owner: 10Umherirrender) [10:53:02] Morning! [10:53:52] morning Tobias! [11:47:37] isaranto: o/ [11:47:53] thanks for the ml-staging update! It all seems good now, I am going to apply again https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1120156 [11:48:22] that basically enforce the "restricted" label/config for most of the namespace [11:48:25] *namespaces [11:52:06] \o ok, thanks! [12:00:00] deployed, all good this time! [12:00:14] I am going to summarize staging's status for everybody: [12:00:50] - We are trying to move away from the Pod Security Policy configs (PSP) because in the new k8s version they will be removed, in favor of Pod Security Standards (PSS). [12:01:49] - PSS offers 3 profiles, that corresponds to various "classes" of security restrictions. For example, most of our workloads for kserve are in the "restricted" profile. [12:03:02] - The migration is a little complicated in our case since a kserve pod is essentially composed of multiple containers (from various layers, istio knative etc..) and all of them need to have the same security restrictions applied (for example, seccomp profile etc..) [12:03:34] - Why didn't we need it before? Since our PSP config auto-injected those when the pod was created, with PSS we can't do anymore so we need to be explicit. [12:04:16] - So, kserve pod: 2 istio containers, knative queue, kserve-inference, storage-initializer. [12:05:13] - ml-staging-codfw is running with a patched knative-serving control plane that automatically injects "restricted" settings when the pod are created, and we use some defaults for seccomp at the pod level as well (for example, for the istio containers). [12:06:06] Lemme know if the above is not clear or missing anything.. [12:06:34] the idea is to let it soak in staging for a bit, you do deployments etc.. and verify that all is stable [12:06:44] then when we are confident we move to prod [12:07:02] to complete the staging migration I'd need to merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1115323 as well [12:07:11] to disable PSP, but it should be fine [12:08:37] what could go wrong? part , or things to look for: [12:09:03] - failed deployments and/or "events" reporting issues with security restrictions (like you deploy and the pods don't come up) [14:56:14] klausman: o/ could you help with docker on ml-lab? it seems that there is no room for containers. I tried pruning (docker container prune) but it is somehow stuck. I can't get even run docker system df now (it says it is already running) [14:56:51] I got the following msg when starting or building containers [14:56:51] ``` [14:56:51] docker: Error response from daemon: devmapper: Thin Pool has 0 free data blocks which is less than minimum required 163840 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior. [14:56:51] See 'docker run --help' [14:56:51] ``` [14:58:53] I was getting the same when trying to build the huge rocm image https://phabricator.wikimedia.org/P73482#294603 [14:59:08] tried to fix it but probably made things worse :( [15:17:06] Having a look [15:18:20] isaranto: ahem, docker on ml-lab? :D [15:18:31] yes, we made an executiv4e decision [15:18:49] can you elaborate a bit more? :) [15:19:11] these kind of use cases should be vetted by SRE, or maybe the k8s sig [15:19:22] We need to try and build rocm images that aren't just upstream monsters. None of our private machines (laptops) are suitable [15:19:59] Since the lab machines have no production traffic (or services), and to unblock Ilias' work, I decided to install docker on ml-lab1002 and add him to the docker group [15:21:10] It's by no means meant to be a permanent solution or state of matters. [15:21:25] sure but it is still production, I am not saying it was a bad decision but it is better to inform other teams for security reasons [15:22:10] even if you are able to build an image there, you'll need to push it to the registry, and atm we don't allow anything else than gilab and the build nodes [15:22:41] elukey: we needed a place to evaluate the docker images from amd upstream so that we can tackle https://phabricator.wikimedia.org/T385173 [15:22:56] [5952365.272923] XFS (dm-4): Failing async write on buffer block 0x9c204e0. Retrying async write. [15:23:05] this machine has a problem beyond a full disk [15:23:11] I mean that we needed to validate that there is value there [15:23:22] isaranto: I recall the task, I am just saying please follow up with SRE before testing things like these, ml-lab nodes are still prod hosts [15:25:14] ok, you are right. I should have and will definitely do in the future [15:27:54] I'll have to reboot the machine. Docker has wedged something on a DM device, but I don't think it's a hardware problem. It still makes everything unresponsive, however [15:45:38] ack, thanks! [16:01:29] isaranto: okay, things should be back in working order. I increased the sandbox size another bump [16:01:53] thanks will give it a try again later or tomorrow morning [16:02:02] ack [16:03:27] actually I'll try to rebuild the image I tried in the first place. Hope it doesnt break it again [16:07:46] 06Machine-Learning-Team: [LLM] ML-lab benchmarking - https://phabricator.wikimedia.org/T382343#10557403 (10gkyziridis) **Performance Benchmark** **Aya-expanse-32B** quantization via **GPTQModel**. It can be easily observable that the latency between the pre-trained model and the quantized one is almost half. {F... [16:16:16] 06Machine-Learning-Team: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645 (10achou) 03NEW [16:28:12] * isaranto afk bbl [17:58:47] new ROCm blog https://rocm.blogs.amd.com/artificial-intelligence/k8s-orchestration-part2/README.html [17:59:10] now officially going afk for the evening o/ [19:50:33] (03PS2) 10Umherirrender: Replace isset() with null coalesce on global [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 [19:51:37] (03CR) 10Umherirrender: Replace isset() with null coalesce on global (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 (owner: 10Umherirrender) [20:57:30] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10557853 (10dcausse) @Isaac thanks! I started the backfil at 60 articles/sec (on a per wiki basis from smallest to biggest). At...