[04:40:37] o/ [05:29:59] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10347209 (10kevinbazira) [05:51:15] (03PS1) 10Kevin Bazira: test: update article-descriptions test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094168 (https://phabricator.wikimedia.org/T360120) [06:58:05] Guten tag! [07:51:00] 10Lift-Wing, 06Machine-Learning-Team: Log and export preprocess size in inference services as a prometheus metric - https://phabricator.wikimedia.org/T374034#10347300 (10isarantopoulos) We will have to add new vizualizations to the kserve inference services dashboard. I'd add 1 more row with 6 graphs (3 for e... [08:20:17] (03PS1) 10Kevin Bazira: test: update articlequality test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094286 (https://phabricator.wikimedia.org/T360120) [08:26:17] (03CR) 10Ilias Sarantopoulos: "Thanks for working on this! I have a question wether we can simplify the blubber file" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092759 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [08:32:08] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Nice, thanks!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094168 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [08:41:17] (03CR) 10Kevin Bazira: [C:03+2] test: update article-descriptions test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094168 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [08:41:59] (03Merged) 10jenkins-bot: test: update article-descriptions test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094168 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [09:35:13] (03CR) 10Kevin Bazira: test: update outlink transformer test image to support latest ci tests (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092759 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [10:11:40] isaranto: q - where is the config for visualizations for the kserve isvc dashboard? which repo? [10:13:44] Hmm I don't remember if these are configured in any repo (I may be wrong though) only the SLO dashboards are [10:13:58] I'm afk now will be back in 45' and check again [10:25:36] but where you saw the PromQL query from? [10:26:43] ok thanks :) [10:30:18] On grafana if you sign in you can edit each chart [10:30:34] Will ping you again in a bit if you need help [11:01:27] ohh I see it [11:27:29] aiko: do you need help? I'm available [11:42:07] isaranto: so far no. I'll ping u when I need help! [11:42:15] ack! [12:41:11] * aiko afk ~1h [14:16:46] 06Machine-Learning-Team: Give Mikhail access to ml-labs - https://phabricator.wikimedia.org/T380593 (10isarantopoulos) 03NEW [14:53:25] 06Machine-Learning-Team, 13Patch-For-Review: Give Mikhail access to ml-labs - https://phabricator.wikimedia.org/T380593#10348353 (10mpopov) [14:54:42] 06Machine-Learning-Team, 13Patch-For-Review: Give Mikhail access to ml-labs - https://phabricator.wikimedia.org/T380593#10348375 (10mpopov) Thank you! [14:56:34] 06Machine-Learning-Team, 13Patch-For-Review: Give Mikhail access to ml-labs - https://phabricator.wikimedia.org/T380593#10348354 (10klausman) 05Open→03Resolved Merged and Mikhail has confirmed he can log in. [15:03:53] https://grafana-rw.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?forceLogin&from=now-1h&orgId=1&to=now&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-component=All&var-model_name=articlequality&var-namespace=article-models [15:04:25] ---^ added new graphs for data size metrics! [15:05:13] Neat! [15:19:32] 10Lift-Wing, 06Machine-Learning-Team: Log and export preprocess size in inference services as a prometheus metric - https://phabricator.wikimedia.org/T374034#10348539 (10achou) A new row "Data size" with 6 graphs has been added to the kserve inference services dashboard. https://grafana-rw.wikimedia.org/d/n3LJ... [15:21:46] that seems nice aiko , nice work! [15:23:43] a short description that explains the metric in the `Description` field would be awesome (and would avoid future confusion +questions) [15:58:26] sounds good, I added description [16:09:58] 🙌 [16:10:42] looks nice! [17:23:12] 06Machine-Learning-Team: Test the feasibility of deployment of Aya-23 model in LiftWing - https://phabricator.wikimedia.org/T379052#10349254 (10isarantopoulos) Just pasting an update. I've loaded the 32B on ml-lab using accelerate and it used ~54GB GPU VRAM and only 5-7GB CPU memory. Previous attempts to "just l... [17:23:27] going afk for the weekend folks. take care! [20:39:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [20:39:49] Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=recommendation-api-ng&var-deployment=recommendation-api-ng-main - ... [20:39:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:44:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [20:44:49] Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s-mlserve&var-namespace=recommendation-api-ng&var-deployment=recommendation-api-ng-main - ... [20:44:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:15:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:15:49] Deployment recommendation-api-ng-main in recommendation-api-ng at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=recommendation-api-ng&var-deployment=recommendation-api-ng-main - ... [21:15:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:45:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:55:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas