[00:26:16] (03CR) 10Divec: [C:03+1] "Looks good to me" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1134401 (https://phabricator.wikimedia.org/T391229) (owner: 10Divec) [07:37:48] good morning o/ [07:43:48] Good  orning [07:44:30] *good morning [07:44:45] (03CR) 10AikoChou: "I also added the plot for the first example to help explain the SHAP values at https://phabricator.wikimedia.org/P74618#300269" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [07:45:47] morning folks :) [08:35:22] morning morning o/ [08:35:22] backport deploymend done: https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_April_8 [08:35:22] the rrla prediction_change stream config has been deployed in WMF wikis. [08:35:22] it can be seen as `mediawiki.page_revert_risk_prediction_change.v1` on both: [08:35:22] https://meta.wikimedia.org/w/api.php?action=streamconfigs and [08:35:22] https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs [08:36:47] hurray! [08:36:55] great work Kevin! [08:38:32] \o/ [08:57:04] Morning! [08:57:12] and yes, well done! [09:51:40] Hey @klausman , we are deploying the following in edit-check with Aiko: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1134989 [09:52:11] Roger! [09:53:05] We got the following error during the staging deployment: Warning FailedScheduling 110s (x1 over 3m20s) default-scheduler 0/5 nodes are available: 1 Insufficient cpu, 2 Insufficient amd.com/gpu, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate. Therefore, our deployment was stuck in the pending stage. [09:53:40] ah, that's probably because the GPUs are used by something else [09:53:51] yeah seems all 3 gpus are used [09:54:57] isaranto: maybe we could stop/remove some of the aya/bert predictor pods running in the experimental NS in staging? [09:55:00] I think it is time we remove some deployments from the experimental ns. [09:55:10] *high five* [09:55:14] heh [09:55:28] aya and aya-llm [09:55:51] yeah I think we can remove the 2 aya deployments + the bert one + logo-detection [09:55:57] on it [10:00:09] (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: ensure CORS headers allow all origins [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1134401 (https://phabricator.wikimedia.org/T391229) (owner: 10Divec) [10:06:25] Alright, cleanup done live. Patch to make it permanent here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1134991 [10:09:00] ozge_: editcheck has scheduled now, enjoy! :) [10:09:14] (03Merged) 10jenkins-bot: edit-check: ensure CORS headers allow all origins [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1134401 (https://phabricator.wikimedia.org/T391229) (owner: 10Divec) [10:11:55] * klausman lunch [10:22:30] Great! Thank you all. Looking into it. [10:40:14] * isaranto lunch! [12:54:03] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Local peacock model server doesn't send CORS headers allowing all origins - https://phabricator.wikimedia.org/T391229#10721314 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [12:56:00] (03CR) 10Kevin Bazira: "thank you for working on the updates Aiko. I ran into an error when I make a request with multiple samples where only the second sample se" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [13:49:49] hey folks, has anybody modified helmfile.d/ml-services/experimental/values-ml-staging-codfw.yaml on deploy1003 manually? [13:50:38] I just reverted it, there is a timer that pulls the new commits every minute and it can't work with unstaged changes :) [16:57:10] this was probably done by us earlier before making the change permanent. Sorry about that! [16:57:25] going afk folks, have a nice evening o/ [16:57:43] elukey: if it's removal of services, my bad [16:57:49] I thought it had been merged [16:59:27] what were the changes you saw? [18:39:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [18:39:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [18:39:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:39:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:39:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [22:39:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas