[00:26:16] <wikibugs>	 (03CR) 10Divec: [C:03+1] "Looks good to me" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1134401 (https://phabricator.wikimedia.org/T391229) (owner: 10Divec)
[07:37:48] <isaranto>	 good morning o/
[07:43:48] <ozge_>	 Good  orning
[07:44:30] <ozge_>	 *good morning
[07:44:45] <wikibugs>	 (03CR) 10AikoChou: "I also added the plot for the first example to help explain the SHAP values at https://phabricator.wikimedia.org/P74618#300269" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou)
[07:45:47] <aiko>	 morning folks :)
[08:35:22] <kevinbazira>	 morning morning o/
[08:35:22] <kevinbazira>	 backport deploymend done: https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_April_8
[08:35:22] <kevinbazira>	 the rrla prediction_change stream config has been deployed in WMF wikis.
[08:35:22] <kevinbazira>	 it can be seen as `mediawiki.page_revert_risk_prediction_change.v1` on both: 
[08:35:22] <kevinbazira>	 https://meta.wikimedia.org/w/api.php?action=streamconfigs and
[08:35:22] <kevinbazira>	 https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs 
[08:36:47] <isaranto>	 hurray!
[08:36:55] <isaranto>	 great work Kevin!
[08:38:32] <kevinbazira>	 \o/
[08:57:04] <klausman>	 Morning!
[08:57:12] <klausman>	 and yes, well done!
[09:51:40] <ozge_>	 Hey @klausman , we are deploying the following in edit-check with Aiko: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1134989
[09:52:11] <klausman>	 Roger!
[09:53:05] <ozge_>	 We got the following error during the staging deployment:   Warning  FailedScheduling  110s (x1 over 3m20s)  default-scheduler  0/5 nodes are available: 1 Insufficient cpu, 2 Insufficient amd.com/gpu, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate. Therefore, our deployment was stuck in the pending stage.
[09:53:40] <klausman>	 ah, that's probably because the GPUs are used by something else
[09:53:51] <aiko>	 yeah seems all 3 gpus are used
[09:54:57] <klausman>	 isaranto: maybe we could stop/remove some of the aya/bert predictor pods running in the experimental NS in staging?
[09:55:00] <isaranto>	 I think it is time we remove some deployments from the experimental ns. 
[09:55:10] <klausman>	 *high five*
[09:55:14] <isaranto>	 heh
[09:55:28] <aiko>	  aya and aya-llm
[09:55:51] <isaranto>	 yeah I think we can remove the 2 aya deployments + the bert one + logo-detection
[09:55:57] <klausman>	 on it
[10:00:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: ensure CORS headers allow all origins [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1134401 (https://phabricator.wikimedia.org/T391229) (owner: 10Divec)
[10:06:25] <klausman>	 Alright, cleanup done live. Patch to make it permanent here: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1134991
[10:09:00] <klausman>	 ozge_: editcheck has scheduled now, enjoy! :)
[10:09:14] <wikibugs>	 (03Merged) 10jenkins-bot: edit-check: ensure CORS headers allow all origins [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1134401 (https://phabricator.wikimedia.org/T391229) (owner: 10Divec)
[10:11:55] * klausman lunch
[10:22:30] <ozge_>	 Great! Thank you all. Looking into it.
[10:40:14] * isaranto lunch!
[12:54:03] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Local peacock model server doesn't send CORS headers allowing all origins - https://phabricator.wikimedia.org/T391229#10721314 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos
[12:56:00] <wikibugs>	 (03CR) 10Kevin Bazira: "thank you for working on the updates Aiko. I ran into an error when I make a request with multiple samples where only the second sample se" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou)
[13:49:49] <elukey>	 hey folks, has anybody modified helmfile.d/ml-services/experimental/values-ml-staging-codfw.yaml on deploy1003 manually?
[13:50:38] <elukey>	 I just reverted it, there is a timer that pulls the new commits every minute and it can't work with unstaged changes :)
[16:57:10] <isaranto>	 this was probably done by us earlier before making the change permanent. Sorry about that!
[16:57:25] <isaranto>	 going afk folks, have a nice evening o/
[16:57:43] <klausman>	 elukey: if it's removal of services, my bad
[16:57:49] <klausman>	 I thought it had been merged
[16:59:27] <klausman>	 what were the changes you saw?
[18:39:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[18:39:49] <jinxer-wm>	 Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ...
[18:39:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[22:39:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[22:39:49] <jinxer-wm>	 Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ...
[22:39:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas