[07:47:33] <isaranto>	 morning!
[08:23:12] <aiko>	 good morning :)
[08:26:09] <isaranto>	 \o
[08:29:03] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10681663 (10kevinbazira) A model-server that supports publishing revert-risk language-agnostic (RRLA) scores int...
[08:29:14] <kevinbazira>	 o/ the rrla model-server can now produce scores to the event stream on LW staging --^
[08:47:51] <isaranto>	 \o/
[08:48:13] <georgekyz>	 morning folks
[09:03:02] <kevinbazira>	 isaranto:  o/ as I was looking at https://gerrit.wikimedia.org/r/1131383
[09:03:02] <kevinbazira>	 I noticed the edit-check check model doesn't exist in the public repo: https://analytics.wikimedia.org/published/wmf-ml-models/
[09:03:02] <kevinbazira>	 but it exists on swift: 
[09:03:02] <kevinbazira>	 ```
[09:03:02] <kevinbazira>	 $ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/edit-check/peacock/
[09:03:02] <kevinbazira>	 ```
[09:03:02] <kevinbazira>	 should I push it to the public repo as this will enable us to test this docker-compose patch?
[09:31:30] <elukey>	 hey folks
[09:31:38] <elukey>	 going to migrate ml-serve-ctrl2002 to containerd: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131649
[09:46:48] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos)
[09:48:24] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: locust: change time between requests for edit-check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817)
[09:48:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] locust: change time between requests for edit-check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos)
[09:49:52] <isaranto>	 o/
[09:50:54] <isaranto>	 kevinbazira: I'm fine with that but I want to ask aiko if that is ok. shall we publish it aiko?
[09:51:58] <isaranto>	 klausman: let me know when you deploy the new limitranges to the revision-models ns. I have merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131327
[09:52:15] <isaranto>	 elukey: ack! ty
[09:57:14] <kevinbazira>	 okok... will wait for Aiko's take.
[09:57:34] <aiko>	 o/ I think that's fine
[09:59:24] <aiko>	 and editing would need it if they want to test it locally
[10:04:58] <klausman>	 isaranto: will be in a couple of minutes, I need to reboot my work machine
[10:05:31] <isaranto>	 thanks, no hurry, just wanted to sync
[10:11:58] <klausman>	 isaranto: mh, do we really want to increase the limits for staging as well? I only now realized that the change covers all clusters.
[10:13:43] <isaranto>	 hmm. tbh I would want to increase the limits for the experimental ns since we are testing things there. If it stresses staging too much we don't need the change in the revision-models ns
[10:15:48] <klausman>	 ack, will push as-is
[10:18:38] <klausman>	 staging done
[10:20:26] <elukey>	 ml-serve-ctrl2002 done
[10:20:49] <klausman>	 So all of ml-serve-codfw done, right?
[10:21:13] <klausman>	 well, etcd is still Bullseye, but everything is Bookworm+containerd.
[10:21:20] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10682018 (10elukey)
[10:21:33] <klausman>	 ml-serv-codfw has new limits
[10:24:15] <klausman>	 and eqiad done as well
[10:28:44] <isaranto>	 thanks! I'll deploy right away
[10:40:15] <elukey>	 klausman: yep, all done
[10:40:35] <elukey>	 I think that we can do the ctrl nodes in eqiad without any special handling, they are quick and easy
[10:40:46] <klausman>	 Aye, agreed
[10:41:14] <klausman>	 vlan-move needing pybal restarts is a bit of bummer
[10:41:38] <klausman>	 But for VMs we obvs don't need that
[11:07:36] <isaranto>	 hmm the new revision is not schedulable in ml-serve-eqiad & ml-stagingc-codfw (all well in ml-serve-codfw) 
[11:08:04] <isaranto>	 ` 0/13 nodes are available: 11 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/contr
[11:08:04] <isaranto>	 ol-plane: }, that the pod didn't tolerate.`
[11:10:30] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10682228 (10achou) We need to decide on the name for this stream. I think `mediawiki.revision_revertrisk_languag...
[11:11:25] <klausman>	 Insufficient CPU points at namespace needing a CPU allowance bump?
[11:12:26] <elukey>	 mmm it smells like the k8s scheduler not finding any host with the capacity to run the pod
[11:12:32] <elukey>	 isaranto: how big is it?
[11:12:55] <isaranto>	 34 cpus
[11:13:14] <elukey>	 lol
[11:13:22] <elukey>	 jumbo pod
[11:14:04] <isaranto>	 it is the experiment of increasing cpu cores /decreasing # of replicas
[11:18:44] <elukey>	 yeah maybe atm we don't have a good target node
[11:20:14] * isaranto commuting back home, will rejoin in ~45'
[11:21:36] <elukey>	 weird though, there should be space
[11:21:51] <elukey>	 klausman: maybe it is something also related to namespace, no idea atm
[11:22:22] <elukey>	 if it was related to the ns I'd have expected the scheduler to not complain, and some errors in get events related to quotas etc..
[11:22:32] <klausman>	 mh, good point
[11:23:44] <elukey>	 what is the change?
[11:24:30] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131327
[11:24:30] <elukey>	 this is ml-serve2001
[11:24:33] <elukey>	 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource           Requests       Limits --------           --------       ------ cpu                50450m (76%)   93 (141%) memory             56414Mi (47%)  75742Mi (63%)
[11:24:40] <elukey>	 uff horrible paste
[11:24:58] <elukey>	 https://phabricator.wikimedia.org/P74456
[11:25:24] <elukey>	 so I am wondering if the ml-serve clusters are a bit used right now
[11:26:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[11:26:49] <jinxer-wm>	 Deployment reference-need-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00008-deployment - ...
[11:26:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[11:26:50] <elukey>	 even if from https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s-mlserve&viewPanel=3 it doesn't seem so
[11:27:07] <elukey>	 gtg, will check later if needed!
[11:30:02] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "I've tested this patch and it LGTM: https://phabricator.wikimedia.org/P74457" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos)
[11:30:19] <kevinbazira>	 aiko: o/ the edit-check model is now available on the public repo: https://analytics.wikimedia.org/published/wmf-ml-models/edit-check
[11:30:19] <kevinbazira>	 isaranto: I've +1'd the docker-compose patch. It LGTM: https://phabricator.wikimedia.org/P74457
[11:30:41] <klausman>	 Yeah, there is no node that can satisfy 34 CPUs
[11:30:49] <klausman>	 cf.
[11:30:51] <klausman>	 https://thanos.wikimedia.org/graph?g0.expr=(sum%20by%20(node%2C%20resource)%20(kube_node_status_allocatable%7Bsite%3D%22eqiad%22%2C%20prometheus%3D%22k8s-mlserve%22%2C%20resource%3D%22cpu%22%7D))%20-%20(sum%20by%20(node%2C%20resource)%20(kube_pod_container_resource_requests%7B%7D))&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.st
[11:30:53] <klausman>	 ore_matches=%5B%5D
[11:31:29] <klausman>	 Biggest available chunk is 28.7, and most are 25 or lower
[11:32:25] <klausman>	 We could drain one host and compact the fragmented usage, but it would decidedly not be a permanent solution
[11:42:48] <aiko>	 thanks Kevin!
[12:05:05] <kevinbazira>	 np! :)
[12:31:26] <isaranto>	 back
[12:32:31] <isaranto>	 hmm so we have to rethink the strategy for reference-need
[12:32:48] <isaranto>	 thank you kevin for pushing the model and for the review!
[12:36:47] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: edit-check: add docker compose file for local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100)
[12:37:35] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "thanks for catching this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos)
[12:41:44] <isaranto>	 I reduced the limits/requests for the ref-need deployment  https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131697
[12:42:29] <isaranto>	 hopefully this will be scheduled
[12:43:54] <klausman>	 LGTM!
[12:50:20] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "great! thank you for adding the dummy path. it works like a charm!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos)
[12:51:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] edit-check: add docker compose file for local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos)
[12:58:37] <isaranto>	 ok the latest revision was successfully deployed
[13:02:04] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[13:02:04] <jinxer-wm>	 Deployment reference-need-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00008-deployment - ...
[13:02:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:33:22] <elukey>	 klausman: nice!
[14:34:23] <klausman>	 I've also made a copy of the Grafana Dashboard for Kube resources (should be easy to find), which includes a right-y-axis graph of the _percentage_ of remaining CPU/Mem resources on a cluster
[14:35:48] <elukey>	 you can also add a new/separate graph/panel, instead of forking
[14:35:56] <elukey>	 so it will be available to everybody :)
[14:36:22] <klausman>	 The existing DB is already pretty dense, so I made a copy to see if what I add is actually useful
[14:38:53] <chrisalbon>	 good morning all
[15:22:49] <wikibugs>	 (03PS6) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100)
[15:39:00] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10683626 (10gkyziridis) **Kserve Batcher for Edit-check**  - Update Pydantic schema for input request to accept "instances" list.  - Add validations...
[17:04:56] <isaranto>	 logging off folks, have a nice evening o/ 
[19:18:05] <wikibugs>	 06Machine-Learning-Team, 10EditCheck: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10684503 (10achou) When we use SHAP values to explain our peacock detection model, for each instance, the SHAP explainer returns a tuple with 3 elements: `values` (SHAP val...
[23:56:33] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Make airflow-dag for addalink training pipeline output compatible with deployed model - https://phabricator.wikimedia.org/T388258#10685279 (10leila)