[07:47:33] morning! [08:23:12] good morning :) [08:26:09] \o [08:29:03] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10681663 (10kevinbazira) A model-server that supports publishing revert-risk language-agnostic (RRLA) scores int... [08:29:14] o/ the rrla model-server can now produce scores to the event stream on LW staging --^ [08:47:51] \o/ [08:48:13] morning folks [09:03:02] isaranto: o/ as I was looking at https://gerrit.wikimedia.org/r/1131383 [09:03:02] I noticed the edit-check check model doesn't exist in the public repo: https://analytics.wikimedia.org/published/wmf-ml-models/ [09:03:02] but it exists on swift: [09:03:02] ``` [09:03:02] $ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/edit-check/peacock/ [09:03:02] ``` [09:03:02] should I push it to the public repo as this will enable us to test this docker-compose patch? [09:31:30] hey folks [09:31:38] going to migrate ml-serve-ctrl2002 to containerd: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131649 [09:46:48] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [09:48:24] (03PS2) 10Ilias Sarantopoulos: locust: change time between requests for edit-check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817) [09:48:47] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] locust: change time between requests for edit-check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [09:49:52] o/ [09:50:54] kevinbazira: I'm fine with that but I want to ask aiko if that is ok. shall we publish it aiko? [09:51:58] klausman: let me know when you deploy the new limitranges to the revision-models ns. I have merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131327 [09:52:15] elukey: ack! ty [09:57:14] okok... will wait for Aiko's take. [09:57:34] o/ I think that's fine [09:59:24] and editing would need it if they want to test it locally [10:04:58] isaranto: will be in a couple of minutes, I need to reboot my work machine [10:05:31] thanks, no hurry, just wanted to sync [10:11:58] isaranto: mh, do we really want to increase the limits for staging as well? I only now realized that the change covers all clusters. [10:13:43] hmm. tbh I would want to increase the limits for the experimental ns since we are testing things there. If it stresses staging too much we don't need the change in the revision-models ns [10:15:48] ack, will push as-is [10:18:38] staging done [10:20:26] ml-serve-ctrl2002 done [10:20:49] So all of ml-serve-codfw done, right? [10:21:13] well, etcd is still Bullseye, but everything is Bookworm+containerd. [10:21:20] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10682018 (10elukey) [10:21:33] ml-serv-codfw has new limits [10:24:15] and eqiad done as well [10:28:44] thanks! I'll deploy right away [10:40:15] klausman: yep, all done [10:40:35] I think that we can do the ctrl nodes in eqiad without any special handling, they are quick and easy [10:40:46] Aye, agreed [10:41:14] vlan-move needing pybal restarts is a bit of bummer [10:41:38] But for VMs we obvs don't need that [11:07:36] hmm the new revision is not schedulable in ml-serve-eqiad & ml-stagingc-codfw (all well in ml-serve-codfw) [11:08:04] ` 0/13 nodes are available: 11 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/contr [11:08:04] ol-plane: }, that the pod didn't tolerate.` [11:10:30] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10682228 (10achou) We need to decide on the name for this stream. I think `mediawiki.revision_revertrisk_languag... [11:11:25] Insufficient CPU points at namespace needing a CPU allowance bump? [11:12:26] mmm it smells like the k8s scheduler not finding any host with the capacity to run the pod [11:12:32] isaranto: how big is it? [11:12:55] 34 cpus [11:13:14] lol [11:13:22] jumbo pod [11:14:04] it is the experiment of increasing cpu cores /decreasing # of replicas [11:18:44] yeah maybe atm we don't have a good target node [11:20:14] * isaranto commuting back home, will rejoin in ~45' [11:21:36] weird though, there should be space [11:21:51] klausman: maybe it is something also related to namespace, no idea atm [11:22:22] if it was related to the ns I'd have expected the scheduler to not complain, and some errors in get events related to quotas etc.. [11:22:32] mh, good point [11:23:44] what is the change? [11:24:30] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131327 [11:24:30] this is ml-serve2001 [11:24:33] Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 50450m (76%) 93 (141%) memory 56414Mi (47%) 75742Mi (63%) [11:24:40] uff horrible paste [11:24:58] https://phabricator.wikimedia.org/P74456 [11:25:24] so I am wondering if the ml-serve clusters are a bit used right now [11:26:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:26:49] Deployment reference-need-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00008-deployment - ... [11:26:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:26:50] even if from https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1&var-ds=thanos&var-site=codfw&var-prometheus=k8s-mlserve&viewPanel=3 it doesn't seem so [11:27:07] gtg, will check later if needed! [11:30:02] (03CR) 10Kevin Bazira: [C:03+1] "I've tested this patch and it LGTM: https://phabricator.wikimedia.org/P74457" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [11:30:19] aiko: o/ the edit-check model is now available on the public repo: https://analytics.wikimedia.org/published/wmf-ml-models/edit-check [11:30:19] isaranto: I've +1'd the docker-compose patch. It LGTM: https://phabricator.wikimedia.org/P74457 [11:30:41] Yeah, there is no node that can satisfy 34 CPUs [11:30:49] cf. [11:30:51] https://thanos.wikimedia.org/graph?g0.expr=(sum%20by%20(node%2C%20resource)%20(kube_node_status_allocatable%7Bsite%3D%22eqiad%22%2C%20prometheus%3D%22k8s-mlserve%22%2C%20resource%3D%22cpu%22%7D))%20-%20(sum%20by%20(node%2C%20resource)%20(kube_pod_container_resource_requests%7B%7D))&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.st [11:30:53] ore_matches=%5B%5D [11:31:29] Biggest available chunk is 28.7, and most are 25 or lower [11:32:25] We could drain one host and compact the fragmented usage, but it would decidedly not be a permanent solution [11:42:48] thanks Kevin! [12:05:05] np! :) [12:31:26] back [12:32:31] hmm so we have to rethink the strategy for reference-need [12:32:48] thank you kevin for pushing the model and for the review! [12:36:47] (03PS2) 10Ilias Sarantopoulos: edit-check: add docker compose file for local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) [12:37:35] (03CR) 10Ilias Sarantopoulos: "thanks for catching this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [12:41:44] I reduced the limits/requests for the ref-need deployment https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131697 [12:42:29] hopefully this will be scheduled [12:43:54] LGTM! [12:50:20] (03CR) 10Kevin Bazira: [C:03+1] "great! thank you for adding the dummy path. it works like a charm!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [12:51:21] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] edit-check: add docker compose file for local run [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131383 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [12:58:37] ok the latest revision was successfully deployed [13:02:04] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:02:04] Deployment reference-need-predictor-00008-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00008-deployment - ... [13:02:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:33:22] klausman: nice! [14:34:23] I've also made a copy of the Grafana Dashboard for Kube resources (should be easy to find), which includes a right-y-axis graph of the _percentage_ of remaining CPU/Mem resources on a cluster [14:35:48] you can also add a new/separate graph/panel, instead of forking [14:35:56] so it will be available to everybody :) [14:36:22] The existing DB is already pretty dense, so I made a copy to see if what I add is actually useful [14:38:53] good morning all [15:22:49] (03PS6) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [15:39:00] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10683626 (10gkyziridis) **Kserve Batcher for Edit-check** - Update Pydantic schema for input request to accept "instances" list. - Add validations... [17:04:56] logging off folks, have a nice evening o/ [19:18:05] 06Machine-Learning-Team, 10EditCheck: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10684503 (10achou) When we use SHAP values to explain our peacock detection model, for each instance, the SHAP explainer returns a tuple with 3 elements: `values` (SHAP val... [23:56:33] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Make airflow-dag for addalink training pipeline output compatible with deployed model - https://phabricator.wikimedia.org/T388258#10685279 (10leila)