[04:40:38] (03CR) 10Santhosh: "retest" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh) [04:45:02] (03CR) 10Santhosh: "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh) [07:53:53] good morning! [07:56:49] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [07:57:37] sorry for not leaving more time to review the above patch, I'm merging to run load tests on ml-staging [08:04:05] good morning folks [08:06:36] (03Merged) 10jenkins-bot: reference-need: multiprocessing in predict [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128414 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [08:23:04] (03CR) 10Ilias Sarantopoulos: inference-services: edit-check service on GPU. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [08:24:09] o/ georgekyz I've +1 your patch. It is up to you if you want to update blubber to 1.1.0 [08:24:18] I'm fine either way! [08:24:59] yeap I am on it [08:25:08] it needs some changes in the instructions [08:25:23] I think I'd try upgrading, if it works as is great, otherwise if there is some non trivial thing to be fixed don't worry about it we'll do it later and focus on load testing etc [08:25:37] ok [08:26:55] a yes I remember some instructions needed changing. I see only the llm image has a builkit version >1.0 [08:27:56] yeap, I am using that one as reference [08:28:46] I will not spend much time on that, I will try some things for blubber1.0.1 and if not, then I will use the older one [08:29:14] ack! lemme know if I can help [08:32:51] thnx [08:53:44] FIRING: LiftWingServiceErrorRate: ... [08:53:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:01:10] the silence expired! [09:01:31] no I'm wrong it is a different alert for itwiki-damaging [09:08:58] (03PS4) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:09:16] (03PS5) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:11:07] georgekyz: why not try 1.1.0 directly which is the latest version? [09:11:22] just curious! [09:12:04] oh I thought v1.0.1 is the latest, didn't check it [09:12:14] I will try the latest one [09:12:57] thnx for checking it out [09:14:49] (03PS6) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:15:07] (03PS7) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:20:16] I see itwiki-damaging latencies going down, and the service is recovering.. [09:20:28] https://grafana.wikimedia.org/goto/JKqWky2Hg?orgId=1 [09:21:08] (03CR) 10CI reject: [V:04-1] inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:23:44] RESOLVED: LiftWingServiceErrorRate: ... [09:23:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:28:46] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10645583 (10isarantopoulos) pasting some raw load test results . sorry for the awful format, I'm running some more tests on ml-staging and will report back ## Specific r... [09:32:43] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906#10645605 (10Samwalton9-WMF) Noting that this is now stalled on decisions a... [09:33:15] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701#10645607 (10Samwalton9-WMF) Noting that this is now stalled on decision... [09:36:19] (03PS8) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:37:00] (03PS9) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:45:33] (03PS10) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) [09:45:59] patch is ready, merging it [09:47:55] (03CR) 10Gkyziridis: [C:03+2] inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:48:40] (03Merged) 10jenkins-bot: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:58:10] (03CR) 10Ilias Sarantopoulos: inference-services: edit-check service on GPU. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128444 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:24:46] edit-check peacock model gpu version on staging is ready, please review whenever have time folks: --> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1128820 [10:34:27] (03PS1) 10Ilias Sarantopoulos: reference-quality: multiprocessing - do not use process pool for workers=1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) [10:48:06] georgekyz: I reviewed! We'll need to add the gpu resources to the pod. but there is another thing. because other deployments use the gpus we'll not be able to use them [10:48:11] so we'd have to remove them [10:48:53] but if you want to test you can edit the experimental ns directly without the commit in the deployment -charts. do u remember how to do that? [10:49:48] I'm trying to see if we have documentation about it. Otherwise we can jump in a quick call to sync [10:51:23] just exporting the KUBECONFIG would do https://phabricator.wikimedia.org/T354516#9481893 [10:51:49] unfortunately we dont have any wikitech docs about this, we'll need to add it [10:52:41] isaranto: alright, I can try that one. Just to be sure, should I merge the patch or not ? [10:53:18] since you didnt get a +1 you shouldn't :P [10:53:38] but for this one it wont work as it wont be able to "see" the gpu if you dont add the suggestions I mentioned [10:56:03] yeah I meant after pushing the changes :P [10:57:39] the rule is that you should get a +1 review before you merge (all rules come with exceptions) [10:58:10] so I'll review right away. especially for deployment-charts we might affect other services (if not careful) [11:05:20] (03PS1) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 [11:06:01] (03PS2) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 [11:06:52] I added a docker compose which we can gradually expand to run the services locally easier. I hope you like it [11:12:41] * isaranto afk lunch [12:31:40] (03PS2) 10Ilias Sarantopoulos: reference-quality: multiprocessing - do not use process pool for workers=1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) [13:01:12] georgekyz: I reviewed the dep-charts patch! [13:01:46] thnk youuu [13:01:53] when you have some time please take a look at the patch above - I disabled the process pool for workers<=2 so that we don't change any existing functionality for reference-risk [13:02:31] I'm on it [13:04:45] thank youu [13:05:13] I'll run some more load tests on experimental ml-staging after my meetings [13:14:12] (03CR) 10Gkyziridis: [C:03+1] "Thank you for working on that one!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:14:57] isaranto: I reviewed it with +1, two comments on pythonic style do not give so much stress on them [13:26:06] hey folks! Is it ok if I reimage the ml-staging-ctrl nodes (one at the time) after the switchover? Cc: klausman [13:26:20] move to bookworm + containerd [13:26:26] +yes, thank you! [13:26:36] I'd help, but I'm out sick today [13:27:08] ah snap sorry, please rest! Didn't mean to ping you [13:28:00] (03CR) 10Kevin Bazira: "Thank you for working on this Ilias!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos) [13:30:52] (03CR) 10Ilias Sarantopoulos: reference-quality: multiprocessing - do not use process pool for workers=1 (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [13:55:28] (03CR) 10Gkyziridis: [C:03+1] reference-quality: multiprocessing - do not use process pool for workers=1 (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:43:14] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10646959 (10kevinbazira) 05Open→03Resolved [15:21:15] (03CR) 10Ilias Sarantopoulos: [C:03+2] reference-quality: multiprocessing - do not use process pool for workers=1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [15:22:00] (03Merged) 10jenkins-bot: reference-quality: multiprocessing - do not use process pool for workers=1 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128824 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:02:09] * isaranto afk - bbl [16:07:18] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10647440 (10gkyziridis) Edit-check service with peacock model new [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1128444 | gpu image ]]... [17:09:29] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10647796 (10achou) Update on the second approach: The template "Peacock_inline" exists in five non-English targeted lang... [17:43:57] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10647994 (10achou) Based on our initial analysis of data for the top 19 non-English languages: - Languages that have enou... [18:21:02] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10648312 (10isarantopoulos) I tried running some load tests on ml-staging experimental using 2 workers and multiprocessing and I saw really high cpu throttling. I assume...