[06:58:56] o/ good morning! [07:09:17] o/ kalimera [07:09:29] thanks for the review, Ilias! [07:09:55] o/ kevin [07:10:01] I am goind to deploy the model-servers that rely on the updated events module one-by-one [07:10:12] np I have a patch for you as well https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1131323 [07:11:02] I'll submit another one for the api gateway afterwards but for that one we'll need Tobias to deploy [07:22:16] right! I've +1'd the patch. [07:22:17] are there tests currently running on the edit-check endpoint? if so, will both the `edit-check-staging` patch and the APIGW one be deployed at the same time? [07:29:23] I'll deploy the change for the service now and later today we can deploy the one I just opened for API GW https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1132534 [07:29:39] + I'm opening one now to fix the load tests to match the staging name [07:33:07] (03PS1) 10Ilias Sarantopoulos: locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) [07:33:10] done! [07:36:09] (03CR) 10Kevin Bazira: [C:03+1] locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [07:38:27] (03CR) 10Ilias Sarantopoulos: [C:03+2] locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [07:49:24] article-country deployed. outlink predictor next: https://gerrit.wikimedia.org/r/1132537 [08:13:30] I've +1. shall we also update the transformer image to have an up2date deployment? [08:44:01] sure sure ... I've updated the patch with tne transformer image too [08:52:08] thanks! [08:53:18] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] locust: fix model name for edit check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1132535 (https://phabricator.wikimedia.org/T388817) (owner: 10Ilias Sarantopoulos) [08:53:40] (03PS14) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [10:19:59] * isaranto lunch! [10:20:33] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: LiftWing model-servers log improper JSON in stderr - https://phabricator.wikimedia.org/T389768#10693087 (10kevinbazira) [10:24:42] ditto :) [10:26:42] outlink deployed. will deploy RRLA once the event stream is in prod. [11:39:33] (03PS15) 10Ilias Sarantopoulos: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:41:20] (03CR) 10Ilias Sarantopoulos: "Resolving the previous comments as all have been implemented" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:44:03] (03PS16) 10Ilias Sarantopoulos: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:44:28] aiko: the above patch is now ready for review. I have tested it as well locally [12:10:12] (03PS17) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:10:47] klausman: let me know if you can deploy the api gw patch sometime today https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1132534 [12:10:49] thanks! [12:12:46] yeah, I was about to do that :) [12:13:18] isaranto: alright! I'll review it [12:13:37] great, thank you both! [12:14:55] I'm following up on an alert we got on saturday for reference-need and I am seeing this chart for a pod that worries me https://grafana.wikimedia.org/goto/4yxH_8THR?orgId=1 [12:15:35] memory usage is increasing which likely indicates that there is a memory leak. this seems consistent in all pods [12:16:04] It seems it did something similar before (go to "2 days") yesterday 9am-noon [12:16:49] I increased memory limits/requests on saturday as I saw the same thing happening [12:18:08] Think it might be a memory leak? [12:18:59] this would be my guess. Something we missed when adding multiprocessing to the service [12:26:49] my assumption is that the process pool isn't managed properly and a process that has died isn't shut down properly so it still occupies memory - which means that we load the model once more in the new process that is spawned [12:27:03] taking a quick look and opening up a task [12:37:27] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10693471 (10isarantopoulos) We are no longer getting 500s as before so the stability has improved BUT the overall latency of the service is stil... [12:56:09] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10693506 (10isarantopoulos) There is an increasing memory consumption which ends up in pods getting killed because they get out of memory (OOMKi... [13:01:57] isaranto: APIGW change has been pushed everywhere [13:02:05] awesome thank you! [13:04:24] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10693521 (10isarantopoulos) **request**: ` curl https://api.wikimedia.org/service/lw/inference/v1/models/edit-check-staging:pre... [13:04:40] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10693522 (10isarantopoulos) 05Open→03Resolved a:03isarantopoulos [14:28:59] (03CR) 10AikoChou: [C:03+1] "LGTM! Only a few minor issues." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:41:37] (03PS18) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:42:18] aiko: thanks for the review, I updated it, lemme know if it is ok! [14:42:21] (03CR) 10CI reject: [V:04-1] edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:42:51] (03PS19) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:49:33] (03CR) 10AikoChou: [C:03+1] edit-check: implement for batch prediction (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:57:17] 10Lift-Wing, 06Machine-Learning-Team, 06Wikimedia Enterprise: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10694215 (10isarantopoulos) I have verified the above by looking at a specific pod: 1. Found some BrokenProcessPool [[ https://logstash.wikimed... [15:12:39] (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: implement for batch prediction (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:12] (03CR) 10Ilias Sarantopoulos: edit-check: implement for batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:23] (03PS20) 10Ilias Sarantopoulos: edit-check: implement batch prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:37] (03PS21) 10Ilias Sarantopoulos: edit-check: implement batch requests/prediction [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:42] (03PS22) 10Ilias Sarantopoulos: edit-check: implement batch requests/predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:13:47] (03CR) 10Ilias Sarantopoulos: [C:03+2] edit-check: implement batch requests/predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:14:38] thanks for the review Aiko! I fixed the commit msg and merged! [15:19:34] (03CR) 10DCausse: "I think this should be ready to go" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:19:40] (03PS2) 10DCausse: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) [15:22:40] (03Merged) 10jenkins-bot: edit-check: implement batch requests/predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [15:36:13] (03CR) 10Kevin Bazira: "Thank you for working on this, David. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:36:50] (03PS3) 10Kevin Bazira: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [15:44:03] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10694418 (10isarantopoulos) Updated request after batch prediction implementation ` curl https://api.wikimedia.org/servic... [15:50:48] (03CR) 10Kevin Bazira: [C:03+2] search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:00:50] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10694499 (10Samwalton9-WMF) [16:01:34] (03Merged) 10jenkins-bot: search weighted_tags: drop BC for rc0 weighted_tag stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1130530 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [16:04:19] going afk folks, have a nice evening/rest of day! [17:23:06] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10694988 (10Kgraessle) Adding the thresholds we arrived at from the analysis that was complete... [20:36:03] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10695756 (10kostajh) >>! In T348298#10694988, @Kgraessle wrote: > Adding the thresholds we arr...