[08:04:54] good morning [08:22:44] morning folks [08:31:57] hey folks! [08:33:26] if you are ok I am going to depool and reimage the two ml-staging ctrl vms (one at the time) [08:33:33] it shouldn't impact any tests [08:33:43] lemme know if you are ok [08:48:37] proceeding with ml-staging-ctrl2001 :) [09:05:10] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10650718 (10kevinbazira) 05Stalled→03In progress a:03kevinbazira [09:09:30] o/ Luca thanks [09:09:46] we are working on ml-staging-codfw so if we see anything odd we'll notify [09:11:37] in theory it should be transparent to you, hopefully :) [09:12:10] after the staging control plane me and Tobias are planning to slowly reimage all k8s prod nodes [09:12:16] to bookworm and containerd [09:12:24] one at the time, etc.. [09:14:34] and after this long upgrade, we'll be able to move to PSS policies [09:14:48] and the migration to k8s 1.31 will be unblocked :D [09:22:42] ack! [09:32:26] staging-ctrl2001 is up and all good, proceeding with 2002 [09:49:01] (03PS1) 10Kevin Bazira: RRLA: process inputs from source event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) [10:07:41] (03PS1) 10Santhosh: Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) [10:09:04] (03CR) 10Santhosh: "Testing notes:" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [10:11:02] (03CR) 10Santhosh: "Note that https://api.wikimedia.org/service/lw/recommendation/api/v1/translation?source=en&target=no&count=24&search_algorithm=mostpopular" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [10:11:40] (03CR) 10CI reject: [V:04-1] Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [10:23:52] both nodes reimaged! [10:25:05] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10651079 (10elukey) [10:27:44] FIRING: LiftWingServiceErrorRate: ... [10:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:30:18] (03CR) 10Jon Harald Søby: Consider special language codes while checking for article existence (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [10:55:39] * isaranto sighs [10:58:09] (03PS1) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) [10:58:56] (03PS2) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) [11:00:03] (03PS3) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) [11:17:59] georgekyz: can you update the commit message to explain why we use device only in the pipeline? [11:18:30] I mean to just record the behavior we experienced so that we have a reference [11:21:43] yeap you are right [11:24:07] (03CR) 10Ilias Sarantopoulos: inference-services: edit-check service on GPU. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:24:21] I added the comments on the patch, thanks! [11:26:24] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10651306 (10elukey) @isarantopoulos the CPU throttling can be a bit misleading sometimes (see [[ https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limi... [11:27:36] (03PS4) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) [11:31:14] (03PS5) 10Gkyziridis: inference-services: edit-check fix of PYTHONPATH and device specification. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) [11:33:32] message updated, I am going for a fast lunch and then I will merge it, please review when you have time folks ~~~~~^ [11:37:49] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Thanks for the fix!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:41:26] Don't have a fast lunch, please have a normal one :D [12:19:57] 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10651609 (10isarantopoulos) @elukey thanks for chiming in, this is very useful! I noticed 2 things I miscalculated: # above I mentioned `ONM_NUM_THREADS` instead of `... [12:32:04] * isaranto lunch! [12:53:43] (03PS6) 10Gkyziridis: inference-services: edit-check fix of PYTHONPATH and device specification. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) [12:56:26] (03CR) 10Gkyziridis: [C:03+2] inference-services: edit-check fix of PYTHONPATH and device specification. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:02:07] (03PS1) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 [13:11:57] (03CR) 10Ilias Sarantopoulos: inference-services: edit-check fix of PYTHONPATH and device specification. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:41:30] (03PS2) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 [14:03:55] (03CR) 10Elukey: [C:03+1] "To be sure - you want to have an env variable called NUM_THREADS that will override CPU_COUNT and get assigned to OMP_NUM_THREADS right?" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos) [14:04:01] isaranto: o/ [14:04:40] the easiest to check the number of threads is to create the pod, then check via `kubectl get pods -n $something -o wide` to get on what worker host is running [14:05:07] then ssh to the host, find the process id and then via `ps -eLF | grep $pid` you'll see the threads [14:05:40] totally forgot about common_settings.sh [14:12:31] (03CR) 10Ilias Sarantopoulos: "Yes, exactly!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos) [14:26:51] (03CR) 10Ilias Sarantopoulos: [C:03+2] override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos) [14:27:05] (03CR) 10Sbisson: [C:03+2] Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh) [14:27:13] ack, thanks Luca, will try again and ping if I need help [14:28:06] (03Merged) 10jenkins-bot: Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh) [14:34:33] 06Machine-Learning-Team, 10EditCheck: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10652198 (10achou) a:03achou [14:34:43] 06Machine-Learning-Team: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10652199 (10achou) a:03achou [14:34:56] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10652200 (10achou) a:03achou [14:42:09] (03CR) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos) [14:45:07] (03PS3) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 [14:45:22] (03PS4) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 [14:46:05] (03CR) 10Ilias Sarantopoulos: [C:03+2] "Added a dummy change so that the docker image build gets triggered" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos) [14:50:11] 06Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730#10652344 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has not seen updates for two years. [14:50:37] (03Merged) 10jenkins-bot: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos) [14:52:09] isaranto: is it ok if I reimage ml-serve2001 to bookworm/containerd? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128463 [14:52:24] the codfw dc is now depooled [15:05:24] yes go ahead! [15:38:07] started the reimage :) [16:03:47] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10652956 (10gkyziridis) New version of [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1129212 | edit-check servic... [16:08:33] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10652971 (10gkyziridis) **Edit-Check Service - Peacock model GPU version TESTS** Test sample: ` num_words = random.randint(5, 600) original = " ".join(["What is Wikipedia"]... [16:10:38] Folks the new version of edit-check service deployed on staging and it is running smoothly. The memory spiking issue went away. Locust results available ~~~^^ [16:10:57] really nice numbers: 82 Avg [16:12:18] great work George \o/ [16:14:06] georgekyz: do you mind running a load test for 5 minutes (300s) for 50 users? not necessarily now, can happen also tomorrow [16:15:42] yeah sure [16:17:33] it is running right now [16:18:53] the memory is stable [16:22:17] oh wow... the average remained the same omg [16:24:13] 87 Avg [16:24:33] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [16:25:10] locust results available: https://phabricator.wikimedia.org/T388817#10652971 [16:26:31] (03PS2) 10Kevin Bazira: RRLA: process inputs from source event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) [16:28:36] that still has 2 users though :D [16:28:53] let's sync tomorrow morning, great progress! [16:31:01] isaranto: scroll down in the paste [16:31:16] MY BAD [16:31:30] haha no worries [16:31:50] the avg is still low [16:32:05] that's awesome [16:33:02] median is also 89, awesooome [16:33:14] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review Aiko!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [16:37:15] (03Merged) 10jenkins-bot: RRLA: process inputs from source event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [16:53:09] (03PS3) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 [16:53:22] (03PS4) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 [16:55:37] (03CR) 10Ilias Sarantopoulos: "I updated the README.md file:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos) [16:56:09] (03PS5) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 [16:56:28] going afk folks, have a nice evening/rest of day! [16:58:15] (03CR) 10Kevin Bazira: [C:03+1] "Thanks! LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos) [17:26:47] ml-serve2001 up and running with containerd! [17:28:45] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10653451 (10elukey) Moved ml-serve2001 today, with the `--move-vlan` reimage flag. We need to run homer on cr1-{eqiad,codfw} (depending on the host, in this case... [17:28:58] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10653456 (10elukey) [17:38:42] as always, please ping me if you see anything weird [17:38:54] ml-staging is already running containerd and nothing popped up [17:38:59] but let's keep an extra eye [17:39:11] also please remember that for a week inference.discovery.wmnet is pooled only in eqiad [17:39:14] and not in codfw [17:39:22] for the MW switchover [18:03:51] (03PS3) 10AikoChou: locust: add util for fetching recent change revisions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 [18:08:51] (03CR) 10AikoChou: locust: add util for fetching recent change revisions (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 (owner: 10AikoChou) [23:35:59] (03PS1) 10Jforrester: build: Update MediaWiki requirement to 1.44 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1129487 [23:46:36] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10655076 (10ppelberg) >>! In T388215#10647994, @achou wrote: > Based on feedback from @jhsoby, @Strainu, and @matej_sucha...