[08:04:54] <isaranto>	 good morning
[08:22:44] <georgekyz>	 morning folks
[08:31:57] <elukey>	 hey folks!
[08:33:26] <elukey>	 if you are ok I am going to depool and reimage the two ml-staging ctrl vms (one at the time)
[08:33:33] <elukey>	 it shouldn't impact any  tests
[08:33:43] <elukey>	 lemme know if you are ok
[08:48:37] <elukey>	 proceeding with ml-staging-ctrl2001 :)
[09:05:10] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10650718 (10kevinbazira) 05Stalled→03In progress a:03kevinbazira
[09:09:30] <isaranto>	 o/ Luca thanks
[09:09:46] <isaranto>	 we are working on ml-staging-codfw so if we see anything odd we'll notify
[09:11:37] <elukey>	 in theory it should be transparent to you, hopefully :)
[09:12:10] <elukey>	 after the staging control plane me and Tobias are planning to slowly reimage all k8s prod nodes
[09:12:16] <elukey>	 to bookworm and containerd
[09:12:24] <elukey>	 one at the time, etc..
[09:14:34] <elukey>	 and after this long upgrade, we'll be able to move to PSS policies
[09:14:48] <elukey>	 and the migration to k8s 1.31 will be unblocked :D
[09:22:42] <isaranto>	 ack!
[09:32:26] <elukey>	 staging-ctrl2001 is up and all good, proceeding with 2002
[09:49:01] <wikibugs>	 (03PS1) 10Kevin Bazira: RRLA: process inputs from source event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179)
[10:07:41] <wikibugs>	 (03PS1) 10Santhosh: Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508)
[10:09:04] <wikibugs>	 (03CR) 10Santhosh: "Testing notes:" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh)
[10:11:02] <wikibugs>	 (03CR) 10Santhosh: "Note that https://api.wikimedia.org/service/lw/recommendation/api/v1/translation?source=en&target=no&count=24&search_algorithm=mostpopular" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh)
[10:11:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh)
[10:23:52] <elukey>	 both nodes reimaged!
[10:25:05] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10651079 (10elukey)
[10:27:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[10:27:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[10:30:18] <wikibugs>	 (03CR) 10Jon Harald Søby: Consider special language codes while checking for article existence (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh)
[10:55:39] * isaranto sighs
[10:58:09] <wikibugs>	 (03PS1) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100)
[10:58:56] <wikibugs>	 (03PS2) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100)
[11:00:03] <wikibugs>	 (03PS3) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100)
[11:17:59] <isaranto>	 georgekyz: can you update the commit message to explain why we use device only in the pipeline?
[11:18:30] <isaranto>	 I mean to just record the behavior we experienced so that we have a reference
[11:21:43] <georgekyz>	 yeap you are right
[11:24:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: inference-services: edit-check service on GPU. (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:24:21] <isaranto>	 I added the comments on the patch, thanks!
[11:26:24] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10651306 (10elukey) @isarantopoulos the CPU throttling can be a bit misleading sometimes (see [[ https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limi...
[11:27:36] <wikibugs>	 (03PS4) 10Gkyziridis: inference-services: edit-check service on GPU. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100)
[11:31:14] <wikibugs>	 (03PS5) 10Gkyziridis: inference-services: edit-check fix of PYTHONPATH and device specification. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100)
[11:33:32] <georgekyz>	 message updated, I am going for a fast lunch and then I will merge it, please review when you have time folks ~~~~~^
[11:37:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM! Thanks for the fix!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:41:26] <isaranto>	 Don't have a fast lunch, please have a normal one :D
[12:19:57] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10651609 (10isarantopoulos) @elukey thanks for chiming in, this is very useful! I noticed 2 things I miscalculated:     # above I mentioned `ONM_NUM_THREADS` instead of `...
[12:32:04] * isaranto lunch!
[12:53:43] <wikibugs>	 (03PS6) 10Gkyziridis: inference-services: edit-check fix of PYTHONPATH and device specification. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100)
[12:56:26] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] inference-services: edit-check fix of PYTHONPATH and device specification. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:02:07] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245
[13:11:57] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: inference-services: edit-check fix of PYTHONPATH and device specification. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129212 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:41:30] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245
[14:03:55] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "To be sure - you want to have an env variable called NUM_THREADS that will override CPU_COUNT and get assigned to OMP_NUM_THREADS right?" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos)
[14:04:01] <elukey>	 isaranto: o/ 
[14:04:40] <elukey>	 the easiest to check the number of threads is to create the pod, then check via `kubectl get pods -n $something -o wide` to get on what worker host is running
[14:05:07] <elukey>	 then ssh to the host, find the process id and then via `ps -eLF | grep $pid` you'll see the threads
[14:05:40] <elukey>	 totally forgot about common_settings.sh
[14:12:31] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Yes, exactly!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos)
[14:26:51] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos)
[14:27:05] <wikibugs>	 (03CR) 10Sbisson: [C:03+2] Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh)
[14:27:13] <isaranto>	 ack, thanks Luca, will try again and ping if I need help
[14:28:06] <wikibugs>	 (03Merged) 10jenkins-bot: Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1126504 (owner: 10Santhosh)
[14:34:33] <wikibugs>	 06Machine-Learning-Team, 10EditCheck: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10652198 (10achou) a:03achou
[14:34:43] <wikibugs>	 06Machine-Learning-Team: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10652199 (10achou) a:03achou
[14:34:56] <wikibugs>	 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10652200 (10achou) a:03achou
[14:42:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos)
[14:45:07] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245
[14:45:22] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245
[14:46:05] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] "Added a dummy change so that the docker image build gets triggered" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos)
[14:50:11] <wikibugs>	 06Machine-Learning-Team: Update editquality demo jupyter notebook - https://phabricator.wikimedia.org/T300730#10652344 (10Aklapper) 05In progress→03Open Resetting task status from "In Progress" to "Open" as this task has not seen updates for two years.
[14:50:37] <wikibugs>	 (03Merged) 10jenkins-bot: override OMP_NUM_THREADS [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129245 (owner: 10Ilias Sarantopoulos)
[14:52:09] <elukey>	 isaranto: is it ok if I reimage ml-serve2001 to bookworm/containerd? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128463
[14:52:24] <elukey>	 the codfw dc is now depooled
[15:05:24] <isaranto>	 yes go ahead!
[15:38:07] <elukey>	 started the reimage :)
[16:03:47] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10652956 (10gkyziridis) New version of [[ https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1129212 | edit-check servic...
[16:08:33] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10652971 (10gkyziridis) **Edit-Check Service - Peacock model GPU version TESTS**  Test sample: ` num_words = random.randint(5, 600) original = " ".join(["What is Wikipedia"]...
[16:10:38] <georgekyz>	 Folks the new version of edit-check service deployed on staging and it is running smoothly. The memory spiking issue went away. Locust results available ~~~^^
[16:10:57] <georgekyz>	 really nice numbers: 82 Avg
[16:12:18] <isaranto>	 great work George \o/
[16:14:06] <isaranto>	 georgekyz: do you mind running a load test for 5 minutes (300s) for 50 users? not necessarily now, can happen also tomorrow 
[16:15:42] <georgekyz>	 yeah sure 
[16:17:33] <georgekyz>	 it is running right now 
[16:18:53] <georgekyz>	 the memory is stable 
[16:22:17] <georgekyz>	 oh wow... the average remained the same omg
[16:24:13] <georgekyz>	 87 Avg
[16:24:33] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[16:25:10] <georgekyz>	 locust results available: https://phabricator.wikimedia.org/T388817#10652971
[16:26:31] <wikibugs>	 (03PS2) 10Kevin Bazira: RRLA: process inputs from source event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179)
[16:28:36] <isaranto>	 that still has 2 users though :D 
[16:28:53] <isaranto>	 let's sync tomorrow morning, great progress!
[16:31:01] <georgekyz>	 isaranto: scroll down in the paste
[16:31:16] <isaranto>	 MY BAD
[16:31:30] <georgekyz>	 haha no worries
[16:31:50] <georgekyz>	 the avg is still low 
[16:32:05] <georgekyz>	 that's awesome 
[16:33:02] <isaranto>	 median is also 89, awesooome
[16:33:14] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review Aiko!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[16:37:15] <wikibugs>	 (03Merged) 10jenkins-bot: RRLA: process inputs from source event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129201 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[16:53:09] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832
[16:53:22] <wikibugs>	 (03PS4) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832
[16:55:37] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "I updated the README.md file:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos)
[16:56:09] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: reference-quality: build/run services with docker compose [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832
[16:56:28] <isaranto>	 going afk folks, have a nice evening/rest of day!
[16:58:15] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "Thanks! LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1128832 (owner: 10Ilias Sarantopoulos)
[17:26:47] <elukey>	 ml-serve2001 up and running with containerd!
[17:28:45] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10653451 (10elukey) Moved ml-serve2001 today, with the `--move-vlan` reimage flag. We need to run homer on cr1-{eqiad,codfw} (depending on the host, in this case...
[17:28:58] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10653456 (10elukey)
[17:38:42] <elukey>	 as always, please ping me if you see anything weird
[17:38:54] <elukey>	 ml-staging is already running containerd and nothing popped up
[17:38:59] <elukey>	 but let's keep an extra eye
[17:39:11] <elukey>	 also please remember that for a week inference.discovery.wmnet is pooled only in eqiad
[17:39:14] <elukey>	 and not in codfw
[17:39:22] <elukey>	 for the MW switchover
[18:03:51] <wikibugs>	 (03PS3) 10AikoChou: locust: add util for fetching recent change revisions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755
[18:08:51] <wikibugs>	 (03CR) 10AikoChou: locust: add util for fetching recent change revisions (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 (owner: 10AikoChou)
[23:35:59] <wikibugs>	 (03PS1) 10Jforrester: build: Update MediaWiki requirement to 1.44 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1129487
[23:46:36] <wikibugs>	 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10655076 (10ppelberg) >>! In T388215#10647994, @achou wrote: > Based on feedback from @jhsoby, @Strainu, and @matej_sucha...