[04:57:57] (03CR) 10Santhosh: "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [05:07:35] (03PS2) 10Santhosh: Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) [05:08:15] (03CR) 10CI reject: [V:04-1] Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [05:19:28] (03PS3) 10Santhosh: Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) [05:21:48] (03CR) 10Santhosh: Consider special language codes while checking for article existence (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [07:25:51] (03PS1) 10Kevin Bazira: RRLA: send prediction results to output event stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) [07:35:27] (03CR) 10Thiemo Kreuz (WMDE): [C:03+2] build: Update MediaWiki requirement to 1.44 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1129487 (owner: 10Jforrester) [07:46:03] hello folks! [07:46:14] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129762 to move ml-serve2002 to containerd [08:04:35] howdy! [08:13:54] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Thanks for updating this Aiko. LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 (owner: 10AikoChou) [08:32:48] (03Merged) 10jenkins-bot: build: Update MediaWiki requirement to 1.44 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1129487 (owner: 10Jforrester) [08:52:40] reimaging 2002 now [08:56:51] (03CR) 10Ilias Sarantopoulos: "Thanks for the work on this. I'd like to suggest the following:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [08:57:54] kevinbazira: o/ I added a comment about publishing events. lemme know what you think, happy to chat about it more! [09:04:16] (03PS1) 10Ilias Sarantopoulos: locust: change time between requests for edit-check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129773 (https://phabricator.wikimedia.org/T388817) [09:05:52] georgekyz: o/ Using the patch above, could you run an additional load test for edit-check for {100, 150, 200 } users for 5 minutes each? [09:06:27] I'd like to see what happens when we go close to 100rps [09:07:05] actually also include 50 users because of the difference between requests. so we have {50, 100, 150, 200 } users for 5 minutes each (300s) [09:07:30] lemme know if you want any help [09:21:00] I'm on it [09:25:04] Bedankt! [09:53:32] Graag gedaan [10:03:36] that is next level Dutch for me :P [10:06:05] hahaha for me as well! it is a very formal version of "my pleasure" 🤣 [10:09:23] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10656297 (10gkyziridis) **Multiple Locust tests eidt-check on GPU** Test specifications: ` wait_time = between(0.0, 0.1) # random number betw... [10:09:35] locust tests ~~~^^^ [10:17:50] thanks! so that puts some stress on the service. I'm wondering what is the cutoff point in rps/users above which latency starts to increase [10:20:23] we should look into setting up a load test that runs for different users so that we can run it in one go (instead of running it for 50 users collect results, then run for 100 etc) [10:20:37] this could be done using a LoadTestShape https://docs.locust.io/en/stable/custom-load-shape.html [10:21:43] I'm not sure if the output is broken down by default or it gives the aggregate stats though [10:22:31] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10656332 (10elukey) Got this for ml-serve2002: ` UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot B2 is disabled because of initialization erro... [10:22:46] (03CR) 10Kevin Bazira: "Thanks for the suggestion, Ilias." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [10:23:51] isaranto: I will have a look on that one [10:23:51] georgekyz: I see 3 replicas for edit-check in experimental. The service scaled horizontally due to the increased traffic. I think we should set minreplicas to 1 for now to see what 1 pod can do. [10:25:12] I'm changing it now on the fly in experimental ns, we should also add the change in deployment-charts [10:25:33] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472 (10elukey) 03NEW [10:25:58] I meant maxreplicas = 1 [10:26:03] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10656369 (10elukey) The host is completely depooled, please take any action that you need to do :) [10:26:35] isaranto: o/ I responded to your comment: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1129755/comments/c0290bd0_9849a062 [10:28:02] ml-serve2002 needs to stay down due to T389472 sigh [10:28:03] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10656373 (10elukey) [10:29:29] isaranto: Did you change it already in isvc ? [10:30:10] doing it now [10:32:30] georgekyz: done [10:32:46] I see 1 pod now [10:36:13] are you running the locust tests now ? [10:36:28] (03CR) 10Ilias Sarantopoulos: "Got it! So only the changeprop requests have some additional latency. Let's keep in mind the background tasks for other cases then!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [10:36:54] no go ahead, sorry for interfering! [10:38:05] I was just curious about pod resources utilization during the tests so was checking grafana and saw 3 pods [10:58:29] Do we have any examples using the LoadTestShape in our locust tests ? just for a reference [10:59:23] no we havent used it yet [12:01:43] * isaranto afk lunch! [12:27:44] FIRING: LiftWingServiceErrorRate: ... [12:27:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:58:38] 06Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Tests, 10Testing Support, and 3 others: Audit tests/selenium/LocalSettings.php file aiming at possibly deprecating the feature - https://phabricator.wikimedia.org/T199939#10656932 (10zeljkofilipin) [13:19:12] (03CR) 10Sbisson: [C:03+2] Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [13:19:56] (03Merged) 10jenkins-bot: Consider special language codes while checking for article existence [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1129205 (https://phabricator.wikimedia.org/T306508) (owner: 10Santhosh) [14:29:44] Locust tests for edit-check on single pod are crashing the pod [15:03:48] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10657360 (10Jhancock.wm) okay since this has happened before i pulled DIMM_B1 to see if it would boot without it. Got the same error on DIMM_B2. moved it to DIMM_B1. error move... [15:04:09] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10657361 (10Jhancock.wm) a:03Jhancock.wm [17:12:36] * isaranto afk! [18:07:56] 06Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Tests, 10Testing Support, and 2 others: Audit tests/selenium/LocalSettings.php file aiming at possibly deprecating the feature - https://phabricator.wikimedia.org/T199939#10658603 (10zeljkofilipin) [18:08:26] (03PS4) 10AikoChou: locust: add util for fetching recent change revisions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 [18:10:25] (03CR) 10AikoChou: "Instructions added!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 (owner: 10AikoChou) [18:10:43] (03CR) 10AikoChou: [C:03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 (owner: 10AikoChou)