[06:10:01] good morning folks [06:17:18] good morning [06:49:14] hello! [07:07:16] good morning :) [08:28:23] 06Machine-Learning-Team: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11228021 (10OKarakaya-WMF) Results for both gpt-oss:20b and aya-expanse:32b are available in the [spreadsheet](https://docs.google.com/spreadsheets/d/1IBVBisx2Ojg0PJvxvOzlYJW4Y_5f2Wp1dOyimpvGEPc/edit?gid=9702... [08:33:38] o/ georgekyz is there anything else missing from https://phabricator.wikimedia.org/T404717? [08:37:23] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11228032 (10BWojtowicz-WMF) @Ottomata @isarantopoulos Thank you for the suggestion and discussion about using the `wiki_id`.... [08:47:27] o/ elukey: do you think we could merge the model-upload cleanup? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1190577 [08:57:30] bartosz: sorry totally forgot to reply, I added a comment but we should be close to merge [08:59:42] elukey: no worries, thank you! responding now [09:29:03] isaranto: I CICD looks fine! We can close the task [09:31:25] 06Machine-Learning-Team: Fix CI/CD on ml-pipelines repository - https://phabricator.wikimedia.org/T404717#11228298 (10isarantopoulos) 05Open→03Resolved [11:42:12] ozge_: merged! [11:51:56] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11228990 (10kevinbazira) a:03kevinbazira [11:52:26] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11228997 (10kevinbazira) 05Open→03Resolved [13:03:48] elukey: thank you for merging! <3 [13:08:04] bartosz: wrong ping earlieron on, sorry! [13:08:10] thanks a lot for the work! [13:35:03] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [13:38:01] elukey: I presume after the install of the device plugin, the kubelet also needs to restart? [13:41:04] klausman: good question, I am not 100% sure [13:41:24] in theory the amd plugin contacts the kubelet saying "hey this is my unix socket" [13:41:39] so it should be ok without a restart [13:42:40] (03CR) 10Bartosz Wójtowicz: [C:03+2] "Merging, thank you for the review @achou@wikimedia.org!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [13:43:03] roger [13:43:47] I've merged the patch, ran puppet agent on all GPU machines in eqiad and installed the plugin. I'll do codfw if nothing explodes in the next hour or so [13:47:47] (03Merged) 10jenkins-bot: articletopic: Add `page_id` parameter to the articletopic model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1191038 (https://phabricator.wikimedia.org/T371021) (owner: 10Bartosz Wójtowicz) [13:48:22] ack [13:48:33] checked logs on ml-serve1010, everything looks good [13:49:01] as FYI there are some errors like 'Failed to read 'current_compute_partition' file at' but it is due to the fact that we don't have hw like mi300x on that host [13:49:17] at the end it states the two GPUs etc.. [13:49:53] Ack, I was about to dig into that error [13:54:41] I am not sure if we have any pod requesting a GPU in eqiad, if so it would be great to kill them and see if they are correctly re-created [13:54:51] (basically the final proof that everything works) [13:55:05] probably edit check? [13:55:21] yep [13:55:23] on it [14:01:30] Looks good [14:09:29] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11229556 (10Eevans) >>! In T402984#11228032, @BWojtowicz-WMF wrote: > > [ ... ] > >> I think these are all quite reasonable.... [14:20:46] niceee [15:01:32] Will do the updates/install in codfw [15:11:35] Also looking good. [15:12:35] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Upgrade the AMD GPU plugin for k8s to support MI300 GPUs - https://phabricator.wikimedia.org/T398600#11229795 (10klausman) 05Open→03Resolved This has been rolled out to both eqiad and codfw GPU machines and I restarted our one prod pod th... [15:24:23] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11229862 (10achou) @Eevans Yep, it's fine :) [21:18:17] 06Machine-Learning-Team, 10Semantic Search: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11231304 (10JTannerWMF)