[00:21:24] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11257604 (10Ahoelzl) [01:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [05:06:43] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [06:06:16] 06Machine-Learning-Team, 07Essential-Work: Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator - https://phabricator.wikimedia.org/T406302#11258029 (10kevinbazira) After enabling deferrable execution of the training operator to handle GPU resource contention, the following warning appear... [06:46:54] good morning! [06:47:09] good morning [06:49:57] I have silenced the adminNG alerts for 1d, until we deploy the changes to these clusters [07:49:27] 06Machine-Learning-Team, 10Semantic Search: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11258317 (10OKarakaya-WMF) [07:58:08] 06Machine-Learning-Team, 10Semantic Search: Semantic Search POC - In article QA - https://phabricator.wikimedia.org/T405359#11258345 (10OKarakaya-WMF) hello @santhosh, Thank you for the comments. We can run the experiments on larger LLMs. I've checked that we can use some larger models (tested with gpt-oss:1... [09:40:40] I've applied the pending changes to admin-ng. The alert should clear in 1-2h [09:48:51] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11258859 (10isarantopoulos) @Eevans Aiko has suggested a way to query for page_i... [09:48:59] o/ lemme remove the silence then [09:51:52] I deleted the silence [09:56:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:01:28] FIRING: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:06:28] RESOLVED: [6x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:08:20] there we go [10:54:06] klausman: o/ I am running the labeller binary on ml-staging2003 to test, no errors but no labels pop up in kubernetes too :D [10:54:24] so I guess something is wrong, I'll check later [10:54:28] lemme know if it is a problem [10:54:44] ack, thank you! [10:58:33] klausman: \o/ [10:58:35] ml-staging2003.codfw.wmnet Ready 476d v1.23.14 beta.amd.com/gpu.vram.64G=2,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ml-staging2003.codfw.wmnet,kubernetes.io/os=linux,node.kubernetes.io/disk-type=ssd,topology.kubernetes.io/region=codfw,topology.kubernetes.io/zone=row-b7 [10:58:40] this is only adding vram [10:58:53] but the first label looks really nice! [10:59:17] yes, indeed [11:00:22] they don't advertise this https://github.com/ROCm/k8s-device-plugin/blob/master/cmd/k8s-node-labeller/main.go#L432C25-L432C37 [11:02:00] I mean it makes sense to only reply to queries about the node the labeler knows about, no? [11:03:26] sure sure [11:03:47] but maybe a note in the docs wouldn't have hurt [11:03:56] I guess that we are the only ones running it in this way [11:04:24] oh, now I see what you mean. [11:04:34] yeah, it being undocumented is not great [11:04:49] anyway, I'll add the binary to the debian package + systemd unit etc.. [11:04:59] excellent [13:16:24] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11259736 (10Ottomata) > Keying on it like this will require that you supply the... [14:31:21] (03PS1) 10Sbisson: Common log prefix for cache update code paths [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1194958 (https://phabricator.wikimedia.org/T406854) [14:32:51] (03CR) 10CI reject: [V:04-1] Common log prefix for cache update code paths [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1194958 (https://phabricator.wikimedia.org/T406854) (owner: 10Sbisson) [14:40:47] (03PS2) 10Sbisson: Common log prefix for cache update code paths [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1194958 (https://phabricator.wikimedia.org/T406854) [14:57:03] (03CR) 10KartikMistry: [C:03+2] Common log prefix for cache update code paths [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1194958 (https://phabricator.wikimedia.org/T406854) (owner: 10Sbisson) [14:57:42] (03Merged) 10jenkins-bot: Common log prefix for cache update code paths [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1194958 (https://phabricator.wikimedia.org/T406854) (owner: 10Sbisson) [15:01:01] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review, 06Growth-Team, and 3 others: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task - https://phabricator.wikimedia.org/T401021#11260349 (10Eevans) >>! In T401021#11258859, @isarantopoulos wrote: > @Eevans Ai... [16:10:41] 06Machine-Learning-Team, 06LPL Hypothesis, 10Recommendation-API: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854#11260935 (10SBisson) Tagging #machine-learning-team in case they can provide insight into what's happening with the `localhost:6500` proxy