[06:25:21] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11200120 (10kevinbazira) As mentioned in T404722#11188124, the tone-check training data generation job completed after running for >9.5hrs. At that time, it wa... [06:44:40] good morning! [07:03:42] good morning! [07:09:59] o/ [07:10:46] isaranto: kalimera :) I have an explanation about why Pyrra shows that error budget graph for tone check's latency, I think the data is correct [07:11:49] \o buon giorno! curious to hear about it! [07:13:58] I wanted to ask you for a quick chat on meet but probably if I try to explain everything in here it may reach a broader audience [07:15:17] so let's start from https://slo.wikimedia.org/objectives?expr={__name__=%22tonecheck-latency-v1%22,%20revision=%221%22,%20service=%22tonecheck%22,%20team=%22ml%22}&grouping={}&from=now-4w&to=now [07:15:39] the SLO mentions to have HTTP 200 responses served below 1s 90% of the times [07:17:11] good morning [07:17:35] (2 mins and I'll continue) [07:22:23] all right [07:22:57] so I expected a window to start from a 100% remaining error budget, and decrease over time until i reaches its final value for the window [07:24:55] and we have two kind of windows: rolling and calendar [07:25:26] the rolling is dynamic, and it is what Pyrra offers: you have your error budget calculated dynamically in a timespan from one month ago to now [07:25:52] the calendar is fixed and corresponds to our 3 months quarter (shifted earlier by one month etc..) [07:26:48] Due to how Prometheus etc.. work, it is easier to establish relative time frames rather than precise ones [07:27:26] so the above graph needs to be read as "every datapoint shows the remaining error budget at the end of the window, calculated from one month before that datapoint to it" [07:28:11] tone check was failing the SLO promises very often until it went on a GPU, and this is why the error budget is slowly trending up to 100% [07:28:46] and the grafana view is the same, since it is the same data over 3 months [07:29:40] at some point the days where tone check was failing the SLO will be older than a month, and at that point probably the error budget will be at 100% [07:32:20] roger that [07:33:47] thanks for the explanation. I understand the 4w rolling window metrics and why they are improving but what about the calendar ones? does this mean that we can't have metrics for a predefined period of time? [07:34:09] yeah this is the current limitation [07:34:26] Pyrra doesn't support calendar natively [07:34:55] it seems easy but the calculation is not very straightforward, you need to do some Grafana tricks to make it happen [07:35:04] and use different ranges in prometheus etc.. [07:35:36] We are going to explore https://sloth.dev/ that may give us a better calendar view [07:39:01] now I'd really love to have your honest feedback on the tool [07:39:19] it will surely drive me/us to have a better solution by end of Q2 [07:43:41] It definitely does its job as we have metrics on how the service is performing. What we are lacking at the moment is to be able to extract metrics for a fixed time window, which we wanted to do for the A/B test that is running now [07:44:10] bartosz: o/ can you drive the deployment of this patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189871? [07:45:22] isaranto: one possible way to progress the A/B testing could be to use the fixed interval in grafana, taking into account the initial value of the error budget [07:45:37] bartosz: you could do this in staging first and then on codfw since it doesn't have traffic . This is temporary and will change after the DCswitchover [07:45:52] for example, if you start at 20% and you keep a positive trend, it means that you are not burning anything [07:46:28] even with sloth, doing arbitrary fixed windows will not be possible [07:47:27] or better, we'd need to spend a lot of time on it and I am not 100% sure if it will give you the guarantee that you want (I guess the A/B test should demonstrate that you're not degrading the user experience) [07:47:37] what we needed for the A/B test is to answer the question :"Proportion of all requests that return a response within 1000 milliseconds" [07:48:23] so this is something that cannot be achieved this way, so we should either change the metrics we will report or try to get that information using specific queries on thanos [07:50:23] isaranto: o/ happy to drive it! however, we've already deployed it on staging, right? So now it's codfw and then eqiad, but first I'll run load-tests against staging [07:51:33] isaranto: if the goal is to demonstrate that the user experience is not degraded, the SLO should be ok to be used (and maybe a nice experiment). Coming up with a proportion in grafana using native metrics should be easy, and we could compare the two results and see how/if the differ and what's easier to read [07:52:35] if you like the idea I am available, it ties very well in what i am working on [07:53:18] bartosz: ok, let's also do the sync on staging though after the merge to see if the transformer will go away [07:54:48] isaranto: I agree, let's do it this way! [07:55:46] bartosz: shall I merge it then? [07:55:53] +1'd the patch, I can deploy once we merge [07:56:28] isaranto: yes please <3 [07:58:29] 10Lift-Wing, 06Machine-Learning-Team: prediction_classification_change stream schema change causes model server failures - https://phabricator.wikimedia.org/T405067#11200338 (10isarantopoulos) 05Openβ†’03Resolved [07:59:43] elukey: I do like the idea! [08:01:28] FIRING: HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-staging-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://grafana.wikimedia.org/d/d15d3135-ff1c-4c6f-bebe-ee57b136df70/helmfile-admin-ng-pending-changes?orgId=1&var-kubernetes_cluster=ml-staging-codfw - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [08:02:48] isaranto: helm shows empty diff after merging this change, running sync also doesn't change anything with the deployments, we might need to terminate transformer manually afterall :( [08:03:08] ack! [08:04:01] ahh we still have the `transformer: null` in the staging config. let's also remove this first, creating a patch [08:07:37] isaranto: If you'd have a free second https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1190185 πŸ₯Ί [08:15:23] isaranto: thank you quick review! it still did not clean up the transformer :( klausman would you have some free time to terminate the `outlink-topic-model-transformer-default-00030-deployment` from `articletopic-outlink` ns on `ml-staging-codfw`? [08:20:27] morning! and yes, will do [08:21:30] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11200463 (10OKarakaya-WMF) [08:21:57] klausman: thank you! [08:23:27] I see that the deployment got re-created automatically :( So the isvc probably still manages and re-creates it [08:34:40] yeah, I had to completely destroy the release and re-make it. [08:35:00] It seems the transformer was correctly removed from `spec`, but still existed within `status.components` in the isvc, possibly this made the kserve recreate it? [08:35:12] klausman: thanks a lot! [08:35:34] There probably would have been a more elegant way of doing it, but this was quick and it's only staging [08:50:12] 06Machine-Learning-Team, 05Goal, 13Patch-For-Review: Merge articletopic outlink model transformer and predictor pods - https://phabricator.wikimedia.org/T404294#11200576 (10BWojtowicz-WMF) In https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1187739, we've combined the `transform... [08:52:23] klausman: we'd also like to do deploy same change on production today, which probably will result in similar behavior. Is it reasonable to also re-create the isvc on prod to fix this? [08:52:39] probably, yes [08:53:26] but let me see if there's a better way to do the cleanup [08:55:52] oki, I can be ready to deploy anytime, would do codfw first [08:56:37] ack, gimme like 5-10 to see if there's something else to try, and at worst, we'll have to drain the service [08:57:04] (as in, update codfw, drain eqiad, update eqiad, undrain) [08:57:59] sure, let me know when ready πŸ™Œ [08:59:19] there are also some changes in admin_nd pending for ml-staging-codfw. https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [08:59:51] are these the same as last week's -- experimental ns limitranges [09:00:55] No, those are just external services IP raneg changes [09:00:59] I'll deploy it [09:01:10] ok thanks! [09:01:19] oh and happy equinox! [09:08:09] bartosz: ok, I got a planβ„’, let's go [09:09:12] klausman: sweet, starting with codfw now [09:09:52] synced [09:12:54] It looks like the new release still contains a transfomer. Are we missing a chart change? [09:13:48] hmm the diff showed that it removes the transformer correctly [09:14:00] I also don't see it inside `spec` field of the isvc anymore [09:14:18] and yet, release 30 has a transfomer [09:15:52] er release 29 [09:16:07] I think we had release 29 of transformer and release 28 of predictor before my sync [09:16:18] And now they are both 29, because only predictor got bumped [09:16:27] ahyes. [09:16:28] πŸ™ˆ [09:19:35] Ok, new dpeloyment running correctly now [09:20:21] oki, I can see it. Will confirm that I can query it [09:21:04] responses look good [09:21:54] should we do eqiad now? [09:22:24] Let me drain it first. Queries will go to codfw while we work [09:22:54] sounds perfect [09:27:59] Alright, drain complete [09:28:34] thank you! starting the sync on eqiad [09:29:07] synced [09:31:19] ok, deployment looks good [09:31:44] I see it and responses look good when querying directly from within vpc [09:32:39] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11200848 (10OKarakaya-WMF) About the release of new wikis that are above the release threshold in v2 an... [09:33:10] vpc? [09:33:52] I meant querying the service directly from statbox and not through API GW [09:33:58] ah, righto. [09:34:02] Will undrain eqiad [09:35:34] perfect, thank you a lot! <3 [09:35:49] will send a couple of requests through api gw once we undrain to make sure everything works [09:36:12] currently waiting for DNS changes to propagate, should take <5m [09:39:21] and done [09:40:54] oki, getting the responses and can see them in the logs on eqiad [09:41:00] \o/ [09:41:36] klausman: thank you again! [09:41:43] np :) [09:49:05] ROCm 7.0 is here :D https://www.amd.com/en/products/software/rocm/whats-new.html [10:06:28] RESOLVED: HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-staging-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://grafana.wikimedia.org/d/d15d3135-ff1c-4c6f-bebe-ee57b136df70/helmfile-admin-ng-pending-changes?orgId=1&var-kubernetes_cluster=ml-staging-codfw - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [10:24:46] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10SRE-SLO, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#11200929 (10elukey) To keep archives happy, I added a more detailed explanation of the current limits that Pyrra sho... [10:33:08] (03PS2) 10Nikerabbit: Downgrade unknown prefix from error to debug [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [10:33:39] (03CR) 10Nikerabbit: Downgrade unknown prefix from error to debug (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [10:34:38] (03CR) 10Nikerabbit: [C:03+1] Warn when a page collection contains no valid links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189872 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [11:32:24] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11201072 (10Michael) >>! In T404460#11200848, @OKarakaya-WMF wrote: > About the release of new wikis th... [11:48:07] (03CR) 10AikoChou: [C:03+1] nsfw: remove blubber images and code [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1189846 (https://phabricator.wikimedia.org/T405083) (owner: 10Ilias Sarantopoulos) [12:33:33] 06Machine-Learning-Team, 10Wikimedia-GitHub: Add ML team members to WMF GitHub organization - https://phabricator.wikimedia.org/T405222 (10isarantopoulos) 03NEW [12:44:49] 06Machine-Learning-Team, 10Wikimedia-GitHub: Add ML team members to WMF GitHub organization - https://phabricator.wikimedia.org/T405222#11201260 (10Jdforrester-WMF) I've invited all three of them to the Wikimedia org, plus added to the ML team. Please shout if you need anything more! [12:45:11] 06Machine-Learning-Team, 10Wikimedia-GitHub: Add ML team members to WMF GitHub organization - https://phabricator.wikimedia.org/T405222#11201261 (10Jdforrester-WMF) 05Openβ†’03Resolved a:03Jdforrester-WMF [12:52:59] (03CR) 10Eamedina: [C:03+1] Downgrade unknown prefix from error to debug [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [12:54:23] (03CR) 10Eamedina: [C:03+2] Warn when a page collection contains no valid links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189872 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [12:55:06] (03Merged) 10jenkins-bot: Warn when a page collection contains no valid links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189872 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [12:58:03] 06Machine-Learning-Team, 07Essential-Work: Incorporate notebook into Tone-Check data generation ml-pipeline - https://phabricator.wikimedia.org/T404722#11201309 (10kevinbazira) As part of investigating why the DAG in T404722#11200120 failed to complete the data splitting step, I wanted to confirm whether reduc... [14:20:17] 06Machine-Learning-Team, 06Data-Persistence, 10Data-Persistence-Design-Review: Data Persistence Design Review: Article topic model caching - https://phabricator.wikimedia.org/T402984#11201671 (10Ottomata) > So that would be enwiki instead of en For your data storage, yes! For your API/UI parameters, whateve... [14:50:17] 10Lift-Wing, 06Machine-Learning-Team: prediction_classification_change stream schema change causes model server failures - https://phabricator.wikimedia.org/T405067#11201781 (10Ottomata) Thank you! And I am sorry about this! I wonder if in the future this could be avoided by either fully constructing t... [15:04:16] (03PS3) 10Sbisson: Downgrade unknown prefix from error to debug [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) [15:05:08] (03CR) 10Sbisson: Downgrade unknown prefix from error to debug (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [15:15:51] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#11201947 (10Samwalton9)... [15:16:53] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Growth-Team, 06Wikipedia-Android-App-Backlog: CRS: Community rollout plan and discussion about adding revertrisk to RecentChanges filters - https://phabricator.wikimedia.org/T352217#11201956 (10Samwalton9-WMF) 05Openβ†’03Declined We deployed to... [15:22:25] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Deploy Revert Risk (language agnostic) filter to all Wikipedias - https://phabricator.wikimedia.org/T348298#11201993 (10Samwalton9-WMF) [15:27:34] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Deploy Revert Risk (language agnostic) filter to all Wikipedias - https://phabricator.wikimedia.org/T348298#11202022 (10Samwalton9-WMF) [15:36:01] (03PS1) 10Sbisson: Decrease max number of connections from 20 to 10 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1190299 (https://phabricator.wikimedia.org/T405004) [15:49:16] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Deploy Revert Risk (language agnostic) filter to all Wikipedias - https://phabricator.wikimedia.org/T348298#11202160 (10Samwalton9-WMF) [15:49:57] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Deploy Revert Risk (language agnostic) filter to all Wikipedias - https://phabricator.wikimedia.org/T348298#11202167 (10Samwalton9-WMF) @Ladsgroup @tstarling We'd like to mov... [16:10:02] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202367 (10KStoller-WMF) [16:12:34] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202372 (10Sgs) [16:17:25] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202386 (10KStoller-WMF) [16:19:11] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202392 (10KStoller-WMF) [16:19:20] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202393 (10KStoller-WMF) [16:31:41] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog, 05WE4.2 Anti-abuse: Deploy Revert Risk (language agnostic) filter to all Wikipedias - https://phabricator.wikimedia.org/T348298#11202425 (10Ladsgroup) From my side: as long as the other rc model... [17:17:29] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Introduce case sensitivity to machine learning model for Add a Link - https://phabricator.wikimedia.org/T405185#11202614 (10KStoller-WMF) [18:51:16] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202839 (10KStoller-WMF) [19:01:28] FIRING: [2x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [19:03:06] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202867 (10KStoller-WMF) [19:05:55] 06Machine-Learning-Team, 10Add-Link-Structured-Task, 06Growth-Team: Add a Link: Rollout "Add a Link" Structured Task to Wikipedias that are supported by V2 model - https://phabricator.wikimedia.org/T404460#11202877 (10KStoller-WMF) @OKarakaya-WMF Sorry, I had some conflicting details in the task description.... [20:06:28] FIRING: [4x] HelmfileAdminNGPendingChangesLiftWing: Pending admin_ng changes on ml-serve-codfw - https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d%2Fadmin_ng - https://alerts.wikimedia.org/?q=alertname%3DHelmfileAdminNGPendingChangesLiftWing [20:07:06] (03CR) 10Nikerabbit: [C:03+2] Downgrade unknown prefix from error to debug (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson) [20:07:45] (03Merged) 10jenkins-bot: Downgrade unknown prefix from error to debug [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1189865 (https://phabricator.wikimedia.org/T404976) (owner: 10Sbisson)