[00:16:44] FIRING: LiftWingServiceErrorRate: ... [00:16:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [00:21:44] RESOLVED: LiftWingServiceErrorRate: ... [00:21:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:03:31] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687 (10kevinbazira) 03NEW [06:16:10] good morning [06:21:36] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146520 (10kevinbazira) I have worked on an [example DAG](https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/kevinbazira/cont... [06:24:28] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11146527 (10achou) > 4. Articles with relatively few pageviews (WIP) > - Ide... [07:08:45] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146603 (10brouberol) Ok, so the error is `touch: cannot touch '/mnt/model-training/test_write.txt': Read-only file system`. The runuser has a `ui... [08:01:53] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146668 (10kevinbazira) Thanks for the pointer, I have added the pod `security_context` argument, but still getting the same error reported in T4... [08:05:23] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11146671 (10MGerlach) >>! In T401968#11146527, @achou wrote: > @diego, do you kn... [08:16:20] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146700 (10brouberol) Oh right, the output clearly states that the directory is group-owned by runuser: `Access: (2775/drwxrwsr-x) Uid: ( 0/... [08:21:05] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146727 (10brouberol) I'm seeing the volue mounted in `readOnly` mode: `lang=yaml - mountPath: /mnt/model-training name: airflow-ml-mod... [08:35:16] o/ [08:36:09] re: ml-serve1012, I've ran provisioning again, the first time without --uefi (by mistake) so some bios options were set, then I've run it again with --uefi to correct. Some reboots happened, and now I see the gpus [08:36:12] * elukey cries in a corner [08:37:32] amd-smi works fine, I see the 8 gpus [08:44:06] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697 (10elukey) 03NEW [09:34:11] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11147040 (10BWojtowicz-WMF) > Why is it more versatile? @Eevans I'll write down an example of request parameters and prediction we are generating: Requ... [09:41:46] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11147052 (10kevinbazira) @brouberol thanks a lot for the bug fix in: T403687#11146727. The issue we were experiencing in T403687#11146520 has been... [09:43:43] 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11147056 (10brouberol) Nice, that's good to hear! [09:46:48] hello team, I think the patch adding the Cache mechanism to articletopics model could be ready for a 1st look by some brave reviewers https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1176448 [09:47:10] it’s quite a big one tho! In the commit message, I wrote the main changes and I’d recommend going over the changes with the same order as in the commit message to have a better understanding [09:47:28] If someone would be interested, I’d also be happy to do a pair-review session, where I could do a small introduction to the changes and we could discuss everything live :D [09:53:48] hello, I have two small MRs. I've added descriptions in the mrs. https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/37 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1651 Can you take a look when you have time? @kevinbazira [09:54:25] ozge_: o/ ack...looking [10:05:37] hi folks! [10:11:16] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11147141 (10elukey) I tried to follow [[ https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/quick-start-guide.html | this guide ]] on ml-serve1012, where Debi... [10:17:27] 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11147159 (10elukey) From the amd-smi's [[ https://github.com/ROCm/amdsmi/blob/amd-mainline/CHANGELOG.md#amd_smi_lib-for-rocm-612 | changelog ]], we have 6.1.2 in Debian and there is a ton of n... [10:23:22] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11147176 (10diego) Thanks @MGerlach , I agree that is the main source of pagevie... [10:36:16] 06Machine-Learning-Team: Revscoring editquality damaging - https://phabricator.wikimedia.org/T403709 (10gkyziridis) 03NEW [11:04:00] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11147306 (10isarantopoulos) [11:25:24] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147351 (10isarantopoulos) >Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some cases @elukey After George deployed the... [11:31:18] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147367 (10isarantopoulos) Apart from configuration differences we'd also need to check what else has changed in the service -- if anything at all -- that would justify these numbers. [11:32:30] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11147369 (10gkyziridis) [11:58:12] Hello, [11:58:12] I want to share some updates about ml-pipelines repo: [11:58:12] - I see all pipelines are triggered when a sub-gitlab trigger have an invalid path in the main gitlab ci even if one sub-gitlab trigger is incorrect. This should be fixed in the last MR. [11:58:12] - I've updated default settings of MRs to encourage-squash-commits before merging. This should help to have a more clear commit history in the main branch. [11:58:12] Thank you! [12:02:30] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147458 (10elukey) >>! In T403378#11147351, @isarantopoulos wrote: >>Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some... [12:23:59] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147516 (10isarantopoulos) Yes yes. Thanks for pasting the kserve logs. It is now clear that istio has nothing to do with this -- quite the opposite -- the istio dashboards report the real latencie... [12:27:23] 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11147531 (10gkyziridis) [12:36:37] 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147585 (10elukey) A simple and effective debug strategy could be do add logging about the payload received from the client, so that coupling high latency predict_ms with its client request becomes... [13:05:30] 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11147725 (10fkaelin) The knowledge gap pipeline (which is snapshot based) aggreg... [14:10:07] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11148081 (10Eevans) >>! In T401778#11147040, @BWojtowicz-WMF wrote: >> Why is it more versatile? > @Eevans > > I'll write down an example of request par... [14:33:54] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11148221 (10Eevans) So to (try to )make this a bit more concrete: If you had... `lang=sql CREATE TABLE table ( page_title text, wiki tex... [17:20:58] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149206 (10Ottomata) > But again... 64 results isn't a lot, so if you want to elide such indexing in favor of late-filtering that's OK too. If this is o... [18:04:47] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149393 (10Eevans) >>! In T401778#11149206, @Ottomata wrote: >> But again... 64 results isn't a lot, so if you want to elide such indexing in favor of la... [18:38:03] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149545 (10Ottomata) > I think my confusion stems from the idea of having a threshold value that is only ever 0.5. I don't think it is expected to always... [18:58:16] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149627 (10Eevans) >>! In T401778#11149545, @Ottomata wrote: > > [ ... ] > >> Storing all of the predictions and their corresponding score could be don... [20:06:09] 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149872 (10Ottomata) I see yah, then either of those options is good. From a usage perspective it is the same: I can get the full prediction either way...