[00:16:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[00:16:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[00:21:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[00:21:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=itwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[06:03:31] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687 (10kevinbazira) 03NEW
[06:16:10] <ozge_>	 good morning
[06:21:36] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146520 (10kevinbazira) I have worked on an [example DAG](https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/kevinbazira/cont...
[06:24:28] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11146527 (10achou) > 4. Articles with relatively few pageviews (WIP) >     - Ide...
[07:08:45] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146603 (10brouberol) Ok, so the error is `touch: cannot touch '/mnt/model-training/test_write.txt': Read-only file system`. The runuser has a `ui...
[08:01:53] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146668 (10kevinbazira) Thanks for the pointer, I have added  the pod `security_context` argument, but still getting the same error reported in T4...
[08:05:23] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11146671 (10MGerlach) >>! In T401968#11146527, @achou wrote: > @diego, do you kn...
[08:16:20] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146700 (10brouberol) Oh right, the output clearly states  that the directory is group-owned by runuser: `Access: (2775/drwxrwsr-x)  Uid: (    0/...
[08:21:05] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11146727 (10brouberol) I'm seeing the volue mounted in `readOnly` mode:  `lang=yaml     - mountPath: /mnt/model-training       name: airflow-ml-mod...
[08:35:16] <elukey>	 o/
[08:36:09] <elukey>	 re: ml-serve1012, I've ran provisioning again, the first time without --uefi (by mistake) so some bios options were set, then I've run it again with --uefi to correct. Some reboots happened, and now I see the gpus
[08:36:12] * elukey cries in a corner
[08:37:32] <elukey>	 amd-smi works fine, I see the 8 gpus
[08:44:06] <wikibugs>	 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697 (10elukey) 03NEW
[09:34:11] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11147040 (10BWojtowicz-WMF) > Why is it more versatile? @Eevans   I'll write down an example of request parameters and prediction we are generating:  Requ...
[09:41:46] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11147052 (10kevinbazira) @brouberol thanks a lot for the bug fix in: T403687#11146727. The issue we were experiencing in T403687#11146520 has been...
[09:43:43] <wikibugs>	 06Machine-Learning-Team: Automate copying of model training data files from Swift or HDFS to PVC for Airflow ML pipelines - https://phabricator.wikimedia.org/T403687#11147056 (10brouberol) Nice, that's good to hear!
[09:46:48] <bartosz>	 hello team, I think the patch adding the Cache mechanism to articletopics model could be ready for a 1st look by some brave reviewers https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1176448
[09:47:10] <bartosz>	 it’s quite a big one tho! In the commit message, I wrote the main changes and I’d recommend going over the changes with the same order as in the commit message to have a better understanding
[09:47:28] <bartosz>	 If someone would be interested, I’d also be happy to do a pair-review session, where I could do a small introduction to the changes and we could discuss everything live :D 
[09:53:48] <ozge_>	 hello, I have two small MRs. I've added descriptions in the mrs. https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/37 https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1651 Can you take a look when you have time? @kevinbazira
[09:54:25] <kevinbazira>	 ozge_: o/ ack...looking
[10:05:37] <isaranto>	 hi folks!
[10:11:16] <wikibugs>	 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11147141 (10elukey) I tried to follow [[ https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/quick-start-guide.html | this guide ]] on ml-serve1012, where Debi...
[10:17:27] <wikibugs>	 06Machine-Learning-Team: Experiment with amd-smi and the new AMD GPUs MI300x - https://phabricator.wikimedia.org/T403697#11147159 (10elukey) From the amd-smi's [[ https://github.com/ROCm/amdsmi/blob/amd-mainline/CHANGELOG.md#amd_smi_lib-for-rocm-612 | changelog ]], we have 6.1.2 in Debian and there is a ton of n...
[10:23:22] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11147176 (10diego) Thanks @MGerlach , I agree that is the main source of pagevie...
[10:36:16] <wikibugs>	 06Machine-Learning-Team: Revscoring editquality damaging - https://phabricator.wikimedia.org/T403709 (10gkyziridis) 03NEW
[11:04:00] <wikibugs>	 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging  - https://phabricator.wikimedia.org/T403709#11147306 (10isarantopoulos)
[11:25:24] <wikibugs>	 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147351 (10isarantopoulos) >Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some cases @elukey After George deployed the...
[11:31:18] <wikibugs>	 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147367 (10isarantopoulos) Apart from configuration differences we'd also need to check what else has changed in the service -- if anything at all -- that would justify these numbers.
[11:32:30] <wikibugs>	 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11147369 (10gkyziridis)
[11:58:12] <ozge_>	 Hello, 
[11:58:12] <ozge_>	 I want to share some updates about ml-pipelines repo:
[11:58:12] <ozge_>	 - I see all pipelines are triggered when a sub-gitlab trigger have an invalid path in the main gitlab ci even if one sub-gitlab trigger is incorrect. This should be fixed in the last MR.
[11:58:12] <ozge_>	 - I've updated default settings of MRs to encourage-squash-commits before merging. This should help to have a more clear commit history in the main branch.
[11:58:12] <ozge_>	 Thank you!
[12:02:30] <wikibugs>	 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147458 (10elukey) >>! In T403378#11147351, @isarantopoulos wrote: >>Our theory of Istio being the only culprit seems not right, because I can see keserve's predic_ms values up to 5 seconds in some...
[12:23:59] <wikibugs>	 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147516 (10isarantopoulos) Yes yes. Thanks for pasting the kserve logs. It is now clear that istio has nothing to do with this -- quite the opposite -- the istio dashboards report the real latencie...
[12:27:23] <wikibugs>	 06Machine-Learning-Team: Increased 5xx error rate in revscoring itwiki damaging - https://phabricator.wikimedia.org/T403709#11147531 (10gkyziridis)
[12:36:37] <wikibugs>	 06Machine-Learning-Team: Review Tone Check Latency SLO and its targets - https://phabricator.wikimedia.org/T403378#11147585 (10elukey) A simple and effective debug strategy could be do add logging about the payload received from the client, so that coupling high latency predict_ms with its client request becomes...
[13:05:30] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 10Revise-Tone-Structured-Task, 05Goal, 07OKR-Work: Analyze samples of articles to see how many structured tasks we might be able to generate - https://phabricator.wikimedia.org/T401968#11147725 (10fkaelin) The knowledge gap pipeline (which is snapshot based) aggreg...
[14:10:07] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11148081 (10Eevans) >>! In T401778#11147040, @BWojtowicz-WMF wrote: >> Why is it more versatile? > @Eevans  >  > I'll write down an example of request par...
[14:33:54] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11148221 (10Eevans) So to (try to )make this a bit more concrete:  If you had...  `lang=sql CREATE TABLE table (   page_title    text,   wiki          tex...
[17:20:58] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149206 (10Ottomata) > But again... 64 results isn't a lot, so if you want to elide such indexing in favor of late-filtering that's OK too.  If this is o...
[18:04:47] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149393 (10Eevans) >>! In T401778#11149206, @Ottomata wrote: >> But again... 64 results isn't a lot, so if you want to elide such indexing in favor of la...
[18:38:03] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149545 (10Ottomata) > I think my confusion stems from the idea of having a threshold value that is only ever 0.5. I don't think it is expected to always...
[18:58:16] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149627 (10Eevans) >>! In T401778#11149545, @Ottomata wrote: >  > [ ... ] >  >> Storing all of the predictions and their corresponding score could be don...
[20:06:09] <wikibugs>	 06Machine-Learning-Team: Evaluate adding caching mechanism for article topic model to make data available at scale - https://phabricator.wikimedia.org/T401778#11149872 (10Ottomata) I see yah, then either of those options is good.  From a usage perspective it is the same: I can get the full prediction either way...