[00:00:21] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Make airflow-dag for addalink training pipeline output compatible with deployed model - https://phabricator.wikimedia.org/T388258#10685283 (10leila) Moved this task from the Research board to the Machine Learning Team's board given that per our updated work... [06:57:25] 10Lift-Wing, 06Machine-Learning-Team: LiftWing articlecountry model logs improper json in stderr - https://phabricator.wikimedia.org/T389768#10685548 (10kevinbazira) @dcausse thank you for confirming in P74316#298792, that event logs are now produced as valid JSON as expected by Logstash. thank you for also l... [07:20:42] (03PS1) 10Kevin Bazira: events: remove newline from event log to prevent log splitting [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131887 (https://phabricator.wikimedia.org/T389768) [07:40:05] (03PS2) 10Kevin Bazira: events: remove newline from event log to prevent log splitting [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131887 (https://phabricator.wikimedia.org/T389768) [07:46:35] (03CR) 10DCausse: [C:03+1] events: remove newline from event log to prevent log splitting [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131887 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [07:48:32] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review, David :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131887 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [07:54:46] (03Merged) 10jenkins-bot: events: remove newline from event log to prevent log splitting [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131887 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [07:55:33] o/ log splitting patch merged, but will deploy it on Monday as it affects multiple model-servers --^ [08:03:29] o/ [08:03:38] ack, you're doin the right thing :D [10:11:14] o/ [10:11:21] quick one for knative - https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1131945 [10:17:38] \o [10:36:47] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10685922 (10achou) > * revision is for any revision of any page For the revertrisk model, revision is for any re... [11:03:16] (03PS8) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [11:03:27] (03PS9) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [11:40:06] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298#10686084 (10Samwalton9-WMF) [11:40:36] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 06Wikipedia-Android-App-Backlog: Enable Revert Risk RecentChanges filter on id.wiki - https://phabricator.wikimedia.org/T365701#10686097 (10Samwalton9-WMF) 05Stalled→03Declined Per the new sc... [12:19:41] (03CR) 10Ilias Sarantopoulos: "Great work adding batching + pydantic validation!! I have some comments some of them are for discussion and one of them I think we should " [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:26:29] * isaranto afk lunch [15:59:42] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10686779 (10isarantopoulos) @klausman @hnowlan Do we know what caused the increased rate limit errors on t... [16:04:00] isaranto: about --^ Hugh has fixed the issue [16:04:10] so it should be ok to retry the patch [16:06:42] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10686808 (10hnowlan) >>! In T388269#10686778, @isarantopoulos wrote: > @klausman @hnowlan Do we know what... [16:19:21] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10686852 (10Ottomata) > vs. they can only get a score for the latest revision of a page (article country). Just... [16:20:23] Thank you will try it Monday then! [16:23:28] Good to know our change wasn't the actual culprit :) [16:24:44] me too! that would have been even more inexplicable :D [16:24:54] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10686873 (10Ottomata) Another Q about revertrisk. Are visibilty settings relevant to possible revert risk? E.g... [16:26:01] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10686875 (10gkyziridis) **Edit-check Service with Kserve Batcher Update** - Pydantic post validation for each field - `Config.py` file for env var... [16:26:28] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10686880 (10gkyziridis) **Edit-check Service with Kserve Batcher Update** - Pydantic post validation for each field - `Config.py` file for env var... [16:31:06] (03PS10) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [16:32:14] (03PS11) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [16:33:44] FIRING: LiftWingServiceErrorRate: ... [16:33:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [16:34:49] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10686919 (10Ottomata) Re naming thoughts: We currently have a `mediawiki.page_change.v1` stream, in which the e... [16:35:09] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10686920 (10Ottomata) cc also @gmodena for the (undefined) mediawiki entity stream naming convention discussion [16:35:31] Folks I just pushed a new patch for review for edit-check batcher. It is a logic for handling wrong requests in the batch. --> https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1131045 . You can see the results in the phab task --> https://phabricator.wikimedia.org/T386100#10686880 [16:36:16] I will be OOO for the next 2 weeks. [16:38:44] RESOLVED: LiftWingServiceErrorRate: ... [16:38:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=plwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:32:53] ack George, dont worry we'll take care of this [17:34:11] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10687151 (10isarantopoulos) Awesome, thank you for taking care of this and thanks for the detailed response [17:36:36] I've synced with George a couple of times over google meet today for the edit-check service --- I'll provide any additional info on the task but we're ok to proceed [17:48:28] (03PS12) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [17:48:39] (03PS13) 10Gkyziridis: inference-services: edit-check GPU version for batch prediction. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1131045 (https://phabricator.wikimedia.org/T386100) [21:27:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:27:49] Deployment reference-need-predictor-00009-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00009-deployment - ... [21:27:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:33:55] 06Machine-Learning-Team, 10EditCheck: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10687934 (10ppelberg) [22:34:16] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10687935 (10ppelberg) [22:57:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [22:57:49] Deployment reference-need-predictor-00009-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00009-deployment - ... [22:57:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:37:58] 06Machine-Learning-Team, 10EditCheck, 10VisualEditor, 10Editing-team (Tracking): Evaluate efficacy of Peacock Check model output (internal review) - https://phabricator.wikimedia.org/T384651#10688055 (10SSalgaonkar-WMF) 05Open→03Resolved