[02:43:19] (03CR) 10Kevin Bazira: [C:03+1] "Thank you for working on this, Ilias. I tested the patch and it LGTM: https://phabricator.wikimedia.org/P74553" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133183 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [03:35:45] 06Machine-Learning-Team, 10ContentTranslation: Content Translation Recommendations API - https://phabricator.wikimedia.org/T293648#10702152 (10Pppery) [06:38:44] FIRING: LiftWingServiceErrorRate: ... [06:38:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:43:44] RESOLVED: LiftWingServiceErrorRate: ... [06:43:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=ptwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [07:25:29] good morning! [07:28:22] Good morning [07:37:11] isaranto: o/ [07:37:26] I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1133315 to verify that we should now be good [07:37:58] I targeted a low traffic isvc so in case of issues the blast radious should be very limited [07:38:05] but I am confident it will work [08:04:12] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10702417 (10achou) Sounds great! Thanks for all the input. We agreed that for this new stream, only `mediawiki.p... [08:20:49] \o hope it works! [09:20:50] all right trying to deploy :) [09:23:30] ok so as before, the pods are not refreshed [09:23:39] so I explicitly deleted bnwiki [09:25:45] the pod doesn't come up sigh [09:26:59] fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile [09:27:19] this is unexpected [09:31:21] ok all reverted, now I get why we needed to kill the pod, knative didn't like the new setting [09:31:53] ok so I think we need to couple the PSS deployment for knative and kserve [09:36:19] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10702832 (10elukey) I tried to deploy https://gerrit.wikimedia.org/r/1133315 to a single NS on ml-serve-codfw, and ended up with pods not ge... [09:44:07] ack, thanks Luca. so if all were ok the pod should pick up the changes and restart right? [10:10:44] exactly yes [10:26:24] I'm running some load tests on a new service and hitting the kserve rate limits \o/ [10:26:56] istio starts blocking some requests and sends 429. I'm just noting this as something positive cause we are able to serve close to 100rps! [10:27:46] ok I just found that the rate limit value is 100 rps! [10:28:56] is this the point where we set it? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/ml-serve.yaml#744 [10:43:13] aiko: is it ok if I merge this https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1133183 [10:43:13] just mentioning it because it has a couple of changes for the edit check service so you could use them while working on the shap values for the service [11:24:30] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10703212 (10isarantopoulos) I ran a load test with kserve batcher in ml-staging with the following batching configuration ` predictor: batcher:... [11:26:39] isaranto: yes I think so (re: rps) [11:34:44] isaranto: thanks for mentioning it! LGTM [11:35:01] ok merging! [11:35:08] (03CR) 10Ilias Sarantopoulos: [C:03+2] edit_check: add unit tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133183 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [11:35:36] I wanna highlight this from the above load tests for edit-check `The service successfully serves 96 rps with latency being 74 ms @ the 99th percentile` [11:35:40] :tada [11:35:44] 🎉 [11:37:47] woohoo \o/ [11:39:14] very nice! [11:44:03] (03Merged) 10jenkins-bot: edit_check: add unit tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133183 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [11:45:06] I've already enabled batching on the service in experimental and I sent a follow up patch to reflect those changes in the repo https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1133364 [12:12:34] I can confirm the source of the 100rps limit, we added it at the time as local "shield" for every pod - it is an envoy filter, nothing more [12:12:44] and it acts separate for every pod [12:12:50] we can remove if not needed [12:13:19] yeah, IIRC, we basically thought we should have _some_ last line of defense, and 100 qps seemed like a reasonable zeroth approximation [12:13:34] I'd not remove it entirely, but maybe bump it up to 500 [12:13:53] it is like not having it though with 500 rps :D [12:14:00] There's that [12:14:22] it is also easy to just disable and re-enable if needed, as last line of defense [12:14:54] yeah. And I guess more external (API GW, CDN-level) ratelimits would still protect us well enough [12:15:16] isaranto: want me to make a patch for the rps limit? [12:19:54] ack thanks! klausman we definitely don't need to serve so many requests at the moment. Perhaps we could temporarily set it to 500 rps in staging to run a load test and then turn it back to 100 . wdyt? [12:20:30] all three options (100, 500, ∞) work for me :) [12:24:18] my 2c - 500 is not worth it since the pod will be long gone before envoy helps [12:31:03] Removing the filter entirely would let us find out where the tipover point is, even if it happens to be >500rps [12:31:57] (plus it deletes more YAML, which is always good ;) ) [12:55:51] wouldn't it be worth to keep the rate limit as an extra layer of security for internal requests? [12:57:43] my point is - either we keep it tuned at a reasonable level, or we don't :D [12:58:11] if you think about it, if you tune it for the highest performing pod, then you can freely hammer the other ones and not get a 429 [12:58:25] so it could make sense to see if we can tune it per namespace [13:05:14] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855 (10isarantopoulos) 03NEW [13:59:31] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10703866 (10isarantopoulos) As a follow up I've ran two load tests for the public API to see what happens in two cases: from my local machine & from a... [14:00:08] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10703868 (10Ottomata) I personally prefer the general purpose `page` name as well. Article is a special case. A... [14:00:59] 06Machine-Learning-Team, 10LDAP-Access-Requests, 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10703895 (10isarantopoulos) [14:05:24] (03PS1) 10Kevin Bazira: Makefile: add support for edit-check [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133416 (https://phabricator.wikimedia.org/T386100) [14:06:14] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10703937 (10kostajh) >>! In T326179#10703868, @Ottomata wrote: > I personally prefer the general purpose `page`... [14:11:10] (03CR) 10Kevin Bazira: "you can test this using:" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133416 (https://phabricator.wikimedia.org/T386100) (owner: 10Kevin Bazira) [14:24:04] (03PS1) 10AikoChou: edit-check: add shap values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) [14:24:51] (03CR) 10CI reject: [V:04-1] edit-check: add shap values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) (owner: 10AikoChou) [14:28:57] (03PS2) 10AikoChou: edit-check: add shap values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) [14:46:33] (03PS3) 10AikoChou: edit-check: add shap values [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1133426 (https://phabricator.wikimedia.org/T387984) [14:56:37] elukey: I was in meetings and never replied. We'll just remove the rate limit to add load tests and then add it again. After that we are going to look into adding per namespace rate limits in the future (since it is not something that we need at the moment) [14:59:47] super [16:15:15] going afk folks have a nice evening! [16:26:09] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10704864 (10achou) > Does RevertRisk work for non main namespace revisions? No, it only works for Wikipedia main... [16:36:26] logging off today o/ [21:34:00] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team: Make airflow-dag for addalink training pipeline output compatible with deployed model - https://phabricator.wikimedia.org/T388258#10706132 (10VirginiaPoundstone) Catching up this task and the slack thread linked in comment above. It seems like there are fou... [22:46:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:46:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [22:46:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:31:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [23:31:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [23:31:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas