[05:21:21] o/ going to deploy latest article-country model-server that supports wikilink-related predictions in LW staging ... [06:50:05] deployment couldn't proceed because of an incorrect image tag. I've fixed this in: https://gerrit.wikimedia.org/r/1127410 [06:50:05] please review whenever you get a minute. thanks! [07:59:15] good morning folks [08:19:34] hello! [08:24:02] kevinbazira: o/ If possible please change the latest image for the articlequality model as well. [08:24:19] we might run load testing one of these days on it [08:25:36] isaranto: thanks for the review I'd already +2'd that patch [08:25:36] I'll push another to update articlequality specifically [08:26:20] kevinbazira: no need to do it then. go ahead and work on your deployment and we'll take care of it when we need it [08:26:36] okok ... [08:30:53] georgekyz: o/ you can go ahead and deploy as well, I +1 [08:31:59] for the model in s3 we should move it to either a timestamped dir or another that would allow us to know which one it is (e.g. mbert_XXX) [08:32:49] no need to do it now, we can come up with names later [08:41:42] isaranto: alrighty [08:42:08] I cannot merge https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127059 [08:42:15] I do not have option for +2 [08:42:30] IDK why [08:47:27] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10631780 (10kevinbazira) @Isaac, the model-server we worked on in P73436 has been deployed in LiftWing staging. Please test the... [08:47:58] deployment succeeded --^ [08:54:17] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10631786 (10isarantopoulos) Update: We have also applied batching and haven't gotten any improvement. We have verified that the issue is caused by t... [09:00:12] georgekyz: o/ you can check https://gerrit.wikimedia.org/r/admin/repos/operations/deployment-charts,access [09:02:10] you may need to be in https://gerrit.wikimedia.org/r/admin/groups/3fdcf8fd0d569e90a3e9b39788a29f2c50d33be9,members [09:04:14] so you are in the deployment POSIX group, so I guess I can add you [09:04:31] all right done [09:04:38] georgekyz: can you refresh and retry? [09:04:44] FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:05:09] my alerts arrived :) [09:06:27] ok this is a different one [09:06:41] elukey: yeap I have the option now! THNX a lot [09:07:01] thanks Luca! [09:08:26] np :) [09:09:44] RESOLVED: LiftWingServiceErrorRate: ... [09:09:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=hewiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:36:44] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10631874 (10gkyziridis) Edit-check service with peacock model at `s3://wmf-ml-models/edit-check/peacock/` deployed on staging. These are the results: {P74218} [09:38:29] Edit-check service with peacock model is deployed on staging! 🥳 [09:38:45] nice work! [10:27:49] Morning! [10:28:40] o/ [10:28:47] isaranto: if-when you want to deploy the apigw change, lmk [10:29:44] klausman: I was ready to say that :) You can deploy it whenever you can. Thanks! [10:29:56] ack, will do in a moment [10:41:59] isaranto: So I pushed the change to APIGW's staging, but it doesn't work: [10:42:08] $ curl -k https://staging.svc.eqiad.wmnet:8087/service/lw/inference/v1/models/edit-check:predict -X POST -d '{"rev_id": 123456, "lang": "en"}'; echo [10:42:10] {"httpCode":503,"httpReason":"no healthy upstream"} [10:43:35] ah, I think I spotted the error, sec [10:45:28] lemme check [10:47:09] I think `internal_host` should point to inference-staging.svc.codfw.wmnet [10:47:23] Since there is no cross-dc discovery endpoint, like we have for prod. [10:47:30] the input is also wrong, I don't recall if the service is handling input properly (if it would throw a 400) but this should work '{"original_text": "asdasdad", "modified_text": "asdasdadadfsafsadfsen", "check_type" : "peacock" ,"lang": "en"}' [10:47:36] ouch, yes that is correct [10:47:56] I missed that [10:48:02] So did I :) [10:52:30] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10632046 (10MunizaA) >>! In T387019#10631786, @isarantopoulos wrote: > @MunizaA that sounds like a great approach! thanks for sharing. I'll explore... [10:57:17] isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127487 [10:57:59] Once helm-lint is happy, I'll push it and test again [11:02:01] ack, thanks! [11:11:47] there are still issues, but I think they may be internal to the apigw or its k8s setup, I'll poke Hugh [11:34:17] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10632122 (10isarantopoulos) Thanks @MunizaA I will take a look and try it I started looking a bit into where time is spent during the predict func... [11:42:22] re: reference-need I'll apply batching to see if it improves a bit. I dont expect a significant change, but sth is better than nothing :) [11:42:22] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127494 [11:44:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:44:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [11:44:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:49:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [11:49:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [11:49:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:05:51] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10632204 (10achou) 05Open→03In progress [12:09:31] isaranto: do you have an example curl commandline for edit-check (not using apigw) that works< I can't get it to work here for some reason [12:09:56] Yep [12:10:09] Gimme a sec [12:11:33] here it is https://phabricator.wikimedia.org/T386100#10631874 [12:12:18] ``` [12:12:18] curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/edit-check:predict" \ [12:12:18] -X POST -d '{"lang": "en", "check_type": "peacock", "original_text": "blah blah blah", "modified_text": "blah blah blah"}' \ [12:12:18] -i -H "Host: edit-check.experimental.wikimedia.org" [12:12:18] ``` [12:12:44] yeah, I keep making the same mistake of thinking -H already implies the Host: bit [12:15:40] Ok, I've given Hugh the details, but he has to deal with an outagwe first, so it may take a bit [12:16:02] * klausman lunch [12:21:21] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10632280 (10achou) [13:01:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:01:49] Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00005-deployment - ... [13:01:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:13:52] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10632622 (10achou) On English Wikipedia, there are two places that discuss peacock language: 1. https://en.wikipedia.or... [13:23:53] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10632691 (10jhsoby) >>! In T388215#10632622, @achou wrote: > The page exists in the 19 non-English targeted languages. Ho... [13:56:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:26:53] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10633019 (10Isaac) One small thing that I've noticed -- empty string countries can show up in the response which we want to filt... [14:58:43] just a heads-up, we're seeing quite a lot of timeouts for "reference_need_cluster" on the gateway and it's causing some indirect pages [15:05:19] can we do something to reduce those or temporarily disable it? [15:05:22] the above alerts are probably related, I see that autoscaling cannot bring up pods in that namespace [15:05:31] lemme try to give more room for cpus/memory [15:05:43] thanks! [15:07:31] I bumped the resourcequotas to 200/200G in the revision-models ns in eqiad [15:08:39] thanks, Luca! [15:09:52] np! I bumped cpus' limit to 250 to be safer [15:10:04] thank you elukey <3 [15:11:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment reference-need-predictor-00005-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:12:10] klausman: I didn't follow up with the admin_ng patch though, FYI if you want to make the new limit permanent [15:12:18] ack! [15:14:20] isaranto: edit-check via staging-apigw now works, will deploy the apigw changes for prod as well [15:14:39] Danke! [15:21:03] 10Lift-Wing, 06Machine-Learning-Team: Load test the language agnostic article-quality model - https://phabricator.wikimedia.org/T388805 (10isarantopoulos) 03NEW [15:25:39] isaranto: it works! https://phabricator.wikimedia.org/P74223 --- including a ratelimit of 0 for non-token requests resulting in a 429. [15:26:02] (those curl calls were made from my workstation here) [15:37:20] Nice! [15:37:40] klausman: I'd like you to review and deploy (if you think it is ok ) this patch for the resourcequota https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127530 [15:38:24] and if we also apply this patch it will give the namespace some breathing room as ref-risk doesnt need an autoscaling rule that is so low https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127541 [15:38:33] Does that take into account Luca's changes above? [15:40:55] (the resource change, that is) [15:41:14] ok, sorry I totally missed the above conversation [15:41:45] That said, we should make the changes permanent, lest we forget about the hot edit. [15:42:21] The autoscaling one I've +2's [15:42:26] *'d [15:42:42] I updated the resourcequota patch to match the 200 cpus that exist now https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127530 [15:42:51] memory is not an issue [15:43:02] ack [15:44:16] 10Lift-Wing, 06Machine-Learning-Team, 10Wikimedia Enterprise - Content Integrity: Load test the language agnostic article-quality model - https://phabricator.wikimedia.org/T388805#10633509 (10FNavas-foundation) [15:48:55] autoscaling changes is pushed in both prod clusters [15:58:10] quota also "pushed (as in a noop in eqiad, but there were some ip-range updates for external services, as usual, so I pushed those; diff is now empty for both eqiad and codfw) [16:02:24] ack. I also pushed the change in eqiad :) [16:05:57] did I forget one? [16:19:23] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Investigate options for providing beta cluster / patchdemo access to liftwing staging - https://phabricator.wikimedia.org/T388269#10633709 (10isarantopoulos) @dchan The service on staging can now be accessed with ` curl https://api.wikimedia.org/service/lw... [16:23:34] I see reference-risk scaled down and it is serving traffic as usual. The issue with reference-need persists though [16:24:36] no you didnt forget it, I think I just synced seconds before you did :P [16:30:07] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817 (10isarantopoulos) 03NEW [17:09:02] going afk folks, have a nice evening o/ [17:17:58] \o [19:13:23] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Verify cost of gathering peacock training/evaluation data for top 20 languages - https://phabricator.wikimedia.org/T388215#10634471 (10Strainu) In Romanian we have: * WP:EJV - peacock words, value judgments * WP:FE - weasel words (peocock are i...