[00:27:17] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10569787 (10ppelberg)
[03:08:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=istio-system&var-backend=istio-ingressgateway.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[03:43:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=istio-system&var-backend=istio-ingressgateway.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[07:08:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=istio-system&var-backend=istio-ingressgateway.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[07:54:48] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work, 13Patch-For-Review: Update the article-country isvc to use Wikilinks for predictions - https://phabricator.wikimedia.org/T385970#10570192 (10kevinbazira)
[08:04:22] <isaranto>	 hii
[08:04:30] <isaranto>	 taking a look at the alerts :(
[08:04:40] <klausman>	 Morning!
[08:05:03] <klausman>	 I've been digging a bit already, and so far my best hypothesis is that we're getting a lot of 500 from MWAPI
[08:08:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=istio-system&var-backend=istio-ingressgateway.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:11:18] <isaranto>	 iiuc the issue is on istio. There are many requests (20reqs/s) made to reference quality https://grafana.wikimedia.org/goto/3my7JPcHR?orgId=1
[08:11:53] <isaranto>	 but they never reach the pod - hence we have no logs
[08:12:03] <isaranto>	 I mean no logs in the service
[08:14:49] <klausman>	 Yeah, it's puzzling. 
[08:15:17] <klausman>	 Have we checked that manual (curl) requests to the service work?
[08:15:41] <isaranto>	 I'm doing that now, I assume it will be unreachable
[08:22:14] <isaranto>	 yes I'm getting a 503 
[08:23:37] <klausman>	 So the outermost part of the service is the istio proxy container in the pod
[08:24:14] <klausman>	 Ignoring the requests to /metrics (Prometheus), the only messages I see are about a deprecated envoy field
[08:24:27] <klausman>	 But that's a non-fatal warning
[08:28:34] <klausman>	 id this service work correctly at any point? What changes have we made since?
[08:30:35] <isaranto>	 nothing changed. we started having increased traffic, observed cpu throttling and 5xx errors from mwapi https://grafana.wikimedia.org/goto/pfXDxP5Hg?orgId=1
[08:30:57] <klausman>	 Should we try bouncing the service?
[08:32:09] <hnowlan>	 o/ lmk if I can help at all, the gateway paged
[08:33:02] <isaranto>	 the container istio-proxy doesnt seem to have any issue as you mentioned. Could there be an issue with the istio system ns ? I dont have access to that ns, althouth the other services work fine so it wouldnt make sense, but the last alert was related to that one
[08:33:10] <klausman>	 Probably not a API GW issue, as we can reproduce without it (rifht, Ilias, you were hitting the Discovery endpoint?) and are not making BE requests through it
[08:33:41] <klausman>	 The system-ns istio works fine, though of course it logs a lot of 503s 
[08:34:04] <isaranto>	 o/ hnowlan yes I can confirm it isn't an api gw issue. I tried both internal (discovery) and external (api gw) endpoints and get the same results. service is undreachable
[08:34:08] <isaranto>	 *unreachable
[08:34:36] <klausman>	 https://phabricator.wikimedia.org/P73500 here's an istio log entry
[08:36:06] <isaranto>	 here is an internal request 
[08:36:06] <isaranto>	 ```
[08:36:06] <isaranto>	 curl "https://inference.svc.eqiad.wmnet:30443/v1/models/reference-quality:predict" -X POST -d '{"rev_id": 1242378206, "lang": "en"}' -H "Content-Type: application/json" -H "Host: reference-quality.revision-models.wikimedia.org"
[08:36:06] <isaranto>	 ```
[08:37:40] <isaranto>	 I switched from discovery to eqiad to target the specific deployment but with a quick test it seems that codfw has the same issue
[08:37:43] <klausman>	 I bounced one of the Ingress gws a bit back and have been looking at its logs, no change, it still works fine in the sense that it doesn't seem to log its own errors but the upstream ones
[08:39:34] <klausman>	 hmmm. did this coincide with the Pod Security Policy changes Luca deployed?
[08:39:52] <klausman>	 Not that it would make any sense to only affect this service...
[08:39:57] <isaranto>	 🤦‍♂️
[08:40:13] <isaranto>	 yes. It was first deployed yesterday in production by me
[08:40:29] <isaranto>	 the changes were there so this was the first deployment
[08:40:50] <klausman>	 Let's try bouncing the service itself, it won't break anything and may fix split-rbain issues regarding the PSS
[08:40:53] <klausman>	 PSP*
[08:41:29] <klausman>	 edyt?
[08:41:36] <klausman>	 er wdyt
[08:43:02] <isaranto>	 yes
[08:43:18] <klausman>	 ok, deleting old pod
[08:43:21] <isaranto>	 I bumped the max replicas in this patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121362
[08:43:33] <isaranto>	 and with that I also deployed the PSS change as it is in the chart
[08:43:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=istio-system&var-backend=istio-ingressgateway.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[08:44:35] <klausman>	 new pod is initializing
[08:46:28] <isaranto>	 I am now almost certain that it is the PSS change that caused this. The service in codfw didnt have any traffic at all but ending up the same
[08:46:49] <klausman>	 hmmm. pod init taking over 3m seems long
[08:48:03] <isaranto>	 no it is "normal"
[08:48:10] <isaranto>	 it downloads a 2GB file locally
[08:48:28] <klausman>	 https://phabricator.wikimedia.org/P73501 This is where it's sitting
[08:48:38] <klausman>	 botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=reference-quality%2F20250127142109%2F&encoding-type=url"
[08:49:26] <isaranto>	 hmm
[08:49:30] <klausman>	 So the storage initializer can't talk to Thanos-Swift
[08:49:39] <klausman>	 Which might be another PSP problem.
[08:49:55] <isaranto>	 can you try to remove this from the isvc?
[08:49:55] <isaranto>	 +     securityContext:
[08:49:55] <isaranto>	 +       seccompProfile:
[08:49:55] <isaranto>	 +         type: RuntimeDefault
[08:50:13] <isaranto>	 and then we can see how to revert it from the config
[08:51:05] <klausman>	 AThat's a non-live-editable attribute, so'll have to do a fs-based one, sec
[08:51:35] <isaranto>	 you could try to edit the isvc and not the pod, that should work
[08:52:40] <isaranto>	 if the attribute is there ofc
[08:53:29] <klausman>	 done
[08:54:04] <klausman>	 waiting to see if this helped
[08:54:49] <isaranto>	 quoting me from above "nothing changed"
[08:54:50] <isaranto>	 haha
[08:54:56] <isaranto>	 famous last words
[08:56:10] <klausman>	 restarted s-i is already tasking >1m, I don't think the isvc edit did the trick
[08:56:31] <klausman>	 but let's wait to be sure
[09:02:04] <klausman>	 yeah, still failing.
[09:02:15] <elukey>	 hey folks
[09:02:19] <elukey>	 anything that I can help with?
[09:02:36] <klausman>	 We think the PSP update broke at least one of our services
[09:03:09] <klausman>	 It was unreachable from the istio-system ingressgw's, so we bounced it, and now it can't even get stuff from Thanos-Swift anymore
[09:03:21] <elukey>	 did you deploy anything?
[09:03:33] <elukey>	 because the PSP upgrade was only in staging
[09:03:42] <klausman>	 The PSP update was pushed yesterday along with another update
[09:03:46] <elukey>	 and the only thing that may went out is the seccomp stuff
[09:03:58] <klausman>	 (I think, Ilias has more detail)
[09:04:20] <elukey>	 yes but if the pods came up correctly after the deployment I don't think it should be an issue
[09:04:36] <klausman>	 The pod we were investigating was 2wks old
[09:04:36] <elukey>	 the seccomp defaults are automatically injected
[09:04:47] <elukey>	 (since PSP is still used)
[09:04:58] <klausman>	 And we only restarted it today to try and clear whatever had gone wrong with it.
[09:05:17] <elukey>	 what is the current status? Any specific ns/model causing the issue? Both eqiad/codfw?
[09:05:46] <klausman>	 The one we were looking at is in NS revision-models in eqiad (though codfw is also broken), there is only one pod
[09:06:14] <klausman>	 I deleteld the  seccompProfile: type: RuntimeDefault bit from the isvc manuall, but it made no diff
[09:06:43] <elukey>	 yep yep that bit is already injected by PSP so I'd have been really surprised if it made a difference
[09:06:57] <elukey>	 also seccomp limitations block syscalls etc.. we should have seen clear impact
[09:07:03] <elukey>	 in the sense of errors etc..
[09:07:07] <klausman>	 Ack.
[09:07:26] <isaranto>	 I deployed this change yesterday https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121362
[09:07:26] <isaranto>	 and with it deployed the psp update. which added:
[09:07:26] <isaranto>	 +     securityContext:
[09:07:26] <isaranto>	 +       seccompProfile:
[09:07:26] <isaranto>	 +         type: RuntimeDefault
[09:07:36] <isaranto>	 ack
[09:07:52] <klausman>	 _something_ made the pod network-isolated it seems. The system ingressgw basically says "I can't talk to that service, here's a 503" on every request
[09:08:17] <elukey>	 and that's revision-models only right?
[09:08:21] <elukey>	 checking eqiad
[09:08:48] <klausman>	 yes, as far we know, only revision-models is affected
[09:08:58] <elukey>	 ah it can't even start the pod
[09:09:10] <isaranto>	 yes. revision-models ns, the service is reference-quality and hosts 2 models from the same pod (reference-need and reference-risk)
[09:09:24] <klausman>	 But then again, we didn't deploy any other stuff in prod as far as I am aware, so only it saw the seccomp update (which led to our hypothesis of it being at fault)
[09:09:51] <klausman>	 Yeah, the Storage initializer times out when trying to fetch form Thanos-Swift
[09:10:43] <elukey>	 so we shouldn't rule out seccomp but if you already removed it from the isvc, it shouldn't be an issue.. and it is a connect timeout, very weird
[09:10:56] <klausman>	 Is there a list of what specific syscalls seccomp forbids/allows in this context?
[09:11:15] <elukey>	 not that I know, and we use the same profile everywhere
[09:12:15] <elukey>	 so the bit that isaranto mentioned is still in the isvc, at least in eqiad
[09:12:35] <elukey>	 there were two places in which we added it, container level and pod level
[09:12:38] <elukey>	 lemme remove it too
[09:14:23] <elukey>	 it is up now
[09:14:25] <elukey>	 what the...
[09:14:31] <elukey>	 and I see it scaling up as well
[09:14:48] <klausman>	 I suspect a bunch of pending requests
[09:14:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[09:14:49] <jinxer-wm>	 Deployment reference-quality-predictor-00006-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00006-deployment ...
[09:14:49] <jinxer-wm>	 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[09:15:44] <klausman>	 And ingress-gw hjas started to log 200s
[09:16:26] <isaranto>	 the api gw request works now 
[09:16:54] <elukey>	 klausman: something is off though, in codfw I see the same isvc with seccomp but everything works
[09:16:59] <isaranto>	 actually no..
[09:17:46] <isaranto>	 elukey: in codfw the pod is old. This is the behavior we had in eqiad too and when we deleted the pod it couldnt start
[09:18:28] <klausman>	 maybe a restart will clear that bieng wedge, since it would reapply (de-apply?) the PSP
[09:18:44] <jinxer-wm>	 FIRING: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[09:18:50] <elukey>	 I removed the seccomp bit from codfw as well
[09:18:54] <elukey>	 from the isvc I mean
[09:19:01] <isaranto>	 Now I got a request from the api gw
[09:19:17] <klausman>	 ack
[09:19:29] <klausman>	 codfw replacement pod is up
[09:19:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[09:19:49] <jinxer-wm>	 Deployment reference-quality-predictor-00006-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-quality-predictor-00006-deployment ...
[09:19:49] <jinxer-wm>	 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[09:19:55] <klausman>	 isaranto: the curl cmdline you shared doesn't work for me, btw: "Model with name reference-quality does not exist."
[09:20:20] <elukey>	 at this point I am going to delete the pod in staging
[09:20:28] <isaranto>	 this is the correct one -  time curl -i https://api.wikimedia.org/service/lw/inference/v1/models/reference-need:predict -X POST -d '{"rev_id": 123456, "lang": "en"}'
[09:20:29] <elukey>	 because it must repro in there too
[09:20:50] <isaranto>	 the models are reference-risk and reference-need
[09:21:58] <elukey>	 I need to deploy in ml-staging because not all ns got the seccomp update, and in this way the istio proxy doesn't get the seccomp stuff
[09:22:26] <klausman>	 codfw (prod) working correctly
[09:22:47] <elukey>	 isaranto: can you test if staging works?
[09:22:50] <isaranto>	 elukey: I think this is all my fault. I deployed the changes in staging and run the httpbb tests and assumed it was working, but it actually didnt start any new pods
[09:23:04] <isaranto>	 which ns didnt get the change in staging?
[09:23:10] <klausman>	 codfw-staging works as well
[09:23:28] <chrisalbon>	 Soo… how’s it going?
[09:23:33] <elukey>	 revision-models, but in theory a deploy to the isvc should trigger a re-creation of the pods
[09:23:42] <klausman>	 chrisalbon: swimmingly :)
[09:23:44] <jinxer-wm>	 RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate  - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[09:23:49] <elukey>	 like it did in staging :D
[09:24:01] <isaranto>	 revision-models in staging work fine
[09:24:12] <elukey>	 ok then there must be some race condition
[09:24:54] <elukey>	 what this seccomp injection does is to apply the default profile (that we should already have everywhere, but injected by psp) to all the containers of the pod
[09:26:01] <elukey>	 with the new config, in staging knative's containers + kserve should get it automatically by the knative control plane (storage-init included), but for istio containers (validation + proxy) the bit that we are investigating takes care of it
[09:26:20] <elukey>	 otherwise with PSS we don't respect the restricted profile, and the pod is forbidden 
[09:26:55] <isaranto>	 o/ chrisalbon are you partying?
[09:26:57] <elukey>	 so maybe there is some race condition that causes the istio-proxy container to be blocked somehow
[09:28:14] <klausman>	 So when you said "we should already have everywhere, but injected by psp", does that mean that all other clusters are already using the injection approach that we're trying to use?
[09:28:41] <elukey>	 klausman: all clusters are now using PSP, and we do the seccomp injection when the pod is created
[09:29:04] <elukey>	 with the move to PSS this can't happen anymore and we need to be explicit
[09:29:06] <klausman>	 Do you think this might be due to the way we use Istio (IIRC, other clusters don't use istio the way we do)
[09:29:40] <elukey>	 in staging it works fine, so the only think that I can think of is that it works because there are also the other changes (knative etc..)
[09:29:50] <elukey>	 but it still feels very weird
[09:30:00] <elukey>	 so lemme do some checks very quick in other ns
[09:30:25] <elukey>	 isaranto: you deployed only to that ns right? If I check another one, like revscoring-editquality-goodfaith, I'll find pristine setups right?
[09:31:37] <elukey>	 also, if you folks could explain to me more or less what happened..
[09:31:40] <isaranto>	 elukey: yes I deployed only to revision-models the other dont have the PSS changes
[09:31:52] <elukey>	 so Ilias deployed the patch yesterday
[09:32:05] <elukey>	 but the pod didn't get refreshed, for some reason
[09:32:34] <elukey>	 but when did it start to misbehave?
[09:33:25] <isaranto>	 right away probably but we didnt figure it out
[09:33:35] <elukey>	 because of not enough traffic etc..
[09:33:36] <isaranto>	 lemme explain the steps/problems
[09:36:47] <isaranto>	 - we have high latencies and cpu throttling on this service (the problem still persists as I'm checking the dashboards ) https://grafana.wikimedia.org/goto/-R0eLEcNg?orgId=1
[09:36:47] <isaranto>	 - increased maxreplicas to 5
[09:36:47] <isaranto>	 - deployed the change and deployed the security context change as well
[09:36:47] <isaranto>	 - pod creation didnt get triggered as the max replicas change is a no op if no traffic exists (iiuc)
[09:36:47] <isaranto>	 - the pod became unreachable. we could see traffic on istio dashboard https://grafana.wikimedia.org/goto/zP5iEPcHg?orgId=1 but when we looked at the pod it was as if there were no requests at all and the service was just unreachable (503 errors)
[09:37:06] <isaranto>	 then we deleted the pod and the new pod wouldnt start
[09:37:22] <elukey>	 klausman: https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile
[09:38:15] <klausman>	 merci!
[09:38:20] <elukey>	 isaranto: re: no-op, true but there was also the seccomp profile change, that should'v triggered a refresh
[09:38:25] <elukey>	 at least, in staging it did
[09:39:22] <isaranto>	 true true. All the deployments I did on staging did trigger a deployment
[09:40:32] <elukey>	 and, weirdest of the weird things, in staging it works perfectly
[09:40:45] <elukey>	 but we have other changes, that I am not 100% why should matter
[09:40:52] <klausman>	 elukey: yeah, that filter list looks sane. I don't see anything there that any of our pods would need.
[09:53:11] <elukey>	 (Still checking some stuff sorry)
[09:55:08] <elukey>	 all right for example
[09:55:10] <elukey>	 kubectl get pod arwiki-damaging-predictor-default-00025-deployment-7df86d9dll57 -n revscoring-editquality-damaging -o jsonpath='{.spec.securityContext}' | jq '.'
[09:55:19] <elukey>	 contains
[09:55:20] <elukey>	   "seccompProfile": {
[09:55:20] <elukey>	     "type": "RuntimeDefault"
[09:55:20] <elukey>	   },
[09:55:26] <elukey>	 and it is at the pod level
[09:55:44] <elukey>	 so it gets applied to all containers, istio included
[09:56:29] <elukey>	 now istio proxy has to do some horrors via iptables to redirect "transparently" all traffic to the local envoy
[09:56:52] <elukey>	 as klausman mentioned, this is a peculiarity of the kserve stuff since in other places we don't have it
[10:00:05] <elukey>	 I am going to file a patch to guard the seccomp stuff with an explicit setting, so you'll be free to deploy
[10:02:00] <isaranto>	 ack, thanks
[10:02:44] <isaranto>	 to circle back to the issue that started this. I see increased latencies in this service even with the 5 replicas https://grafana.wikimedia.org/goto/KItbyP5Ng?orgId=1
[10:03:11] <isaranto>	 and there is some cpu throttling on all 5 pods https://grafana.wikimedia.org/goto/rVpasP5Ng?orgId=1 
[10:03:53] <isaranto>	 I'm thinking that this service would benefit both from some horizontal (more replicas ) and vertical (increase cpu resources) especially since it serves 2 models. wdyt?
[10:06:16] <klausman>	 sgtm!
[10:16:50] <isaranto>	 hm we're already using 75/90 cpus according to resource quotas
[10:23:31] <klausman>	 as in the whole cluster? 90 seems too low for that
[10:27:19] <isaranto>	 for the revision-models ns
[10:28:06] <klausman>	 so between 5 replicas at 14 CPUs each?
[10:30:10] <isaranto>	 current config is 5 replicas at 12 cpus each (dont know how this results 75 limit and 65 requests )
[10:30:48] <klausman>	 Ah, extra CPU for the other containers in the same pods? Or is the 12 CPU per-pod
[10:30:50] <klausman>	 ?
[10:31:23] <isaranto>	 a yes the 12 cpus are for the kserve-container so a couple more for the other containers
[10:36:14] <klausman>	 We should see if the service actually uses 12 CPUs under load. It seems a bit high
[10:37:02] <elukey>	 all right https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121591 should be ready
[10:37:15] <elukey>	 it is bigger than expected but the diff looks sane
[10:37:16] <elukey>	 namely:
[10:37:24] <elukey>	 - in staging no seccomp is removed
[10:37:39] <elukey>	 - in prod eqiad/codfw all seccomps are removed (at the pod level)
[10:38:02] <klausman>	 Ack, taking a look
[10:38:03] <elukey>	 I also added in staging seccomp to the transformer, I checked and I missed it in the first pass
[10:40:54] <isaranto>	 klausman: they are indeed used. I filed a patch for it. lemme know if it would work https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121595
[10:41:55] <isaranto>	 any idea what the negative values for throttling mean in the pod resources dashboard? https://grafana.wikimedia.org/goto/zaolQE5HR?orgId=1
[10:42:31] <isaranto>	 I just assume that it is indeed throttling and the negative values are caused due to the window based calculations of the prom metrics
[10:44:01] <klausman>	 Nah, the graph definition makes them negative (e.g. `- sum(irate(container_cpu_cfs_throttled_seconds_total{namespace="$namespace", pod="$pod", container="$container"}[5m]))`)
[10:44:44] <isaranto>	 ack
[10:45:31] <klausman>	 Probably for UI/visualization reason
[10:51:45] <isaranto>	 ok, I merged it, klausman could you deploy the change to admin_ng please? I will deploy it to the revision-models afterwards
[10:56:40] <klausman>	 ack
[10:59:50] <klausman>	 eqiad done
[11:00:45] <isaranto>	 thanks!
[11:01:18] <elukey>	 all right checked all namespaces, the seccomp diff is gone
[11:01:24] <klausman>	 I now see 8 pods rfunning in eqiad
[11:01:29] <elukey>	 the only fix is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121599 since I am stupid
[11:02:48] <klausman>	 elukey: btw, these admin-ng deploys I am doin, I am using -l namespace!=knative-serving to avoid pulling that in.
[11:03:39] <klausman>	 isaranto: codfw also done
[11:03:42] <elukey>	 it is only the docker images right?
[11:03:55] <klausman>	 Yes.
[11:04:29] <klausman>	 1.7.2-{1,2} -> 1.7.2-6
[11:09:01] <isaranto>	 I've deployed to eqiad and codfw. We now have 8 replicas, they seem to help (latencies went down) but it is far from perfect. I still see throttling in some of the pods (not all) which likely means that increasing cpu instead of replicas would help. We'll have to deep deeper in the code a bit to understand what is causing this and fix it if we can before deciding to increase resources
[11:09:22] <isaranto>	 thank you both for the fixes and the deployments <3
[11:12:39] <elukey>	 klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1121602
[11:14:03] <klausman>	 <3
[11:14:09] <klausman>	 I was halway into a similar chaneg
[11:14:43] <klausman>	 also good catch on the webhook replicas
[11:18:54] <elukey>	 ok verified that we do have a no-op now for knative (in admin_ng)
[11:19:13] <elukey>	 at this point I want to kill more pods in staging etc..
[11:19:20] <elukey>	 but I am very puzzled
[11:19:20] <klausman>	 :+1:
[11:20:04] <klausman>	 Same. The seccomp stuff really shouldn't break talking to the 'net
[11:20:15] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Thanks for working on this George! I added a couple of comments." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:21:38] <elukey>	 the only explanation that I can give is that somehow the istio-proxy container dind't get to modify the iptables rules, so things like the storage initializer etc.. also failed
[11:21:43] <elukey>	 why no idea
[11:21:47] <elukey>	 in staging all works
[11:22:10] <klausman>	 I wonder what the tie-in with the updated knative/serve might be
[11:58:09] <wikibugs>	 (03PS3) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[12:00:03] <wikibugs>	 (03CR) 10Gkyziridis: "Developed all comments." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[12:17:54] <wikibugs>	 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10570761 (10elukey) Something really weird happened today, after a deployment of a kserve isvc in production. The pod-level change to force...
[13:26:58] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10570956 (10isarantopoulos) a:03gkyziridis
[13:30:32] <wikibugs>	 (03PS4) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[13:41:00] <wikibugs>	 (03PS5) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[13:47:24] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10570995 (10Seddon)
[13:48:34] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019 (10isarantopoulos) 03NEW
[13:49:16] <isaranto>	 I have put in writing the current status of the reference-quality issues https://phabricator.wikimedia.org/T387019
[13:49:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:50:05] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:50:24] <isaranto>	 georgekyz: this is how you can trigger a CI rerun
[14:01:58] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10571062 (10elukey) @isarantopoulos if you have time could you add in here how you build/test the image on ml-lab? I can try to create another version of the image playing with the Dockerfile whil...
[14:33:18] <wikibugs>	 06Machine-Learning-Team, 10ContentTranslation, 06Research, 07Epic: Verify if the Python recommendation API can support the use cases of the nodejs one - https://phabricator.wikimedia.org/T340854#10571199 (10Isaac) 05Stalled→03Declined Being bold and declining this -- it was determined that a LiftWi...
[14:34:46] <wikibugs>	 06Machine-Learning-Team: the error message from gapfinder service refers to a deleted rev - https://phabricator.wikimedia.org/T377331#10571204 (10Isaac) It's now been over six months with the deprecation notice without any additional questions/concerns being raised so I am officially taking down the API endp...
[14:34:50] <wikibugs>	 06Machine-Learning-Team: the error message from gapfinder service refers to a deleted rev - https://phabricator.wikimedia.org/T377331#10571205 (10Isaac) 05Open→03Resolved a:03Isaac
[15:39:46] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10571405 (10isarantopoulos) **Note: I haven't gotten to succesfulyl test it yet -- there are error logs at the bottom of this message attached but the procedure is the following on ml-lab1002**  I...
[16:01:38] * isaranto afk bbl