[07:58:10] 10Lift-Wing, 06Machine-Learning-Team: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547 (10kevinbazira) 03NEW [08:40:09] (03PS1) 10Kevin Bazira: article-country: fix error handling and omission of geographic data in wikidata-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) [08:45:44] FIRING: LiftWingServiceErrorRate: ... [08:45:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:55:44] RESOLVED: LiftWingServiceErrorRate: ... [08:55:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:11:15] 早安! [09:11:34] I have deployed the new knative version in ml-staging, all good afaics [09:11:45] https://github.com/istio/istio/issues/35894#issuecomment-1511634924 is also interesting for the istio bits [09:12:11] because the solution that people suggest is the same that we are using (setting seccomp to the pod level, so it gets applied to all containers) [09:17:30] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10589995 (10elukey) New knative version deployed to staging, tested the removal of the kserve's container-securitycontext (since it is now a... [09:17:34] Added all the thoughts to the task [09:21:49] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547#10590065 (10kevinbazira) Below are the production example responses after a fix has been applied: 1.**Error Handling for Cl... [09:24:00] (03CR) 10Kevin Bazira: "this patch has been tested as shown in: https://phabricator.wikimedia.org/T387547#10590065" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [10:13:44] FIRING: LiftWingServiceErrorRate: ... [10:13:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:23:44] RESOLVED: LiftWingServiceErrorRate: ... [10:23:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:12:35] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10590410 (10elukey) @klausman @isarantopoulos @achou The only thing that I can think of is the following: 1) depool eqiad or codfw from inf... [11:20:52] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10590417 (10klausman) >>! In T369493#10590409, @elukey wrote: > @klausman @isarantopoulos @achou The only thing that I can think of is the f... [11:59:58] (03CR) 10AikoChou: "Nice work! I have one comment" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [12:32:08] (03PS2) 10Kevin Bazira: article-country: fix error handling and omission of geographic data in wikidata-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) [12:34:55] (03CR) 10Kevin Bazira: article-country: fix error handling and omission of geographic data in wikidata-related predictions (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [13:22:05] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [13:25:14] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [13:25:58] (03Merged) 10jenkins-bot: article-country: fix error handling and omission of geographic data in wikidata-related predictions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123585 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [13:28:13] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10590806 (10Gehel) [13:47:04] 10Lift-Wing, 06Machine-Learning-Team: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547#10591068 (10Isaac) Looks good to me - thanks! [14:29:13] klausman: o/ [14:29:31] I think that all the knative metrics stopped being collected after https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060452/3/charts/knative-serving/templates/networkpolicy-prometheus.yaml [14:29:41] see https://prometheus-eqiad.wikimedia.org/k8s-mlserve/targets?search=knative [14:29:54] I wanted to check the dashboard but it is empty.. [14:30:11] I suspect it may be due to the absence of a selector or similar, but not super sure [14:32:21] we can probably quickly check in staging [14:41:11] ah, interesting. yes, we should fix that. I wonder whether it's the PSPS now allowing it or the kserve patches just not supporting Prom discovery anymore. Probably the former. [14:44:19] nono it is broken in prod as well [14:44:22] no metrics at all [14:44:32] I think it is the calico policy itself [14:44:44] oooh. [14:44:52] I thought as well that without selector it would have picked everything [14:45:00] but apparently it doesn't target a single pod :D [14:45:20] so port 9090 is blocked for prometheus [14:46:09] the other weird bit that I found is that only on ml-serve-codfw we have knative-serving-knative-serving-activator (old k8s networkpolicy) [14:46:12] maybe a leftover? [14:46:25] That seems the most likely [14:50:21] tried to add a selector to the prometheus policy in staging [14:54:40] nope doesn't work [14:55:49] So you mentioned that the other (complete) failure we saw was on pod- vs. container-level policies. Do you think there might be a similar discrepancy here? [14:56:13] what do you mean? [14:58:04] so if I understood right, the "pod becomes entirely unreachable" problem was caused by having knative adding policies ontop of the ones we "manually" added, right? [14:58:58] ah in prod you mean, no no it was simpler - the knative changes to inject values are only in staging, in prod we tried to rollout a specific securityContext at the pod level setting seccomp's default profile [14:59:09] (that applied to all containers) [14:59:17] aaah, right, now I get it. [14:59:57] I want to test it because I recall that we didn't see traffic, but I don't know it if was because of the storage-initializer not coming up or the istio-proxy not responding [15:09:40] I am going to open a task for the metrics later [15:09:44] very weird [15:56:29] ahhh ok wait I may get why the traffic is blocked [15:57:05] we have a policy for each knative pod-kind basically, and in there we have ingress policies [15:57:27] so if we set "allow only port XXXX" in there, we may not be able to open the prometheus port in other places [16:01:41] tried to modify the controller's policy but no luck [16:19:22] For a moment I thought that maybe because we explicitly mention port 9090, everything gets auto-denied, but that is the reverse of the symptoms we see [17:17:36] 06Machine-Learning-Team: Knative Serving's metrics don't work on all ML k8s clusters - https://phabricator.wikimedia.org/T387580 (10elukey) 03NEW [17:17:47] 06Machine-Learning-Team: Knative Serving's metrics don't work on all ML k8s clusters - https://phabricator.wikimedia.org/T387580#10591770 (10elukey) [17:17:47] created https://phabricator.wikimedia.org/T387580 [17:26:34] 06Machine-Learning-Team: Knative Serving's metrics don't work on all ML k8s clusters - https://phabricator.wikimedia.org/T387580#10591792 (10elukey) Same for the KServe controller, see https://grafana-rw.wikimedia.org/d/Rvs1p4K7k/kserve (but the calico policies for it don't allow its Prometheus port to be fetched).