[10:28:55] hey folks! [10:29:11] I left some code reviews for the knative/kserve missing metrics https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1123692 [10:29:14] (and nexts) [10:29:22] lemme know what you think :) [10:31:07] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10595428 (10elukey) >>! In T369493#10590417, @klausman wrote: >>>! In T369493#10590409, @elukey wrote: >> @klausman @isarantopoulos @achou T... [10:39:38] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10595486 (10klausman) >>! In T369493#10595428, @elukey wrote: > @klausman we can check what inference DC takes the majority of the traffic a... [10:40:00] elukey: I'll do some reviewing now, and have commented on the bug above regarding which DC to drain [10:41:03] ack thanks! [10:41:39] I assume that the eqiad spikes are probably Enterprise-related [10:41:58] thanks! [10:42:04] merging and deploying [10:50:37] yeah, WME is my suspicion as well. [11:06:34] I am doing some tests for the knative netpolicies in staging, very weird [11:17:14] (kserve fixed in staging and prod though) [11:52:02] o/ thanks for the review Aiko! [11:52:02] going to deploy the latest article-country model-server on LW ... [12:01:21] 06Machine-Learning-Team: Knative Serving's metrics don't work on all ML k8s clusters - https://phabricator.wikimedia.org/T387580#10595644 (10elukey) Something is definitely weird, but I still can't get what. I tried multiple things: - Changing manually in staging the network policies. - Change the prometheus por... [12:05:22] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547#10595658 (10kevinbazira) >>! In T387547#10591068, @Isaac wrote: > Looks good to me - thanks! np! we have deployed this fix i... [12:06:36] latest article-country image is up and running in staging --^ [13:38:19] 10Lift-Wing, 06Machine-Learning-Team: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547#10595952 (10Isaac) Tests are working for me too and didn't see anything other issues. Thanks! [14:05:45] 10Lift-Wing, 06Machine-Learning-Team: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547#10596036 (10kevinbazira) Thank you for the confirmation, @Isaac! The fix has now been deployed in LiftWing production. ` # pod running in eqiad $... [15:27:35] 06Machine-Learning-Team: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10596450 (10achou) **Evaluate the existing peacock detection model** 1. We first created balanced datasets from the paragraph/sentence labeled data we've collected where, for each revision, we have... [16:09:46] good morning all [17:30:07] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124155 - I am feeling very sad inside [17:31:09] what an embarrassing bug [17:31:13] I think that your calico patch just restarted all the knative daemons, that were upset about the config-observability being there [17:31:51] we previously removed it because it contained only the "_example" bit, hoping for defaults, but apparently the code wants something to be there [17:32:01] I tested in staging and I saw metrics flowing [17:32:05] sigh [17:32:07] might be a bug in their yaml parser, too [17:33:16] nono I restarted the controller pod and a log mentioned that an empty config-observability map was unacceptable, and just it (no exporter created) [17:33:46] I still think the map being empty should not stop it from starting [17:34:37] yep me too, probably there is a fix in the commits from knative 1.8+ [17:34:57] it goes on my list of things to hunt for in the changelogs :) [17:44:03] metrics are visible again in staging, will do prod tomorrow :) [17:50:40] ty! [17:52:15] np! This was a sneaky one