[05:24:28] 06Machine-Learning-Team: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10599291 (10ppelberg) [05:37:28] 06Machine-Learning-Team, 10EditCheck: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10599296 (10ppelberg) [08:03:06] good morning folks, I'm back! :D [08:13:21] morning morning, welcome back Ilias [08:13:32] \o [08:26:15] Morning! [08:37:39] hi Tobias! [08:49:55] (03PS3) 10AikoChou: reference-quality: fix reference models from getting unnecessary data from mwapi [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) [08:51:20] (03CR) 10Ilias Sarantopoulos: [C:03+1] reference-quality: fix reference models from getting unnecessary data from mwapi [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) (owner: 10AikoChou) [09:05:48] hello folks! [09:06:00] the knative metrics for staging are working now (also the kserve ones) [09:06:05] I'll deploy the fix to prod later on [09:06:19] ty! [09:33:45] thanks Luca! [09:41:59] 06Machine-Learning-Team: Knative Serving's metrics don't work on all ML k8s clusters - https://phabricator.wikimedia.org/T387580#10599965 (10elukey) 05Open→03Resolved a:03elukey https://gerrit.wikimedia.org/r/1124155 fixed the problem, when we emptied the config-observability configmap (it contained on... [09:42:15] deployed and closed --^ [09:44:09] isaranto: o/ welcome back :) - by any chance do you recall what happened when you deployed the seccomp stuff in prod? [09:44:52] namely - were the pods up and running but without networking, or was the storage-initializer not working? Or both :D ? [09:47:15] iirc pods were up but networking was failing so the storage initializer was failing [09:54:22] yeah, basically, already-running pods looked fine mostly log-wise, but were network-isolated. restarting pods never made it to the "up" state since the SI failed to fetch stuff [09:57:19] okok so it was definitely the istio-proxy container [09:58:04] Yeah, I agree, symptoms and the discrepancy you found definitely fit together [10:04:31] (03CR) 10Ilias Sarantopoulos: reference-quality: fix reference models from getting unnecessary data from mwapi (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) (owner: 10AikoChou) [10:27:44] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10600080 (10gkyziridis) [10:57:01] mmm I have some doubs about what we discussed earlier [10:57:26] this particular setting, for some reason, doesn't trigger a recreation of the pod since the isvc doesn't get a new revision [10:57:43] so when Ilias deployed, in theory the pre-existing pods didn't get affected [10:58:30] only new pods would have been affected, but at that stage the storage initializer would have failed before any network test could have been done [10:58:47] I am a bit confused [11:05:09] I wonder if the istio-proxy got affected and restarted (and changed ipt rules), but the pod didn't restart since it wasn't "directly" affected? [11:06:30] i.e. the machine/worker-level ipt rules changed (and broke the pods because packets never made it there), but since from k8s pov, the pod config didn't change, they didn't restart [11:09:24] the pre-existing pods didnt get affected as no re-deployment happened. Only when we changed sth else (I think we just deleted a pod) we realized the issue [11:09:58] isaranto: so the other pods kept serving traffic fine? [11:10:45] yes as far as I remember. At the moment I regret not capturing all this on phabricator.. I'm going through IRC logs to check [11:11:13] okok that would make sense [11:11:58] klausman: not sure, in theory the pod is one unit, the containers inside should work with the same config (either all or nothing) [11:12:10] one thing that I found is https://github.com/istio/istio/issues/26882#issuecomment-683900991 [11:12:28] that could be relevant, namely Istio doing something at the pod security context level [11:12:41] pretty sure it was a bug that got fixed [11:12:54] but, it doesn't explain why in staging works :D [11:14:20] There's still the kserve and Linux kernel differences. The latter might have changed a security default? [11:14:49] could be yes, I haven't found a way to compare the differences though [11:15:10] IIUC the seccomp stuff is set as eBPF filter, and the kernel enforces it [11:15:32] docker has its own filter, but I didn't find a way to compare bullseye vs bookworm [11:16:36] My hopes are slim it's a kernel thing. Defaults like these very rarely change [11:16:50] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10600197 (10gkyziridis) The `edit-check` dummy service is deployed on staging under the `experimental` namespace. The dummy service receives an API request like the follo... [11:23:58] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10600215 (10elukey) Difference for the Docker's default seccomp profile between Bullseye (docker.io v20.10.5) and Bookworm (docker.io v20.10... [11:24:02] klausman: the diff for the seccomp syscall list should be https://phabricator.wikimedia.org/T369493#10600215 [11:24:10] but I can't find something that stands out [11:25:53] The landlock calls can do network limits, but AIUI, you can't use them to unlock stuff [11:26:20] The rest of the calls don't pertain to network [11:31:37] klausman: https://github.com/istio/istio/issues/44244 [11:32:02] and clone3 is mentioned [11:32:07] I think this is the problem [11:35:56] aaah, yeah, that makes some sense [11:36:32] Though I haven't seen the errors the original poster of the bug mentiones [11:36:37] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10600273 (10elukey) The problem should be https://github.com/istio/istio/issues/44244. In Bullseye's docker version the seccomp default prof... [11:37:13] maybe they were buried somewhere [11:37:14] At least I now have an excuse to do All The Re-imaging [11:38:07] I can help as well [11:39:20] seems that we have 16 nodes to reimage [11:39:35] I think I'll start reimaging the eqiad workers this week, see how well that goes. [11:39:51] there is another complication though, namely the move to containerd [11:40:18] that is advised to be done via full reimage [11:40:29] (a partition name changes, etc..) [11:40:59] so to pack everything in one go, we could move staging to containerd via reimages [11:41:02] test etc.. [11:41:20] and then proceed with one prod worker, test etc.. [11:41:20] https://wikitech.wikimedia.org/wiki/Kubernetes/Administration/containerd_migration I'd be following this [11:41:43] exactly [11:42:04] but better to scope this with isaranto first, it is a sizeable amount of work and it can be done later on [11:42:07] I can probably do one of the staging workers today, after lunch [11:42:22] I am happy now that staging is ok now, PSP/PSS config wise [11:42:30] so the rest can be done even next quarter [11:42:38] there is no rush [11:42:49] yeah, at least we definitely have a working setup, we "just" need to make it happen everywhere [11:42:55] yep yep [11:43:03] and that will unblock the upgrade to k8s 1.31 [11:43:55] going afk, ttl! o/ [11:44:04] thanks again for all your help! [11:44:10] * klausman lunch [13:29:35] (03PS1) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) [13:30:53] (03CR) 10CI reject: [V:04-1] inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:32:08] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854 (10klausman) 03NEW [13:32:08] (03PS2) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) [13:40:09] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10600697 (10klausman) One additional note: we used to use our own Partman recipe (`partman/custom/kubernetes-node-overlay-large-kubelet.cfg`). Since the larger kubelet partition is al... [13:48:01] (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [14:30:06] (03PS1) 10Kevin Bazira: Makefile: download SQLite db used by article-country [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124443 (https://phabricator.wikimedia.org/T385970) [14:46:10] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on this." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124443 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [14:47:07] I'll be re-imaging ml-staging2002 to use containerd (instead of Docker) in a moment. Aside from some pod shuffling, it should Just Work™ [14:50:55] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2002.codfw.wmnet with OS bookworm [15:01:09] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] Makefile: download SQLite db used by article-country [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124443 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [15:03:32] (03CR) 10Kevin Bazira: [V:03+2 C:03+2] "thank you for testing it, George! :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124443 (https://phabricator.wikimedia.org/T385970) (owner: 10Kevin Bazira) [15:15:44] 06Machine-Learning-Team, 10Automoderator, 06Moderator-Tools-Team: Use multilingual revert risk model in Automoderator on supported wikis - https://phabricator.wikimedia.org/T365581#10601125 (10Samwalton9-WMF) [15:15:52] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10601128 (10isarantopoulos) p:05Triage→03Medium [15:33:43] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2002.codfw.wmnet with OS bookworm executed with errors: - ml-stag... [15:58:01] 10Lift-Wing, 06Machine-Learning-Team: Fix duplicate wikidata-related predictions and omitted category-related predictions - https://phabricator.wikimedia.org/T387275#10601345 (10kevinbazira) 05Open→03Resolved [15:58:53] 10Lift-Wing, 06Machine-Learning-Team: Fix error handling and omission of geographic data in wikidata-related predictions - https://phabricator.wikimedia.org/T387547#10601349 (10kevinbazira) 05Open→03Resolved [16:01:07] (03PS4) 10AikoChou: reference-quality: fix reference models from getting unnecessary data from mwapi [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) [16:02:37] (03PS5) 10AikoChou: reference-quality: fix reference models from getting unnecessary data from mwapi [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) [16:03:47] (03CR) 10AikoChou: reference-quality: fix reference models from getting unnecessary data from mwapi (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) (owner: 10AikoChou) [16:07:11] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2002.codfw.wmnet with OS bookworm [16:31:26] (03CR) 10AikoChou: [C:03+2] reference-quality: fix reference models from getting unnecessary data from mwapi [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) (owner: 10AikoChou) [16:32:10] (03Merged) 10jenkins-bot: reference-quality: fix reference models from getting unnecessary data from mwapi [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1123304 (https://phabricator.wikimedia.org/T387019) (owner: 10AikoChou) [16:48:24] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2002.codfw.wmnet with OS bookworm completed... [16:49:50] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2001.codfw.wmnet with OS bookworm [16:56:48] klausman: important bit for containerd - you don't have dockerctl anymore on the nodes, it gets replaced by nerdctl [16:57:00] that support most of the same formats etc... [16:57:13] yeah, I saw [16:57:16] super [16:57:25] I got confused initially because I didn't remember :D [16:57:53] `alias dockerctl=echo Use nerdctl` ;) [17:24:53] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2001.codfw.wmnet with OS bookworm executed... [17:25:25] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10601978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2001.codfw.wmnet with OS bookworm [18:01:41] * isaranto afk [18:05:44] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10602153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2001.codfw.wmnet with OS bookworm completed... [18:09:59] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10602161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm [18:42:07] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10602258 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm completed... [19:13:59] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Expose revision revert risk scores in EventStreams - https://phabricator.wikimedia.org/T326179#10602482 (10Ottomata) FWIW, I believe that if this task had been done, investigatory work for tasks like {T374440} would be much easier. [19:46:17] 06Machine-Learning-Team, 10EditCheck, 06Editing-team, 10VisualEditor: Evaluate efficacy of Peacock Check model output - https://phabricator.wikimedia.org/T384651#10602678 (10ppelberg) [19:50:30] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Expose revision revert risk scores in EventStreams - https://phabricator.wikimedia.org/T326179#10602692 (10diego) I'm confused, I think in T374440 they are working just with dumps, nothing like Eventstreams. >>! In T326179#106024... [20:19:49] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Expose revision revert risk scores in EventStreams - https://phabricator.wikimedia.org/T326179#10602807 (10Ottomata) If revert risk scores were in event streams (lower case, not necessarily stream.wikimedia.org EventStreams service)...