[07:19:40] (03CR) 10Kevin Bazira: inference-services: Add PydanticModel for requests. (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [08:07:57] hello! [08:31:52] klausman: o/ I tried to check the status in staging, most of the pods are in crashloop, I am afk this morning but I'll check later on [08:32:15] it seems as if there was an unclean state during the upgrade, but hopefully we should be able to fix all of them [08:33:58] I have taken a look: it's the sotrage-initializers being unable to fetch models [08:34:32] So the same bug we saw in prod originally [08:34:39] I'll do some digging [08:49:25] morning morning [09:12:00] klausman: lemme know if there is anything I can help with? even just test etc [09:38:38] (03PS3) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) [09:44:00] (03PS4) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) [09:46:27] (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:46:40] (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [09:56:26] isaranto: ack [09:56:59] isaranto: atm, my hypothesis is that the security policy is still wonky, but it's a bit of a thicket to untable [09:57:07] untangle* [10:19:46] klausman: this morning half of the knative pods were down, I'd start from those [10:20:05] qq - did you depool all nodes from pods etc.. before reimaging? [10:20:36] no, I forgot, but depooled 2001 and 2003 before I started reimaging them [10:21:15] draining pods etc.. or depooled from LVS? [10:21:26] drained pods with kubectl only [10:21:26] (trying to get what could have contributed to the instability, no blame) [10:21:31] okok perfect [10:22:11] Some knative pods seem to be running, but I don't know yet if they're reachable for other pods [10:22:48] 2025/03/05 10:22:13 Failed to get k8s version Get "https://10.194.62.1:443/version": dial tcp 10.194.62.1:443: i/o timeout [10:22:59] ^^^ Stuff like this makes me think we still have a network/policy issue [10:23:23] I am deleting pods, they are starting up correctly, I think they failed to contact the kube api for some reason [10:23:48] i.e. pods in the knative-serv NS? [10:24:12] yep [10:24:31] if those are not 100% up and running there may be issues creating the isvc ones [10:24:31] just by kubectl delete? [10:24:44] yeah, some of them still fail for the kubesvc [10:24:46] that is weird [10:26:20] I now see all pds running in that NS, except the activators [10:26:36] and one webhook restart ~1m ago [10:28:51] the ones failing are due to failing to contact the kubernetes svc [10:29:21] some certmanager and kserve NS pods are also crashlooping [10:30:29] ooh, and I completely missed some pods that were Running, but 0/1 Ready [10:32:34] I am going to roll restart the kube-apiserver daemons on ctrl nodes [10:33:20] ack [10:45:52] (03PS1) 10Kevin Bazira: Makefile: rename reference-need config to reference-quality [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) [10:46:28] klausman: so there are multiple core things that don't work, but the primary one is that on 200[12] the calico pod is not up [10:46:35] that could explain connectivity issues [10:47:42] it may be related to the move-vlan? [10:48:09] yeah, that's my suspicion as well [10:48:21] also note, from the calico-node logs: bird: Node_2620_0_860_11b__1: Error: Bad peer AS: 64811 [10:49:07] it's also a bit odd that the calicod-node on 2003 seems fine [10:50:02] The thing is: the 2003 calico-node pod is 9d old. [10:50:59] mmmh, let me check something [10:52:20] yeah, my hypothesis that 2001 and 2002 differ in theyr physical ToR switch setup from 2003 was not true. [10:52:30] So that ain't it, either. [10:53:08] maybe time to poke the networking people? I doubt it's just going to be a calico/BGP issue, but we'll have to start somewhere [11:16:58] (03CR) 10Kevin Bazira: [C:03+1] inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:39:50] Tried completely draining a node and deleting even the pods that usually remain, then rebooted it, no change [11:42:12] `I'm going for lunch and a walk, maybe the fresh air will jog my brain [11:44:26] ack [12:16:48] 06Machine-Learning-Team: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984 (10achou) 03NEW [12:17:50] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on this." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) (owner: 10Kevin Bazira) [12:25:53] elukey: turns out, move-vlan requires a home commit to the relevant ToR switch. Cathal helped me figure it out. AFAICT, staging is now fine again. [12:25:59] s/home/homer [12:28:05] \o/ [12:30:49] super! I was busy with another maintenance window sorry :) [12:31:03] are all the other pods up now? [12:31:36] yep all up, nice, so containerd seems to work fine as well [12:32:16] klausman: we can let it bake until next week and then think about reimaging the prod workers at a slow pace [12:32:28] we can share eqiad/codfw if you want [12:32:33] ack. [12:33:01] there are also the ctrl nodes that need to be done but they shouldn't be that problematic [12:33:12] (they just run calico and they are vms, so no move-vlan etc..) [12:33:24] One important thing: the change I made for staging (preseed and pulling containerd instead of docker) broke the not-yet-imaged hosts, in that pods were failing etc. I think for a proper rollout to a whole prod cluster, we'd need to make the change staggered [12:36:11] For staging I could/should have probably just dsiabled puppet during the update, but I don't think that'd be a good idea if a prod cluster takes 2 weeks [12:43:16] oh yes for sure [12:45:23] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the revewiew :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) (owner: 10Kevin Bazira) [12:46:08] (03Merged) 10jenkins-bot: Makefile: rename reference-need config to reference-quality [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) (owner: 10Kevin Bazira) [13:00:22] staging pods are now running but the cluster seems unreachable [13:01:01] example [13:01:01] ``` [13:01:01] curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-risk:predict" -X POST -d '{"rev_id": 1242378206, "lang": "en"}' -H "Content-Type: application/json" -H "Host: reference-quality.revision-models.wikimedia.org" [13:01:01] ``` [13:01:33] I can verify that this is the same for all ml-staging services -- I ran httpbb tests and they all failed with the same message [13:02:03] " Failed to connect to inference-staging.svc.codfw.wmnet port 30443: No route to host" [13:10:20] nice :D [13:12:48] and it seems that the docker registry can't be contacted. I tried updating an isvc in the experimental namespace and it can't fetch the new image (updating other things works fine) [13:14:21] and there is more from the isvc description [13:14:21] Warning InternalError 17m (x3 over 17m) v1beta1Controllers fails to reconcile predictor: fails to update knative service: Put "https://10.194.62.1:443/apis/serving.knative.dev/v1/namespaces/experimental/services/reference-quality-predictor": dial tcp 10.194.62.1:443: connect: connection refused [13:15:12] isaranto: that could be a missing network policy though, for knative [13:15:43] in theory knative needs to be able to contact the registry to fetch the SHA of the images, so it has a more predictable way to create revisions etc.. [13:15:46] (at least IIRC) [13:16:16] the service was running fine (although not reachable) and had this knative error. When I changed the image it couldn't start [13:17:04] finishing up a maintenance window, I'll check later on if anything weird is ongoing [13:22:40] (03CR) 10Gkyziridis: [C:03+2] "Thnx for reviewing." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:22:54] (03PS5) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) [13:24:40] Having a look at discovery, I see: [13:24:43] # confctl --quiet --object-type discovery select 'dnsdisc=inference-staging' get [13:24:45] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=inference-staging"} [13:25:24] (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:26:08] (03Merged) 10jenkins-bot: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [13:29:55] I've restarted the calico pods for the ml-staging-ctrl2* nodes, there were weird issues, probably not the culprit but it may help [13:30:14] isaranto: the IP that you posted is not the docker registry though, but the kubeapi svc [13:30:28] ack [13:31:09] the above concfctl command still reports inference-staging as not pooled (I presume that's an automatic thing, I didn't depool anything, and I suspect Luca would have told me if he did [13:31:56] klausman: I think it is not an issue since we never used the staging discovery endpoint IIRC, Ilias used the svc endpoint above (that should work) [13:32:08] ack [13:34:13] now I suspect that move vlan caused some issue with LVS, since workers and LVS host need to share the same vlan to work properly [13:35:06] ah, yeah. I had assumed that --move-vlan keeps the vlan tag, but maybe not? [13:35:08] isaranto: I think it is lvs-related, if you swap inference.svc.codfw.wmnet with ml-staging2001.codfw.wmnet it works [13:35:40] klausman: very ignorant about what it does, but I assume it caused an issue with LVS [13:37:04] in the immediate bright new world, LVS is going to use maglev and IPIP encapsulation to avoid sharing any L2 thing [13:37:31] I've poked Cathal about the VLAN thing, see if he has an idea [13:37:36] super [13:37:54] thank you both for looking into this <3 [13:39:43] np, I broke it, I fix it :) [17:05:23] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams - https://phabricator.wikimedia.org/T326179#10606508 (10Ottomata) [17:05:26] 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10606512 (10Ottomata) [17:17:11] * isaranto afk [17:46:56] ditto \o [22:10:07] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10607693 (10HNordeenWMF) [23:45:14] 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10608150 (10HNordeenWMF)