[07:19:40] <wikibugs>	 (03CR) 10Kevin Bazira: inference-services: Add PydanticModel for requests. (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[08:07:57] <isaranto>	 hello!
[08:31:52] <elukey>	 klausman: o/ I tried to check the status in staging, most of the pods are in crashloop, I am afk this morning but I'll check later on
[08:32:15] <elukey>	 it seems as if there was an unclean state during the upgrade, but hopefully we should be able to fix all of them
[08:33:58] <klausman>	 I have taken a look: it's the sotrage-initializers being unable to fetch models
[08:34:32] <klausman>	 So the same bug we saw in prod originally
[08:34:39] <klausman>	 I'll do some digging
[08:49:25] <georgekyz>	 morning morning 
[09:12:00] <isaranto>	 klausman: lemme know if there is anything I can help with? even just test etc
[09:38:38] <wikibugs>	 (03PS3) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100)
[09:44:00] <wikibugs>	 (03PS4) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100)
[09:46:27] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. (033 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[09:46:40] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[09:56:26] <klausman>	 isaranto: ack
[09:56:59] <klausman>	 isaranto: atm, my hypothesis is that the security policy is still wonky, but it's a bit of a thicket to untable
[09:57:07] <klausman>	 untangle*
[10:19:46] <elukey>	 klausman: this morning half of the knative pods were down, I'd start from those
[10:20:05] <elukey>	 qq - did you depool all nodes from pods etc.. before reimaging?
[10:20:36] <klausman>	 no, I forgot, but depooled 2001 and 2003 before I started reimaging them
[10:21:15] <elukey>	 draining pods etc.. or depooled from LVS?
[10:21:26] <klausman>	 drained pods with kubectl only
[10:21:26] <elukey>	 (trying to get what could have contributed to the instability, no blame)
[10:21:31] <elukey>	 okok perfect
[10:22:11] <klausman>	 Some knative pods seem to be running, but I don't know yet if they're reachable for other pods
[10:22:48] <klausman>	 2025/03/05 10:22:13 Failed to get k8s version Get "https://10.194.62.1:443/version": dial tcp 10.194.62.1:443: i/o timeout
[10:22:59] <klausman>	 ^^^ Stuff like this makes me think we still have a network/policy issue
[10:23:23] <elukey>	 I am deleting pods, they are starting up correctly, I think they failed to contact the kube api for some reason
[10:23:48] <klausman>	 i.e. pods in the knative-serv NS?
[10:24:12] <elukey>	 yep
[10:24:31] <elukey>	 if those are not 100% up and running there may be issues creating the isvc ones
[10:24:31] <klausman>	 just by kubectl delete?
[10:24:44] <elukey>	 yeah, some of them still fail for the kubesvc
[10:24:46] <elukey>	 that is weird
[10:26:20] <klausman>	 I now see all pds running in that NS, except the activators
[10:26:36] <klausman>	 and one webhook restart ~1m ago
[10:28:51] <elukey>	 the ones failing are due to failing to contact the kubernetes svc
[10:29:21] <klausman>	 some certmanager and kserve NS pods are also crashlooping
[10:30:29] <klausman>	 ooh, and I completely missed some pods that were Running, but 0/1 Ready
[10:32:34] <elukey>	 I am going to roll restart the kube-apiserver daemons on ctrl nodes
[10:33:20] <klausman>	 ack
[10:45:52] <wikibugs>	 (03PS1) 10Kevin Bazira: Makefile: rename reference-need config to reference-quality [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902)
[10:46:28] <elukey>	 klausman: so there are multiple core things that don't work, but the primary one is that on 200[12] the calico pod is not up
[10:46:35] <elukey>	 that could explain connectivity issues
[10:47:42] <elukey>	 it may be related to the move-vlan?
[10:48:09] <klausman>	 yeah, that's my suspicion as well
[10:48:21] <klausman>	 also note, from the calico-node logs: bird: Node_2620_0_860_11b__1: Error: Bad peer AS: 64811
[10:49:07] <klausman>	 it's also a bit odd that the calicod-node on 2003 seems fine
[10:50:02] <klausman>	 The thing is: the 2003 calico-node pod is 9d old.
[10:50:59] <klausman>	 mmmh, let me check something
[10:52:20] <klausman>	 yeah, my hypothesis that 2001 and 2002 differ in theyr physical ToR switch setup from 2003 was not true.
[10:52:30] <klausman>	 So that ain't it, either.
[10:53:08] <klausman>	 maybe time to poke the networking people? I doubt it's just going to be a calico/BGP issue, but we'll have to start somewhere
[11:16:58] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:39:50] <klausman>	 Tried completely draining a node and deleting even the pods that usually remain, then rebooted it, no change 
[11:42:12] <klausman>	 `I'm going for lunch and a walk, maybe the fresh air will jog my brain
[11:44:26] <isaranto>	 ack
[12:16:48] <wikibugs>	 06Machine-Learning-Team: Investigate using SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984 (10achou) 03NEW
[12:17:50] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on this." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) (owner: 10Kevin Bazira)
[12:25:53] <klausman>	 elukey: turns out, move-vlan requires a home commit to the relevant ToR switch. Cathal helped me figure it out. AFAICT, staging is now fine again.
[12:25:59] <klausman>	 s/home/homer
[12:28:05] <isaranto>	 \o/
[12:30:49] <elukey>	 super! I was busy with another maintenance window sorry :)
[12:31:03] <elukey>	 are all the other pods up now?
[12:31:36] <elukey>	 yep all up, nice, so containerd seems to work fine as well
[12:32:16] <elukey>	 klausman: we can let it bake until next week and then think about reimaging the prod workers at a slow pace
[12:32:28] <elukey>	 we can share eqiad/codfw if you want
[12:32:33] <klausman>	 ack.
[12:33:01] <elukey>	 there are also the ctrl nodes that need to be done but they shouldn't be that problematic
[12:33:12] <elukey>	 (they just run calico and they are vms, so no move-vlan etc..)
[12:33:24] <klausman>	 One important thing: the change I made for staging (preseed and pulling containerd instead of docker) broke the not-yet-imaged hosts, in that pods were failing etc. I think for a proper rollout to a whole prod cluster, we'd need to make the change staggered
[12:36:11] <klausman>	 For staging I could/should have probably just dsiabled puppet during the update, but I don't think that'd be a good idea if a prod cluster takes 2 weeks
[12:43:16] <elukey>	 oh yes for sure
[12:45:23] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "thanks for the revewiew :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) (owner: 10Kevin Bazira)
[12:46:08] <wikibugs>	 (03Merged) 10jenkins-bot: Makefile: rename reference-need config to reference-quality [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124748 (https://phabricator.wikimedia.org/T371902) (owner: 10Kevin Bazira)
[13:00:22] <isaranto>	 staging pods are now running but the cluster seems unreachable
[13:01:01] <isaranto>	 example
[13:01:01] <isaranto>	 ```
[13:01:01] <isaranto>	 curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/reference-risk:predict" -X POST -d '{"rev_id": 1242378206, "lang": "en"}' -H "Content-Type: application/json" -H "Host: reference-quality.revision-models.wikimedia.org"
[13:01:01] <isaranto>	 ```
[13:01:33] <isaranto>	 I can verify that this is the same for all ml-staging services -- I ran httpbb tests and they all failed with the same message
[13:02:03] <isaranto>	 " Failed to connect to inference-staging.svc.codfw.wmnet port 30443: No route to host"
[13:10:20] <elukey>	 nice :D
[13:12:48] <isaranto>	 and it seems that the docker registry can't be contacted. I tried updating an isvc in the experimental namespace and it can't fetch the new image (updating other things works fine)
[13:14:21] <isaranto>	 and there is more from the isvc description
[13:14:21] <isaranto>	 Warning  InternalError             17m (x3 over 17m)    v1beta1Controllers  fails to reconcile predictor: fails to update knative service: Put "https://10.194.62.1:443/apis/serving.knative.dev/v1/namespaces/experimental/services/reference-quality-predictor": dial tcp 10.194.62.1:443: connect: connection refused
[13:15:12] <elukey>	 isaranto: that could be a missing network policy though, for knative
[13:15:43] <elukey>	 in theory knative needs to be able to contact the registry to fetch the SHA of the images, so it has a more predictable way to create revisions etc..
[13:15:46] <elukey>	 (at least IIRC)
[13:16:16] <isaranto>	 the service was running fine (although not reachable) and had this knative error. When I changed the image it couldn't start
[13:17:04] <elukey>	 finishing up a maintenance window, I'll check later on if anything weird is ongoing
[13:22:40] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] "Thnx for reviewing." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:22:54] <wikibugs>	 (03PS5) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100)
[13:24:40] <klausman>	 Having a look at discovery, I see:
[13:24:43] <klausman>	 # confctl --quiet --object-type discovery select 'dnsdisc=inference-staging' get
[13:24:45] <klausman>	 {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=inference-staging"}
[13:25:24] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:26:08] <wikibugs>	 (03Merged) 10jenkins-bot: inference-services: Add PydanticModel for requests. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1124434 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[13:29:55] <elukey>	 I've restarted the calico pods for the ml-staging-ctrl2* nodes, there were weird issues, probably not the culprit but it may help
[13:30:14] <elukey>	 isaranto: the IP that you posted is not the docker registry though, but the kubeapi svc
[13:30:28] <isaranto>	 ack
[13:31:09] <klausman>	 the above concfctl command still reports inference-staging as not pooled (I presume that's an automatic thing, I didn't depool anything, and I suspect Luca would have told me if he did
[13:31:56] <elukey>	 klausman: I think it is not an issue since we never used the staging discovery endpoint IIRC, Ilias used the svc endpoint above (that should work)
[13:32:08] <klausman>	 ack
[13:34:13] <elukey>	 now I suspect that move vlan caused some issue with LVS, since workers and LVS host need to share the same vlan to work properly
[13:35:06] <klausman>	 ah, yeah. I had assumed that --move-vlan keeps the vlan tag, but maybe not?
[13:35:08] <elukey>	 isaranto: I think it is lvs-related, if you swap inference.svc.codfw.wmnet with ml-staging2001.codfw.wmnet it works
[13:35:40] <elukey>	 klausman: very ignorant about what it does, but I assume it caused an issue with LVS
[13:37:04] <elukey>	 in the immediate bright new world, LVS is going to use maglev and IPIP encapsulation to avoid sharing any L2 thing
[13:37:31] <klausman>	 I've poked Cathal about the VLAN thing, see if he has an idea
[13:37:36] <elukey>	 super
[13:37:54] <isaranto>	 thank you both for looking into this <3
[13:39:43] <klausman>	 np, I broke it, I fix it :)
[17:05:23] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams - https://phabricator.wikimedia.org/T326179#10606508 (10Ottomata)
[17:05:26] <wikibugs>	 06Machine-Learning-Team, 06Data-Engineering, 06Research, 10Event-Platform: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179#10606512 (10Ottomata)
[17:17:11] * isaranto afk
[17:46:56] <klausman>	 ditto \o
[22:10:07] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10607693 (10HNordeenWMF)
[23:45:14] <wikibugs>	 06Machine-Learning-Team, 06collaboration-services, 10Wikipedia-iOS-App-Backlog (iOS Release FY2024-25): [Spike] Fetch Topics for Articles in History on iOS app - https://phabricator.wikimedia.org/T379119#10608150 (10HNordeenWMF)