[03:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[03:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[03:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[05:26:02] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11365066 (10kevinbazira) @Trokhymovych, following up on T406179#11353371, the revertrisk-wiki...
[06:57:03] <wikibugs>	 (03CR) 10Samwilson: [C:03+2] "All filtering functionality appears to be unchanged, in my testing." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) (owner: 10Tim Starling)
[07:09:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused WatchedItemQueryService hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) (owner: 10Tim Starling)
[07:17:18] <bartosz>	 good morning :) 
[07:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[07:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[07:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[08:06:54] <ozge_>	 good morning
[08:07:22] <dpogorzelski>	 morning
[08:23:02] <dpogorzelski>	 once the pod is working i can give the standalone mode a try
[08:45:15] <georgekyz>	 good morning folks
[08:45:24] <georgekyz>	 thnx for deploying @kevinbazira 
[08:47:21] <elukey>	 good morning folks
[08:48:09] <elukey>	 dpogorzelski: I had a chat with Cathal and he merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1203820, so the next steps should be to deploy that to ml-serve clusters (admin_ng) and then flip to on the BGP flag in https://netbox.wikimedia.org/dcim/devices/6344/
[08:48:33] <elukey>	 after that we'll need to run homer, a tool that we have on cumin hosts to deploy configs to the network devices
[08:49:05] <dpogorzelski>	 morning, yep i was looking into the diff just now :) 
[08:49:14] <elukey>	 we'll have to target the L3 switch lsw1-e8-eqiad, that is where ml-serve1012 is attached to
[08:49:34] <elukey>	 after that, in theory, the BGP session should be established and Calico should work properly
[08:49:54] <elukey>	 Cathal told me that since ml-serve1012 is special, there may need more things to fix 
[08:57:44] <elukey>	 dpogorzelski: do you want to do the k8s deploy part and then we do the network bits together?
[08:58:18] <dpogorzelski>	 so calico changes are synced 
[08:58:34] <dpogorzelski>	 sure thing
[08:59:23] <dpogorzelski>	 the bgp flag seems to be ON in netbox
[09:00:24] <elukey>	 yeah I am reading https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1204072
[09:00:37] <elukey>	 I think that Cathal already tried it, but we miss a homer config
[09:00:44] <elukey>	 so we'll have to wait for him to be online
[09:00:47] <elukey>	 I'll ping you later
[09:01:16] <dpogorzelski>	 kk
[09:01:59] <elukey>	 in the meantime, could you please check that the admin_ng sync status in the various cluster is ok? Just to avoid having different configs etc..
[09:02:06] <elukey>	 this one is eqiad-specific so not a big deal
[09:11:08] <dpogorzelski>	 sure thing, generally there are other changes pending that i'm not familiar with so i avoided applying them but would assume it's safe
[09:13:03] <dpogorzelski>	 the calico config is synced to other clusters as well now
[09:20:03] <elukey>	 yeah you are right, when they pop up we can review them together (or you can do it with Tobias)
[09:20:36] <elukey>	 usually there are external-services chart changes, that is a place where we try to keep IPs related to services to adding network policies gets quicker and more DYR
[09:20:38] <elukey>	 *DRY
[09:21:00] <elukey>	 but other things may need a question in the k8s SIG channel for confirmation, just to be sure
[09:21:13] <elukey>	 99% of the time it is all safe
[09:22:13] <dpogorzelski>	 kk
[09:23:35] <dpogorzelski>	 klausman: what do you think? safe to sync? https://www.irccloud.com/pastebin/jmJX69Ch/
[09:23:50] <klausman>	 looking
[09:24:45] <klausman>	 yeah, LGTM
[09:24:56] <klausman>	 just push eqiad first and let it soak like 10m
[09:29:13] <dpogorzelski>	 kk, massaged in
[09:33:37] <elukey>	 klausman: I am chatting with Cathal and there is a surprise for ml-serve1012, namely it is in the analytics VLAN :O
[09:33:46] <elukey>	 I don't find traces of it in https://phabricator.wikimedia.org/T393948
[09:34:04] <elukey>	 was is a mistake/miscommunication between ML and dcops or do you recall a reason for it?
[09:34:08] <klausman>	 Huh. That's... "a little weird"
[09:34:28] <elukey>	 the only thing that I can think of is that we weren't sure if the host was for training or serving
[09:34:42] <klausman>	 My best guess is that it was a confusion arising from moving ml-lab1002 into that vlan maybe? I don't think I ever gave anyone a hint that 1012 should be there.
[09:35:04] <elukey>	 okok, so me and Cathal will fix netbox, and then we'll have to reimage ml-serve1012
[09:35:27] <elukey>	 it is probably a good use case to go through the reimage cookbook with dpogorzelski 
[09:35:33] <klausman>	 Yeah. Thanks for catching that. I presume it wreaked havoc with BGP?
[09:36:25] <elukey>	 correct :D
[09:36:32] <claime>	 elukey: there should be an option to the reimage to move vlan btw
[09:36:33] <dpogorzelski>	 would love to go through the reimage process
[09:36:44] <dpogorzelski>	 so i can get a complete picture 
[09:36:53] <claime>	 --move-vlan 'Call the sre.hosts.move-vlan cookbook to migrate the host to the new VLAN during the reimage.'
[09:37:21] <claime>	 (sorry for the drive-by I happened to be in the neighbourhood while cleaning my IRC unread)
[09:37:30] <elukey>	 claime: o/ in this specific case Cathal suggested to fix netbox, lemme ping him again, maybe it is easier this way
[09:37:39] <claime>	 elukey: I think you need to fix netbox first
[09:37:48] <claime>	 Then when you reimage, hit the move-vlan option
[09:38:02] <claime>	 But I don't remember exactly, so yeah check with Cathal
[09:38:36] <elukey>	 ideally we should decom the host and re-add it, but it is super cumbersome so we'll do it differently :D
[09:39:10] <claime>	 elukey: iirc using move-vlan during reimage is what we used for the wikikube vlan moves
[09:39:14] <claime>	 Let me check
[09:39:33] <elukey>	 claime: so we cannot use it since the move vlan cookbook will fail, the switch to which ml-serve1012 is attached is currently not configured for the analytics vlan etc..
[09:39:57] <elukey>	 IIUC it works when the host is already correctly configured in a VLAN, and you move it to another one
[09:40:03] <claime>	 aaaah
[09:40:04] <claime>	 yep yep
[09:40:06] <claime>	 makes sense
[09:58:52] <elukey>	 dpogorzelski, klausman - ready for a reimage
[09:59:22] <elukey>	 so in theory the host should just do the whole dance without issues, at least when I tried it the first time
[09:59:40] <elukey>	 but I'd keep a mgmt console open in a separate shell to confirm what's happening
[09:59:47] <elukey>	 ping me if you see anything weird during dhcp etc..
[10:26:27] <dpogorzelski>	 i'll have to jump out now, will be back after training, can try then
[10:29:55] <klausman>	 elukey: we could wait for Dawid to come back, maybe do a screenshare, so he can see what a reimage looks like?
[10:30:10] <klausman>	 though not sure how interesting that would be
[10:30:28] <klausman>	 He should be back sometime after 13:00
[10:38:40] <elukey>	 klausman: oh yes definitely please do it with him, it is not urgent and a good occasion to go through reimage etc.. I am oncall today so ping me if anything weird comes up!
[10:38:52] <klausman>	 roger that!
[10:39:42] <kevinbazira>	 o/
[10:39:42] <kevinbazira>	 thanks for the review George.
[10:39:42] <kevinbazira>	 going to deploy rrwikidata in LW staging ...
[10:40:33] <georgekyz>	 🤞
[10:40:40] <klausman>	 elukey: just to confirm: it'll be a normal reimage the netbox/decom/recom stuff is already done?
[10:40:52] <elukey>	 yes correct
[10:40:54] <klausman>	 ty!
[10:41:08] <elukey>	 we had to wipe the DNS records so hopefully it will not be an issue
[10:41:12] <elukey>	 but in case let me know
[10:41:16] <klausman>	 ack
[10:42:02] <elukey>	 interestingly enough, ml-serve1013 doesn't seem to be in the analytics vlan https://netbox.wikimedia.org/dcim/devices/6345/interfaces/
[10:45:18] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 2 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11365751 (10kevinbazira) As we prepare to run load tests, the revertrisk-wikidata isvc has been deployed in LiftWing staging:...
[10:45:42] <kevinbazira>	 rrwikidata is up and running in staging: https://phabricator.wikimedia.org/T406179#11365751
[10:48:11] <georgekyz>	 🎉
[11:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[11:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[11:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[12:32:03] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11366230 (10achou) > With respect to GRANTs, is it safe to assume that MODIFY is sufficient? There is no requirement to do reads here, is there?...
[12:32:08] <dpogorzelski>	 back
[12:32:33] <dpogorzelski>	 klausman: if we want to show me around i'm available 
[12:32:52] <klausman>	  alright. let me set up some stuff
[12:32:58] <dpogorzelski>	 cool
[13:04:45] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11366393 (10gkyziridis) ==== Update ==== The issue I am facing for reproducing the error is that we are logging the incoming request if it is successful (status code 200), but...
[13:27:39] <klausman>	 elukey: 1012 is reinstalled and back in the clutser. There was one speedbump (/usr/libexec/cpupower is a script that is installed without +x, and so puppet fails staring the systemd service. I've fixed it manuall, will also make a phab task)
[13:29:17] <elukey>	 klausman: okok nice! No need for a phab task, we can just fix it in puppet
[13:30:11] <elukey>	 on trixie we don't have cpufrequtils
[13:30:36] <elukey>	 so I added an override to use linux-cpupower
[13:30:59] <elukey>	 the debian package doesn't provide /usr/libexec/cpupower yet, so we have it in puppet
[13:31:13] <elukey>	 it is a matter of adding exec perms in the file definition
[13:31:25] <elukey>	 do you want to send the code review? Otherwise I'll do it
[13:32:14] <elukey>	 dpogorzelski, klausman - keep in mind that https://github.com/ROCm/amdsmi/issues/132 still affects us, now that we have reimaged
[13:32:33] <elukey>	 I modified the python files manually to unblock us, until upstream decides to fix it properly
[13:32:50] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11366521 (10BWojtowicz-WMF) > When the service starts, Lift Wing will validate whether the target table exists, so we'll need SELECT as well. @BW...
[13:33:17] <klausman>	 elukey: will make patch, but I gotta leave or a doc appt soon
[13:33:30] <elukey>	 I'll file it don't worry
[13:37:23] <klausman>	 oh, also:
[13:37:37] <klausman>	 [   53.555005] amdgpu 0000:46:00.0: probe with driver amdgpu failed with error -22
[13:37:41] <klausman>	 also more in dmesg
[13:37:51] <klausman>	 But I think that is a KI
[13:38:47] <elukey>	 the GPUs are not recognized
[13:39:29] <klausman>	 yeah, I just saw.
[13:39:36] <klausman>	 I gtg, be back in 1h+
[13:43:33] <elukey>	 going to try two things
[13:43:37] <elukey>	 1) regular reboot
[13:43:41] <elukey>	 2) powercycle via BMC
[13:43:55] <elukey>	 the latter has proven over time effective when GPUs where not recognized
[13:49:36] <elukey>	 reboot didn't work, trying powercycle
[13:54:07] <wikibugs>	 06Machine-Learning-Team: Fix logging on Revertrisk - https://phabricator.wikimedia.org/T409931 (10gkyziridis) 03NEW
[14:02:10] <elukey>	 exactly, the powercycle fixed it
[14:02:41] <elukey>	 so I am almost sure that those hosts have an interaction between the BMC and the GPUs
[14:02:48] <elukey>	 that sometimes end up in this situation
[14:14:30] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11366745 (10elukey) @gkyziridis I am still puzzled by the main exception:   I checked [[ https://github.com/kserve/kserve/blob/release-0.13/python/kserve/kserve/model.py#L296...
[14:28:18] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11366792 (10Eevans) >>! In T409850#11366521, @BWojtowicz-WMF wrote: >> When the service starts, Lift Wing will validate whether the target table...
[14:29:20] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11366793 (10elukey) The client seems to be a single one from the [[ https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@219537e&_a=h@...
[14:29:59] <elukey>	 so the BGP config for this host is special https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1204587
[14:30:17] <elukey>	 I'll wait for Cathal to review but after this we should get the BGP session running
[15:17:54] <elukey>	 dpogorzelski: aya is now crashlooping but in the kserve container, there seem to be a stack trace related to the model
[15:17:58] <elukey>	 so calico works :)
[15:26:10] <klausman>	 A friend of mine recently observed that when debugging, we spend a lot more time _changing the error message_ than making it go away entirely ;)
[15:26:38] <elukey>	 :)
[15:42:54] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11367247 (10Samwalton9-WMF)
[15:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[15:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[15:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:14:36] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11367732 (10DMburugu)
[17:23:59] <wikibugs>	 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11367768 (10tchin) > what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-ch...
[19:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[19:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[19:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[20:23:46] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11368448 (10RobH) IRC Update from chat with Tobias:  ml-cache1002 & ml-serve1004 have both been migrated to the new switch stacks. ml-serve1003 will be drained t...
[22:11:59] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11368806 (10Eevans) 05Open→03Resolved a:03Eevans Ok, a new role has been created: `revise_tone_task_generator`, and it has been given `...
[22:12:15] <wikibugs>	 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11368810 (10Eevans)
[23:52:04] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[23:52:04] <jinxer-wm>	 Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ...
[23:52:04] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas