[03:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [03:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [03:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:26:02] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, 10Wikimedia Enterprise - Content Integrity: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11365066 (10kevinbazira) @Trokhymovych, following up on T406179#11353371, the revertrisk-wiki... [06:57:03] (03CR) 10Samwilson: [C:03+2] "All filtering functionality appears to be unchanged, in my testing." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) (owner: 10Tim Starling) [07:09:19] (03Merged) 10jenkins-bot: Remove unused WatchedItemQueryService hooks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1200750 (https://phabricator.wikimedia.org/T407087) (owner: 10Tim Starling) [07:17:18] good morning :) [07:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [07:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:06:54] good morning [08:07:22] morning [08:23:02] once the pod is working i can give the standalone mode a try [08:45:15] good morning folks [08:45:24] thnx for deploying @kevinbazira [08:47:21] good morning folks [08:48:09] dpogorzelski: I had a chat with Cathal and he merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1203820, so the next steps should be to deploy that to ml-serve clusters (admin_ng) and then flip to on the BGP flag in https://netbox.wikimedia.org/dcim/devices/6344/ [08:48:33] after that we'll need to run homer, a tool that we have on cumin hosts to deploy configs to the network devices [08:49:05] morning, yep i was looking into the diff just now :) [08:49:14] we'll have to target the L3 switch lsw1-e8-eqiad, that is where ml-serve1012 is attached to [08:49:34] after that, in theory, the BGP session should be established and Calico should work properly [08:49:54] Cathal told me that since ml-serve1012 is special, there may need more things to fix [08:57:44] dpogorzelski: do you want to do the k8s deploy part and then we do the network bits together? [08:58:18] so calico changes are synced [08:58:34] sure thing [08:59:23] the bgp flag seems to be ON in netbox [09:00:24] yeah I am reading https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1204072 [09:00:37] I think that Cathal already tried it, but we miss a homer config [09:00:44] so we'll have to wait for him to be online [09:00:47] I'll ping you later [09:01:16] kk [09:01:59] in the meantime, could you please check that the admin_ng sync status in the various cluster is ok? Just to avoid having different configs etc.. [09:02:06] this one is eqiad-specific so not a big deal [09:11:08] sure thing, generally there are other changes pending that i'm not familiar with so i avoided applying them but would assume it's safe [09:13:03] the calico config is synced to other clusters as well now [09:20:03] yeah you are right, when they pop up we can review them together (or you can do it with Tobias) [09:20:36] usually there are external-services chart changes, that is a place where we try to keep IPs related to services to adding network policies gets quicker and more DYR [09:20:38] *DRY [09:21:00] but other things may need a question in the k8s SIG channel for confirmation, just to be sure [09:21:13] 99% of the time it is all safe [09:22:13] kk [09:23:35] klausman: what do you think? safe to sync? https://www.irccloud.com/pastebin/jmJX69Ch/ [09:23:50] looking [09:24:45] yeah, LGTM [09:24:56] just push eqiad first and let it soak like 10m [09:29:13] kk, massaged in [09:33:37] klausman: I am chatting with Cathal and there is a surprise for ml-serve1012, namely it is in the analytics VLAN :O [09:33:46] I don't find traces of it in https://phabricator.wikimedia.org/T393948 [09:34:04] was is a mistake/miscommunication between ML and dcops or do you recall a reason for it? [09:34:08] Huh. That's... "a little weird" [09:34:28] the only thing that I can think of is that we weren't sure if the host was for training or serving [09:34:42] My best guess is that it was a confusion arising from moving ml-lab1002 into that vlan maybe? I don't think I ever gave anyone a hint that 1012 should be there. [09:35:04] okok, so me and Cathal will fix netbox, and then we'll have to reimage ml-serve1012 [09:35:27] it is probably a good use case to go through the reimage cookbook with dpogorzelski [09:35:33] Yeah. Thanks for catching that. I presume it wreaked havoc with BGP? [09:36:25] correct :D [09:36:32] elukey: there should be an option to the reimage to move vlan btw [09:36:33] would love to go through the reimage process [09:36:44] so i can get a complete picture [09:36:53] --move-vlan 'Call the sre.hosts.move-vlan cookbook to migrate the host to the new VLAN during the reimage.' [09:37:21] (sorry for the drive-by I happened to be in the neighbourhood while cleaning my IRC unread) [09:37:30] claime: o/ in this specific case Cathal suggested to fix netbox, lemme ping him again, maybe it is easier this way [09:37:39] elukey: I think you need to fix netbox first [09:37:48] Then when you reimage, hit the move-vlan option [09:38:02] But I don't remember exactly, so yeah check with Cathal [09:38:36] ideally we should decom the host and re-add it, but it is super cumbersome so we'll do it differently :D [09:39:10] elukey: iirc using move-vlan during reimage is what we used for the wikikube vlan moves [09:39:14] Let me check [09:39:33] claime: so we cannot use it since the move vlan cookbook will fail, the switch to which ml-serve1012 is attached is currently not configured for the analytics vlan etc.. [09:39:57] IIUC it works when the host is already correctly configured in a VLAN, and you move it to another one [09:40:03] aaaah [09:40:04] yep yep [09:40:06] makes sense [09:58:52] dpogorzelski, klausman - ready for a reimage [09:59:22] so in theory the host should just do the whole dance without issues, at least when I tried it the first time [09:59:40] but I'd keep a mgmt console open in a separate shell to confirm what's happening [09:59:47] ping me if you see anything weird during dhcp etc.. [10:26:27] i'll have to jump out now, will be back after training, can try then [10:29:55] elukey: we could wait for Dawid to come back, maybe do a screenshare, so he can see what a reimage looks like? [10:30:10] though not sure how interesting that would be [10:30:28] He should be back sometime after 13:00 [10:38:40] klausman: oh yes definitely please do it with him, it is not urgent and a good occasion to go through reimage etc.. I am oncall today so ping me if anything weird comes up! [10:38:52] roger that! [10:39:42] o/ [10:39:42] thanks for the review George. [10:39:42] going to deploy rrwikidata in LW staging ... [10:40:33] 🤞 [10:40:40] elukey: just to confirm: it'll be a normal reimage the netbox/decom/recom stuff is already done? [10:40:52] yes correct [10:40:54] ty! [10:41:08] we had to wipe the DNS records so hopefully it will not be an issue [10:41:12] but in case let me know [10:41:16] ack [10:42:02] interestingly enough, ml-serve1013 doesn't seem to be in the analytics vlan https://netbox.wikimedia.org/dcim/devices/6345/interfaces/ [10:45:18] 10Lift-Wing, 06Machine-Learning-Team, 10Wikidata, 06Wikimedia Enterprise, and 2 others: Request to host Wikidata Revert Risk on Lift Wing - https://phabricator.wikimedia.org/T406179#11365751 (10kevinbazira) As we prepare to run load tests, the revertrisk-wikidata isvc has been deployed in LiftWing staging:... [10:45:42] rrwikidata is up and running in staging: https://phabricator.wikimedia.org/T406179#11365751 [10:48:11] 🎉 [11:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [11:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [11:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:32:03] 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11366230 (10achou) > With respect to GRANTs, is it safe to assume that MODIFY is sufficient? There is no requirement to do reads here, is there?... [12:32:08] back [12:32:33] klausman: if we want to show me around i'm available [12:32:52] alright. let me set up some stuff [12:32:58] cool [13:04:45] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11366393 (10gkyziridis) ==== Update ==== The issue I am facing for reproducing the error is that we are logging the incoming request if it is successful (status code 200), but... [13:27:39] elukey: 1012 is reinstalled and back in the clutser. There was one speedbump (/usr/libexec/cpupower is a script that is installed without +x, and so puppet fails staring the systemd service. I've fixed it manuall, will also make a phab task) [13:29:17] klausman: okok nice! No need for a phab task, we can just fix it in puppet [13:30:11] on trixie we don't have cpufrequtils [13:30:36] so I added an override to use linux-cpupower [13:30:59] the debian package doesn't provide /usr/libexec/cpupower yet, so we have it in puppet [13:31:13] it is a matter of adding exec perms in the file definition [13:31:25] do you want to send the code review? Otherwise I'll do it [13:32:14] dpogorzelski, klausman - keep in mind that https://github.com/ROCm/amdsmi/issues/132 still affects us, now that we have reimaged [13:32:33] I modified the python files manually to unblock us, until upstream decides to fix it properly [13:32:50] 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11366521 (10BWojtowicz-WMF) > When the service starts, Lift Wing will validate whether the target table exists, so we'll need SELECT as well. @BW... [13:33:17] elukey: will make patch, but I gotta leave or a doc appt soon [13:33:30] I'll file it don't worry [13:37:23] oh, also: [13:37:37] [ 53.555005] amdgpu 0000:46:00.0: probe with driver amdgpu failed with error -22 [13:37:41] also more in dmesg [13:37:51] But I think that is a KI [13:38:47] the GPUs are not recognized [13:39:29] yeah, I just saw. [13:39:36] I gtg, be back in 1h+ [13:43:33] going to try two things [13:43:37] 1) regular reboot [13:43:41] 2) powercycle via BMC [13:43:55] the latter has proven over time effective when GPUs where not recognized [13:49:36] reboot didn't work, trying powercycle [13:54:07] 06Machine-Learning-Team: Fix logging on Revertrisk - https://phabricator.wikimedia.org/T409931 (10gkyziridis) 03NEW [14:02:10] exactly, the powercycle fixed it [14:02:41] so I am almost sure that those hosts have an interaction between the BMC and the GPUs [14:02:48] that sometimes end up in this situation [14:14:30] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11366745 (10elukey) @gkyziridis I am still puzzled by the main exception: I checked [[ https://github.com/kserve/kserve/blob/release-0.13/python/kserve/kserve/model.py#L296... [14:28:18] 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11366792 (10Eevans) >>! In T409850#11366521, @BWojtowicz-WMF wrote: >> When the service starts, Lift Wing will validate whether the target table... [14:29:20] 06Machine-Learning-Team, 07Essential-Work: Revertrisk multilingual predictor returning 500s - https://phabricator.wikimedia.org/T409657#11366793 (10elukey) The client seems to be a single one from the [[ https://logstash.wikimedia.org/app/dashboards#/view/138271f0-40ce-11ed-bb3e-0bc9ce387d88?_g=h@219537e&_a=h@... [14:29:59] so the BGP config for this host is special https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1204587 [14:30:17] I'll wait for Cathal to review but after this we should get the BGP session running [15:17:54] dpogorzelski: aya is now crashlooping but in the kserve container, there seem to be a stack trace related to the model [15:17:58] so calico works :) [15:26:10] A friend of mine recently observed that when debugging, we spend a lot more time _changing the error message_ than making it go away entirely ;) [15:26:38] :) [15:42:54] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 2 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11367247 (10Samwalton9-WMF) [15:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [15:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:14:36] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 2 others: WE 1.3.4 Roll out Revert Risk Filters to Wikis that don't have damaging/goodfaith Edit Models - https://phabricator.wikimedia.org/T408388#11367732 (10DMburugu) [17:23:59] 06Machine-Learning-Team: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11367768 (10tchin) > what work is required to produce mediawiki.page_content_change.v1 to Kafka main? I'm expecting just some helmfile changes, for example, in the mw-page-content-ch... [19:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [19:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [19:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:23:46] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11368448 (10RobH) IRC Update from chat with Tobias: ml-cache1002 & ml-serve1004 have both been migrated to the new switch stacks. ml-serve1003 will be drained t... [22:11:59] 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11368806 (10Eevans) 05Open→03Resolved a:03Eevans Ok, a new role has been created: `revise_tone_task_generator`, and it has been given `... [22:12:15] 06Machine-Learning-Team, 06Data-Persistence, 13Patch-For-Review: Cassandra role & grants for Lift Wing isvc integration - https://phabricator.wikimedia.org/T409850#11368810 (10Eevans) [23:52:04] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:52:04] Deployment aya-llm-predictor-00001-deployment in llm at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=llm&var-deployment=aya-llm-predictor-00001-deployment - ... [23:52:04] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas