[08:44:04] good morning folks [09:07:32] morning morning [09:55:03] hey folks! [09:55:17] I am going to delete every pod in ml-staging-codfw (one by one with some delay) [09:55:23] to see if anything weird pops up [10:17:10] ack [10:20:48] Folks o/ [10:20:48] I have a failing test in inference-services. [10:20:58] https://www.irccloud.com/pastebin/sJmpKcco/ [10:21:27] Is there any way to just rerun the tests? [10:28:30] georgekyz: I'm not sure if we can just rerun the test, but we can rerun the job here https://integration.wikimedia.org/ci/job/inference-services-pipeline-llm/ [10:30:28] aiko: thank youuu [10:39:16] Morning! [10:42:08] (03CR) 10Gkyziridis: [C:03+1] "Rerun the tests." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:42:24] (03CR) 10Gkyziridis: [C:03+1] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [10:42:28] (03CR) 10Gkyziridis: [C:03+1] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:22:33] (03PS6) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) [11:34:21] (03CR) 10Kevin Bazira: "thank you for working on this, George. most of it LGTM, I've left one comment." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:40:46] (03PS7) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) [11:50:44] (03CR) 10Gkyziridis: inference-services: add peacock dummy model service (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:54:38] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10574921 (10gkyziridis) [11:58:38] (03CR) 10Kevin Bazira: [C:03+1] "nice! feel free to merge :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [11:59:37] (03CR) 10Gkyziridis: [C:03+2] inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:00:13] (03CR) 10Gkyziridis: [V:03+2 C:03+2] "Merging." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [12:01:31] merging dummy peacock service. Going for lunch and then check if everything went smooth. [12:47:31] ml-staging is currently having some problems with knative, so deployments may fail [12:47:50] lemme know if you have to do anything on staging, I'll try to fix sooner in case [12:59:15] fixed manually, this was the issue https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1122129 [12:59:18] sigh [13:04:18] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10575025 (10elukey) I've killed all pods in ml-staging and I found a separate issue for knative (https://gerrit.wikimedia.org/r/1122129), th... [14:02:53] elukey: that keyword (ALL) being case-sensitive is kinda silly, isn't it? [14:03:25] klausman: o/ I died a bit inside when I discovered the issue, yes [14:03:54] reminds me of the of the super-sensitive syntax of sudoers [14:05:04] :) [14:05:13] I didn't find any other issue though [14:16:44] elukey: bte, is there a way to get the BIOS settings of a running SMC machine? I have a suspicion that the two lab machines have different configs, and that one of the results is reduced CPU<->GPU bandwidth on one of the GPUs of ml-lab1001 [14:18:04] klausman: yes there is, are you worried about ml-labXXXX? [14:18:10] yes [14:18:43] okok, lemme get it quickly, we have a spicerack shell for things like these but it is internal to infra foundations atm, we are planning to release it soonish [14:19:11] ack, no rush [14:27:21] (03CR) 10Daimona Eaytoy: [C:03+2] Replace isset() with null coalesce on global [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 (owner: 10Umherirrender) [14:40:35] klausman: https://phabricator.wikimedia.org/P73510 [14:41:12] ty! will have a digf [14:41:15] dig* [14:41:43] (03Merged) 10jenkins-bot: Replace isset() with null coalesce on global [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 (owner: 10Umherirrender) [14:41:46] so the bottom of the three is the diff between 1001 and 1002? [14:42:03] I added two quick checks via sets but the results are strange, intersection shows something but "difference" nothing [14:42:04] weit, it says the difference set is empty... [14:42:19] yeah I need to check it in a better way, may not be what I intended to do [14:42:31] anyway, you have the dumps, you can compare in a better way :D [14:43:53] qck [14:52:08] elukey: looks like they don't differ in anything relevant. Lemme paste what I've found [14:52:44] https://phabricator.wikimedia.org/P73511 [14:56:13] yep a loop was better :) [14:57:33] ahhh I am stupid set(dict) of course takes only the keys [14:58:09] * elukey hides in shame [14:58:38] lemme know if you need anything else, on the top of my head I can think about any other setting that I could pull from the BMC [14:58:51] but maybe in the BIOS settings (manual way) there is something [14:58:53] if it's any consolation, I had to look up how to range over a dict in Python because I am so out of practice [14:59:12] ahahahah [15:00:14] https://phabricator.wikimedia.org/P73512 Here's the discrepancy I am looking at [15:00:57] Note how on 1001, the transfers between device 0 (CPU) and GPU 1 differ from those between 0 and GPU 2. On 1002, no such difference [15:01:39] There is one difference: kernel version, and machine uptime (83 days for 1001). I think I'll reboot to see if that clears it [15:21:15] aha! the reboot helped! But now the BMC is in a weird state /o\ [15:21:49] how come? [15:21:49] elukey: the BMC on 1001 says it's in "FW update mode", but I don't recall doing anything like that. Then again, maybe that was lingering from 83 days ago? [15:22:21] Only the webui works, the SSH one is in that weird state that e.g. cd /system1 just hangs [15:39:34] Good morning [15:41:04] hey Chris [15:43:09] good morning Chris [15:43:22] klausman: maybe we need to gently bounce the BMC, there is some doc on wikitech [15:44:09] I tried to do that (bmc reboot), but they webui doesn't allow it. I'll see if w'tech has something I haven't tried. [15:46:14] klausman: some info in https://wikitech.wikimedia.org/wiki/Management_Interfaces [15:46:28] Yeah, currently trying `bmc-device --cold-reset; echo $?` [15:46:35] returned with a 0, so there's hope [16:17:47] elukey: ok, the cold reset from within helped. thanks for the hint! [16:18:09] klausman: the best thing that I have in my skillset, try to turn off/on [16:18:23] als: blame DNS [16:18:26] also*