[08:44:04] <aiko>	 good morning folks
[09:07:32] <georgekyz>	 morning morning 
[09:55:03] <elukey>	 hey folks!
[09:55:17] <elukey>	 I am going to delete every pod in ml-staging-codfw (one by one with some delay)
[09:55:23] <elukey>	 to see if anything weird pops up
[10:17:10] <aiko>	 ack
[10:20:48] <georgekyz>	 Folks o/
[10:20:48] <georgekyz>	 I have a failing test in inference-services.
[10:20:58] <georgekyz>	 https://www.irccloud.com/pastebin/sJmpKcco/
[10:21:27] <georgekyz>	 Is there any way to just rerun the tests?
[10:28:30] <aiko>	 georgekyz: I'm not sure if we can just rerun the test, but we can rerun the job here https://integration.wikimedia.org/ci/job/inference-services-pipeline-llm/
[10:30:28] <georgekyz>	 aiko: thank youuu
[10:39:16] <klausman>	 Morning!
[10:42:08] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Rerun the tests." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[10:42:24] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[10:42:28] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:22:33] <wikibugs>	 (03PS6) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[11:34:21] <wikibugs>	 (03CR) 10Kevin Bazira: "thank you for working on this, George. most of it LGTM, I've left one comment." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:40:46] <wikibugs>	 (03PS7) 10Gkyziridis: inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100)
[11:50:44] <wikibugs>	 (03CR) 10Gkyziridis: inference-services: add peacock dummy model service (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:54:38] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Create dummy peacock detection model server - https://phabricator.wikimedia.org/T386100#10574921 (10gkyziridis)
[11:58:38] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] "nice! feel free to merge :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[11:59:37] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+2] inference-services: add peacock dummy model service [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[12:00:13] <wikibugs>	 (03CR) 10Gkyziridis: [V:03+2 C:03+2] "Merging." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1121398 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis)
[12:01:31] <georgekyz>	 merging dummy peacock service. Going for lunch and then check if everything went smooth.
[12:47:31] <elukey>	 ml-staging is currently having some problems with knative, so deployments may fail
[12:47:50] <elukey>	 lemme know if you have to do anything on staging, I'll try to fix sooner in case
[12:59:15] <elukey>	 fixed manually, this was the issue https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1122129
[12:59:18] <elukey>	 sigh
[13:04:18] <wikibugs>	 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10575025 (10elukey) I've killed all pods in ml-staging and I found a separate issue for knative (https://gerrit.wikimedia.org/r/1122129), th...
[14:02:53] <klausman>	 elukey: that keyword (ALL) being case-sensitive is kinda silly, isn't it?
[14:03:25] <elukey>	 klausman: o/ I died a bit inside when I discovered the issue, yes
[14:03:54] <klausman>	 reminds me of the of the super-sensitive syntax of sudoers
[14:05:04] <elukey>	 :)
[14:05:13] <elukey>	 I didn't find any other issue though
[14:16:44] <klausman>	 elukey: bte, is there a way to get the BIOS settings of a running SMC machine? I have a suspicion that the two lab machines have different configs, and that one of the results is reduced CPU<->GPU bandwidth on one of the GPUs of ml-lab1001
[14:18:04] <elukey>	 klausman: yes there is, are you worried about ml-labXXXX? 
[14:18:10] <klausman>	 yes
[14:18:43] <elukey>	 okok, lemme get it quickly, we have a spicerack shell for things like  these but it is internal to infra foundations atm, we are planning to release it soonish
[14:19:11] <klausman>	 ack, no rush
[14:27:21] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:03+2] Replace isset() with null coalesce on global [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 (owner: 10Umherirrender)
[14:40:35] <elukey>	 klausman: https://phabricator.wikimedia.org/P73510
[14:41:12] <klausman>	 ty! will have a digf
[14:41:15] <klausman>	 dig*
[14:41:43] <wikibugs>	 (03Merged) 10jenkins-bot: Replace isset() with null coalesce on global [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1119829 (owner: 10Umherirrender)
[14:41:46] <klausman>	 so the bottom of the three is the diff between 1001 and 1002?
[14:42:03] <elukey>	 I added two quick checks via sets but the results are strange, intersection shows something but "difference" nothing
[14:42:04] <klausman>	 weit, it says the difference set is empty...
[14:42:19] <elukey>	 yeah I need to check it in a better way, may not be what I intended to do
[14:42:31] <elukey>	 anyway, you have the dumps, you can compare in a better way :D
[14:43:53] <klausman>	 qck
[14:52:08] <klausman>	 elukey: looks like they don't differ in anything relevant. Lemme paste what I've found
[14:52:44] <klausman>	 https://phabricator.wikimedia.org/P73511
[14:56:13] <elukey>	 yep a loop was better :)
[14:57:33] <elukey>	 ahhh I am stupid set(dict) of course takes only the keys
[14:58:09] * elukey hides in shame
[14:58:38] <elukey>	 lemme know if you need anything else, on the top of my head I can think about any other setting that I could pull from the BMC
[14:58:51] <elukey>	 but maybe in the BIOS settings (manual way) there is something 
[14:58:53] <klausman>	 if it's any consolation, I had to look up how to range over a dict in Python because I am so out of practice
[14:59:12] <elukey>	 ahahahah
[15:00:14] <klausman>	 https://phabricator.wikimedia.org/P73512 Here's the discrepancy I am looking at
[15:00:57] <klausman>	 Note how on 1001, the transfers between device 0 (CPU) and GPU 1 differ from those between 0 and GPU 2. On 1002, no such difference
[15:01:39] <klausman>	 There is one difference: kernel version, and machine uptime (83 days for 1001). I think I'll reboot to see if that clears it
[15:21:15] <klausman>	 aha! the reboot helped! But now the BMC is in a weird state /o\
[15:21:49] <elukey>	 how come?
[15:21:49] <klausman>	 elukey: the BMC on 1001 says it's in "FW update mode", but I don't recall doing anything like that. Then again, maybe that was lingering from 83 days ago?
[15:22:21] <klausman>	 Only the webui works, the SSH one is in that weird state that e.g. cd /system1 just hangs
[15:39:34] <chrisalbon>	 Good morning
[15:41:04] <klausman>	 hey Chris
[15:43:09] <georgekyz>	 good morning Chris
[15:43:22] <elukey>	 klausman: maybe we need to gently bounce the BMC, there is some doc on wikitech
[15:44:09] <klausman>	 I tried to do that (bmc reboot), but they webui doesn't allow it. I'll see if w'tech has something I haven't tried.
[15:46:14] <elukey>	 klausman: some info in https://wikitech.wikimedia.org/wiki/Management_Interfaces
[15:46:28] <klausman>	 Yeah, currently trying `bmc-device --cold-reset; echo $?`
[15:46:35] <klausman>	 returned with a 0, so there's hope
[16:17:47] <klausman>	 elukey: ok, the cold reset from within helped. thanks for the hint!
[16:18:09] <elukey>	 klausman: the best thing that I have in my skillset, try to turn off/on
[16:18:23] <klausman>	 als: blame DNS
[16:18:26] <klausman>	 also*