[15:28:42] We just dropped our memory usage by 1/3rd and our CPU usage by 1/2 in ORES without reducing capacity! [15:28:45] Woooo! [15:28:58] Performance refactor is performant! [15:32:36] congrats! [15:35:25] Thanks apergos [17:48:09] halfak, good presentation. I agree with the story about human review of models, but I was also wondering about some technical solutions. It seems like the anon issue is caused by the model focusing on one signal and forsaking several other correlated signals, since they are largely represent by the one it focuses on. [17:48:21] It seems like you should be automate the detection of these sort of features using similar methods as to what you were already doing. Perhaps building the model without the feature included and then comparing it to the original model with intentionally bad signals coming from that feature. Surprisingly large differences in results between the two models [17:48:21] would a red flag for excessive dependence on the field being fiddled with. [19:06:28] kjschiroo, Good notes. That's actually the first thing that I did. [19:06:51] I found that we lost a substantial amount of signal without is_anon, but not all of it. [19:07:13] See table here: http://socio-technologist.blogspot.com/2015/12/disparate-impact-of-damage-detection-on.html [19:07:59] To choose to do that, we need a mechanism to identify what the tradeoff is between classifier performance and the cost of unfairness. [19:08:10] See also some work here: http://dl.acm.org/citation.cfm?id=2783311 [19:08:19] I haven't worked through the methods enough to really grok them yet [19:08:37] But it seems to me that this is excessively complicated. [19:20:08] I think I am talking about something slightly different, at least from that particular table. Instead of comparing the current version to one without the signal I would be interested in comparing the current version + randomized feature of interest against a model that doesn't have the feature at all (which is what I was thinking you had done in your [19:20:08] presentation). If we observed excessively bad drops from randomizing that feature compared to not using it at all we would have an indicator that we are over utilizing that specific feature, when we would probably prefer a model that makes its decisions based on a broader set of features that we know are available. I was thinking that this could be automated [19:20:09] as a way of evaluating model quality. [19:20:17] ^ halfak [19:20:53] kjschiroo, maybe that feature is just very useful [19:21:06] It seems like you are advocating removing the most useful feature [19:21:28] Also, what if the bias manifests through an intersection of features? [19:26:23] WHat if that feature is "adds a bunch of curse words"? [19:28:19] I'm not advocating for removing powerful features. I am suggesting that I would prefer a model that does not entirely abandon good features for better ones. This appeared to be what was going on when you were comparing svc and gb. As I recall gb maintained some predictive power even when you feed it bad signal for anons, whereas svc remained focused just on [19:28:19] anons and was not utilizing other features. Am I remembering this correctly? I do acknowledge that the intersection of several features would still pose a problem. [19:29:28] Ahh yes, but what if that feature really is that useful? [19:29:56] It seems that if you model abandons good features for better ones, that's an issue with the modeling strategy, isn't it? [19:30:16] As it would necessarily lose fitness. [19:30:35] it should go to the gym more often [19:30:37] * yuvipanda slinks away [19:30:53] How much worse is gb than svc with all features included? [19:30:57] Nice one, dad [19:30:59] yuvipanda, ^ [19:31:18] yw son [19:31:19] kjschiroo, they score comparably with ROC-AUC [19:33:45] But then with this evaluation wouldn't we consider gb the better model since it draws support from more features and is less likely to make bad predictions when there is substantial evidence from other signals. [19:34:01] *? [19:41:59] kjschiroo, oh I see. I thought you were looking to use this strategy to select features, but you were looking at this strategy to choose a model. [19:44:00] halfak, yes. It could also be used to flag features for human evaluation to determine if it really is okay to really heavily on it. [19:44:51] In the context of larger social concerns it is the sort of thing that would at least tell you to think about if your machine learning algorithm is being racist. [19:44:53] kjschiroo, personally, I'd rather look at how the output affects protected classes. This strategy seems secondary and besides the point in comparison. Is there something you think it adds beyond reviewing how the model treats different classes of people? [19:45:22] E.g. we could compare false-positive rates for anons between the two modes [19:45:24] *models [19:52:50] I think I'd agree that this is beside the point of how it affects protected classes. It would focus more on building resilient models and this would have consequences for protected classes, but that wouldn't be the main focus of it. [20:02:06] Maybe we could instead look for features that have a large effect on false-positive rates for protected classes. [20:02:14] Rather than just looking at fitness loss. [20:03:18] J-Mo & abbey___: I'll be skipping the start of documentation teatime today. I've got a headache coming on so I'm going to stop staring at the screen for a bit and see how that goes. [20:04:27] cool, halfak. hope you feel better!