[14:05:41] o/ guillom [14:05:43] around? [14:07:49] (as an update on me learning about machine learning, I now feel I understand 'features' well enough to reason about them) [14:08:34] yay! [14:08:40] Predictive statistics :) [14:10:26] I can also see how this directly has benefitted from the corporate surveillance state [18:03:53] back [18:04:04] halfak: Sorry, I was away earlier [18:04:24] I'm at ICWSM this week. [18:04:31] (with crappy wifi, when there's wifi at all.) [20:34:15] halfak: btw, AWS now has an instance type with ~1.9TB of RAM. If you can think of interesting use cases for that, I'm curious [20:34:50] * halfak wants a distributed computing framework that uses RAM flexibly. [20:34:58] But having 20GB per mapper would work too. [20:35:15] right, but not having to deal with network overheads can be liberating for some things as well :) [20:35:26] this one has 128 vCPUs and 1.9TB of RAM and a 10G network [20:40:18] YuviPanda, could be better than a size-able hadoop cluster at that size. [20:40:40] yeah [20:40:51] multiprocessing for all its warts is probably going to be faster maybe? [20:40:59] or maybe not - maybe we'll be bottlenecked by IO [20:41:01] someone will come up with a cutting edge use case related to serving pr0n I'm sure [20:41:14] debbie does AWS [20:44:08] I think I might be bottlenecked mostly on processing the last 50 dump files. [20:44:16] (there's about 178 per dump) [20:44:19] For english Wikipedia. [20:44:33] One of those files is *HUGE* because it contains the Administrator's Noticeboard. [20:45:27] ah that one [20:45:42] I wonder if it'll be useful to produce a dump dataset that's all histories but only ns 0 [20:46:05] Na. Usually, we can just scan past that one giant page in ~5 minutes anyway [20:46:19] The problem is when we need to actually process it :S [20:46:37] ah [20:46:39] I se [20:46:41] ok [21:24:07] hey halfak, is there any research out there that specifically measures the negative impact of talk page 'warning' messages on new editor retention? Or even research that quantifies the volume of such messages that newcomers receive? [21:24:37] other than Steven and Maryana's "Rise of warnings" blog post from 2011 [21:26:12] J-Mo, http://files.grouplens.org/papers/defense-mechanism-icwsm.pdf [21:27:52] saw that, halfak (and am already citing it), but this particular finding didn't leap out at me. Figure 1, maybe? [21:28:03] Yeah. Figure 1 [21:28:11] Regression analysis was mostly inconclusive. [21:28:26] got it. Thanks [23:34:50] milimetric: I'm getting some really confusing data out of Wikimetrics. Are you still the person to poke about that? [23:36:28] For example: https://metrics.wmflabs.org/static/public/5224520.json [23:36:57] This lists a sum of 80k bytes added by User:SiSreach [23:37:18] But their contrib history is more like 12k: https://en.wikipedia.org/w/index.php?limit=500&title=Special%3AContributions&contribs=user&target=SiSreach&namespace=0&tagfilter=&year=2016&month=-1 [23:38:08] er, even less since 2016, as the report is scoped. [23:38:38] The user also doesn't seem to have any deleted revisions or anything else that I can see that might account for that error. [23:39:57] Possibly related: https://phabricator.wikimedia.org/T133715