[00:36:53] * halfak has submitted 4 pull requests today. [00:37:03] 3 of them are coordinated. [00:37:23] So I suppose they should count as one. [00:37:25] But they are BIG! [00:40:45] Waits and watches his queries run. [00:40:57] If you're on analytics-store and waiting for something, I'm sorry. :( [00:41:44] halfak: what on? [00:41:49] (re: queries) [00:42:11] I've got an urgent request (yup) to update some monthly active editor stats. [00:42:34] Right now, I'm joining the "are you active" part with the "did you registered your account for the first time" part. [00:42:50] Next comes the "how many of you showed up this month" part [00:42:57] For all the projects. [00:43:06] Which is kind of a fun dataset to be building. [00:43:18] I really should just schedule a job to keep it curated. [00:43:20] cool! [00:43:31] I've only written 3 lines of code in the last week and a half [00:43:43] but my head is exploding with ideas, and the R Foundation has asked me to build them a geolocation service [00:43:46] (which I'm writing in Python) [00:43:56] In Python for R? [00:44:09] I could write a web service in R, and have [00:44:12] Or in Python for Python because we have converted you to the light side? [00:44:36] but the vectorised nature makes "1 million bits of info for 1 person" easy and "1 bit of info for 100000 people" hard [00:44:41] and it's the latter we need to have [00:45:12] in Python, for the R Foundation; basically when you install.packages("foo"), it'll first ping off your IP and package name to a centralised server which will attempt to geolocate you and pipe back the closest download mirror [00:45:40] that way you don't have to go through that 100-option menu every bloody time (you still might if it can't identify your nearest mirror, of course) [00:45:43] (which never works and I always have to choose a mirror from a huge list) [00:45:55] yeah, it doesn't exist right now, and that's why! [00:46:09] building that centralised hub has been stuck on my plate by a couple of board members [00:46:09] Here's my location! Recognize me and stop asking me questions! [00:46:12] it's gonna be fun on a bun! [00:46:15] :D [00:46:34] so yeah, doing it in Python, probably just as a simple Flask app hooked up to an nginx server to provide more robustness than flask's native server [00:46:56] simulate the workload, memoise it if the answer is "too much", and then we're done! [00:47:11] They better be funding these digressions.... wait.. what's the opposite of digression? Re-alignments? I'm going to go with 'alignments'. [00:47:37] Ironholds, indeed and you can probably host it in tool labs and get amazing uptime. [00:47:44] But we'll need YuviPanda|zzzz [00:47:51] TO make sure we don't touch NFS. [00:48:04] Because it turns out that labs is AMAZING except for NFS uptime [00:48:20] and EVERYTHING touches NFS [00:48:27] halfak: well, I imagine they'll host it themselves [00:48:36] Hence why the YuviStress level is a bit high recently. [00:48:40] this is a volunteering thing to do [00:48:48] Ironholds, fair enough. but the first thing could be on WMFlabs. [00:48:52] * Ironholds hugs YuviPanda|zzzz [00:49:01] Indeed. It doesn't sound that hard either. [00:49:02] halfak: makes sense! We've already got the geo binaries synced there, too [00:49:09] Cool! [00:49:14] yeah, it's trivial; just need to run a TOOON of testing [00:49:30] like: maybe we use closest lat/long, but maybe lat/long is harder to come buy than say, country [00:49:39] and so we use country as a fallback [00:49:43] many possibilities [00:49:59] Do you need to do the anonymity dance with this? [00:50:03] but simple python flask app with all the config and permutations stored in tuples [00:50:14] shouldn't need to - the logs will just automatically be voided and not contain IPs [00:50:15] It seems like, with this service, I'm fine with the R foundation pinpointing my coords. [00:50:36] what I'm probably going to do is set it up to, from the get go, log the package and the country of the person downloading it [00:50:38] and nothing else [00:50:54] this isn't a feature request but it's a thing I'm advocating it having because R doesn't currently have download logs for measuring package popularity around [00:51:14] (we sort of do but it's off /one/ CRAN mirror and it doesn't distinguish "I asked to download scales" and "I asked to download ggplot2 and it downloaded scales" [00:53:04] Makes sense [00:57:26] So, the big PRs I was talking about are for a distributed processing framework for ORES [00:58:04] You can now have a very computationally intensive feature without worrying about backing up the live-ness of the system [00:58:26] And when it still gets overloaded (because with more models, it will), you can just add more hardware. [00:58:44] The system is horizontally scale-able for both cache and CPU. [00:59:34] Once I deploy this, looking up scores for all of the revision in a wiki's RC feed is going to take about 1 second. [00:59:45] Because it will almost entirely be in the cache. [01:00:44] If you want to generate scores historically, we're going to have an elastic compute cluster capable of helping speed up the processing. [01:03:41] DarTar, FYI: ^ [01:03:54] So, when we talk about 'productization', that's what I mean. [01:04:41] I know someone who’ll be excited about it :) [01:04:42] halfak: that's incredible [01:05:17] This is fruit of all of the time I have been spending working with YuviPanda|zzzz, milimetric and joal. [01:09:43] good timing, I just sent out a celebratory email ;) [01:12:41] What are we celebrating? [01:13:24] That was really nice DarTar, thanks. :) [01:13:36] I love how you pulled in my biology metaphors :D [01:14:59] halfak: that was the most compelling moment of the year ;) [01:15:17] HareJ: a bunch of good news which will be announced publicly tomorrow [01:15:38] Does it have to do with the Spike '15? [02:01:49] congrats on your ORES fanciness halfak [02:03:06] Woot! Thanks milimetric [02:05:20] Celery is working out pretty well. It helps that I'm doing something relatively simple with it. [02:05:33] No DB connection. All the workers just need to manage a little bit of local memory. [02:11:16] OK. last pull request in. Have a good night folks! [02:16:58] HareJ: for active editor trends, keep an eye on this page https://meta.wikimedia.org/wiki/Research:Active_editor_spike_2015_(July_update) [17:18:22] halfak, leila: we should prepare 1-2 slides for lila + the board for revscoring and article recs, I’ll get something started but I’ll need your input [17:18:40] glad to see strong support for both projects [17:19:09] DarTar: I need two slides, one for data extraction work, one for article recommendation. I can make them. when is the deadline? [17:20:31] I don’t know when the board meeting is [17:20:34] I’ll ask [17:20:46] you’ll get one slide but you can use small fonts [17:20:47] :p [17:22:44] heh [17:23:38] General +1 on communicating to the board. If I have to do an impact study, that's going to taker quite a lot of time. [17:23:53] If we're just saying what revscoring *is* and what you do with it, sure. [17:24:03] you'll have to justify the time spent on doing the impact study first, halfak [17:24:17] Impact analysis of my work from January to July 2015 will probably take 2+ years [17:27:00] YuviPanda & hare +1 These things take time. Is it the right time to do an impact study? Well, would time the time spent doing the study be more worthy than time spent doing further development and expansion? [17:29:59] The thing I am worried about is answering important questions. [17:30:21] Is this worth our time? Well, we better have some good signal suggesting it is. [17:30:44] E.g. people talking to us & developing tools with ORES support. [17:35:04] I have the same issue. [17:35:15] For you it's worth it because I have incorporated your product into my product [17:35:18] Product product product [17:35:29] The question becomes, is *my* product worth it. [17:47:08] I think it will be but I can't prove it quite yet. "The best is yet to come" as they say. [18:45:48] halfak: btw, there seems to be interest in having access to SuggestBot's task predictions as a tool, so I'll try to schedule some time to get that into WikiClass so it can go into ORES [18:46:06] in other words: more interest in tools built on top of ORES [18:46:20] (can send you a link to talk page discussions if needed) [18:51:21] Nettrom :D [18:53:13] hare: :) WikiProject support comes first, though, gotta get that wrapped up soon [18:53:26] Do you want my output formatting? :D [18:53:49] hare: I saw WP Ghana's format, is that what you'd like it to be? [18:54:01] Yes. (They're all the same) [18:54:24] hare: I'm thinking I could do some Lua scripting to do that, SuggestBot needs to go the Lua route anyways [18:54:25] You can cohere the way things are laid out, right? The presentation is a little dense. I think I was a Perl programmer in a past life. [18:54:29] Ooooh [18:54:50] with some Lua, all SuggestBot would have to do is replace a template invokation [18:54:53] In any case, I believe in separating content from presentation. [18:55:00] hare: exactly! [18:55:04] Content should just be {{template|x|y|z}} [18:55:09] the appearance being governed by {{template}} [18:55:47] And I use a baffling mix of s and s to make everything different depending on whether you're looking at the main portal or the list itself. [18:56:20] I have to say, taking your table output and converting it to WPX UI is extremely tedious. It takes about an hour. [18:56:45] hare: yeah, that's why I was thinking about going the template route [18:57:11] allows for on-wiki editing of presentation, and simpler methods for updating with new suggestions [18:58:21] (although with mwparserfromhell, deleting the table is easy, haha) [18:59:05] Rather than update the table, wouldn't it make sense to just generate it de novo each time? [18:59:56] hare: updating it is technically "delete and insert a subst-call" [19:00:46] but long-term I've been planning to have SuggestBot's suggestions also be a template transclusion, so there's less wikitext on people's talk pages [19:01:12] with some Lua to handle images and text [19:01:54] Lua is amazing. I have a module that renders entire WikiProjects through it. [19:02:07] o/ lzia [19:02:13] Did the dev/ops checkpoint happen? [19:02:37] halfak: no. joal couldn't make it and Andrew is OoO [19:03:04] Gotcha. I don't feel so bad for missing it then. [19:03:07] Thanks lzia [19:03:24] I'm happy the Mexico trip is coming up. We can talk dev/ops for three days. :-) [19:03:32] np, halfak. [19:19:45] hruh [19:19:51] halfak, do I want to submit to an Elsevier publication? :/ [19:20:15] NOoooO! Maybe [19:20:43] What publication? [19:20:57] SoftwareX - it's a journal they've just kicked off to provide credit for scientific software [19:21:00] actually fairly interesting, imo [19:21:04] just run by elsevier [19:21:18] That sounds pretty cool. [19:21:25] I have an R Q if you can spare the cycles. I might have asked this one before. [19:21:40] what is it? I was just gonna submit reconstructr [19:21:51] I want to do a group-by operation, but I don't want to actually group up the values [19:22:24] so, say I have this: data.table(cat=c("foo", "foo", "bar", "bar"), val=c(1,2,3,4)) [19:23:18] I want to perform an operation that will allow me arrive at: data.table(c=("foo", "foo", "bar", "bar"), val=c(1,2,3,4), sum=c(3, 3, 7, 7)) [19:23:40] So, I'm performing the sum() operation on the whole vector, but returning a new vector rather than just a single value. [19:23:48] The method I really want to use is called "scale" [19:23:58] It will z-score normalize a vector of numeric. [19:25:22] huh [19:25:45] scale(c(1,2,3,4,5)) == c(-1.2649111, -0.6324555, 0.0000000, 0.6324555, 1.2649111) [19:26:05] halfak, so you want just: for each cat, put z-score normalised elements in a new vector? [19:26:11] Yes [19:26:54] data.table[,j={temp <- .SD; temp$z_score <- scale(temp$value); temp}, by = "variable"] [19:27:00] ugly because it's only on one line, hence the semicolons [19:27:15] but basically you can manipulate .SD, the group itself, within data.table expressions [19:27:28] so you can return more than manually specified vectors. You can return the original data.table with a new column, say. [19:27:44] (you have to do it indirectly; .SD is read-only hence the copy() operation) [19:28:18] alternately if it's literally only two columns an easier way of doing it would be datatable[,j=list(value=value, z_score=scale(value)), by = "variable"] [19:28:36] expressions messing around with .SD are for scenarios where you have a ton of columns and manually specifying them is fool's business [19:28:49] What does the "j=" mean? [19:28:56] Is it different from the default second arg? [19:29:23] is the default second arg "specify what you want to do to each subset"? If not, no [19:29:29] I just like manually specifying it for some reason [19:29:43] Will get you a paste of that not working [19:29:48] relying on default parameter order has always felt like a good way of confusing people not-familiar with a particular function who are reading my code [19:29:49] cool! [19:29:55] https://gist.github.com/halfak/d87c7b3280a7bf629c06 [19:30:49] Oh! Maybe it's the fact that scale returns a weird list thing. [19:31:00] ohh I see the problem [19:31:01] yep, that [19:31:15] it looks like a one-dimensional matrix [19:31:31] replace that call with as.numeric(scale(val)) [19:31:33] * halfak punches himself in the face [19:31:35] wait, no [19:31:41] it's a vector with...extraneous attributes? [19:31:42] c(scale(val)) [19:31:45] goddammit I hate the R devs [19:31:48] lol [19:31:52] they get clever with custom object types far too much [19:32:00] NOT EVERYTHING NEEDS TO BE A POORLY SUPPORTED TYPE [19:32:09] PARTICULARLY WHEN YOU FUCKERS CAN'T WRAP YOUR HEADS AROUND DEQUES. FIX THAT FIRST. [19:32:58] You guys have deques? [19:33:03] I <3 deques [19:33:22] SO beatutiful. Thank you Ironholds [19:33:31] halfak, no problem! [19:33:44] and actually yeah, my friend Drew - him of the supercomputers - implemented deques and queues [19:33:54] https://github.com/wrathematics/dequer [19:34:04] written in native C and seamlessly integrates with R types [19:34:23] BTW, scale is amazing for comparing otherwise values that occur at different scales. [19:34:49] scale = function(x){(x-mean(x))/sd(x)} [19:35:08] So it centers the values at zero (x-mean(x)) [19:35:21] And then controls for the standard deviation (/sd(x)) [19:40:04] So, the "active editor spike"... it looks like it is due to a massive increase in the number of newly registered users. [19:40:14] Or at least that's where it originated./ [19:42:10] But, the shift didn't start when we think it did. It started in July 2014. [19:49:36] Interesting. [20:01:32] * halfak runs off to beat Nettrom at squash :P [20:05:17] cool! [21:12:25] The revision table is such a mess. [21:13:09] I recommend not using it where possible. [21:13:18] At first, you think you can go through the revisions of a page by rev_id, because they seem incremental. But imports screw those up. [21:13:42] So then you turn to parent_id, but the same problem happens. [21:14:02] I see on some pages they just arbitrarily import old revisions. [21:14:11] So it'll jump from 2013 to, like, 2002. [21:14:13] And then back to 2013. [21:14:15] And so on. [21:14:15] So then you turn to timestamps, but then you can't meaningfully compare revisions of articles that were history-merged. [21:14:29] I CAN'T COPE. [21:14:33] Is there a bug to fix the ordering of those revisions? [21:15:03] There are a few bugs at https://en.wikipedia.org/wiki/User:Graham87/Import but I don't think anyone's going to fix them. [21:15:31] So basically, even just identifying reverts is a huge pain. [21:15:56] Because you can't trust the revision table to tell you which edit happened right before and right after a given edit. [21:17:32] Unless… [21:18:07] I guess I could try to detect if the log of the page shows imports and/or deletions (for history merges). [21:19:37] But this makes me sad. [21:20:35] And of course the page I'm interested in has imported edits. [21:30:26] guillom what exactly are you trying to do? [21:31:15] why not sort by edit date? [21:31:26] ToAruShiroiNeko: I've got a list of revisions for a given page (taken from the revision table) and I want to process that list to add new metadata to the revisions, like whether the revision was a revert, or whether it was reverted, etc. [21:31:53] ToAruShiroiNeko: Because if histories were merged, then the edit date mixes two different histories. [21:32:23] sure but a revert normally comes immediately after the edit [21:32:31] also tools do mark reverts [21:32:32] guillom, my strategy is to sort by timestamp, rev_id and let the history merge break things the way that it has to. [21:33:39] guillom take a look at http://pastebin.com/QYfQRu2e [21:33:59] (c) halfak inc. [21:34:12] Rrrrrrr [21:34:46] halfak: Yeah, that's where I was slowly going. But it's unsatisfying! [21:35:14] ToAruShiroiNeko: I'm not sure what that is supposed to do? [21:35:26] I'm with you. It would be great to have a universal ordering we could all agree on. [21:35:42] guillom thats last 20k reverts from pt wiki [21:35:54] you can remove said limit and restroct it to an article id [21:35:58] Whats worse is that the order that appears in the XML dump is neither consistently timestamp ordered nor rev_id ordered. [21:36:47] how can it be consistent though [21:37:00] people import revisions from decade ago at times [21:37:10] ToAruShiroiNeko: Hmm. But the RC table probably doesn't go back to 2001. [21:37:21] So fixing the order myself a posteriori is probably best. [21:37:25] guillom why would you want to go back to 2001? [21:37:41] I left 2001 behind some 14 years ago [21:37:44] ... WOW [21:37:52] ToAruShiroiNeko: Because https://wikimania2015.wikimedia.org/wiki/Submissions/Let%27s_talk_about_bees:_What_a_single_Wikipedia_article_can_tell_us_about_Wikipedians [21:38:04] guillom you will put me in mid life crisis in my tender age :p [21:38:46] ToAruShiroiNeko: Good! Better to have that crisis as early as possible, so you can enjoy what comes after it :) [21:39:04] I am more likely to assume a fetal position and stay like that [21:39:18] which would make my presentation in wikimania awkward [21:39:38] We have that proverb in French: "Si jeunesse savait, si vieillesse pouvait", which would roughly translate to "If the young knew, and if the elderly could." [21:39:55] guillom solution is cybernatics. [21:40:08] Everyone lives foverer! [21:40:13] Old people in mechs, what could possibly go wrong? [21:41:32] Also, a great example in wikipedian idiocy: https://en.wikipedia.org/wiki/User_talk:Graham87/Import#Please_explain_what_authority_do_you_have_to_do_this.3F [21:42:17] "I don't understand what you're doing so I don't like it and who even put you in charge" [21:42:26] (paraphrased.) [21:42:30] guillom ya [21:42:38] so what is the actual complaint? [21:43:45] The guy complained that Graham imported edits via the import feature instead of making edits the "conventional way". Except you can't make "old" edits the "conventional way". [21:43:52] Anyway. [21:44:11] guillom ah, classical absence of a delorian :3 [21:44:23] to be honest import makes server kitties cry [23:28:23] halfak, congrats :) [23:28:58] Thanks, dude. [23:30:15] I get to be part of the promotionpocalypse [23:31:39] congrats halfak [23:31:45] Thanks subbu :) [23:32:19] and ewulczyn too [23:34:03] congrats! [23:34:07] new fancy titles too? [23:34:35] Yup. I am now Señor Research Scientist just in time for Mexico [23:35:38] halfak: cool, congrats Sir! [23:35:59] Did they just decide to upgrade a bunch of people's titles? [23:39:01] now that I have code that appears to work, I can go have something that appears to be beer, seeyas! [23:40:10] hare, budget [23:41:16] what about it [23:42:43] Ironholds, "they" decided we were going to do it at a specific time of the year this year. [23:42:48] It's right after yearly reviews. [23:42:59] aha [23:43:05] I got, uh. A rais [23:43:07] e? [23:43:15] Woops. Was supposed to ping hare with that one [23:43:31] Ah.