[06:57:20] YuviPanda, feel like handling a fun problem? [06:57:30] heh, already am atm :D [06:57:35] aww [06:57:38] (betalabs puppetmaster is dead, am sleuthing) [06:57:41] to see why it happened [06:58:06] I'm trying to find the least algorithmically complex solution to a python problem [06:58:13] I know how I'd do this in R! [06:58:20] ah [06:58:27] write me an email / poke me in another 2-3h? [06:58:36] sure! [06:59:17] sweet [07:06:11] YuviPanda, you have a CS question in yer inbox [07:06:46] \o/ sweet [07:17:05] oh, YuviPanda: that datastore you were suggesting I mess around with waaass... [07:17:18] cassandra? [07:17:29] for some definition of 'mess around' of course [07:17:34] RESTBase is very read heavy as well [07:18:59] * Ironholds nods [07:19:02] RESTBase would make sense [07:19:35] although it appears to be implemented in JavaScript and exclusively maintained by Gabriel [07:20:08] Ironholds: haha [07:20:36] Ironholds: I meant in the case that RESTbase uses cassandra, restbase is very read heavy, page views will be read heavy, and so if restbase is a good fit for that, and so if cassandra maybe a good fit for that... [07:20:49] that makes more sense ;p [07:21:08] YuviPanda, actually, you know what? :P [07:21:30] I could just use hadoop, partitioning on a hash of {project,page} [07:22:27] might be overkill, though [07:22:30] * Ironholds digs into Cassandra [07:23:10] idk if hadoop is good for anything interactive (vs batched) [07:23:17] I think stat.grok.se actually uses cassandra [07:23:20] so it's evidently sustainable [07:24:41] really? [07:24:48] yerp [07:25:00] you know the stats.grok.se source code is freely released, right? [07:25:07] like, if we wanted to have an us-maintained instance...we could. [07:25:25] indeed [07:25:26] oh, wait, I lie [07:25:29] it's an outdated version [07:25:31] of COURSE it is [08:08:47] YuviPanda, ignore the email, my friend rohit and I worked out how to get it down to O(n) [08:09:01] Ironholds: \o/ do let me know, btw [08:09:18] yay! [08:25:38] YuviPanda, test completed! [08:25:42] want the details? [08:26:20] Ironholds: sue [08:26:21] err [08:26:22] tsuer [08:26:24] gah [08:26:24] sure [08:26:25] ... [08:26:27] heh [08:26:34] okay, so if you give me a 500mb dump file? [08:26:48] try 350mb, and that's with actual TSV formatting and full project names and URLs [08:26:57] * Ironholds thumbs up [08:27:16] Ironholds: nice! [08:27:20] so that’s a 30% reduction [08:27:28] is that compressed? [08:27:28] or? [08:28:27] YuviPanda, uncompressed [08:28:41] nice [08:28:43] and the advantage is, this is also the format hive will spit out when I set up the relevant oozie job :D [08:28:47] aaaaah [08:28:48] nice nice [08:28:57] YuviPanda, I can actually reduce it more, I think, because the {project,url} is getting md5 hashed [08:29:06] (it's appropriate for use as a key in a partitioned store) [08:29:57] ahoy qchris :) [08:30:17] Hi Ironholds. Just read your email. [08:30:35] cool! [08:30:47] YuviPanda, okay, I lied, it's not detecting mobile. bah. [08:31:17] still, almost there [08:32:18] ro.wikipedia.org Mihai_Stelescu"_\t_"_blank 1 0 [08:32:24] single anonymous reader I will FIND YOU. [08:32:44] Boooooo! :-( [08:32:50] Napster\t_blank? you too! [08:32:53] HAHAHAAHA [08:33:08] a more sensible human being would just remove \t and ASCII nuls immediately after the URL decoding step [08:33:13] but I am not a sensible person, so I will grouse instead [08:35:01] ...oh wait [08:35:07] YuviPanda, worked out where the mobile bug was coming from [08:36:56] can we actually have quotes in article titles? [08:37:18] I won’t be surprised [08:37:21] quiddity: probably knows [08:37:32] quiddity? [08:37:33] * quiddity denies everything. [08:37:45] * quiddity reads scrollback. [08:38:27] oh. Hmm. [08:38:41] wait [08:38:49] why don't I just search for quote marks in the page_title field [08:38:50] ;p [08:39:14] https://en.wikipedia.org/wiki/%22Crocodile%22_Dundee [08:39:42] https://en.wikipedia.org/wiki/%22Heroes%22 [08:40:00] (used as examples in https://en.wikipedia.org/wiki/Wikipedia:Article_titles#Article_title_format ) [08:40:20] ok. I sleep nao. [08:42:45] ta! [08:46:57] YuviPanda, https://github.com/Ironholds/thaumiel/blob/master/import_dataset.py#L102-L121 is the simplified solution [08:46:59] I sleep now too [08:47:00] * Ironholds waves [08:47:06] Ironholds: sleep well [13:30:08] o/ FaFlo1 [13:30:33] I just saw your wiki-research-l email bounce. Can you try again without the attachement? [13:57:43] o/ FaFlo [14:07:00] * halfak grumbles at FaFlo and his lack of ping response [14:07:02] :P [14:20:44] * Emufarmers hugs halfak. [14:20:56] halfak, sorry, yes [14:21:08] aha, I summoned him [14:21:10] o/ Emufarmers. :) [14:21:13] didn't think about that attachment… [14:21:15] Thanks Emufarmers ;) [14:21:28] FaFlo, normally OK, but we have a big mailing list these days :D [14:21:30] they hugs made me listen up [14:21:38] ok [14:21:42] let me resend it [14:22:58] Thanks. Also, really cool stuff. [14:23:03] done [14:23:05] and thanks [14:23:07] :) [14:23:09] I pinged right away because I wanted to make sure others saw :) [14:23:22] thx [14:23:49] btw, while writing that email, why did pine put that IEG on hold/withdrew it= [14:23:53] ? [14:24:12] just saw it when I wrote the email [14:24:24] Is this re. the editor interaction stuff? [14:24:31] ah sorry, yes [14:25:01] It was Siko's suggestion given that Pine had a lot of other commitments and a lot of skills to pick up. [14:25:20] ok, makes sense [14:25:57] I will update the page anyway with the stuff from the email, just in case this is getting picked up later on [14:25:57] Yeah. I'm totally down for trying to picking it up again though. [14:26:24] You've done a ton of work already :) [14:27:15] yeah, the interaction dataset could be done already with the code we have, I just have no time/manpower right now [14:27:52] +1 I know what you are saying. [14:28:21] We can always plan a bit farther out or get some funding for a undergrad/grad to be able to work with you. [14:28:49] what are you currently busy with? this revscoring? [14:29:09] Revscoring and the WikiCredit project. [14:29:39] The problem I am solving is very similar to WikiWho, but I need to be able to do it at scale and FAST. [14:30:15] BTW, I have been hard at work making the whole wikiwho-like segmentation faster. [14:30:23] ah ok [14:30:29] The deltas library is alive and well. [14:30:42] so wikiwho was not fast enough for your purposes? [14:30:46] I still haven't implemented a version that tracks segments historically though. [14:31:01] Oh! yeah. Tokenization and Segmentation are surprisingly slow. [14:31:12] An LCS diff tends to be a bit faster than segmentation. [14:31:25] But of course the diff after segmentation is ridiculously fast. [14:31:37] i see [14:32:00] fair enough, the tokenization was just done very haphazardly [14:32:33] Oh! I figured out a way to save a ton on memory usage last night too. [14:32:45] I still have to get the last bits of that deployed. [14:33:03] I re-use tokens with exactly matching strings. [14:33:12] Dropped the memory footprint by an order of magnitude. [14:33:14] +q [14:33:15] :) [14:33:17] *+q [14:33:21] sorry.. +1 [14:33:23] heh [14:33:37] is that code available already? [14:33:40] Yup. [14:33:53] See https://github.com/halfak/Deltas [14:33:57] thanks [14:34:28] I'll have a branch with the token duplication strategy as soon as I can work out a good enough implementation of a "trie tree". [14:34:32] and authorship attribution of words is something you want to do on top of that, right? [14:34:49] ok [14:35:01] Right now, the one I am experimenting with is a C module and it is not compiling nicely on a few of the machines i need to run this on. [14:35:26] FaFlo, to an extent. I think I might skip authorship to go right for value measurements. [14:35:35] Since it takes a bunch of disk to store authorship [14:35:36] ok [14:35:41] yes [14:35:56] BUT, I'd like to work out how I can help generate data for WikiWho. [14:36:11] yes, I was thinking that [14:36:26] how to use the more efficient methods [14:36:28] in wikiwho [14:36:48] but do you think the two things are actually complementary? [14:37:02] The biggest problem I see is matching segments historically. [14:37:09] This won't be that hard to implement on top of deltas. [14:38:30] eww.. readme example is out of date. [14:38:36] * halfak fixes that right now [14:40:13] {{done}} [14:42:40] and you're going to offer this as service/API right? [14:42:48] or do you already have one? [14:43:54] FaFlo, regretfully, I'm still fighting with map reduce to fully process the dataset in a reasonable amount of time. [14:44:25] I've got lots of people who specialize in hadoop looking at the problem. [14:44:35] It's surprisingly difficult. :( [14:44:56] Really, the biggest problem is that some pages take forever to process. [14:45:12] I bet that WP:Sandbox is one of them, but hadoops logging sucks so I haven't confirmed yet. [14:45:18] jup, especially if they have a lot of changes and lot of stuff to diff [14:56:45] FaFlo: these graphs are super cool [14:57:06] Emufarmers: I agree :) [14:57:20] I don't understand what the reciprocity metric indicates, though [14:58:01] It's in the paper: it measures how much disagreements go back and forth. and for the line graph it is normalized for each article to put it between 0 and 1 [14:58:30] roughly: it shows how many red edges there are in the graph at that revision [14:59:02] why the normalized disagreement shows how many grey edges (unidirectional) there are [15:08:00] Timezone-appropriate greetings, scientists. [15:18:38] Ironholds: Can you ping me when you're around? (Nothing urgent) [15:23:22] guillom, TAG to you too! [15:23:54] hey halfak [15:26:11] hello computer users [15:45:11] FaFlo: so a non-reciprocal disagreement is if someone reverts someone, and then that person doesn't revert back? [15:46:49] Emufarmers: (kinda) yes: we don't call it "revert", because that word is already occupied by the meaning of "doing a full revert on the whole edit". but the rest of what you said is exactly correct. [15:47:02] I delete a word of you and you don't put it back, for example [15:47:59] or I put back a word of X and you delete it again, then only you would have a "disagreement" edge of weight one to me [15:49:49] of course one of these disagreements *could* also be a full (going-back-to-previous-revision) revert, but many are not [15:50:43] So is the strength of the red line a reflection of the amount of disputed content, or the number of times you interact over it, or both? [15:51:09] (the formula in the paper is a bit above my level of mathematical proficiency) [15:51:22] +1 good question [15:51:42] the strength of the tie is only corresponding to the weight, i.e. to the words that were disagreed on [15:51:51] in both directions [15:52:05] the gradient of red indicates the level of reciprocity [15:53:07] by "strength of the tie" you mean the thickness of the line? [15:53:11] yes [15:53:19] okay [15:54:30] the reciprocity takes into account (i) the difference in amount of the disagreements from A to B and B to A , (ii) the disagreement focus in each of the edits that included disagreement between them and (iii) how often mutual disagreements occured [15:54:43] that's why the formula is so complicated :P [15:55:14] in the end, it is an arbitrary measure we implemented and if someone comes up with a better one, we are happy to implement it [15:55:42] it's hard to measure what is the best way to represent these things [15:57:14] but while I was checking it manually, it gave a decent picture of what was going on, so we decided on that metric [16:06:50] I see [16:07:30] Well, I'll let you know if there's anything else I can't figure out [16:07:38] sure :) [16:07:42] and I can't wait to be able to use it on arbitrary articles :> [16:07:53] hehe, yes, we are working on that [16:08:34] also, I would really like to make the edge context a bit better, so you can see the actual replacements of deleted words [16:10:31] Is anyone building a WikiTrust-like frontend for the API? [16:11:42] I actually have a student working on a tamper/greasemonkey script (and a service that is called by that) to color the front-end article with the authors [16:11:53] that calls the api [16:12:00] *wikiwho-API [16:12:55] excellent! [16:13:08] no working version I can test yet, I suppose? [16:13:45] not yet, still extremly early, but I will let you know when I have something usable [16:13:55] Gret :) [16:13:57] Great [16:15:06] halfak: https://phabricator.wikimedia.org/T92506 [16:31:25] bleeh [16:31:27] morning! [16:31:57] halfak, guess what I have? [16:48:55] * Ironholds hugs quiddity [16:49:03] I understand that today is hard for you, and I empathise deeply. [16:49:07] * quiddity hugs back. [16:49:08] You have my condolences :( [16:49:25] also, please distribute to http://www.lspace.org/books/reading-order-guides/the-discworld-reading-order-guide-20.jpg [16:49:43] He changed the world, drastically for the better. That's all anyone can ask. [16:51:33] Terry Pratchett died!? :( [16:51:51] yep #sadface [16:52:39] quiddity: sorry for your loss, my condolences as well [16:53:34] (justincase, for clarity: he's just my favourite author, no closer connection) [16:56:12] ah, still a bummer of a day, sorry [16:57:25] augh [16:57:29] and on the /very same day/ we get http://www.theguardian.com/society/2015/mar/12/alzheimers-breakthrough-as-ultrasound-successfully-treats-disease-in-mice [17:34:24] ottomata, would you be up to batcave in a bit? I've got something I wanna pitch to you formally ;) [17:37:00] Ironholds: yt? [17:37:11] kevinator_, yep [17:37:35] in the mobile app session reports you ran ad-hoc [17:37:43] did you look back 30 days? [17:37:52] yes. [17:37:59] https://phabricator.wikimedia.org/T86535#1016026 [17:38:03] ok [17:38:25] we're going to run it weekly with a 30 day windows [17:38:44] cool [17:38:46] have fun! [17:38:52] :-) [17:39:24] sure Ironholds [17:39:25] can now [17:39:29] ottomata, neat! [17:39:31] It's still not ready for development... just getting the remaining questions answered [17:39:43] kevinator_, to answer the geometric mean question: because maths [17:40:04] specifically, the distribution of session length and several other variables is, at best, log-normla [17:40:05] *normal [17:40:16] a mean would point us to a place on the density curve where nobody lives [17:40:49] you get 10 400 second sessions and 1 40,000? great! [17:40:51] the mean is 4,000 [17:40:59] except nobody, zero people, had a mean session length of 4,000 [17:41:07] *had a session length of [17:41:43] ah, I get it... I'll add these comments to the ticket [18:02:21] oh it is in batcave? [18:02:59] Ironholds: we are in batcave if you want to join [18:03:05] halfak: you might want to as well, we gonna talk about some spark! [18:03:53] ottomata, sure! [18:04:31] https://spark.apache.org/docs/latest/index.html [18:05:48] https://gist.github.com/ottomata/adcb200b99ac1c9d5941 [18:20:34] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hardware [18:38:40] DarTar, tnegrin, final pageview QA done [18:38:48] that appears to be my last task for y'all, so letting you know. [18:49:22] hey Ironholds, thanks, can you send a note to analytics or analytics internal? [18:49:28] DarTar, already have [18:49:35] you’re the man [18:51:39] well, self-evidently not, because see "last task" ;p [18:52:26] okay, off to phab I switch [18:58:11] DarTar, would you have time today to talk about the new dataset format, if you have thoughts on it? [18:59:10] Ironholds: it;s going to be tight today, I have a few short windows with no meetings and I need to draft the blog post [19:00:03] okie-dokes [19:00:21] FWIW, I pitched it to otto and he agrees it sounds fun and is willing to bear the CR load [19:01:59] let’s try tomorrow [19:03:03] sure [19:03:05] I'll find a space [19:04:29] invite sent [20:46:20] Alright, so using http://wikistics.falsikon.de/long/wikipedia/en/ to get a list of consistently-most-visited articles doesn't seem to be wise. There isn't a "last updated on" marker, but I was hoping it was still updated. But the maintainer hasn't made any edit on Wikimedia wikis since 2008. [20:48:17] Make that 2010. But still. [21:45:58] hey Ironholds [21:46:15] can I get your blurbs for the data release blog post now? [21:46:25] I have to edit it with Fabrice at the end of the day [21:52:03] halfak: ISBNs? [21:52:13] when did that happen? [21:52:52] DarTar: sure. Do you have a link? [21:53:10] hmm, no – sorry for the confusion, I want you to fill out this etherpad: [21:53:15] http://etherpad.wikimedia.org/p/DataReleases [21:53:30] yes, I know. I meant a link to the etherpad ;p [21:54:13] got it :) Can you also add some top level statistics to impress our audience? [21:54:26] Like how many requests this datawas extracted from? [21:55:24] Ironholds: ^ [21:55:43] who's the audience? [21:57:37] DarTar ^ [21:58:14] it’s going on the blog, audience should be: CSCW attendees, anyone with an interest in our data [21:58:47] top level stats, I mean not so much a documentation of the dataset but figures that may get people’s attention [21:59:49] I don't really have that to hand, I'm afraid [21:59:58] they've got exploratory apps with visualisations. If that's not enough ;p [22:02:38] Ironholds: sure [22:02:57] thanks! I might still do some final editing, this will go live tomorrow morning [22:03:09] halfak: sorry, we’re looking for a room [22:07:30] hey Ironholds, quick question regarding the annual report stats, I thought you said you could do this, which is why I haven’t moved it to our backlog [22:08:17] if you can’t, you said you knew about potential problems getting these requests from the varnishes, do you mind giving some context on the card so we can address it? [22:11:18] DarTar: I repeatedly asked in the doc and elsewhere whether you would take it over and reassign it, so I'm not sure where you got the idea I was going to do it [22:11:37] the further context is twofold; first, the annual report dedicated site is served from the misc cluster, and we only recently started streaming that into HDFS [22:11:46] (I'm not sure precisely when - it's in the analytics mailing list archive) [22:12:17] and, second, it's not a wiki which means it has god knows what structure, and so we have no pageviews definition. For that I'd just go with LIKE('text/html%') and only include 200 and 304 codes. [22:15:47] Ironholds: great, I’m in a meeting now, but can you add that to the card on Trello so it doesn’t get lost? [22:16:12] sure [22:16:50] and having done that I am now no longer on the R&D team! [22:16:56] this means people can stop giving me work [22:17:07] particularly at 6:17pm ;p [22:18:38] Ironholds: talking about time: how early is too early for you? I need to move our chat to another time at least for next week to get a room, and earlier means fewer people in the office. [22:20:40] guillom: I'm usually synced to SF; earlier works but poke me, like, the day before, the first couple of times? [22:20:49] because, synced to SF. So I might oversleep ;p [22:21:26] Ironholds: Fair enough :) I'll move it to the latest of the early slots where I can find a room. [22:23:21] which apparently means "earlier than 9:30." #OnlyInTech [22:24:00] hah [22:24:45] Ironholds: Is 9am too early? ('yes' is a totally acceptable answer) [22:27:32] guillom: isn't that 12:00? [22:27:38] yeah, that's when I clock in. Perfect :) [22:27:47] alrighty [22:27:48] thanks [22:27:49] ! [22:29:45] cool! [22:31:05] I created a new event because I'm useless with Google calendar and couldn't figure out how to change the other one. [22:32:21] And tbh I expect many of those meetings to be "(on IRC) Hey, have anything to talk about?" "Nope, not really" "Me either" "Ok, have a good day, you just got 30 minutes of it back" [22:34:19] hahah