[01:55:08] YuviPanda, diffing code? Yes. Didn't complete the diff run on enwiki though. I need space and there's a weird intermittent locking issue I'm working out. [01:55:20] oh, hmm [01:55:36] * YuviPanda wonders if he could write something in Rust/Clojure that could just output diffs in an XML format [01:55:50] space needn't be an issue, since I can just put them on NFS [01:55:56] unsure if it'll be useful, tho [01:56:35] :P you shamed me for considering NFS before. [01:56:45] * Ironholds marvels at hadoop [01:56:50] halfak: only because you wanted to put a *db* on NFS [01:56:54] files on NFS is ok :) [01:57:00] :P Append writes are append writes [01:57:15] not fully. if I put them on NFS, I'll write out one file per page [01:57:38] so reading isn't that much of a problem [01:58:10] I'd recommend one file per XML dump file. [01:58:12] I can also setup a simple http server that caches them heavily (with invalidation) on a local /srv system [01:58:17] halfak: hmm, why so? [01:58:38] You can run out of inodes when dumping out 30 million files. [01:58:41] halfak: and by 'xml dump file' do you mean the full dump file, or the split up ones? [01:58:46] aaaah, that. [01:59:00] Enwiki comes in about as hundred split files. [01:59:03] yeah [01:59:11] although labs has the unsplit ones, I think? [01:59:30] I could use https://code.google.com/p/google-diff-match-patch/ and write it in C#! [01:59:36] and that way I can get trivial parallelization as well [01:59:59] or I could use the Java version and write it in Clojure [02:00:27] or... bite and use C++ + threads, and regret it [02:00:32] halfak: only question is, will it be useful? [02:01:55] diff-match-patch isn't a very robust strategy compared to what I'm working with. [02:02:29] I think we're gonna hadoop this one. [02:02:37] hmm, ok [02:02:41] * YuviPanda will stay out of it hten [02:02:49] * YuviPanda goes back to learning docker [02:02:55] halfak: btw, you might like docker as well :) [02:03:00] :( [02:03:47] I don't mean that. Just that I sort of specifically developed these difference algorithms for a particular type of robustness. Plenty of other problems. :) [02:04:42] yeah :) [02:04:45] I would really recommend against using Hadoop for this. [02:04:56] Why is that Ironholds? [02:04:59] I'm going to put quarry into docker, so people can setup a working test install quickly [02:05:01] * Ironholds thinks [02:05:07] What is docker? [02:05:10] so, to turn instinct into rationality, or at least a stab at it.. [02:05:13] * halfak googles [02:05:29] I tend to find that Hadoop scales very well. In one direction. [02:05:37] If you want it to take one big task, and fork that into many small tasks, awesome. [02:05:45] docker == vagrant? [02:05:56] If you have one small task....it doesn't have a linear time-to-completion relationship with the big task. [02:06:21] Diff is perfect for splitting up. I've done a lot of map reduce stuff. :) [02:06:55] Hadoop sucks at the things that btrees are awesome at. [02:07:27] halfak: docker == super-lightweight-VMs [02:07:35] So, I want to do the diff algorithm in hadoop and then the querying with btrees. [02:08:55] The thing that surprises me about hadoop is the lack of stables. [02:09:14] s-table [02:10:00] it's all free-range data [02:18:31] all your flat files are belong to us. [03:10:39] Got the locking issue worked out. [04:01:49] \o/ [13:26:27] G'morning Ironholds [13:26:34] heya [14:07:20] hello everyone! [14:09:19] Hey Helder [14:09:20] :) [14:32:16] o/ all [14:34:27] Hey Nettrom :) [14:34:54] * Ironholds giggles [14:34:59] I just introduced halfak to beantown slang. [14:35:08] He refuses to accept it. His reaction is pissah. [14:36:06] * Nettrom coffee [14:42:53] wicked pissah [14:44:19] If you'd put some jimmies and a hoodsie on my frappe, that would be wicked pisser. [14:44:48] Ironholds, we also call turn signals "blinkers" in MN [14:45:00] huh [14:45:06] * Ironholds tries to think of what we call them [14:45:07] oh! [14:45:08] indicators [14:49:50] Indicators would be acceptable in MN. [14:50:14] Whereas, if you said "this ice cream is pissah", I'd probably tell you to fuck off. [14:53:52] * Ironholds snorts [14:53:58] I can't imagine you telling anyone to fuck off, for anything [14:54:04] every time you swear I am a tiny bit shocked [14:56:09] Heh. If you insult my icecream. We' [14:56:43] re going to engage in northern MN style pretend-I'd-be-aggressive-when-I'd-actually-probably-be-agreeable [14:57:57] ahh [14:58:03] I've never had MN ice crea-wait, hell [14:58:09] I've only had ice cream once since I came to this country [14:58:42] We have nice ice cream. [14:58:52] You should have some the next time you are in town. [14:58:57] I should! Whenever that is. [14:59:08] I'm going to organize a visit in January and feed you ice cream. [14:59:26] Mwahahaha [14:59:37] January == deathly cold MN [15:34:40] leila, I have failed. [15:34:54] I really don't know what's the block here [15:35:10] :-( [15:35:13] Thanks for trying. [15:35:23] I'm sitting in the hangout in the hope that people might actually show up for my scary 8:30 AM meeting [15:35:24] (do you think DarTar may have blocked us? :D) [15:35:52] heh. :P [16:35:44] ping leila [16:35:47] Standup [16:37:18] I'm going to write a R&D version of Miley Cyrus's "Party in the U.S.A." [16:37:29] "I've got a standup/they're calling my name/the butterflies fly away" [16:37:35] "casting the strings like yeah" [16:37:44] "grouping the fields like yeah" [16:37:51] I know the chords, it's on [16:37:54] deal [16:38:41] I never got around to go to Guitar Center to play it when it was popular [16:38:53] pretty sure I'd freak everyone out by not playing the usual barriage of Metallica and fast runs [16:41:11] oh gosh, now I just remembered the StrongBad Guitar episode [16:41:14] gotta go watch that again [16:42:45] http://youtu.be/Bloiue3mSuA btw [16:45:54] Nettrom, I just realised. Something that needs to exist, if we're talking research jokes. [16:46:10] A teacup with χ on it. [17:11:30] hmn. Can anyone think of a word equivalent to "okay" that ends in an "i" sound? preferrably two syllables. [17:16:38] aye [17:17:02] or you can go Spanish with "si" [17:17:13] or French with "oui" [17:17:13] Mais oui [17:17:27] two syllables [17:18:30] or "Hihi" if you want to go pirate-talk [17:18:35] or dutch, even [17:18:47] thankee all [17:18:50] this is for work, I promise [17:19:03] oopps. soworry [17:19:10] ? [17:19:15] no, it's cool! [17:19:19] never mind the hihi :) [17:24:21] dammit this is strangely hard [17:24:27] Jessie J deserves more credit than I gave her. [17:29:25] okay, something equivalent that ends in a "D" sound. [17:33:46] Nettrom, you on the R&D internal list? [18:00:07] Ironholds, "sounds good" [18:01:23] halfak, hmn? [18:01:44] Two syllables. Mean "okay" Has a "D" sound at the end. [18:11:25] aha [18:11:31] too late, already sent out Party in the R&D [18:35:40] Ironholds: not on the R&D list as far as I know [18:35:46] darnit [18:35:47] :( [20:22:41] Hey Ironholds. [20:22:46] hey halfak :) [20:22:54] leila: you appear offline to my client [20:22:55] Was looking at Page view stuffs for App traffic [20:23:06] I see leila as online [20:23:09] weird [20:23:13] this is weird. [20:23:22] So app traffic has application/json as MIME [20:23:25] tnegrin, do you get my messages to you? [20:23:26] but so can API calls [20:23:36] Am I reading it right? [20:23:48] yes -- but I can't respond [20:23:54] (once I know I am reading right, future feedback will go to talk page) [20:24:02] this is weird. let me sign out and come back in. not sure what's going on [20:24:19] halfak, yes [20:24:27] hence the filtering by the content of the call and the user agent [20:24:50] leila: not better [20:25:07] not sure what's going on. lemme do few tests [20:25:24] Ironholds, I don't see the specification about the user agent for apps in R:Page view. Should it appear there or only in the sub-pages? [20:25:46] hmn; point [20:25:55] I'm not sure how to elegantly put them in the main page pseudocode. [20:26:12] I could actually go in the other direction and restrict it "where the request carries an appropriate MIME type..." [20:26:28] OK. I'm playing with that. I read better when I'm editing. I'll have something nice. [20:26:47] heh [20:27:31] The sunlight if filtering into my work cave and hurting my eyes. [20:27:39] /hisss/ [20:27:59] * YuviPanda watches halfak dissolve [20:28:38] halfak, can you message me privately to test my IRC? [20:29:00] I can do back and forths with ewulczyn_ but not sure if that's because we're on the same network [20:29:20] Done. PM worked [20:29:31] * YuviPanda PM'd leila as well [20:29:33] tnegrin, it seems it's something with your IRC. [20:30:01] I'm the manager [20:30:06] ;P [20:30:11] let me restart [20:31:16] now I don't see leila at all [20:31:25] lol [20:31:28] awww [20:31:29] What IRC client? [20:31:35] colloquy [20:31:49] mmm [20:31:52] LimeChatttt! [20:32:31] ok -- better now -- thanks for the help [20:32:48] Woot [20:48:07] Ironholds, bold edits to R:Page view OK? [20:48:23] I read better when I edit and I think I can make it easier to read. [20:51:21] totally! [20:51:26] I actually do the same thing [20:51:28] woot [20:51:39] I'll go to fix a typo and...fuck, this sentence isn't structured right, and...gah...okay, one more tweak.. [21:01:02] This is kind of a big change Ironholds. [21:01:03] https://meta.wikimedia.org/w/index.php?title=Research%3APage_view&diff=10088928&oldid=10087159 [21:01:09] But I didn't remove anything. [21:01:26] halfak, yeah, looks a lot better! [21:01:30] Woot [21:28:57] hey halfak, I may need a few more minutes, there’s no room available [21:29:06] Bah! Forgot to get a room [21:29:07] Derp [21:29:10] sorry dude [21:29:18] Maybe a phone call room on 6th? [21:29:22] 6th floor yeah [21:29:23] brb [21:46:41] * Ironholds grumps [21:46:45] one of my friends' coworkers is dissing R [21:46:55] "I think James knows R" "James, are you just gonna stand there and TAKE THAT?!" [21:51:44] halfak, Ironholds, do you know if we already have a dataset with the following information? user_id, timestamp, edit, namespace, revert? [21:51:56] definitely a halfak question [21:52:01] also, toby was looking for ya earlier(?) [21:52:02] brb meeting [21:52:08] I have some datasets where timestamp is month, and I see halfak's revert datasets that have timestamp [21:52:15] thought to check before making a new one [22:26:47] halfak, please ping when you're free. [22:26:55] Woops. Just got out. [22:27:07] OK. So the "revert" field. What would be inside of it? [22:27:28] Also, what would be in the "edit" field? [22:27:44] it will be in {0,1} [22:27:54] again, {0,1} [22:28:38] something like this also works: [22:28:39] user_id, timestamp, event [22:28:54] where even can be edit_main, edit_talk, or revert [22:29:24] I have this data for month, not timestamp. [22:30:31] OK. I don't have that kind of data. But I am imagining how you could generate it. [22:30:47] okay, that's helpful, too. :-) [22:30:50] Ironholds: did I get it right by skimming over that thread? 1B fake PVs? [22:31:11] if you point me to few tables, I'll get it [22:31:16] Are you looking to have the entire history of the wiki? [22:31:47] let's say enwiki for now. entire history would be nice, but as a first step I can live with smaller subsets [22:32:54] Leila, it will take a while to produce a dataset that joins revision and page (to get namespace), but if you do produce that, I could make use of it too. [22:33:18] DarTar, yeeep [22:33:25] I mean. 1B in the last month. [22:33:35] but Christian noticed this thing in...hmn. [22:33:35] yes [22:33:47] I can build that, it seems it something we both need. :-) [22:33:49] June [22:33:55] * DarTar shakes his head, why are we even looking at PVs [22:34:08] leila, That would be great. Flat file or DB table. Either way is fine with me. [22:34:11] :) [22:34:23] DarTar, because everything else is too hard. [22:34:25] * Ironholds shrugs [22:34:36] sounds good. just point me to tables: revision? [22:34:37] I'mma take that as "my line manager is telling me to deprioritise pageviews as a metric" [22:34:51] if anyone wants me, I'll be at a blackjack table with a large drink and the R&D credit card. [22:34:54] I'm imagining something list this. SELECT * FROM revision INNER JOIN page ON rev_page = page_id UNION ALL SELECT * FROM archive [22:35:02] Ironholds: yeah, we seem to have every now and then this “oops” moment, whether it’s Platform or FR or whatever, where our traffic suddenly doubles (or drops)… [22:35:05] We'll have to do some field renaming between revision and archive. [22:35:25] okay. got it. will look into it and will let you know. thanks! [22:35:25] DarTar, this is, I think, a good argument for having it explicitly built into our software as a metric and relying on x_non_shittily_formatted_analytics [22:36:15] yes it’s not a bad idea [22:36:54] leila, I can help with the first draft of a query. Writing them in etherpad is terrible. I'm looking for an alternative [22:37:44] leila, http://pairjam.com/#qwhw2g [22:43:09] (this is cool, halfak) [22:43:18] +1 I like it :) [22:45:33] OK. I think I have got the bits in there. Look good to you? [22:45:52] I'm parsing it. give me few min please. [22:47:18] kk [22:47:28] brb [22:51:24] halfak, this looks good. thanks! let me play with it and get back to you [22:52:45] Sounds good. :) [22:55:29] halfak: how do you tell if the revision was a revert in the table your query generates? [22:55:53] You can't ewulczyn___. I hope to join that table with revert datasets I have generated afterward. [22:56:28] generating revert data is a bit of a trick. It's not explicitly recorded in the database. [22:56:33] ah so revert is a revision? [22:57:07] https://meta.wikimedia.org/wiki/Research:Revert [22:57:09] (bam) [22:57:28] halfak: do you think this would be faster in sqoop? [22:57:56] For the page/revision join. Yes, quite possibly. [23:01:52] Hey leila, we were going to look at something else now. I forgot what it was. [23:01:58] Analytics meetup stuff? [23:02:05] haha. :D [23:02:09] yeah, let's do it. [23:02:25] we need to plan Monday, though it's already late for Europeans to be informed. :-( [23:02:36] let me have a look at votes [23:03:17] Woo! We have a little bit of stacking on hadoop streaming. [23:03:21] Otherwise, no stacking [23:03:46] yeah [23:04:30] okay. how about hadoop streaming and logging preferences for Monday? [23:04:47] it may be too much, specially that logging preferences will need more time, but we can spend half a day on it and then continue? [23:05:32] we can spend 2-3 hours Monday morning (realistically, there will be 2 hours) to kick-off logging preferences. Christian can also join [23:05:44] we do hadoop streaming in the afternoon for 3 hours? [23:05:49] what do you think halfak? [23:06:07] Leila, I think we should not schedule the actual topics for a time block. [23:06:18] Rather just block off hackathon time and we'll fill as needed. [23:06:29] we've already blocked that [23:06:37] but we need to know who to pull to each hour [23:06:51] If we want to have Christian there Monday morning, we should tell him. [23:06:54] Will we not all be working from the same room all day? [23:07:03] Ahh yes. I that's a good point. [23:07:20] yeah, so we don't have to do pre-planning for most of them, but for some of them we should [23:07:39] The original plan with the sorting of hackathon topics was so that milimetric could help us sort when we should aim for Christian's involvement. [23:07:43] check analytics calendar. I've marked the mini-hackathon times. [23:08:33] okay. so how about this: we all meet Monday morning. If Christian is there, great, and we can do whichever we choose. If he is not, we plan things that need him for Tuesday [23:08:35] ? [23:08:47] +1 [23:08:57] I think that hadoop streaming will be one of those things. [23:08:57] okay. I send an email to give a heads up to everyone [23:09:02] though it's quite late. :-\ [23:09:08] yeah! [23:09:20] I think that's OK. We'll be all over email over the weekend for the traveling [23:09:28] uhun [23:10:34] How about the IEG meeting? [23:10:49] Were there other teams that wanted to sit down with us? [23:25:14] Hey leila. I just saw the email go. [23:25:23] We have some meetings to schedule too, right? [23:25:29] party time! [23:25:31] * Ironholds waves [23:25:37] anyone wants me I'll be drinking port and eating octopus. [23:25:47] neither of these are euphemisms, it's just how we roll in beantown. [23:25:48] meetings of what sort you mean, halfak? [23:25:54] IEG [23:25:56] (for now I sent the mini-hacks as place-holders) [23:26:07] yeah, those I put 4-5pm. [23:26:09] ? [23:26:23] * halfak refreshes his calendar [23:26:24] I don't want to take it from people's hack time [23:26:25] :-( [23:26:50] I don't see them on the calendar. [23:27:57] they haven't been scheduled yet [23:28:06] pinging tnegrin now as this was from him [23:28:23] Gotcha. We could reach out to IEG folks ourselves too. I wouldn't mind. [23:32:47] * halfak failed to effectively ping siko [23:37:00] j/k [23:37:04] * halfak succeeded [23:46:33] Have a good weekend folks! [23:46:34] o/