[03:06:59] puppy: http://imgur.com/i9bOpnD [03:21:43] puppy???? [03:38:27] :> [03:38:37] halfak: I just saw https://www.mediawiki.org/wiki/Lyon_Hackathon_2015/Buddies You were trying to absorb my non-stafferness, weren't you? :o [03:41:12] I'd be very interested in working with halfak, but I don't really develop software. It's also vanishingly unlikely I will be in Lyon since I will be in Berlin the week before. [03:41:49] Emufarmers, I was. [03:41:59] harej: that ought to make it easy, logistically. [03:42:08] It's the awkward gap week. [03:42:16] The awesome travel/work week? [03:42:28] Surely there are Europeans you can stay with [03:42:36] I mean, like, I have a job or something. [03:42:49] I don't know if my overlords will approve me working remotely from Europe for that long. [03:43:17] ...depending on if they actually get a contract to me. They said they wanted to bring me on by early March. :-( [03:43:50] Ahh. Yeah. This is one of the benefits of working remotely full time. [03:44:09] If I can work out a place to work from every day, I could probably work from just about anywhere. [03:44:18] I will be a remote worker too, but I don't entirely understand the terms of my work yet, plus I'll be new. [03:44:26] Some people at the WMF travel a lot. [03:44:34] "Remote [03:44:47] harej, gotcha. Makes sense. [03:44:49] Remote yet local at the same time. [03:45:02] My core competency is living in DC, as opposed to the other half of my full time equivalency. [03:45:48] This means I can show up in their office however much they want me to and train their staff (which will be a big part of my job!) [03:46:22] It's part time, so I can bill all my hours one week and take the following week off, but I don't want to stretch that system too thin (I will already be using it for Berlin and Mexico City) [03:48:15] It's hard to work from a conference/hackathon too. [03:48:23] Well, the plan is to not work ;] [03:48:28] :) [04:58:27] halfak: heh, I’m back working from beaches again :) [05:10:40] Hey YuviPanda|zzz! Have a good vacation? [05:10:45] yes yes [05:10:54] still at beach tho :D [05:20:51] halfak: Honestly I wish I could travel more. :-( [05:21:29] halfak: I used to travel a lot in my time in academia and in my 1.5 years at the WMF I've only travelled for work once and it was to my bloody home country. [05:39:45] Deskana: I'm on a beacccchhhhh [05:40:09] YuviPanda|zzz: Sad to be leaving that life behind? :-p [05:41:00] Deskana: who says I am... ;) [05:41:04] Hah [05:41:10] There are beaches in SF, true. [05:41:16] They'll probably be arctic by your standards though [05:45:24] Deskana: I could still travel around inside the US [05:45:27] Deskana: in some form or way [05:45:29] unsure how yet [05:45:34] Deskana: first priority is to fix my hand [09:32:17] <_Fremen_> hello everyone [09:32:18] <_Fremen_> I want to ask about colwiz and other research management software, is this relevant to here? [12:35:39] halfak: you might like: https://phabricator.wikimedia.org/T90534 [19:57:43] Ironholds: yo! [19:58:00] did you know...that page_id is now in x_analytics? [19:58:20] yep [19:58:31] I saw; it's a good change :). Hope y'all have fun with it. [20:02:08] haha [20:02:13] don't you CARE?! [20:02:23] does it make pageview def easier? [20:06:14] ottomata, yes, but I'm not working on that in the long-term, sooo... ;p [20:06:23] like, it will become your problem or ellery's problem or aaron's problem or whoevers. [20:06:30] dawwwww [20:06:33] :) [21:09:35] halfak, leila, if you get a chance today could you do a 30s check on the dataset I emailed out about? [21:09:53] I'd like to get that out sooner rather than later (and then spend the weekend building a dynamic visualisation system around it. Booya.) [21:13:25] wb, DarTar [21:13:37] your connection died; the only other thing I had was, if you have 30s can you check out the dataset I emailed about? [21:13:50] Ironholds: I'm looking at the dataset [21:14:07] I'm not sure if it's fine to share rows with very few pageviews [21:14:17] Ironholds: sure [21:14:20] leila, example? [21:14:29] first line of the table [21:15:07] I understand that this table is based on sampled logs, so that may help you with entries with only few pageviews. [21:15:26] oh! it's a percentage! [21:15:47] 1 == 1% [21:15:55] sorry, I should've made that clear in the email/table format [21:16:14] I see. so it says: 1% of the pageviews from country x are to url/project y? [21:17:01] yup [21:17:07] or, sorry; [21:17:14] 1% of pageviews to project y are from country x [21:18:24] I see. [21:18:28] I'll respond to the thread. [21:18:43] what’s the subject of that thread? [21:20:42] Nettrom, did you email me the location of that wp1.0 labeled dataset? [21:20:47] thamls! [21:20:51] I can't seem to find the message. [21:20:52] ....thanks, even! [21:21:12] DarTar, "Research-Internal] Country/language pageviews informatio" [21:21:14] DarTar, "Country/language pageviews information" [21:21:16] snap [21:21:17] I win [21:21:19] :) [21:21:20] nope [21:21:24] yours missed the trailing character [21:21:32] boo [21:21:32] ok thanks [21:21:41] FAILS THE CHECKSUM. REJECTED PACKET. RESEND AT A PSEUDORANDOM INTERVAL IN LINE WITH THE VAN JACOBSON PROTOCOL [21:21:48] says my internal TCP/IP nerd. [21:21:54] Nettrom, nevermind. I found it in my other inbox. [21:21:58] (yes, I have an internal TCP/IP nerd) [21:23:59] Ironholds: I checked out the table data briefly. I'm not sure if it's safe to release it or not given that I don't know of all the other data sets we're releasing. [21:24:10] leila, fair; thanks! [21:25:22] my concern is for projects that are small, and then the percentage of pageviews can be large per country for those projects but the actual numbers can be small [21:26:44] Ironholds: what’s the time period during which the data was aggregated? [21:26:55] DarTar, all of 2014 [21:30:48] Ironholds, leila: I really can’t think of any privacy concern if this is data aggregated over a year and given that these are percentages [21:30:57] And sampled! [21:31:04] totally [21:31:09] and <1% == "aggregated as 'Other'" [21:31:15] :) [21:31:29] I might suck at science but I'm a paranoid bastard around data [21:31:32] the only edge case is the one in which this would disclose the country of an editor [21:31:54] I mean, for that to happen you'd need... [21:31:55] *thinks* [21:32:05] I really can’t see how you would do that even combining all public data we release as of today [21:32:26] brb [21:32:40] if you had a project with only 10 pageviews and a single editor from a specific country which has no readers, and you rolled a 1,000 sided die and landed on 0, that could happen. [21:33:24] Wouldn't we need to know the true pageview count too? [21:34:24] Did you guys see the public job application posted to analytics-l? [21:35:40] halfak: no? [21:35:46] * DarTar checking [21:36:24] halfak: oh god [21:36:37] I wish I could undo that :( [21:36:57] It looks like he isn't looking to get paid. [21:37:19] This looks like a "hey, where do I go to volunteer" email. [21:37:38] regardless, I think there’s a mistake about using a public mailing list for that [21:37:43] +1 [21:37:48] Totally [21:37:59] halfak, yep! re true pageview count [21:38:06] I'm considering responding to the email and welcoming him to some of our public-facing projects. [21:38:09] so, I suggest we reach out to him and at least I remove the entry from the archives [21:38:11] DarTar, awesome! Could you note as such in email? That way I can get on to doing what I do best; geo data. [21:38:19] halfak: that would work [21:38:25] I'm impressed he got mailman to accept attachments [21:38:27] Ironholds: sure [21:38:28] DarTar, do you think I should reach out to him publicly or privately? [21:38:30] ta! [21:38:35] halfak: privately please [21:38:38] Sure. [21:38:42] thanks [21:40:34] DarTar: why should we remove the entry from archive? [21:40:52] leila: because there is personal information incldued in that email [21:41:01] I interpreted it as what halfak says, offering to volunteer. [21:41:09] including an endorsement letter [21:41:23] ow! I just saw the resume, DarTar. [21:41:24] I think the intention was to reach out to us, not to the planet [21:41:25] makes sense [21:41:37] leila, should I CC you on my response? [21:41:38] haha! fair. [21:41:53] nono, halfak. go ahead. I just wanted to learn why we're removing it from the archive [21:41:57] He might be more interested in your work than mine after all. [21:42:00] Oh sure :) [21:45:45] halfak: if you want the dataset split into training/test-sets, I've got that for you too [21:45:50] just let me know [21:46:09] No worries. My training scripts do all that for me. I'm gathering features now :) [21:46:23] It's really fast to gather features when you only need the text from the current revision. [21:46:40] cool! I'm retraining my model, forgot to set the RF parameters correctly [21:46:49] When I'm scoring for vandalism, I need diffs, site information, the first revision to the page and the most recent revision from the user!! [21:47:08] What are the "correct" params? [21:47:47] Man... at this rate, we're going to have a working wikiclass model in revscores by the end of the day. :) [21:48:23] Just so long as I can get my pull requests merged. [21:48:38] I suppose I can always just create a branch, merge all the pulls to that and install it. [21:48:48] params (besides criterion='entropy', which Wiki-Class already sets): n_estimators=501, min_samples_leaf=8, random_state=42 [21:49:16] Anything meaningful to that random_state=42? [21:49:18] (although you can of course argue I shouldn't set the random seed) [21:49:23] :P [21:49:38] Only if you're planning on cross-folding [21:49:40] helps to get repeatable results when you test, though [21:49:44] +1 [21:50:09] halfak: I don’t even think I can moderate mailman archives :-/ [21:50:17] post hoc, that is [21:50:36] Bah! [21:50:42] There's got to be a way. [21:51:17] bwa-ha-ha: http://wiki.list.org/DOC/How%20can%20I%20remove%20a%20post%20from%20the%20list%20archive%20or%20remove%20an%20entire%20archive%3F [21:51:30] “You need to edit the raw archive "mbox" directly,” [21:51:32] right [21:51:55] yeah.. probably not. [21:52:10] There should be someone at the WMF who has dealt with this before. [21:54:02] I’ll file an RT ticket if ops considers this worth the effort [22:03:07] For the sake of the curious, we asked around and it turns out that it is near impossible to remove a message from a list. However we might have luck with removing attachements, so we'll reach out to ops to try that. [22:04:07] hi halfak, thanks for the mediawiki-utils library [22:04:20] :) I'm glad it is helpful. [22:05:13] I'd like to know any suggestions or pain points you have as you work with it. [22:05:22] I started streaming over the dump, but it looks like it'll take much longer than 24 hours on 16 cores. I think I'm doing something wrong. Just wanted to check in about it. [22:05:37] Is it running on one of the stat machines? [22:05:42] Yes 1002 [22:06:12] Can you point me to your script? [22:06:59] sure, give me a minute. [22:07:44] hey ashwinpp, welcome to da channel [22:07:57] thanks :) [22:08:19] ashwinpp: I guess I owe an email to introduce you. DarTar is our research lead. [22:08:28] halfak you know by now. ;-) [22:08:40] o/ [22:08:44] :D [22:09:07] Yes [22:10:10] Code looks good. [22:10:37] I'm looking at /home/ashwinpp/code/wikimedia/src/main/python/extract_link_diffs.py [22:12:12] ashwinpp, I'd be happy to take a pass on the code and run a couple of tests. [22:12:22] ahh! [22:12:39] Oh, I thought I would have to push it on git :P [22:13:01] That would be great! I'd just fork it and submit a pull request :D [22:13:19] So, I see one potential problem. [22:13:30] I think that you are processing all of the bz2 and 7z files. [22:13:42] This line: files = [x for x in os.listdir(head_path) if 'enwiki-'+date+'-pages-meta-history' in x] [22:13:55] Doesn't seem to specify an extension. [22:13:59] Oh okay. That was stupid [22:14:12] Na. I do that all the time. That's why I thought to look for it. [22:14:15] :S [22:14:39] I'll correct that. [22:15:08] you can also check log.txt in the same folder [22:15:51] if you grep for "finished" files, its at 46 right now after 12 hours [22:16:27] Yeah.... We should definitely be running much faster than that -- especially since you are skipping everything before last month! [22:16:49] There are a total of 176 files I have to stream (I had mistakenly taken it to be twice the number) [22:17:04] +1 [22:17:05] So it looks like it should take ~50-55 hours [22:17:21] Some files are much larger than others too. [22:17:30] yes, that is also true [22:17:39] Those files contain pages like WP:Sandbox and WP:AIV [22:19:12] ashwinpp: the bot has received some comments/questions. did you get a chance to go over them? [22:19:18] Oh, I see [22:20:08] leila, yes I went over them and have the answers more-or-less ready. Bob will put up the comments after we discuss them today [22:20:49] There are some suggestions which we should incorporate, like filtering out countries etc. so that we don't over-link. [22:22:29] And there are some heuristics which we are using to find the correct anchor text for creating a link, but they can be tweaked and are up for discussion. However complicated heuristics are hard to justify. [22:23:53] halfak, the CPU usage also doesn't seem to be full utilized. Could disk I/O be the bottleneck? [22:24:25] It could. CPU could also be a bottleneck on the decompressors. [22:24:56] You end up using twice as many cores as you specify because it will load a decompressor separate from the multiprocessing.Process. [22:25:09] ashwinpp: makes sense. the comments are very thorough, which is nice, since they can help you improve it. [22:25:56] I'm not sure where the bottleneck would be here though. It looks like the decompressors are not toping out either. [22:26:10] It could be the log writes. [22:26:22] That seems unlikely, but that's the best hypothesis that I have now. [22:27:15] leila: yes. I was pleasantly surprised to see such effort and detailed analysis. It actually set me thinking in different directions :) [22:27:40] :-) [22:28:45] log writes are 1 line per page at the maximum. The link diff writes are almost always more than that. [22:36:59] ashwinpp, yeah, I'm wondering if you are getting poor performance from multiprocessing.Queue. I have had that trouble before. [22:37:19] It's that datastructure that collects all the results from each processor. [22:39:13] I see, but there is not a lot being handed over to it by yield [22:39:55] I think one improvement, which I should do is to not store the last diff before the last month and work around it. This way I will reduce the overhead of reading the text before last month. [22:40:43] Sorry, by last diff, I meant the last revision text [22:49:24] halfak, so should I run it with 8 threads? [22:49:33] cpu_count()/2? [22:52:21] Actually, I might be wrong about the log writes. I'll remove the unnecessary lines. [22:57:09] ashwinpp, I'd run it with the full cpu_count. Just make sure to NICE it. [22:58:28] sure, 20 [23:43:10] Ironholds: Got a moment? [23:43:25] (not urgent) [23:43:39] guillom, totally! [23:43:46] cool [23:56:42] hey halfak, I got a ping from the folks behind m:Research:Impact of Wikipedia on Academic Science, did we decide who would reach out to them in a volunteer capacity? I haven’t and I forgot if I was supposed to [23:57:08] Ack! That was my bad. I had planned to reach out to them. [23:57:12] I'll get on that right away. [23:57:46] thx dude [23:58:20] halfak: make sure you reply to both, I got the ping from Shane [23:59:06] * Ironholds grumblegrumbles [23:59:10] I hate writing dynamic web apps