[01:06:44] Ironholds, re. variable naming, do we now have wmfuuid? [01:06:56] Oh wait, I think I misread the message. [01:06:57] I've asked for AppInstallID so they match [01:07:02] +1 for that [03:57:46] leila, go relax! [03:58:14] :P Same for you. [03:58:22] It's like a billion o-clock there [03:58:24] I am busy following some of Leila's advice [03:58:33] it only took me four months to realise she was right! [03:58:42] so I have an excuse [03:59:39] Is this the kind of wisdom that's appropriate to convey via IRC? [04:00:34] I hope so, that's how she conveyed it [09:28:05] halfak: hah, looks like the first thing I’m going to do as soon as christmas starts is meet you and dario :D [14:27:41] YuviPanda, ho ho ho! [14:27:48] halfak: :D [14:27:56] like literally first thing :D [14:28:02] meeting is at midnight-1AM 25/12 [14:28:19] Boo. That's a rough meeting. Want to make it earlier? [14:28:42] Or, you know, you could say "No. Leave me alone." [14:28:51] It is a holiday after all. [14:29:52] halfak: nah, ’tis ok :) [14:36:35] morning. [14:43:31] Hey Ironholds [14:43:59] I cannot find my keys. I'm terribly worried I might've lost them :/ [14:44:06] I can't actually leave the apartment without them [14:46:15] Well, it would be hard to get in without them, right? [14:46:28] yeah, so in theory they're in here somewhere [14:46:34] unless I somehow managed to take them out with the trash [14:46:40] or, unless I left them in the door and someone took them [14:47:02] GorillaWarfare has my spare set, which would be useful if only she wasn't currently in Maine. [14:47:21] Welp, you can order toilet paper and groceries on amazon. [14:48:38] yeah, but I need a way to get to the post office some time by the 26th [14:48:50] and amazon does not do fresh stuff (womp womp) [14:49:49] did you know amazon sells LAB EQUIPMENT? [14:49:55] Yup. [14:50:00] :P [14:50:59] I want me a lab coat with sergeants stripes [14:51:01] * Ironholds nods firmly [14:52:13] Why sergeant stripes? [14:52:39] so I can open every sentence with "LISTEN UP, SCIENCE MAGGOTS" [14:52:46] which is really all I've wanted for the last 24 years. [14:52:47] :P [14:52:56] even unrelated sentences [14:53:02] "LISTEN UP, SCIENCE MAGGOTS! How was your flight?" [15:22:24] say, halfak [15:22:29] Wussup? [15:22:47] you wouldn't have any datasets around with a union of the revision and archive tables, wouldja? Or any ideas on how best to identify a subset of users by longevity? [15:23:01] I want to do that session-analysis-over-user-life thing [15:23:28] Sure. It'll be a little old -- like 6 -12 months. [15:23:30] * halfak digs. [15:23:55] sweet! Thankee :) [15:24:08] related, I'm opening the Phab ticket to get ops to create a 'datasets' db on analytics-store.* [15:25:07] Why not use staging and prod? [15:25:38] prod isn't duplicated to analytics-store, I thought? [15:26:22] duplicated? [15:26:42] Oh I see. We didn't carry over the DB. [15:26:48] Yeah. It appears that way. [15:27:02] How about I give you all of the sessions and labeled revisions up to Nov. 2013? [15:27:12] So, one dataset contains user sessions. [15:27:23] The other contains all revisions labeled with session stuffs. [15:28:11] Also, this dataset has the time of the last user edit already backed in. [15:28:50] ooh [15:28:56] wait, but if you've done this analysis, why do I need to? :p [15:29:06] I actually haven't yet. [15:29:15] I made this dataset for nettrom when he was working on something similar. [15:29:23] aha [15:29:26] cool! [15:29:33] I just did a lot of the ground work :) [15:29:53] OK. See stat1003:/home/halfak/projects/Archive/nettrom_sessions/datasets [15:30:38] session_index is the Nth session for this user [15:30:53] session_ordinal is the Nth edit in a session. [15:31:13] When last_timestamp is NULL, that means there was a session boundary there. [15:31:18] thankee! [15:32:43] I have so much stuff I could work on this week and I genuinely can't decide what [15:33:07] heh. Do the thing that sounds fun at the time. [15:33:17] Let your muse decide :) [15:33:21] in that case, I am going to study automata [15:33:32] I am going to do a study on behavioural patterns we find with non-human traffic [15:33:38] the difficulty is working out how not to make it circular [15:33:58] I guess I can say "here are the heuristics I used, here is the traffic I identified, here are the traits" [15:34:49] I just need to work out if I can get away with titling the writeup "You know I'm all about that bots" [15:37:25] Have you done any atoll simulations? [15:37:55] troll simulators? [15:38:08] atoll [15:38:09] I..don't know what those are, so probably not! [15:38:20] https://en.wikipedia.org/wiki/Atoll [15:38:34] I mean, all I know about atoll simulations is: when an american uses the word "testing" and "atoll" in the same sentence, try not to be fishing nearby [15:38:51] ahh, I thought it had a special HCI meaning ;p [15:39:05] You construct your cellular automata as though they are standing on the shore of an atoll. [15:39:26] Its a very simple cell structure that lets you play with chaotic patterns. [15:39:33] * halfak looks for his old code. [15:39:34] aha [15:39:46] http://cran.r-project.org/web/packages/CellularAutomaton/CellularAutomaton.pdf [15:40:05] oh god, what a hideous package [15:40:09] it's dependent on R.oo?! [15:40:18] last published AUGUST 2013? [15:40:22] okay, ew. No. [15:41:01] heh. [15:41:34] Looks like I can't find my python simulations easily, but building these types of simulations is really easy. [15:41:50] * halfak tries to find a nice writeup. [15:43:47] wtf [15:44:04] I've ready about this in books, but never seen it online. It looks like it hasn't made it there! [15:44:38] Here's where I first came across these simulations: http://www.amazon.com/Complex-Adaptive-Systems-Introduction-Computational/dp/0691127026 [15:44:51] cool! [15:44:51] I did a lot of programming while I was reading this book. [15:44:53] I'm reading http://natureofcode.com/book/chapter-7-cellular-automata/ now [15:45:17] This is a better book: http://www.amazon.com/Complexity-Guided-Tour-Melanie-Mitchell/dp/0199798109 [15:45:31] yay! [15:45:39] That's exactly what I was looking for Ironholds [15:45:42] There's no R library for this. There should be. [15:45:46] * Ironholds adds to to-do [15:45:53] I bet I can implement Wolfram's systems in C++ hella-fast. [15:46:33] actually the examples in this are in C++. Perfect! [15:47:13] Hmm... Speed isn't really the point here. [15:47:25] It's still open to question whether what is computed is actually useful. [15:47:27] I just like writing C++. It's evocative as a language [15:47:35] Fair enough [15:47:35] yeah, I'll read and retrieve data and think [16:07:17] halfak, amusingly, you know one of the articles I strongly suspect of being the target of a bot attack? [16:07:21] the one on additive gaussian noise. [16:07:30] if that's not deeply funny in a very sciency way, I don't know what is [16:07:51] Seems apt [16:16:28] halfak: any suggestions on where to share a (revscores) dataset in csv? [16:17:09] turn it into a tsv! [16:17:14] * Ironholds spits at csvs. SPITS, I SAY. [16:17:30] Helder, if it is small, then I like datahub.io [16:17:45] Otherwise, I could put it up on datasets.wikimedia.org [16:17:46] It is 1,2 MB [16:17:56] Small enough for datahub.io [16:18:08] You can upload and describe datasets there for free. [16:18:23] it has the scores for ~5000 recent revisions [16:18:27] Nice. [16:18:34] *scores -> features [16:18:56] * halfak is poking at reading and writing model files. [16:19:00] brb [16:22:47] goddamn this album is incredible [16:22:52] halfak, how do you feel about cinema? [16:24:56] Ironholds: what is up with csv? [16:25:12] I just don't like them [16:25:19] commas are far to common in user-inputted text [16:25:21] morning leila! [16:25:45] morning Ironholds. morning halfak. [16:27:47] Morning leila! [16:29:04] Helder, I share this overly concerned stance with regards to CSVs. Personally, I see the TAB character as specifically design to delineate tabular data, so it's a shame we don't just use it for that. [16:30:18] It seems that the devs of MySQL and Postgres agree since their default input/output formats are Tab separated. [16:30:35] tabs are good [16:33:11] Tabs are right, tabs work. tabs clarifies, cuts through, and captures the essence of the evolutionary spirit. tabs, in all of their forms; tabs for life, for money, for love, knowledge has marked the upward surge of mankind. [16:35:42] I got luck: a simple search and replace worked for my data and now I have a tsv file [16:37:14] yay! [16:39:30] hmm... looks like I need to create an organization on datahub first =/ [16:39:38] http://help.datahub.io/kb/general/creating-a-dataset-on-the-datahub-december-2013 [16:40:12] Wat. K. Hmm. [16:42:11] Helder, how about I upload to wikimedia.datasets.org for now and we talk to DarTar about adding external collaborators to our R&D account. [16:42:29] *account-->org [16:42:32] I'm ok with that [16:42:41] If you send me an email I'll upload it. [16:48:51] done [16:51:48] goddamn IETF [16:51:58] Their HTML is incredibly inconsistent and some of their RfCs don't /actually exist/ [16:55:48] halfak: uh fair warning. I've been locked out of the house since everyone else went to a Christmas eve party. Should be inside before midnight but if not won't make the meeting. I'll keep you posted [16:56:03] Gotcha. [16:56:17] Will relay to DarTar if we don't see you for some reason. [16:56:45] what's the meeting about and why does nobody invite me to fun things? [16:56:53] hey quiddity what are you doing for christmas? [16:57:37] Ironholds, about the events stuff I talked about at the last group meeting. [16:57:41] cool! [16:57:48] DarTar somehow had no idea I was working on this stuff. [16:57:56] I demoed it at Wikimania too. [16:58:09] * Ironholds blinks [16:58:13] but you did that great presentation! [16:58:17] <3 [16:58:43] Here's my work document: https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource [16:58:48] Seriously, that presentation had indirect effects you didn't even see [16:59:10] namely, it made me spent 3 months thinking really hard about the nature of importance [16:59:25] Oh! Different presentation, but still. :D [16:59:56] Almost done with the writeup on importance measures BTW. [17:00:02] aha [17:00:03] cool! [17:00:14] I hope to use the X-mas holiday as an opportunity to get this background work done. [17:00:23] re. importance, productivity and value. [17:02:12] oh yeah, people get holidays [17:07:57] halfak: after changing SVC parameters to use gamma=1e-16 and C=3, the recall increased a little (from 0.00 to ~0.3): [17:07:58] https://gist.github.com/he7d3r/7f2aebb00e18b4963d07/413c2f92e743b057755d7d4487a8c65469b0e837#file-classification_report-txt-L5 [17:08:04] yay [17:08:24] this is more reasonable than zero [17:08:38] Helder. That's great. I'm just about to push a change that will add a group of tests to LinearSVCModel [17:09:06] I'm stuck on a pickling error with features. [17:09:16] I should just push the change and come back to this. [17:11:38] holy crap [17:11:46] I just produced an incredible R error I've genuinely never seen before [17:13:19] Helder https://github.com/halfak/Revision-Scoring/pull/18 [17:14:33] Note the test function now returns a dict of "mean.accurace", "roc" and "auc". [17:19:04] I'll be back later [17:31:53] good morning DarTar. I'm in the Hangout whenever you'd like to join. [17:32:04] hey leila, connecting [17:36:27] hey DarTar, tnegrin :) [17:36:37] I wasn't aware anyone else was working today [17:37:00] I think a lot of us are working -- grace is online [17:37:13] so is Kevin [17:37:22] next week less [17:37:31] :-) [17:37:47] I’m editing Phabricator tasks :-) [17:38:00] yay! [17:38:04] trying to get as much stuff organized before I shut off for a week [17:38:13] I'm gonna come to the office and wreck up the place [17:38:41] steal all the mugs and watch crazed engineers kill each other for the privilege of sticking their head under the coffee machine's spout [17:40:16] Hopefully the carnage would be over by the time I’m back in the office on Tuesday 6th [17:41:26] reason #37 for being grateful the analytics engineers are all remote [17:45:39] morning ggellerman_! [17:46:04] tnegrin, so does that mean we have our 1:1 today, or? I'm not sure how to apply the meetings calendar to christmas eve [17:46:12] yes [17:46:25] yay! people! [17:48:34] halfak, we should get R&D t-shirts made or something. Every other team has a t-shirt [17:48:55] Or lab coats. [17:49:00] deal! [17:49:05] Can you get screen printed lab coats? [17:49:07] R&D in big letters, with "regression and derivation" underneath [17:49:11] I don't know. I'll ask the internet [17:49:28] http://www.imageoutfitterstampa.com/scrubsandlabcoats.php boom [17:51:42] Now we just need to get the wiki research logo on there and we're set. [17:53:56] can we have nicknames too? [17:54:09] Science Master, Count Logula... [17:54:19] we don't have good nicknames for dartar or leila :( [17:54:38] I figure for ellery we could just print a string of 20 random consonants [17:54:42] it probably spells something [18:10:40] halfak: as an update I'm still sitting by the road with no laptop and very little power to my phone surrounded by mosquitoes [18:10:52] Let's see if they show up in the next 20mins [18:10:53] DarTar, ^ [18:11:01] we might not be able to get Yuvi|LockedOut [18:11:14] And also see if me murdering them one by one will take less than 20mins if they do show up... [18:12:50] hey guys [18:12:56] oh noes [18:13:49] talking to leila, hope to see you both in 15 [18:15:31] Ironholds, I am cat-sitting for a neighbour (the most anxious cat I've ever met, but I shall be-friend him, damnit!), and going to a 95%-probability-of-high-awkwardness friend's parent's house for dinner, and catching up on sleep, and playing this which I bought last night for someone reason (75% off! Keegan recommended month ago, iirc) http://store.steampowered.com/app/227300/ and that's about it. [18:15:37] neat! [18:16:27] yourself? [18:17:34] (and was there an ulterior motive for asking, or was it just random followup to yr grumble about not being invited to meetings? :) [18:30:10] halfak, Yuvi|LockedOut: I’ll be back in 2 [18:30:19] kk DarTar [18:34:01] halfak, Yuvi|LockedOut: I’m in the hangout [19:12:00] quiddity, naw, justr wondering [20:34:01] halfak, I've been trying to finish a todo list since yesterday morning. I've captured some items but more items are added to it. :-\ [20:34:07] it's kind of funny at this point. [20:34:07] :D [20:34:46] I think we should stop talking to each other for a month so we can catch up. ;-) [20:34:59] :P did I give you todo items? [20:35:08] not, yet. do you want to? :D [20:35:33] I take my question back, halfak. ;-) [20:35:36] Hmm... I mean, if you had time, but from the sounds of it, your list is pretty big. [20:45:17] I skipped lunch and I’m so hungry… it’s already 3:45pm wow [20:45:26] I’m singing off for the year [20:45:33] Happy Holidays everyone! [20:45:35] kevinator, o/ [20:45:52] ciao [20:46:05] halfak: I just realized... [20:46:16] halfak: that we can just use XMPP and get our ‘can not resume’ problem fixed [20:46:20] XMPP has ‘store and forward' [20:46:26] you don’t even have to specify an identifier [20:46:32] it just resends you the things you missed [20:46:41] so everyone gets their own buffer... [20:46:42] serverside [20:46:44] That's not quite what we want though. [20:46:57] well... yeah it is. [20:47:11] Snuggle runs based off the last 30 days of recentchanges. [20:47:29] So it would not be able top pick up with the stream. [20:48:05] isn’t it a one time operation to first do the last 30 days and then just keep up with a more ‘specialiezed’ stream? [20:48:27] In this case, yes. [20:49:04] I would have figured that storm had something like this built in. [20:56:48] halfak: ah, hmm. Maybe. Haven’t looked at storm at all. [21:41:51] hey guys, I’m about to push the button and announce the AFT corpus on the lists, any last minute change you want me to make? [21:41:58] halfak, Ironholds ^ [21:42:22] more images [21:42:24] 1 star [21:42:25] negative. Looks good to me. Did you ever get a DOI associated with it? [21:44:21] halfak: yes, via figshare [21:44:25] it’s a small dataset [21:44:40] so I could upload the 3 dumps and cross-link with the datahub [21:45:10] really looking forward to the “pure registry” version Mark announced [21:45:20] +1 [21:59:19] halfak: grabbing a cup of coffee, brb [21:59:30] kk [22:05:15] halfak: I’m in the hangout now [22:06:30] leila: is it ok if I give your bike to my daughter for xmas? [22:06:43] hahaha! it's too big even for me tnegrin [22:07:01] please don't recycle it. I'll bring it home (the derailer is causing problem again) [22:07:05] she's grown 5 inches this year [22:07:16] j/k -- I left my bike here for all of november [22:07:20] It will be a good gift for next year. [22:07:58] For my career development, I should attend a workshop for fixing bikes, tnegrin. ;-) [22:13:46] halfak: I get to plot the ROC curve: http://tools.wmflabs.org/ptwikis/static/data/roc_svc1.png [22:19:03] what are you predicting danilo_? [22:19:46] reverted revisions [22:20:40] and what does svc stand for? [22:22:46] it is the module of sklearn I used ( http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html ) [22:25:26] thanks danilo_. are you guys debugging? crossing 45-degree line is problematic. [22:27:29] yes, we are trying to find the best configs, I have also made this tool to analyse the features http://tools.wmflabs.org/ptwikis/Features [22:30:50] nice tool, danilo_. [22:31:12] :) [22:33:52] just a thought danilo_: you can try another algorithm on the same dataset you're using SVC on to see if the problem is from the data or the implementation of the algorithm you're using. [22:34:26] This kind of behavior is usually not caused by the data but it's good to check. [22:35:22] what language you're using danilo_? [22:36:08] yes, I will try with others algoritmms [22:36:10] python [22:55:20] Naive Bayes algorithms have a better performace: http://tools.wmflabs.org/ptwikis/static/data/roc_bayes1.png [22:56:23] Eek! That ROC is pretty bad. [22:56:33] There's no way that's right. [22:57:03] danilo_, I have a pull request in that will generate ROC data with the test() method. [22:57:17] https://github.com/halfak/Revision-Scoring/pull/18 [23:00:54] Crap. I just pushed a bunch more changes than I planned to that pull request. [23:01:57] Ugh. Now how do I roll this back? [23:40:43] halfak: maybe the problem is in the params I used in SVC, I used the default kernel ('rbf'), when I try to use kernel='linear' it breaks the script, I don't know why [23:41:14] danilo_, I'll have some changes pushed soon that I think will make this work easier. I have a linear kernel working. [23:42:01] ok [23:49:50] damn, today is dull [23:53:32] danilo_, check out https://github.com/halfak/Revision-Scoring/tree/feature_work [23:54:14] In that branch, I updated Scorer and how features work so that we can get better test statistics and so that we can pickle/unpickle models into files. [23:54:23] See https://github.com/halfak/Revision-Scoring/blob/feature_work/demonstrate_scorer.py for a demo of running test() on the model [23:54:50] I've got to run right now. I might swing by later tonight. [23:54:54] See ya folks!