[00:22:00] filin' bugs. [00:22:54] terrrydactyl: *wave* I am performing the traditional new-to-a-codebase task: trying to get set up, finding bugs in the setup process, and filing them :) (in my case, wikimetrics) [00:38:14] sumanah: that's always fun. heh [00:40:04] have you found some? [00:42:34] terrrydactyl: indeed! [00:42:41] https://bugzilla.wikimedia.org/59846 [00:43:09] https://bugzilla.wikimedia.org/59843 [00:43:22] one more coming, I think [00:44:00] hmm, when i was doing the install, dan mentioned something about virtualenv not working very well. i'm not sure if it's changed or what specifically about virtualenv made it break. because i was using virtualenv when i had my clean install and decided to abandon it. [00:45:07] i also ran into some trouble with creating the database from scripts [00:45:07] :) [00:45:18] In my understanding, a lot of people (including me) really like using venvs to keep our dependencies separate [00:45:27] yeah, i think it's pretty nice to have that work [00:45:33] so I think we oughta make it work with venvs and I doubt it is a huge honking problem [00:45:34] it's definitely something that should be working [00:45:37] nod [00:46:16] hey halfak [00:46:24] i'm having trouble deciphering the last oauth change. have been waiting for dan to get less busy about it, but i might just send him an email. [00:46:25] we're going to run the module storage experiment for a week, right? [00:47:07] i'll ping you before disabling anything to confirm either way, just making sure we're on the same page roughly [00:49:24] sumanah: hmm, my version is working without running --mode queue [00:50:12] but then again, i'm not running virtualenv [00:50:46] terrrydactyl: if you run pip freeze, do you get a list of what version things are at? maybe you could pastebin that [00:51:17] also -- I dunno, maybe this is some setup-related thing, where you've been developing on it long enough that you have something config'd right, and I have not yet accidentally done that thing [00:51:37] i had a ton of problems getting my setup working [00:52:00] dan found a ton of bugs that were somewhat unique to my system since i'm running mac [00:52:14] hm [00:52:29] http://pastebin.com/WiEXtxeu [00:52:31] there might be something in MacOS saving you from my troubles [00:52:35] here's my pip freeze [00:52:43] what are you running? [00:53:06] cause it should work on linux machines [00:53:21] hmm, I have argparse and you do not [00:53:48] that shouldn't really affect it [00:54:37] http://tools.wmflabs.org/paste/view/849b068e [00:55:06] I have a higher version of billiard [00:55:09] * sumanah is checking difference [00:55:44] no flake8 or ipython, that should not make a difference [00:55:58] hmm, i wonder if my httplib2 is the wrong version and that's why my oauth isn't working [00:56:08] though that shouldn't be related to your problems. [00:56:55] interesting, I don't have oauthlib at all [00:57:27] or stevedore, pyzmq or tornado [00:57:50] some of them may be specific to other programs [00:57:51] I would be interested to see whether oauthlib and requests-oauthlib are making a difference for me [00:57:58] i'm not running a virtualenv right now [00:58:12] yeah [00:58:31] ok, there's nothing in the git tree that requires oauthlib, I did a git grep to confirm [00:58:36] maybe i can try installing a new instance with virutalenv and see if i come across the same problems [00:58:48] (of course if there were a dependency it should have shown up by now! but I was just checking) [01:03:56] sumanah: did you follow the setup from here: https://github.com/wikimedia/analytics-wikimetrics ? [01:04:50] yep, tried to, terrrydactyl [01:04:56] (aaand https://bugzilla.wikimedia.org/59850 just filed) [01:05:15] terrrydactyl: average helped me out and we discovered (and Dan agreed) that the README there needs a lot of work [01:05:28] yeah it does [01:05:51] i was gonna edit a bit of it with my os x experience but never got around to it [01:06:19] go ahead :) [01:06:43] ori: Yes. One week. [01:08:51] dinner [01:08:53] see ya later! [01:14:48] halfak: cool, thanks [01:19:55] DarTar: Hey, two questions for you: Would it be preferable for me to use the e2 Limn instance for Multimedia things, and also do you remember if you sent me an example configuration for a bar graph in limn (it may have been milimetric, but he's out now)? [01:21:06] instance: yes that's totally fine [01:21:26] under enwiki-features? [01:21:27] or [project]-features [01:29:26] DarTar: Sorry, I didn't see this 'til now - I did mean to ask whether it was "preferable", not whether it was OK. If it's only OK, I would prefer our own instance. [01:29:59] And we don't separate based on project, we just get all the data [01:31:15] marktraceur: I think it fits in there, consider that at some point this year a big chunk of this data will be moved somewhere else, visualized with other libraries etc etc [01:31:40] DarTar: I find it weird because of the name I guess [01:31:48] e2- doesn't really apply to us [01:31:54] that's right [01:32:06] but there's a ton of stuff in there that has nothing to do with ee [01:32:09] Or ee- or whatever [01:32:21] Well, the fact that other people have done it wrong has never been good enough reason for me [01:32:48] on the flipside, reducing fragmentation of these dashboards is not a bad idea [01:32:57] right now it's impossible to know where to find what [01:33:04] unless you know about the magic page on Meta [01:33:15] http://meta.wikimedia.org/wiki/Research:Data/Dashboards [01:33:16] It shouldn't be terribly complicated, we'll put links from relevant places on mw.org probably [01:33:47] which assumes people already know where to find these pages, but yes [01:33:52] Well [01:33:54] Yeah. [01:34:20] The pages are discoverable from the viewer itself, which is discoverable from article pages, etc. etc. [01:34:20] bottom line is, I don't really care where we host this :) [01:34:26] 'kay, own instance it is [01:34:34] I just know that the current setup sucks and we'll have to replace it [01:34:40] I will set that up tomorrow, and hopefully include some bargraphs [01:34:49] cool [05:03:49] (PS6) Stefan.petrea: [ready for review] Added jumpstart [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/105893 [08:40:45] drdee: too late for the straw poll or not yet? [08:41:03] be quick do it now :) [08:41:10] aye! [08:41:13] thx [08:47:46] drdee: done, thanks! [08:48:00] thank you! [09:54:01] hey qchris, good morning [09:54:41] early bird :) [09:55:21] Hi average. [09:55:30] "early bird" hahaha :-) [09:55:46] :D [12:46:08] morning everyone [12:50:16] morning milimetric [12:50:41] hi [12:53:39] (CR) Milimetric: [C: 2 V: 2] Reactivate edits-monthly-5plus-editors graph [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105850 (owner: Jdlrobson) [13:13:40] (PS1) Gerrit Patch Uploader: change SERVER_HOST from 0.0.0.0 to localhost [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106509 [13:13:45] (CR) Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106509 (owner: Gerrit Patch Uploader) [13:27:26] (CR) Milimetric: [C: -1] "Some minor nitpicks and one important unused variable." (10 comments) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105851 (owner: Jdlrobson) [14:10:11] (PS2) Milimetric: Added user_name information for CSV output [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/104731 (owner: Terrrydactyl) [14:29:05] (CR) Milimetric: "This looks great but it fails three tests. Just run scripts/test from the root directory and you'll see the tests for the CSV output are " (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/104731 (owner: Terrrydactyl) [14:31:16] (CR) Milimetric: "Thanks Sumana, but the .orig file is not necessary. I like to make changes and rely on git in case I need history." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106509 (owner: Gerrit Patch Uploader) [14:36:52] (CR) Milimetric: [C: 2 V: 2] Extract repository looping into separate function [analytics/geowiki] - https://gerrit.wikimedia.org/r/103221 (owner: QChris) [14:37:10] (CR) Milimetric: [C: 2 V: 2] Split committing and pushing limn files into separate steps [analytics/geowiki] - https://gerrit.wikimedia.org/r/103222 (owner: QChris) [14:37:58] (CR) Milimetric: [C: 2 V: 2] When cleaning repo, remove untracked files [analytics/geowiki] - https://gerrit.wikimedia.org/r/103223 (owner: QChris) [14:38:26] (CR) Milimetric: [C: 2 V: 2] Switch data repository order when looping over them [analytics/geowiki] - https://gerrit.wikimedia.org/r/103224 (owner: QChris) [14:40:25] (CR) Nuria: [C: 2] "+2 per our conversation yesterday all looks good." [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105869 (owner: Jdlrobson) [14:46:11] (CR) Milimetric: [C: 2 V: 2] "I still think there's some confusion here but I'm merging so that I don't block you any more." [analytics/global-dev/dashboard-data] - https://gerrit.wikimedia.org/r/100737 (owner: QChris) [14:47:58] hello [14:48:39] hi [14:48:41] :) [14:49:06] yo yo [14:49:32] how are things going [14:49:38] I see the EL change is progressing nicely [14:50:28] yeah, i think so [14:50:39] it's a nice initiative, i'm glad that it's happening [14:50:54] indeed [14:51:08] hey nobody replied to your email last night about the client perf right? [14:52:57] this thing I mean: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&tab=v&vn=Navigation+Timing&hide-hf=false [14:53:19] i chatted about it with paravoid, it doesn't look like a persistent spike [14:54:32] i volunteered to do a metrics meeting lightning talk so i actually stayed up all night to look at the data in aggregate [14:54:38] it doesn't look terribly encouraging :/ [14:55:25] my enthusiasm for matplotlib is also, ahem, dampened -- what a maddeningly insane API! [15:00:25] nuria: i'd try to create an account on testwiki and to submit an edit via the mobile site [15:00:30] and then see what shows up in the table [15:00:51] ok, where is test wiki? [15:01:02] test.wikipedia.org :) [15:01:16] not to be confused with test2.wikipedia.org [15:01:21] or the beta cluster [15:04:55] ugh, from the code it appears that the API request is not annotated in any way to specify that it is a mobile web edit [15:17:38] ori: yeah, there's something in MF that does the tagging, IIRC [15:17:40] Jon mentioned all mobile edits are js api edits [15:25:24] well, so i did some poking [15:25:33] there's this change from july: https://gerrit.wikimedia.org/r/#/c/75802/1/javascripts/modules/editor/EditorApi.js [15:26:20] the 'useformat=mobile' query parameter is snuck in so that there's a way for the server to determine that this was a mobile edit [15:26:33] however, it's not in master [15:26:38] or rather, it is, but only for photo uploads [15:26:46] javascripts/modules/uploads/PhotoApi.js [15:26:46] 128: uploadUrl = apiUrl + '?useformat=mobile&r=' + Math.random(), [15:26:46] 157: // send useformat=mobile for sites where endpoint is a desktop url so that they are mobile edit tagged [15:45:15] ok i see, let's ask the mobile team about it [15:46:14] (PS1) Milimetric: Fixes Bug 59218 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106521 [15:46:43] (CR) Milimetric: [C: 2 V: 2] Fixes Bug 59218 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106521 (owner: Milimetric) [16:46:46] (CR) Sharihareswara: "Thanks, Dan. Weird - I think Gerrit Patch Uploader caught something strange in my paste..." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106509 (owner: Gerrit Patch Uploader) [17:43:42] DarTar: Got a sec? I have an idea for dealing with the old newly registered user problem. [17:43:52] hey [17:43:53] sure [17:44:07] We can define "newly registered user" with negatives. [17:44:27] So if the user doesn't have a recorded "autocreate" or "create2", then they are a "newly registered user" [17:45:05] This will (1) make analysis easier and (2) make the definition clear: we filter out non-new users as much as possible. [17:45:31] halfak, that's pretty much what I do in my query structure, too [17:45:42] remove the ones you know aren't new and let god sort out the rest, or something. [17:45:49] lol [17:46:02] I think that's going to work [17:46:15] is it worth checking with the gods of pre-SUL wiki? [17:46:31] We can communicate with these gods? [17:47:02] james_f is pretty knowledgeable about user registration archaeology [17:47:34] so he might be able to advise if there's something better that we can do other than just removing negatives [17:47:43] but I like to keep things simple [17:48:54] Agreed. Once I finish the Newly registered user --> New editor --> Productive new editor set, I want to spend some time with Attached user. [17:49:43] first record in the newusers log is 20050907221649 [17:49:55] Yeah. The first "autocreate" is in 2008. [17:50:08] The first "create2" is in 2006 [17:50:12] right [17:50:39] we should document all of this here: https://meta.wikimedia.org/wiki/Research:Attached_user [17:54:25] Agreed. I plan to document it through some data visualizations from major wikis. [17:54:33] sweet [17:54:40] In the meantime, I'll get our notes dumped there. [17:55:54] Ack. It looks like the query on that page is wrong. log_action = "autocreate" are attached users. [17:56:14] I believe that "create2" are users created by another registered user for their own use. [17:56:24] Do we have a name for those types of users? [17:58:00] good catch [17:58:01] If not, I propose "Alternate users". [17:58:08] yes I called them "registered by proxy" [17:58:14] or proxy registrations [17:58:23] not particularly attached to the term [17:58:25] So "byemail" is a good proxy registration. [17:58:47] (dropped a line to james/csteipp in the meantime) [17:59:20] MW defines them as "another user account is created by an existing user." [17:59:25] mw.org that is [17:59:43] Do you think that we should skip byemail users in "Newly registered user"? [18:01:16] hm yes I think we should consider them proxy registered [18:01:36] Why? [18:02:11] It seems that anyone requesting an account be created for them via email would be a real new editor. [18:02:46] because the account creation involves a third party? I don't know how this is used in practice, for example if users routinely batch create accounts by email for something other than genuine account creation requests [18:03:14] oops, we're supposed to talk UA, aren't we? [18:03:19] I should move to some conf room [18:14:50] (PS2) Milimetric: change SERVER_HOST from 0.0.0.0 to localhost [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106509 (owner: Gerrit Patch Uploader) [18:15:13] (CR) Milimetric: [C: 2 V: 2] change SERVER_HOST from 0.0.0.0 to localhost [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106509 (owner: Gerrit Patch Uploader) [18:16:11] (PS1) Milimetric: Improves readme and db creation script [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106546 [18:16:22] (CR) Milimetric: [C: 2 V: 2] Improves readme and db creation script [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106546 (owner: Milimetric) [18:46:09] (CR) Jdlrobson: Story 1481: Collect graph data only for current month (10 comments) [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105851 (owner: Jdlrobson) [18:46:53] (PS3) Jdlrobson: Story 1481: Collect graph data only for current month [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105851 [20:02:34] (PS4) Milimetric: Story 1481: Collect graph data only for current month [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105851 (owner: Jdlrobson) [20:03:07] tnegrin, JFYI, Rohit applied for a couple of positions [20:03:44] (CR) Milimetric: [C: 1] "The proposed way of injecting from_timestamp and to_timestamp was failing, so I updated it. Everything else looks good." [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/105851 (owner: Jdlrobson) [20:33:05] phew [20:33:06] meetings! [20:33:07] over [20:33:08] qchris [20:33:12] it is 3:30 [20:33:12] hmmmmm [20:33:17] nginx deployment? hmmm [20:33:23] Are you still in the mood? [20:33:31] :-) [20:33:34] I am [20:33:41] Let's break the site! [20:35:11] it really is too bad that there are nginx logs in the webrequest stream at all! [20:35:33] Bwt ... how was your demo? [20:35:42] (Youtube did not give me the stream as it was not done yet) [20:35:46] Mhmmm [20:35:47] was good! [20:35:58] Do we want to remove nginx from the logs? [20:36:23] I thought you mentioned that they were [20:36:23] in before [20:36:23] then removed [20:36:23] then came back in? [20:36:35] i don't want to deal with it :p [20:36:48] they shouldnt' be in the logs, as they are duplicate reqs :/ [20:36:55] but there was some reason that we had to leave them in [20:36:57] Glad to hear that the demo went well. I totally will watch it now that your done filmning (Hoping for concluding jokes in there :-D) [20:37:04] haha [20:37:05] no jokes [20:37:05] haha [20:37:12] No jokes :-((( [20:38:43] I'd also prefer to not have ssl requests in twice. [20:38:43] Having them in only once would make counting easier ... but problem detection harder. [20:39:29] hmmmm [20:39:33] you know, can we deploy this tomorrow? [20:39:35] i don't think it will break anything [20:39:43] but i'd rather not do it this close to the end of my work day? [20:39:48] I guess it'll be monday then. [20:39:51] oh? [20:39:55] But yes. you are right. [20:39:57] you aren't working tomorrow? [20:40:03] Let's not do it at the end of your workday. [20:40:25] I am ... but only while you are still sleeping [20:40:28] I'll have to say hello to a newborn tomorrow [20:40:39] ohhhh [20:40:40] hmm, ok [20:40:42] monday ok? [20:40:46] how urgent is it [20:40:52] Community wants it. [20:41:00] Since ages ... [20:41:08] So I hope they can wait some more days. [20:41:14] ok [20:41:17] we will wait [20:41:25] Problem is there since at least May IIRC [20:41:28] Ok. [21:12:07] ah yay! milimetric! [21:12:30] so, the hive json serdes i've been using is one that is provided by cloudera and uses the Jackson json lib [21:12:41] i just tried the one on code.google.com [21:12:47] and it works with the ints and floats [21:13:04] not sure how it performs compared to jackson though [21:13:22] it uses hte standard java json lib [21:14:24] ah bit it doesn't have write serialization support [21:14:24] hmmm [21:20:48] hello [21:21:07] ottomata: nice presentation! [21:23:34] hi ori! [21:23:35] thanks! [21:23:36] you too [21:28:34] thanks! [21:33:54] qchris, ottomata: http://lists.wikimedia.org/pipermail/analytics/2013-May/000633.html that's relevant for your discussion about nginx duplicate logs :) [21:35:43] tnegrin, you in the office today? [21:35:52] eventually [21:36:20] okay ;p. I guess, all I really want to know is; what's the best way I could contribute to the filtering/categorising of RLs in varnishkafka? [21:36:39] contributing to the codebase, listing all the weird idiosyncracies I found so that the people contributing to the codebase adapt to them...? [21:36:51] it's a great question and I totally appreciate it. [21:37:11] I was just about to leave -- can we continue this discussion f2f in 30 mins? [21:38:35] assuming I have not left for tabletop gaming and booze, totally [21:45:50] so ottomata, we need the cloudera one to support camus? [21:46:11] or to support writing to other json-backed tables? [21:46:22] not sure I understand why we need write serialization [21:47:04] DarTar: I updated my analysis of "New editor" with data from dewiki and ptwiki. Enwiki is still coming. https://meta.wikimedia.org/wiki/Research:New_editor#Analysis [21:47:26] thank you! Checking it out in a moment [21:47:51] Thanks drdee! [21:47:53] m, nope [21:47:56] we can use whatever we want [21:47:57] * qchris is reading... [21:48:08] Oh drdee you're gone already :-( [21:48:16] if we wanted to insert into a table like this in json format, we couldn't use the other one [21:48:22] btw I forgot to add one thing to our discussion on hashing that might be worth adding to the proposal, I'll ping you later [21:48:29] also, isn't jackson supposed to be super speedy? [21:48:30] dunno [21:48:40] yeah, supposedly jackson is faster [21:48:42] drdee, you once said something like that, right? [21:48:43] yeah [21:48:46] i'm reading hive docs [21:48:52] i think i understand how we could patch it and fix the bug [21:49:00] this cloudera one seems much more complete [21:49:04] wouldn't be hard i thik [21:49:11] i mean, patching it sounds like the way to go [21:49:14] is it actively developed? [21:49:29] like will there be some people glad of the patch and able to merge? [21:49:32] i don't think so [21:49:40] maybe I should ask around [21:49:44] in hive mailing lists or something [21:49:45] ghmm [21:49:45] hm [21:49:47] or cloudera [21:49:48] hm [21:52:41] oo thread [21:52:41] https://groups.google.com/a/cloudera.org/forum/#!searchin/cdh-user/hive$20json/cdh-user/4iXuZyIb_d8/fY0PQbAmqacJ [22:27:19] halfak: excellent stuff on new editor, thanks for updating the page [22:27:33] No problem. Did you see the comparison section? [22:27:37] should we cherry pick a couple of projects to run similar analyses? [22:27:39] yes [22:27:53] the list I chose for Active Editors is somewhat arbitrary [22:28:03] but I picked the largest wikis either by content or population [22:28:08] :) I'm just finishing up an edit that explains how to interpret the graphs. [22:28:28] k waiting [22:28:33] Yeah. I'd like my list to be bigger. It turns out that generating this data is waking up springle with replag notifications [22:29:21] halfak: that was the index creation, not any query you were running, if i understood correctly [22:30:12] halfak: mailing you something I used before [22:31:33] ori: good point, but the index was for this query. :) [22:31:58] halfak: csv in your mailbox [22:32:22] Thanks [22:33:03] we still need to specify on that page that this is a U with archive [22:33:41] it's another departure from the historical way of calculating these numbers [22:34:03] adding that to Discussion [22:35:02] Hmmm this should be part of the definition I think. [22:36:57] yes but we need to call that out in the discussion too, hang on [22:37:46] it's already in the sample query but not yet in the definition itself [22:38:20] wondering if that should go into the R:Edit page cross-linked from the definition instead [22:38:45] https://meta.wikimedia.org/w/index.php?title=Research%3ANew_editor&diff=7020453&oldid=7020441 [22:39:32] I've got that prose added to the comparison. [22:40:04] Oh! that discussion. Good spot. [22:41:32] extra prose looks great [22:42:29] It's interesting to look at that third plot. I'm excited to see how it looks for enwiki. [22:43:39] The plot I'm referring to: https://meta.wikimedia.org/wiki/File:Wiki_metrics.comparison.n.svg [22:43:43] brb [22:43:45] kk [22:47:43] agreed [22:48:02] so how do you feel about: [22:49:26] - formally qualifying an edit in R:Edit as any change to any ns with a unique rev_id from the union of revision and archive (including redacted revs) [22:49:57] - dealing with limitations to this definition downstream (for example when defining a productive edit) [22:50:17] or a content edit [22:50:19] AH HA! [22:50:21] milimetric: [22:50:29] :) [22:50:33] https://hive.apache.org/hcatalog/ [22:50:41] ah sorry, i gotta run an interview [22:50:43] halfak: ^^ [22:50:47] hehe ok [22:50:52] will look after [22:51:17] many revisions in archive don't have a rev_id, sadly. [22:51:22] DarTar: ^ [22:51:38] wait [22:51:39] anyway, it comes with amuch much more complete json serde [22:51:42] But I like the idea. [22:51:44] and is included with cdh4 [22:51:47] so we can just use that [22:51:48] done! [22:51:54] like what revisions? [22:52:00] I think everything before 2008 [22:52:10] lol old craziness. [22:52:34] Or rather, revisions to pages that were archives before 2008. [22:52:58] Pages that were archived after 2008, but had revisions that were saved after 2008 will have ids. [22:53:07] Not try and imagine how much hair I lost on that one. [22:53:11] sweet jesus [22:56:50] that really makes me want to focus on truncated series [22:58:15] halfak: and that's not even including edits that are imported across wikis ;) [22:58:30] do you know what Special:Import can do, right? [22:58:34] Oh yeah. I don't know how to think about those yet. [22:58:50] Yes. I'd love a nice writeup on how it works and what to expect in the DB [23:00:15] I suspect that Special:Import is the reason for most of the out-of-order craziness in the XML dumps. [23:01:48] I've never looked into the cross-wiki import logs, we should totally do it to get a sense of volume [23:01:58] https://meta.wikimedia.org/wiki/Research:Edit#Special_cases [23:01:59] ? [23:02:31] yessss this works awesome! [23:02:38] great great great, ok will make that serde fix tomorrow [23:02:39] laters all! [23:05:18] halfak: so back to your pre-2008 case, what happens exactly to these records in archive? [23:05:27] and also ^^ [23:06:25] DarTar, I can think of a special case for edits you may want to take into account [23:06:41] initialising LQT appears as a revision, I think (think. Do not precisely recall.) [23:07:03] in revision or rc? [23:07:13] if it's rc only I don't care [23:07:27] if it's revision, I guess it depends what "initializing" means [23:08:58] cool ottomata, that looks awesome [23:11:19] DarTar: turning something previously a talkpage into a LQT page [23:11:23] and, revision, I believe [23:11:37] but you should probably check with werdna or someone more familiar with the system [23:11:40] do you have an example handy that I could look up? [23:14:12] I can try and find one... [23:14:48] DarTar, https://www.mediawiki.org/wiki/Extension:LiquidThreads#User_documentation [23:14:54] so, yeah, it'd be a revision [23:15:36] hm [23:37:13] Hey DarTar: Sorry I missed you. [23:37:31] We also get a revision when someone moves a page. It's a noop and the comment explains the move. [23:37:56] As for the archive, it appears that everything is intact except for the revision ID. [23:38:12] So, when I count them, I count the "ar_id" instead. [23:39:17] right about moves, ar_id: ok, I thought you were suggesting that records would be missing in archive [23:40:59] Ahh. No. Sorry. It's just that they have a NULL rev_id. [23:42:53] DarTar, this is an awesome CSV [23:43:12] halfak: :p ? [23:43:34] I assume so [23:43:40] re: the date field? [23:43:41] halfak endorsing CSV as a storage format? [23:43:43] Not at all. I want to create some CSVs that contain a metric for ~100 wikis. [23:43:44] TSV, JSON or bust! [23:43:55] ^^ [23:43:59] ha, yeah [23:44:26] Ironholds started a while ago a CSV suppression squad [23:44:43] I'm in. [23:44:44] no comma will survive [23:45:04] Why do we separate values with a comma when THAT'S WHAT TAB WAS DESIGNED FOR? [23:45:06] TSV _and_ quoting. [23:45:12] the quoting is important [23:45:20] and will remain important until morons stop storing tabs in their input data. [23:45:23] Hmm... I'd rather escape than quote. [23:45:30] halfak: EZ has a ton of these raw files from wikistats that are super-helpful for this kind of sensitivity analysis [23:45:32] but I could go either way. [23:45:37] escaping would be nice [23:45:49] the problem is escaping when it's \tuser\tagent [23:45:58] you end up accidentally escaping the separators ;p [23:46:07] I just mimic the MySQL format because the MySQL TSV format is not configurable and everything else is. [23:46:18] Therefore, NULL --> \N and TAB --> \t [23:46:22] yeah, which is nice [23:46:32] the problem is things like the sampled RLs where I don't get to control input format :( [23:46:38] (curse you, request logs. Curse you.) [23:46:39] yeah, let's just not blame poor tabs for what HTTP headers do [23:46:46] haha [23:46:53] that's fair. [23:46:56] but we can totally blame commas. [23:47:01] OH YES [23:47:18] commas aside, [23:47:22] groan [23:47:26] :D [23:48:34] other than comparing content vs all, values of t and n, what do we want to do with New Wikipedians? [23:49:00] Yup. I'm working on that right now. [23:49:04] if we don't think that measures anything of interest, we should just stop doing comparative analyses of what tracks them [23:49:05] (PS1) Gerrit Patch Uploader: add contact information [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106634 [23:49:06] (CR) Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106634 (owner: Gerrit Patch Uploader) [23:49:48] Though I think that we would be better off comparing New Wikipedian with Surviving new editor. [23:50:16] Because I think that New Wikipedian is a better measure of survival than of activation [23:58:22] (PS1) Gerrit Patch Uploader: Link to correct Bugzilla bug entry form [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106636 [23:58:27] (CR) Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/106636 (owner: Gerrit Patch Uploader)