[00:05:07] DarTar, shall we meet? [00:05:17] in Hangout [00:09:02] leila: sure [00:09:14] I'm in the event [12:55:17] Good morning Science people. [14:33:39] technical difficulties [16:01:29] halfak, interesting problem for you. [16:01:36] Wussup? [16:02:00] calculating the length of a session. So, sum(intertimes) + sum(intertimes)/length(intertimes) to factor in time spent on the last page, which isn't represented by an inter-time value. [16:02:31] ..what happens if there's only one intertime value, and that's greater than the [3600/1800/whatever] second threshold? [16:02:36] (can you tell I'm writing unit tests?) [16:02:53] So, I use a global estimate of intertime values. [16:03:13] https://meta.wikimedia.org/wiki/Research:Estimating_session_duration [16:03:13] hmn; explain? [16:03:19] *looks* [16:03:49] oh, interesting! [16:04:20] It's not great, but given the strong regularities in behavior between users, it should be pretty good -- especially when summing labor hours for many users. [16:04:45] oh, totally. But it's predicated on having timestamps rather than intertimes, yes? [16:06:45] hmn. Although you could do it with intertimes. [16:07:01] I mean, minus the 430s thing, last_stamp - first_stamp == sum(intertimes) [16:07:27] mean(itertimes) would work just fine. [16:07:57] as a replacement for 430s? yeah, when it's available [16:08:28] Seems like we wouldn't necessarily want it to change based on a user-session. [16:08:44] Maybe we would rather change it based on a user's full history. [16:08:55] yeah. Which would be difficult to do consistently. [16:08:55] Then, we'd want to change it again as new data came in. [16:08:57] Agreed. [16:09:08] I'd say, let's have a parameter called...average_intertime, with a default value of 430 [16:09:21] I think this might be a good thing left to future work (aka people who like to twiddle with complexity that no one will use) [16:09:22] users can tweak it if they want, or it'll be tweaked with new discoveries that lead to new code releases. [16:09:36] +1 [16:09:37] The problem I'm trying to headscratch over is what to do in the scenario where, say, there's one intertime event, and it's 3900 seconds. [16:09:56] that's two sessions represented there. Both of average_intertime? [16:09:58] I guess so. [16:10:02] Oh. Two sessions. Each one took ~430 seconds. [16:10:08] * Ironholds high-fives [16:10:16] rubber ducking! [16:10:26] on that note, if someone at the office could send me my tiny schopenhauer. [16:10:43] Hey! I have proper organs and no squeaker thank you very much. [16:11:16] you have two squeakers! [16:11:21] They run through your house and smell like honey [16:11:35] (how on earth did you get ferrets to smell like honey? They're meant to be stinky beasties) [16:12:08] I'm glad that you don't think they are stinky. I hear the same from others. I'm not sure why I lucked out on non-smelly ferrets. [16:12:37] They don't get baths unless they got into the trash -- which is something that ferrets don't really grow out of. [16:13:09] "ooh, stuff nobody wants! I WONDER WHAT'S THERE" [16:13:34] Must smell amazing to weasel brain. [16:14:10] Their search algorithms' heuristics are highly weighted toward smelly things. [16:16:22] how do they feel about good cheese? [16:17:04] Negative. Also picky eaters. However, there might be something shiny or sweet underneath that cheese. [16:17:19] ahah [16:17:25] better open the package and drag it across the kitchen just to make sure. [16:17:40] the saga of Jessica's Protein Bars [16:17:46] she finally let us forget about that round about Cleveland. [16:18:02] heh. I think I may have found the remains of one of them. [16:18:38] hah! [16:18:58] * halfak kicks off the aggregating query for measures of importance. [16:19:12] inlink counts vs. page view rate vs. WikiProject classification [16:19:30] oooh [16:19:43] *with redirects resolved* [16:19:49] :) [16:20:12] It this works for enwiki, we can probably just apply it broadly across other languages. [16:33:25] halfak, standup! [16:33:58] thanks! [16:40:26] halfak, who IS the new person? [16:40:41] Isn't that the Product analyst guy. [16:40:45] Jon Katz [16:40:47] or something [16:41:37] See " [16:41:38] [16:41:38] [16:41:38] Move to Inbox More [16:41:38] 2 of 2 [16:41:38] [16:41:39] Collapse all Print all In new window [16:41:41] [Wmfall] Welcome Jon Katz, Product Manager" [16:41:45] Bah. [16:41:51] "[Wmfall] Welcome Jon Katz, Product Manager" [16:50:07] yup! Now worked out [17:03:15] wait, I'm an IDIOT. [17:03:18] * Ironholds headdesks [17:03:30] wow, finally he found out by himself... [17:03:41] heh [17:03:47] o/ YuviPanda [17:03:54] hi halfak [17:03:56] * YuviPanda is in the himalayas [17:04:11] also I broke a wiki today! [17:04:13] * legoktm quips [17:04:26] YuviPanda, shame! [17:04:40] Getting some hicking in? [17:04:44] *hiking [17:04:59] halfak: I tried walking about a km, had to sit and recover for an hour... [17:05:01] so maybe not [17:05:20] and then I broke wikitech, then unbroke it [17:05:22] Altitude or athleticism? [17:05:22] good times. [17:05:49] latter [17:05:55] I ostenibly have 'smaller lungs' than I should have [17:05:58] born with it, etc.. [17:06:10] plus I'm a slob [17:06:35] * Ironholds slaps YuviPanda [17:06:49] also, I shouldn't be allowed to listen to pop music, I end up having a running commentary [17:07:09] gave up on Jessie J after the exchange "it's like a thousand degrees" "wait, Celsius or Kelvin?" [17:11:09] :P Celsius or Kelvin around scales of 1000 doesn't really matter when human flesh is involved. [17:17:44] halfak, yes, but it's important! [17:17:49] and no Farenheit kthx [17:17:54] I want a scale that measures something usefully [17:18:43] (1000F|1000C|1000K) = shits on fire [17:18:54] *shit's [17:19:51] I dunno. What's the combustable temperature of bone? [17:20:08] I seem to recall crematoria tend to run at between 700-1000C. [17:20:15] 1000K is /barely/ in that range. [17:20:56] ... you realize whether your bones burn or not is the least of your concerns if you should happen to enter a 1000 degree situation, right? [17:21:17] I'm with halfak on this, +1 to specification of units being irrelevant [17:21:23] This conversation is so awesome [17:21:25] :) [17:21:42] milimetric, it's totally relevant! [17:21:51] if I'm gonna burn to death I want whoever created that situation to be held responsible [17:22:06] in terrible non-bone-burning-related news: Ironholds I just dropped by to check what you need in terms of graphing [17:22:07] more evidence == increased chance of that. [17:22:18] if they don't even have my /skeleton/, oy. [17:22:41] i see - but doesn't DNA survive regardless? [17:23:10] actually, no [17:23:14] the Tm for DNA is 60C [17:23:24] oh :( indeed - no: http://www.exploredna.co.uk/dna-test-after-death.html [17:23:33] (that's: the point at which 50% of the DNA fuses or disassociates) [17:23:39] (it's like LD50 but for genetics!) [17:23:56] so if you give me some graphing requirements on top of this compelling new argument, I might remove my +1 [17:24:22] haha [17:24:41] milimetric, so, I have a TSV of four fields; YYYY-MM-DD, device_type, access_method, count [17:24:56] fantastic, and who's it for? [17:25:00] apps [17:25:17] In an ideal world I'd visualise it as a literal grid. Outside of that...I'm not entirely sure how to do it efficiently. [17:25:19] ok, and did you stick it in the rsync-public folder place? [17:25:22] I guess I could set up a knitr HTML document [17:25:29] "grid"? as in, table? [17:25:40] yeah, SAVE_FILE = '/a/public-datasets/readership/traffic_grid.tsv' [17:25:41] yep [17:25:50] (no data there yet, I don't think. Background run is still going. [17:25:59] i'll check [17:26:47] right, not there [17:26:51] * Ironholds nods [17:26:54] ok, so apps has a limn dashboard [17:27:16] cool! [17:27:25] and if you point a limn dashboard at a datafile, it makes a graph [17:27:50] like literally: graph_ids is an array, and they can add 'http://path/to/file.tsv' [17:28:11] from that, you get the "show data as table" button [17:28:15] would that be enough? [17:29:26] Ironholds: ^ [17:29:36] oooh [17:29:44] that would work well! [17:29:45] or..hmn. [17:29:46] * Ironholds thinks [17:30:07] okay, this is gonna sound crazy, but here goes [17:30:28] what if I modified it to write data to one file, in a tabular format, in such a way that it could be referred back to - appending each time - and just kept that to hand [17:31:01] and then had a second file which contains [latest day's results] in a visually comprehensible form, and is overwritten each time [17:31:06] please explain "just kept that to hand" [17:31:23] stuck it in public-datasets but didn't visualise/load by default [17:32:00] ah, crap, sorry Ironholds, I forgot to look closely at your file format above [17:32:21] not your fault! [17:32:23] so limn needs metrics, it can't aggregate rows [17:32:33] so you'd have to have something like: [17:32:35] yeh, that was my thinking for the first_file/second_file approach [17:32:44] What they have asked for is not easy to store in a TSV for multiple views [17:32:58] date, count-for-device-access-type-combination-1, count-for-device-access-type-combination-2, etc. [17:33:03] yeah [17:33:15] that was my suspicion. [17:33:15] or like date, count-for-device-access-1, count-for-access-type-1 [17:33:35] where what we actually want is a set of views by [date], where each one is access_type by device [17:33:35] we can easily do that with a custom dataset but if it's easy enough for you, that's better [17:33:42] halfak: I’m in the hangout [17:34:08] okay. So, how's this sound: two files, one storing as variable_permutation/value, for storage and long-term referring to [17:34:17] and one stored as a literal grid, which is overwritten with each run [17:34:23] we visualise the second one and make THAT accessible as a table [17:34:31] the first one is just so we have a history and can look at long-term trends. [17:35:03] it seems like the only real way to display what I'm being asked for, and it's not a massive pain, so it works for me iff it doesn't sound mad to you? [17:38:54] * halfak curses DarTar's battery [17:41:00] Ironholds: that doesn't sound mad to me but the second file you're talking about, the one you'll visualize, are you saying you'd restrict that to just the last day? Or to a small time period? It matters if the data is not time-series [17:41:35] just the last day, yeah. The "visualisation" wouldn't really matter; it'd just be the table display [17:41:42] I mean: that's the important thing [17:41:58] that or I use knitr to construct a HTML representation of the matrix and stick it up somewhere, which sounds fun but would leave it kind of detached [17:47:19] Ironholds: if it's not timeseries, it would just be a bar chart, which I could do but it'll take a minute longer (gotta make the graph by hand) [17:47:32] that's fine, no worries, if the last day is the only thing that matters [17:48:01] but normally people care to see that data over time - any reason this is different? [17:48:22] the way they've asked for this data is impossible to usefully linearly visualise without splitting it amongst multiple graphs [17:48:30] they want it as a grid. [17:51:07] Ironholds: "visualized" as a grid is a bit of a contradiction in my opinion. It sounds like they just want the TSV. Any reason their respective TSV / CSV readers are not enough for them? [17:51:19] fair point; I'll ask [17:51:22] LibreOffice / Mac FancyPantsExcel whatever... [17:51:28] Deskana|Away, ping me when you're back? [18:26:11] leila, re due date for the traffic assignment; for tracking reasons, or are you busy right now? [18:26:23] * halfak tries to load 3.7 million rows into R. [18:27:15] 225MB of text data = how much memory usage [18:27:25] Time to find out. [18:27:32] Ironholds, for day-to-day prioritization [18:29:06] leila, gotcha! [18:29:12] halfak, oh, 3.7? that's nothing. [18:29:21] you're fine [18:31:28] (Ironholds, I need it to figure out should I drop everything I'm working on right now and do it now, or it can wait, and if the latter, for how long. :-) ) [18:31:50] so the answer was "both"! :D [18:32:22] yes, Ironholds. :D [18:32:29] Indeed. It was nothing. :) [18:32:39] but my question for you is: is there any time you're not busy here? :D [18:32:52] cuz I really need to talk to you if that's the case [18:33:08] I'm not busy now! [18:33:19] well, in 30s so I can have a smoke [18:33:36] I mean, serious time interval you can commit to something [18:33:36] :d [18:36:43] an hour? [18:38:23] leila, ^ [18:41:38] for example [18:41:56] well, I'm around for an hour now! :D [18:42:38] If leila isn't making use of Ironholds' hour, I want it. [18:43:04] hahah [18:43:10] halfak, https://github.com/Ironholds/EveryDayImSessioning [18:43:18] I guess we're diverging. You know what I meant, I guess. [18:43:33] leila, yeah. So: I've got an hour free now if you have something you're working on you need assistance with. [18:44:21] thanks, Ironholds. for now I just need to focus and do things, not blocked specifically by anything. thanks for the offer though. [18:44:26] Ironholds, yes./ [18:44:31] So much yes [18:46:02] leila, okay! [18:46:09] halfak, want me to grab editor data as well while I'm here? [18:46:36] "editor data"? [18:46:42] Session stuff? [18:48:00] halfak, indeed. Like, if I'm grabbing uuid-timestamp pairs for app readers, and mobile readers, and desktop readers, and search events... [18:48:09] ...want me to just write a handler for editors as well and do it all in one fell swoop? [18:48:18] Oh yeah. If you could grab them for editors, that would be great :) [18:50:12] totally! [18:51:11] halfak, and what sorta date range do you want? Like, random sample over a week, random sample over a month... [18:51:25] Ironholds: Greetings. [18:51:31] Deskana, the grid of device_class versus method_of_access [18:51:45] would you lot be okay with just a daily-updated .tsv? It's kind of hard to visualise grids over time. [18:52:24] Ironholds: Sure, that works. [18:52:51] So just a 3x3 matrix in .tsv format each day? [18:52:53] That's fine. [18:53:15] cool! [18:53:20] Ironholds, random sample over the biggest time window you can manage. [18:53:24] I'll be storing things in a more-storable-but-less-readable format elsewhere [18:53:31] halfak, 30 days! [18:53:42] too small for editors [18:53:45] and I'll explicitly note the time windows in the queries so we can refer back [18:53:49] * Ironholds is thinking like a paper-writer [18:53:57] great :) [18:53:59] editors I'll go broader, then. [18:54:07] that doesn't need a distinctly stored query, so is easier. [18:57:58] halfak, one final question, am I overcomplicating it if I retrieve "Mobile web edits" and "desktop edits" as distinct groups of edits? I suspect so, but... [19:00:04] Hmm... That's a good question. [19:00:38] Presumably, we wouldn't see that many sessions that cross devices. [19:01:01] indeed. Although I can imagine that around /editing/ it's a possibility [19:01:22] example thought process: "oh, I can totally tweak this article. tweaky tweaky tweak. Ech, this is kind of a pain, with the small screen. I'll load up my laptop" [19:01:34] I can't see a desktop -> mobile transition but I can see mobile -> desktop transitions happening. [19:01:35] +1 [19:01:46] so treating as one dataset may be a better idea [19:02:17] I think so. We can label intertime values by whether they were a transition or not and look at some stats about how often it happens and if it tends to represent a session break. [19:05:41] halfak, ah, so you'd like a type field as well? [19:05:49] mobile/desktop, so we can add that label [19:06:22] I think so. I haven't thought through the analysis of it yet, but it seems like we could explore it a bit. [19:06:42] Also, it seems to me that the label would be easy. If that's wrong, then I'd drop it. [19:08:04] oh naw, it's trivial [19:08:50] ...CASE WHEN ts_tag RLIKE('mobile') THEN 'mobile' ELSE 'desktop' END FROM revision INNER JOIN tag_summary ON rev_id = ts_rev_id [19:14:22] excellent [19:14:35] * halfak touches fingertips [19:23:20] the queries seem(?) to work, which is nice [19:23:31] gonna let them run in the background so I can see how many events N[mobile/desktop IPs] corresponds to [19:48:47] Ironholds, leila: I can only do 4 PT today, does that work? [19:49:09] 7pm? [19:49:10] sure [19:51:10] oh man, I love east coast banks [19:51:27] "the wait time is: seven minutes. Remain on the line, or press 1 for us to call you back when a representative becomes available" [19:55:41] Ironholds: thanks (and sorry for the time, let’s schedule something at a more east coast friendly time from next week on) [19:56:00] np, and sure [20:01:14] hey halfak: I’d like to talk quickly about https://bugzilla.wikimedia.org/show_bug.cgi?id=72541 at our next 1:1 (adding it to the etherpad) [20:01:43] Oooooooooh. :) [20:23:47] Ironholds: saw the reply from Dan above, about the tsv, I feel like a true zen developer [20:24:01] * milimetric waves hand: "these are not the requirements you are looking for" [20:24:23] milimetric, haha [20:25:04] is a drive-by question considered in very poor taste? Should I ask on the internal list? [20:25:13] it's about mobile / desktop breakdown [20:26:16] naw, ask away! [20:26:28] I know all of the answers except for the answers I don't know [20:27:00] so open question is: how do we break down our metrics by "mobile" and "desktop"? [20:27:18] by "our metrics" I mean the ones currently implemented in wikimetrics and shown in dashiki, which are: [20:27:37] P, E, RAE, RNAE, RSNAE, NRU, and RROAE [20:27:56] * milimetric fears that his blood acronym level is toxic [20:28:49] so, for those metrics, I have a proposal which we devs could start implementing right away, and ideally if it was crazy and wrong we need to start figuring out the right approach right now [20:29:14] take, for example, Rolling Active Editor (RAE), we need 5 edits in some time period [20:29:40] so, for this to be "mobile", I propose that 4 of those 5 edits are tagged mobile with revision tags [20:30:06] and for it to be "desktop", I propose that 4 of those 5 are not tagged mobile, or apps, or anything else [20:30:24] and for it to be part of "total", we count all edits, regardless of tags [20:30:33] thoughts? [20:31:44] milimetric, I think this is a reasonable approach. I'm confused about why you chose 4 as your cutoff. [20:31:53] aw, edit questions. Womp womp :( [20:32:00] * Ironholds headscratches [20:32:13] because I am not a researcher and I just randomly pick numbers :) [20:32:34] halfak: would a simple majority be better? [20:32:45] OK. one problem I'm still trying to work out is, if we don't have only two options. [20:32:53] We might have apps as an option. [20:32:54] hmn. Me no editor specialist, but... [20:33:04] hmn. no, that's a dumb idea. Ignore me! [20:33:18] halfak: i've made this case too, and argued that whatever breakdown we come up with, people will want and very soon *need* more [20:33:25] word [20:33:33] milimetric, if there's only mobile/desktop, I think simple majority will work for us. [20:33:34] I want to use PVs as the place to push on that. [20:33:39] If not, then we might choose the plurality. [20:33:48] grouping zero and apps traffic as the same thing? What is that shit. [20:33:59] If no plurality exists, we can just put people in the crazy/mixed case. [20:34:01] gotcha, that's sensical to me [20:34:11] Another option is to break editors in half [20:34:17] * halfak growls and stamps the floor [20:34:24] lol [20:34:39] But seriously, if you do 20% of your edits on mobile and 80% on desktop, maybe you count for .8 desktop active and .2 mobile active. [20:35:20] interesting! that might be the easiest to implement as long as people are comfortable with fractional editors [20:35:27] we may get burned at the stake... [20:36:21] * halfak is used to that. [20:36:32] Luckily I'm not a witch and therefore flame-retardant. [20:36:40] Also, I weight more than a duck. [20:36:41] halfak: thanks very much, I'll suggest this for now and let me know if you'd like the discussion somewhere more formal [20:37:26] milimetric, OK. [20:37:36] DarTar, you should read scrollback when you have a chance. [20:39:06] Ironholds, I'm moving the mobile review to tomorrow [20:39:10] you have one giant busy for tomorrow [20:39:33] will you have 1-hour for chat? [20:40:00] halfak: can you guys give me a summary? [20:40:14] leila, no you don't. [20:40:22] tomorrow has an all-day meeting for a reason [20:40:25] it's my scaremeeting [20:40:32] If you did 80% of your monthly edits via the mobile interface, you count for .8 of a mobile active editor. [20:40:35] DarTar, ^ [20:40:36] it is the only day I get to get substantial work done without getting interrupted by meetings [20:40:48] unless Erik's beard is on fire and I can somehow solve that, the scaremeeting is sacrosanct. [20:41:01] Which Erik? [20:41:06] yes [20:41:08] k [20:41:17] Ironholds: let’s talk tonight (4pm my time), I’ll follow up with Leila later [20:41:19] (bloody researchers, asking questions! :P) [20:41:21] I see. Ironholds, DarTar will figure out then. [20:41:23] DarTar, cool. Thanks :) [20:41:39] I don't think our work areas overlap so it shouldn't bear a cost (other than 2 meetings for DarTar. Sorry DarTar :() [20:41:50] well, exactly [20:42:19] remember I was proposing to cancel my 1:1s, it looks like I just doubled them [20:42:36] Ironholds, I have an interview scheduled for today at the time the meeting was scheduled, so had to skip our meeting. [20:42:37] DarTar / halfak: I posted the relevant scrollback on the internal list, feel free to comment there, but I personally got what I came for - thanks again [20:42:45] leila, ahh. Good luck w/interview! [20:42:50] DarTar, we can skip our 1:1 this week if it'd help [20:42:55] milimetric: thx [20:43:02] unless you know why my C++ is breaking [20:43:04] in which case I want to talk [20:43:21] Ironholds: let’s keep it for now [20:43:21] I'm blaming indices. All the C++ breakage I've ever had has involved indices, recursion, divide-by-zeros or some combination thereof. [20:43:37] and no, we won’t be talking C++ ;) [20:44:12] I C(++) [20:44:48] * halfak groans [20:46:23] * Ironholds bows [20:46:36] seriously though, wtf this code. [20:46:51] whatever I input, it returns 0. Whatever I ask it to output, it returns 0. [20:47:02] I can ask it to return the size of [input]. 0. Value of input? 0. [20:47:07] The integer "12" is apparently 0. [20:47:31] I assume I'm overwriting the object accidentally with an empty vector or something. A problem for Future Oliver! Current Oliver has given up. [21:27:30] Hey Ironholds, do you know if https://dumps.wikimedia.org/other/pagecounts-raw/ is based on sampled logs? [21:29:28] Hey Nettrom. I finished up a preliminary look at view rates and inlink rates. [21:29:29] See https://meta.wikimedia.org/wiki/Research_talk:Measuring_article_importance/Work_log/2014-10-28 [21:29:41] I see some strangeness with view rates. [22:08:04] halfak: cool, looking at it now [22:08:25] I think that my view rate generation must be wrong. [22:08:39] halfak: I wrote some code to compare cl_timestamp to my own results parsing talk page revisions [22:08:58] found that cl_timestamp isn't an authoritative source for when an article was added to a category :( [22:09:17] Oh yeah. Totally. Just when it was *last* added. [22:09:57] or when the sort key was updated [22:10:11] I also think the timestamp will change if you change a template [22:10:17] Oh really. the sort key. Boo [22:10:41] so to figure out when an article was first assessed, I'll still have to go talkpage-parsing [22:11:19] Yup. That's a bummer. [22:14:45] I already have the code for that, so I'll have a look at it later this week [22:14:58] and also potentially go fetch a newer sample of articles [22:15:01] anways [22:15:12] *anyways [22:15:15] so those views... [22:16:47] Yeah. So, I have a bunch of code that really just does one useful thing. [22:16:48] https://github.com/halfak/Article-importance-in-Wikipedia/blob/master/importance/sum_pageviews.py [22:17:11] It takes a directory of hourly pageviews and aggregates them by page_name. [22:17:33] Luckily the hourly files are in sorted order, so you can intersect them relatively efficiently. [22:17:46] However, I still end up with a few duplicated pages. [22:18:17] So I use this to group up the page_names and sum up the views. [22:18:18] https://github.com/halfak/Article-importance-in-Wikipedia/blob/master/sql/resolved_view_count.sql [22:18:50] Note that I use the "inlink_count" table because it already had a field containing the redirected to page_id. [22:20:01] I end up with a lot of pages that have an importance class, but zero pageviews. [22:20:14] how long of a timespan are you aggregating views for? [22:20:23] The month of Sept. 2014 [22:20:32] (should have done 4 weeks like you did) [22:20:40] potato-potato [22:20:53] :) [22:21:07] and you're not sampling the importance classes? [22:21:13] I only used 35,000 from each class [22:23:20] Yeah. I just used all of 'em. [22:23:30] 3.7 million articles with an importance classification. [22:24:38] hmmm [22:25:47] I need to go look at my code again [22:26:39] Nettrom, I think the clearest difference is pages with zero views. If you don't have them in your dataset, one of us has done something wrong. [22:30:46] looks like I don't have 0-view articles in my data [22:31:02] but I check and only found 379 (0.22%) were without views [22:31:54] so I didn't think they would matter much [22:32:15] (in my analysis, that is) [22:32:32] Interesting. [22:32:41] I'll look around at the articles without views. [22:33:01] largest proportion of 0-views as Low-importance, 0.38% [22:33:09] *was [22:41:26] Nettrom, here's a page that got no views: https://en.wikipedia.org/wiki/Actrius [22:41:33] Found it right at the top of my dataset. [22:42:00] Next one down: https://en.wikipedia.org/wiki/Asparagales [22:42:14] stats.grok.se says Actrius had 180 views in Sept [22:42:28] http://stats.grok.se/en/201410/Asparagales [22:42:30] Yup [22:42:45] wow, 4,824 views for the second one [22:42:50] Yup [22:42:53] Hmm. [22:43:43] weird, since neither of them have Unicode characters in the title, and both are a single word [22:44:00] Yeah. Will dig into this and get back to you. [22:44:05] Thanks for looking at it with me. :) [22:44:14] no problem! [22:46:23] Oh Nettrom. I see a few different versions of "en" in the hourly view logs. There's "EN", "En.d", "En" and "en". [22:46:28] Which ones did you include? [22:46:58] I used Wiki ViewStats as my data source [22:47:11] I haven't figured out how their parser works [22:48:07] (but I have poked around at their source code to figure out how they parse titles) [22:48:37] Oh. Can you simply download their dataset? [22:48:51] nah, I got Hedonil to give me access to their database on Tool Labs [22:49:08] their page has been down for quite a while though, not sure what's happening there [22:49:09] Oh. So now you *can* just download their dataset. [22:49:29] :) [22:49:51] Hmm... Sure enough the prefix "en" is just fine. [22:49:52] "just", took me about 8 days to get data for the analysis for our CHI-paper [22:50:05] Yikes. Must be a big DB. [22:50:39] it's complicated, I think it's optimized for their usage, not for extraction of large amounts of data [22:50:46] to get the views is a join of some tables [22:50:53] so it's slow [22:52:09] Just found an issue. It looks like I didn't filter anchors (e.g. Foo#Bar == views to Foo) [22:52:16] That doesn't explain the problem though. [22:58:59] OHHH!!! I know what's up. Wooo. It's MySQL handling a NULL in a well documented, but stupid way. [22:59:09] * halfak proceeds to fix [22:59:32] sorry to hear MySQL's stupid [22:59:36] but that's expected? ;) [22:59:50] Bah. Really needs to be considered user error. [23:00:08] So, the issue is that if you don't have a redirect that got views, you don't get views. [23:00:16] ahh [23:00:19] But if you have a redirect that has at least one view, you get all your views. [23:00:39] Fun story, I can actually fix this in the plot code. :) [23:01:38] sounds like it's easy to fix, happy to hear that! [23:01:40] I gotta head out [23:03:05] See ya! [23:06:40] hey Ironholds 2 mins and I’m yours [23:12:40] Ironholds: in the hangout, ready when you are