[17:52:39] So, what's up harej? [17:52:47] halfak: you had some thoughts regarding my Research: page but I don't remember what they are or if i need to re-scope my project [17:53:38] also, in the absence of the referrer log dashboard, would it be possible to get anonymized referrer logs for specific pages (I will have a list for you)? [17:57:16] (the last time I brought up the research project, i had to run away to oakland for a meeting... but now, i am firmly planted in my seat, and not going anywhere... until i get lunch) [17:58:03] * halfak wishes he has his notes from earlier. [17:58:09] I suppose the chat log should be there. [17:58:18] wm-bot4, where do you store your log? [17:58:30] FYI: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-research/ [17:58:41] harej, what day was that? [17:59:45] * halfak finds harej questioning his Minnesotanness. [18:00:02] February 2, halfak [18:00:21] Got it. https://gist.github.com/halfak/bb6e5c3ad103e1397b63 [18:00:39] is GitHub Gist like Pastebin but not lousy with ads? [18:01:37] yes [18:01:38] :) [18:01:51] Sounds wonderful. I should use GitHub, or something. [18:01:57] +1 [18:01:58] I haven't deployed code in over three years. [18:02:22] Of any kind. [18:02:23] So, I was trying to figure our what you were working towards with your hypothesis. [18:02:51] It seems like you were saying that subject area activity ought to correlate with relevant WikiProject activity. [18:02:59] Yes. [18:03:28] So, if we formalize "subject area activity" and "wikiproject activity", we should be able to do a good job. [18:03:37] But as you mentioned before, WikiProjects have cycles. [18:03:54] And a more mature, late-stage WikiProject might not correlate the way we expect. [18:04:42] I would formalize subject area activity as edits to pages within a WikiProject scope (as defined by the WikiProject categories found on talk pages) [18:04:48] Any time there is a factor like this that could confound the results, I like to try to include it in the model. [18:05:19] hey all [18:05:23] WikiProject activity is measured by overall talk page activity, and particularly by average thread size. (You can usually determine which post belongs to which thread via edit summary, which should work for most cases.) [18:05:24] +1 for edits to WikiProject pages. We might want to control for bot edits and vandalism/counter-vandalism. [18:05:38] Hey DarTar. [18:05:57] halfak: did you see the back and forth with Joe over the weekend? [18:06:18] And a more mature, late-stage WikiProject might not correlate the way we expect. << So perhaps there is correlation for *certain points in the lifecycle*, but not necessarily in second-half 2014. I am starting to see where you're going [18:06:28] DarTar, I did. Seems like they have a bug? [18:07:22] yep, their suggested solution is probably the right one: simply test if we get a 200 in return, or maybe try and parse the “noredirect” response of the resolver [18:07:24] harej, avg thread size is time-dependent and archiving-strategy-dependent. [18:07:33] the latter sounds like a PITA thought [18:07:37] *though [18:07:46] DarTar, I think we need to drop metadata from our thoughts. [18:07:55] It's a serious pain. [18:07:57] halfak: how often do wikiprojects archive threads before they're done? [18:08:04] halfak: totally, at this stage I only care about lightweight validation [18:08:08] aren't most archives done on the basis of "X days since last edit"? [18:08:18] DarTar, We're going to need to come up with our own standard if we want the dataset to have any real value for the user. [18:08:32] DarTar, see my notes on validation in Trello [18:08:44] I finished a pass on the sample set and did spot checking. [18:08:50] halfak: k, I still need to catch up with it [18:09:12] awesome, do you want to check in 15 minutes later today? [18:09:25] harej, I imagine that depends on whether there is a bot or not. [18:09:50] DarTar, sure. All my meetings today got cancelled! So I am free whenever. [18:10:03] wow [18:10:10] This is my first day since I can remember with *no meeting* [18:10:10] ** that wasn't a SF holiday. [18:10:15] Sure, but I still don't see why divergent archiving methods would affect the metric. It's a measurement of how much back-and-forth there is. [18:10:35] harej, how do we take the measurement? [18:10:52] Edit summaries. Each post within a talk page section includes the section name at the beginning of the edit summary. [18:11:11] harej, how do we know when a topic is mature enough to count up the number of replies? [18:11:57] I've seen topics sit for a year without replies, but still end up with a few. [18:12:52] As in, what is our cut-off point before we decide "there is no way this topic is indicative of a project's activity", such that we won't count a thread against a project for being all of 15 minutes old? [18:13:53] Indeed. Or even two months old. [18:14:14] But if a thread languishes for several months with no replies—even though it totally deserves to get replies—that in itself is a measure of project activity. [18:14:28] I post to the Teahouse. I get a response very quickly. That is an indication of an active project. [18:15:04] Sure. That's fine and all, but we need to deal with the time dimension or let it be a confound. [18:15:59] So what if we changed our metric? Instead of average talk page thread size, average time between post 1 in a thread and post 2 (for threads of 2+ posts)? [18:16:21] +1 sounds good. [18:16:38] I wonder if we might include the % of threads that get a response within some amount of time. [18:16:43] e.g. one week [18:17:06] Some threads don't deserve responses though. They're just notifications that do not in themselves spur conversations, even on otherwise highly active pages. [18:18:47] harej, let's say that those are common in a few WikiProjects. Are we willing to take that on as a limitation? [18:20:37] Interesting point. If they're a consistent tendency across WikiProjects, then it evens out. [18:22:30] But I don't *know* that, and I don't know if we should assume the presence of notification-only threads (as opposed to queries that get no response) as a constant. [18:23:46] I propose we operationalize a few different measurements. [18:24:00] 1. The number of threads started per time period. [18:24:11] 2. The proportion of threads replied to. [18:24:22] ^ within a month [18:24:32] 3. The expected number of monthly replies [18:25:00] 2 will be unstable for low observations, so I'd like to use the raw number of threads with replies instead. [18:25:32] 3. will mostly share signal with 2 assuming a long tail of replies #s [18:25:48] So, we could get away with counts of 1. and 2. [18:26:48] And those will measure a general sense of "activity" [18:26:58] Indeed. [18:27:02] We can then chunk out when WikiProjects are at their most active, and try to deduce correlational activity acordingly. [18:27:22] chunk out? [18:27:32] My thoughts merged together. [18:27:39] Then we can file a lot of deletion proposals for articles in those projects, so that talk notifications increase and activity explodes [18:28:05] Nemo_bis: I have left two talk page messages on each and every WikiProject. I already screwed up the data and we haven't even started measuring yet! [18:28:24] ;) [18:28:33] (I very seriously propose that my edits shouldn't count for this research.) [18:28:46] harej, makes sense. [18:28:50] Nemo_bis, no evil :P [18:29:10] So based on our measurements of project activity, we can then do analyses of what periods projects are most active in. [18:29:29] And then do project activity/subject area correlations for different chunks of time according to when the projects are most/least active. [18:30:02] Ahh yes. I would suggest that, for this first analysis, we allow ourselves to not worry about lifecycles of WikiProjects and instead simply include WikiProject age as a predictor. [18:30:16] This could be a simple, multi-variate linear regression. [18:30:33] So we're sticking to the second-half-of-2014 restriction while using WikiProject age as a variable? [18:30:39] Indeed. [18:30:52] And I think we should do the work month-to-month and drop Dec. [18:30:58] Because December is weird. [18:31:09] Does Christmas screw things up that much? [18:31:16] Holy crap yes. [18:31:33] * halfak looks for the first graph that will demo the problem. [18:32:38] Here's mobile registrations on Italian Wikipedia: https://meta.wikimedia.org/wiki/File:Itwiki_anon_test.mobile_registrations.trend.svg [18:32:57] I suppose that Sept. is equally weird. [18:33:07] I see that nice dip toward the end of December. [18:33:15] What's wrong with September? [18:34:12] All the kids go to school [18:34:22] Probably the teachers get busy too. [18:35:00] The kids go to school September through June (or August through May) [18:36:22] Do we see more activity in the summer? [18:37:58] In general, yes. We also see different types of vandalism in Enwiki. [18:38:02] (the northern summer, I should clarify) [18:38:29] Which raises another question, does the southern hemisphere have a different school year if their seasons are different? [18:38:41] U.S. schoolchildren having summers off stems from the harvest. [18:39:29] Anyways, I generally accept that December is weird and should be excluded. I don't know if the same case can be made for September. [18:40:59] On second though, let's keep 'em in and be skeptical of those months. [18:42:57] if we do it month to month, it won't really matter if december is skewed, so long as december is skewed across the board [18:43:01] except where it's not—that will be interesting! [18:43:49] "This project measured as a 0.8 in July, while 0.4 is average for that month. It measured as a 0.5 in December, while 0.1 is average for that month." [18:43:51] Or something like that. [18:45:15] Why don't we just use the whole year? That way we'll have a representative "period". [18:46:20] We can also do that! [18:46:49] I actually changed it from one year to six months in order to be consistent with an unrelated metric I am developing for measuring WikiProject activity (for the WikiProject directory's purposes), but I suppose that change was arbitrary. [18:48:12] Either way, we can grab data for the year, run the model on 6 months and compare. [18:49:36] And WikiProject age is measured based on number of days since the first edit? [18:50:22] I suppose so. [18:53:51] harej, do you guys have someone in mind for running these queries -- or is that something you were hoping I'd help with? [18:54:16] yes, i do have someone running the queries [18:54:39] kk. Let me know if you need a hand. I have a few tricks. [18:54:48] i am not sure what would take longer—developing and running the queries, or analyzing the output [18:55:33] IME, doing the writeup is the most time-consuming part. [18:55:39] Also, the most essential. [18:55:46] brb [18:59:52] So, this is where the project stands, I think: for each month in 2014, we analyze the number of threads started per time period and the number of threads replied to. We correlate those to subject-area article edits in the same time periods. We control for WikiProject age as measured in days since edit 1. We somehow control for me spamming (almost) every WikiProject talk page, if you think we need to. [19:04:59] harej, what do you mean when you say "spamming?" [19:05:16] notconfusing, I've got a DOI parser that I think you'll find useful. [19:05:27] halfak: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_United_States_Federal_Government_Legislative_Data#Comment_on_the_WikiProject_X_proposal << see these last two talk page threads? i did this on almost 2000 talk pages [19:05:32] halfak, i'm listening [19:06:37] notconfusing, so I tried a big set of different parsing strategies. See https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia/blob/master/mwcites/extractors/doi.py [19:06:52] nice, lemme checkout the code [19:07:06] extract_mwp(), extract_island() and extract_search() all do roughly the same thing. [19:07:35] Except that extract_search() is 100x faster and catches some valid DOIs that extract_mwp() misses. [19:07:51] There are other problems with extract_mwp() that I'd love to discuss. [19:08:00] halfak: incidentally, as an informal metric, i check wikiproject talk pages to see how many talk page threads occurred between my two mass messages. [19:08:10] notconfusing, My point being, I think we can collaborate on this at least. :) [19:09:28] ok, im reading over it right now [19:09:51] you mean collaborate on a "extract doi from wikitext" callback [19:09:56] yup [19:10:19] ok, i will test it out today [19:10:49] here's a simple way you can access just the extractor: [19:10:50] https://gist.github.com/halfak/ff6f67ceb1ce95f2449e [19:25:01] halfak: any opinion on this? "In Part 2 we will interview a random sample of editors within highly active, moderately active, and inactive WikiProjects and WikiProject-defined subject areas. These areas will be defined according to the data we received or other standard measurement of activity. Interview questions will focus on the interviewee's motives for editing, use (or non-use) of a WikiProject, and tools that would help facilitator the [19:25:01] interviewee's editing. We will make our interview protocol available to the Wikimedia Foundation before carrying out this part." [19:26:20] What's the goal of that study. What makes it a good methodology for achieving that goal? [19:30:44] To ascertain the needs of editors and develop solutions around them. [19:31:22] What do editors need? Do WikiProjects fulfill those needs? Why or why not? [19:36:17] Only thought is: Make sure your protocol asks them about their needs and motivations before bringing up WikiProjects." [19:36:39] Right. [19:36:49] I suspect you'll find many editors who would benefit from working with a WikiProject, but they don't know they exist or how they'd make use of them. [19:38:34] Well, it's all supposed to be user-centered. I'm not fixated on WikiProjects because I love them to death, but because they're existing infrastructure I want to co-opt to meet a deeper need that Wikipedia has. [19:38:50] WikiProjects being flexible enough to allow that. [19:39:12] And that's one thing we're kind of expecting: people to not even know what WikiProjects are. [19:41:47] (Well, I don't know if we're *expecting* it, so much as accounting for it as a possibility.) [20:46:51] DarTar, when do you want that 15 minutes? [20:47:09] ^ to look at DOI stuffs. [20:47:18] halfak: 1.30, I blocked the calendar event but forgot to invite you [20:47:27] cool [20:47:29] WIll be there. [20:47:41] great, fixed [21:31:30] halfak: coming in 2