[00:27:28] hey harej: keep me posted on the LibraryBase plans too, I had a good chat with tarrow in Montreal [00:28:11] like halfak said, we might be able to speed up the work on citation extraction either with diego or the new researcher who will join us later [00:29:29] i would consider tarrow the principal re. matters of librarybase [01:07:55] halfak: when you are back: i am trying to make sense of it, between the notebook (which is in R, which I don't understand) and the paper (which I think I understand but I don't know) – the number is (weighted sum + 1) divided by the total number of articles as of the most recent month? And then the bar chart you have that shows the inflection point is the derivative of that? [06:51:05] https://phabricator.wikimedia.org/T149811#3530016 [14:24:03] halfak: would you like me to resubmit my question from last night? [14:24:58] Hey harej. Just scrolled back and found it. [14:25:06] So it's not the derivative. [14:25:44] It's the difference of women_scientists.avg_weighted_sum - all_wiki.avg_weighted_sum [14:25:59] Is the first formulation correct? [14:26:31] Yes. Dividing by the number of articles in the most recent month. [14:26:44] Which is where I get into limitations in the paper. [14:27:14] One could game this measurement by creating few FA-class articles [14:27:46] In practice, the distribution of articles in any interesting cross-section seem to be relatively consistent. [14:27:54] The idea being that the denominator is very huge if you're dealing with weighted sum + 1 in, say, 2001 [14:28:05] harej, right yeah [14:28:19] and the +1 is to give a point for an article existing at all [14:29:06] Right. [14:29:22] Nothing + 1 = Stub, Stub + 1 = Start... [14:29:26] That's the idea. [14:29:56] So you'll want to add 1 to the weighed_sum column in the table. [14:30:02] so i did the numbers for wikiproject occupational safety and health, and it looks like there's no inflection point -- the articles have pretty much always been better [14:30:08] (that wikiproject is mostly chemical articles) [14:30:34] harej, seems likely. It could be they've never done a substantial outreach or work-push [14:30:48] because it was never necessary [14:30:59] Right [14:31:10] I wonder what military history looks like. I haven't tried it yet. [14:31:18] we have a distinct problem in that our wikiproject includes articles specifically about the topic as well as articles related to the topic. it is not like women scientists where an article is or isn't related. [14:31:29] this makes analytics harder [14:31:46] harej, maybe we could use wikidata [14:32:00] carbon dioxide is a workplace health issue. it is also many things other than a workplace health issue [14:32:16] i guess? i also want to segment the articles better [14:36:02] also, are we going to get numbers after October 2016? I want to see the impact of our newest Wikipedian in Residence, who has been doing a lot of work [14:38:05] Oh yeah! I have those. I can load them soon. [14:38:24] can you make a task for that? [14:39:05] Also, how did you manage to get those into quarry? Does quarry include tools-db as a server or did you just put a user table in the replica space? [14:39:16] I have so many questions. [14:39:47] user table on replica. Had an almost year-long conversation with the DBAs about it. [14:40:04] Impressive [14:40:14] I take it people should generally not do that [14:42:33] harej, right. I'd really like to figure out a long term solution for big dataset tables like this. [14:42:48] E.g. a special db called "datasets_p" and a little process around it. [14:42:55] I'd be willing to maintain and organize a bit. [14:43:21] Some days I'm tempted to start my own data horde [14:44:49] Speaking of, I think I'd like to go ahead with a VPS with a 300 GB disk? Is it an existing server type now that you've brought it into existence? [14:46:54] In other words, should I say "I want one of those," or should I come up with a more unique proposal? [14:48:51] I shut off dataset generation for static media so I'm in a good position for a transition [14:49:33] harej, I linked to the relevant phab tasks yesterday in cloud. [14:49:59] I think andrewbogott worked out the details. Not sure if he can manage yet another VM type or not. [14:50:18] I imagine they don't want just anyone to have access to it in Horizon [14:51:00] I think there's a good business case. It provides a useful service for many people and is not just a niche activity. [20:04:02] Time to document the tea! [20:06:23] I’m documenting diego [20:06:43] making sure he has a pic and user page linked from a bunch of places [20:08:06] nice. [20:08:10] I'll brb to the call