[00:28:54] I'm off. Have a good one! [00:28:55] o/ [15:52:22] halfak: Hi, I reanalysed the mwcites data and annotated each pmcid point with the title of the paper: https://plot.ly/~tarrow/32/first-appearance-of-pmcids-on-wikipedia/ [15:53:17] oooh! [15:53:35] I also noticed, from a random sample (not the full set yet) that around 5% of the pmcids found did not resolve. I'll have a little investigate further but mostly I found that they were pmids not pmcids [15:53:37] Any secondary analysis of words in titles or anything like that? [15:53:54] not yet, something that would be interesting to do [15:54:02] tarrow, gotcha. If you can file some examples as bug reports, we can look into it. [15:54:22] I've got a set of DOI detection bugs that I'm due to fix too. [15:54:29] * halfak needs a whole week to devote to this. [15:54:36] also I think I can find "keywords" added by publishers which would be nice to see trends in [15:54:48] DarTar has been pushing to try to organize a workshop/hackathon for wikipedia citation data. [15:55:07] tarrow, +1 for keywords. That would be very interesting. [15:55:25] cool, I'll do that (file some bugs) when I know what is going on [15:55:40] tarrow, would you be interested in a meetup/workshop/hackathon around wiki citations? [15:55:49] yeah! I sure would [15:55:57] I'm not sure when DarTar is planning, but I'll keep you in the loop [15:55:59] :) [15:56:58] cool, that would be awesome. [15:57:53] I'm obviously still learning slowly but the python library for querying EPMC is being improved [15:58:40] Not sure if you want to integrate that into mwcites for a validation section or not (it will massively slow things down to start with) [15:58:55] hello tarrow, do you find the librarybase sparql to be useful? [15:59:05] however it does now cache the requests so it will get faster [15:59:32] harej: yeah, I only made it work on monday though [15:59:55] tarrow, still would like to integrate, but finding time is hard. I did already start some integration work though. [16:00:17] harej: it is currently the only way I see to query the data that is already in there [16:00:17] This branch merges extraction, ID extraction and Bibliography parsing: https://github.com/mediawiki-utilities/python-mwrefs/tree/merger [16:00:34] I want to pull in the metadata fetching stuff too, but I can't promise it will be soon :\ [16:00:47] One of my other projects has been getting a lot of attention, so that makes for more work. [16:01:19] halfak: that's fine. If you'd like I can consider having a look at it (I'll be very slow at making improvements but meh.) [16:01:45] tarrow, totally! Even if you only have notes and suggestions, that would be great. [16:01:55] What's your github username? [16:02:09] One thing we mentioned before that I'd be keen on having is being able to feed it RC and exctract live citations [16:02:16] tarrow [16:02:48] Yes. I've been meaning to get a proof of concept for that together. [16:02:57] I think we can do that in ~100 lines of code. [16:03:05] :) [16:03:27] Depending on what you want to *do* with this live feed of citations. [16:04:33] tarrow, just sent you an invitation to collaborate on mwrefs -- the library where I am pulling everything together. [16:04:42] have you seen the demo crossref have? its at http://wikipedia.labs.crossref.org [16:05:19] and the code is: https://github.com/crossref/baleen mostly I think [16:05:34] that is live DOI event tracking [16:05:39] Yeah. I've been working with those guys on it. [16:06:08] They started out doing some *super* complicated distributed computing in order to do it in realtime. [16:07:22] Ah cool, I went to see them last week and Joe vaguely talked me though his clojure (at the time it seemed simple enough but I bet it isn't now) [16:09:37] harej: were you thinking that is unnecessary and wanted to decommission it? [16:09:42] Not at all! [16:10:55] tarrow, na. they used to be doing it in python with a custom distribution strategy. [16:10:56] tarrow: any objection if I create a property called "type of item" with potential values {source, publication, author}? comparable to P31 on Wikidata [16:11:16] Regretfully, I wasn't able to convince Joe to just use pythons concurrent.futures and run with that. [16:11:36] So they re-wrote in Clojure and write their own DOI extractor that uses links. [16:11:42] So, no plaintext DOIs. [16:12:51] Any convenient way to strip out the DOI from the link? [16:13:58] also, tarrow: http://sparql.librarybase.wmflabs.org/ [16:14:07] I suppose they are parsing like I am. All DOIs start with "10.<4 digits>" [16:14:16] So a regular expression works well. [16:14:49] I use a combination of regular expressions and a look-ahead parser. [16:17:18] harej: no objection at at all sounds like a plan on the property [16:17:38] a plan on the property? [16:18:19] harej: re sparql: that is the address but you need to follow the 404 to get to the splash page. I haven't yet worked out what to poke to put in an automatic redirect [16:18:20] "tarrow: any objection if I create a property called "type of item" with potential values {source, publication, author}? comparable to P31 on Wikidata" [16:18:30] harej, ^ [16:18:40] yeah; It wasn't a very clear way for me to say it :P [16:18:49] * halfak waits for queries to finish and listens in [16:19:09] I'm measuring Jimmy Wales' productivity over time in Wikipedia right now :D [16:19:25] I have no objection on a "type of item" property. It sounds like a good plan :) [16:19:30] Ah. Thank you. [16:19:45] I'm kind of compulsive and want each item to be a part of some master ontology [16:19:54] This will allow us to construct queries for "get all authors who..." [16:21:21] sure; I'm already realizing that might be helpful on the items I've already made [16:22:56] o/ [16:23:32] and now I will blow your minds: http://librarybase.wmflabs.org/wiki/Q262 [16:24:16] halfak: I think that some doi's perhaps don't start with a 10.. I'm not sure how many though... [16:26:27] The people at crossref seemed to say that the prefix didn't even indicate the assigning agency. In theory the whole doi can be any unicode string [16:28:10] the sparql endpoint seems like a nice way to mass-query librarybase. but only if there was a nice way to mass-edit... [16:34:08] tarrow, the "10." is from the standard [16:35:07] harej: so I'm writing a pywikibot script (it made all of many author items that are filling it up) [16:35:29] each author item should be P19:Q265 [16:35:32] "Type of item: person" [16:36:08] Bah! My index on user_ids didn't finish. Boo. [16:36:22] No Jimmy Productivity measures for the next couple of hours :( [16:36:40] harej, what about random paper generators? [16:37:24] Hey guillom. I'll have an update re. measuring the productivity of Wikipedia for the RG meeting today :) [16:37:31] Are there Wikimedia projects that cite papers that do not have human writers or preparers? [16:37:39] halfak: \o/ [16:37:49] I'm excited to see the results! [16:38:01] harej, not sure. Maybe you could have an organization be an author? [16:38:34] guillom, if this index finishes, we can scrutinize per editor measurements :D [16:38:52] It would be fun to run a few queries during the meeting. They should be fast for anyone who doesn't have more than 10k edits. [16:39:03] ^ to articles. [16:39:04] * guillom 's productivity is probably close to zero these days. [16:39:09] me too. [16:39:12] heh [16:40:16] halfak, in which case, they would "type of item: organization", no? [16:40:24] Yeah. That would work. [16:40:36] Just trying to imagine authors that wouldn't be "person" or "human". [16:40:43] Ooh. Is a corporation a "person"? [16:40:51] Or is there some reason corporate authors should be treated the same as individual human authors? [16:40:54] US law != Wikidata [16:40:58] Corporations would get the organization designation. [16:41:09] Actually, I have >18k edits on frwp, but most of those are probably a combination of reverts and admin/process edits, so I doubt my historical productivity is any better. [16:43:08] just for you halfak http://librarybase.wmflabs.org/wiki/Q266 [16:43:28] :D [16:44:20] harej: cool; I can put that in. I'll try and figure out how to keep my pwb script in git then you can also see that [16:44:40] tarrow, any objection to designating http://librarybase.wmflabs.org/wiki/Q261 as the sandbox? [16:44:53] harej: nope, sounds like a good idea [16:47:16] * harej wonders if it would be excessive to deem it "type of item: sandbox" [16:47:51] * harej decides against it, at least as long as there is only one sandbox [16:56:25] harej: it might also be worth checking that the SPARQL endpoint is working as you imagine it to be. I had great trouble making it stay up to date with librarybase and then it suddenly started working. I'm nervous that might un-happen because I don't know what changed. [18:58:45] 10Quarry, 6Labs, 10Labs-Infrastructure, 7HTTPS: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#1849414 (10Dzahn) [18:59:30] 10Quarry, 6Labs, 10Labs-Infrastructure, 7HTTPS: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#1849419 (10Dzahn) looks like https works just fine and only a redirect is missing from http->https, will look into it [18:59:34] 10Quarry, 6Labs, 10Labs-Infrastructure, 7HTTPS: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#1849420 (10Dzahn) a:3Dzahn [21:34:47] J-Mo: Nice presentation earlier :) [21:35:00] thanks guillom [21:35:06] J-Mo: \o/ nice plug for ES :) [21:35:22] J-Mo: I was wondering if there had been any news about that collaboration project with UX design students. [21:35:23] just for you, yuvipanda [21:35:31] J-Mo: we might have the Elasticsearch infrastructure ready as early as today :) but definitely this / next week. [21:35:32] guillom, yes! [21:35:53] and Trevor said I should sync up with you about it. [21:36:00] Hah! :D [21:36:18] We don't have to do it now, I just wanted to know what the status was :) [21:36:35] want to meet next week? I could find a time Monday or Tuesday (I'm out on Weds-Fri). [21:37:01] yuvipanda: you are the awesomest [21:37:16] J-Mo: all credit to bd808 in this one [21:37:27] Sure; I'm usually in the office from 8 to 3:30. Any time in that window should work. [21:37:29] I just nerd-sniped appropriately :D [21:37:42] hehe [21:37:48] teeaaaahooooussseee [21:37:51] J-Mo: did you have any time to check any of the links I sent you [21:37:52] WIKIPROJECTS [21:37:55] I'll look for a time for us to meet guillom [21:38:05] Great, thanks J-Mo! [21:38:28] yuvipanda: I looked at the iPython notebook, didn't play with anything though. So mostly, the answer so far is "not really", I guess :/ [21:38:47] harej: Teahouse and WikiProjects indeed. [21:39:06] J-Mo: how did you build up a technical volunteer group? [21:39:35] This is something I have had trouble with. [21:41:39] it was pretty organic in this case, harej. There have always been enough editors around the Teahouse who knew Javascript that maintaining/expanding the gadget was no problem. [21:41:57] Hmm. WikiProject X is more Python based. [21:42:05] Worklists based on database queries. [21:42:26] yeah, I've never had any help with that part of the Teahouse. HostBot has always been my thing. [21:42:53] Whereas with JavaScript, it's right there on the wiki, anyone can tinker with it. [21:43:03] I think it's harder because the collaboration has to happen off-wiki (as opposed to gadgets). There's no built-in development environment, unlike JS and LUa [21:43:07] yep [21:43:31] though I hear the PyWikiBot folks are making it really easy to develop with that framework these days. [21:43:56] and yuvipanda has big dreams of providing all sorts of research/dev infrastructure on labs [21:44:13] they are closer to reality now [21:44:21] J-Mo: I've implemented 90% of what we talked about [21:44:24] I need to be less terrified of my customers. [21:44:38] who are your customers? [21:44:38] J-Mo: the students for Google Code-In for pywikibot are going to be using the kind of setup we talked about [21:44:46] nice yuvipanda [21:44:46] WikiProject users are my customers. Wikipedians, broadly speaking. [21:45:06] wikipedians are pretty scary! grrr. argh. [21:45:32] "The thing broke. Let's abandon progress and go back to the stone age." [21:48:46] well, alpha testers have a right to be fussy. [21:49:53] And I want to help them. It's hard to do so when you are limited in your ability to do so for reasons beyond your control. [21:49:56] it's always going to be a heavy lift to stabilize the infrastructure so that people trust it not to break. [21:50:02] yep [21:50:23] ^ [21:50:36] * yuvipanda hopes to build technical infrastructure so others can build social infrastructure [21:51:22] * harej has been building technical infrastructure to build social infrastructure [21:51:25] it's the premise of WikiProject X [21:51:26] as long as you don't make people feel like you're demanding their time and not providing them any value in return, you should always be able to return to the well and (humbly) request another chance. [21:51:48] I can't control how people feel. I try to be responsive. [21:52:14] that's good. let them know that you will hear and engage, no matter what. [21:52:50] I can't control how people feel. < The story of my life. [21:52:55] have they told you about the aspects of WPX that they find most valuable? You may be able to earn a little more buy-in from them if you make it clear that you're prioritizing improving those [21:53:01] lol guillom [21:53:55] I have been asking for feedback, yes. [21:53:55] focus on the parts that they find most useful/fun, and work on the stuff that YOU think is most importantly on the side. [21:54:09] There's a lot that needs to happen. Before I can improve, I have to fix the basics. [21:54:20] sounds right. [21:54:32] what do your users want from the system? [21:54:40] Everyone wants something different. [21:54:42] other than improved stability [21:54:52] examples? [21:55:43] Women in Red provided a laundry list of feedback. They're the most active users of our product. They are being prioritized because they actually use the stuff (and because it dovetails nicely with WMF priorities). [21:57:35] sounds good. Might want to ask them to try to rank their suggestions, so that you have a sense of what you need to work on first for the most win. [21:57:48] laundry lists can be pretty overwhelming when you're resourced constrained. [21:58:20] or, suggest a prioritization (rank the things yourself, using your best guesses), and present that to them and say "how does this sound? We'll work on this first, then this, then this?" [21:59:01] phabricator reflects our current priorities https://phabricator.wikimedia.org/project/board/1370/ [21:59:24] I don't expect people to use it. But there *is* prioritization work. As we begin planning sprints (once our funding is renewed), I can more specifically visit those priorities. [22:00:24] awesome! and I didn't mean to suggest you didn't have priorities. Just suggesting that prioritizing things with your core users will help them feel more invested in the project :) [22:19:27] Librarybase has a logo now. http://librarybase.wmflabs.org/wiki/Librarybase:Home