[12:11:44] hi, I helped plotting the maps for this blog post [12:11:45] http://elections.oii.ox.ac.uk/online-presence-of-the-general-election-candidates-labour-wins-twitter-while-tories-take-wikipedia/ [12:12:05] includes UK candidates with a Wikipedia article per party [12:12:16] http://elections.oii.ox.ac.uk/wp-content/uploads/2015/05/w_all1.png [12:13:32] it looks like that for the vast majority of constituencies only one candidate has a Wikipedia article [12:14:11] that is, if the conservative cadidate has a page, the labour candidate doesn't [12:14:26] any thoughts? [15:04:05] Heh. Early on a Sunday is rough for this channel. [17:01:49] Ironholds: someone asked if my owning MITLicense.org was ironic. Folks, there's no irony; it's just a copy of the MIT License in human- and machine-readable formats. [18:38:00] o/ science people [18:38:42] hi halfak [18:38:49] How you doin' yuvipanda [18:38:50] ? [18:39:12] http://s3.amazonaws.com/rapgenius/1325110916_fonzie.jpg [18:39:16] halfak: pretty good! I got nerdsnipped yesterday and ended up deploying apache mesos on labs and writing a puppet module for it [18:39:33] and now suddenly I’ve a fairly powerful clustered distribution mechanism running with pretty web endpoints! [18:39:48] http://marathon.wmflabs.org/ [18:40:43] * halfak googles the things. [18:41:18] What do I do with this thing? [18:42:29] "Mesos can run Hadoop, Jenkins, Spark, Aurora" Oh! [18:42:43] So what other mesos-like things are there? [18:42:57] i.e. what occupies the same/similar niche? [18:44:56] halfak: the idea being [18:45:15] halfak: instead of having a hadoop cluster, a spark cluster, a jenkins cluster, a marathon (similar to toollabs/gridengine) cluster [18:45:22] you have one ‘cluster’ and then use it for all these things [18:45:44] Hmm.. Seems like that's what we're doing with CDH on the analytics cluster. [18:45:54] I dunno about Jenkins, but we're running spark and hadoop. [18:45:57] so I’m interested in it as a toollabs gridengine replacement [18:46:04] Interesting. [18:46:10] halfak: it can run about 40 different things [18:46:13] I like the idea of having a cluster in labs too. [18:46:15] and building your own is simple too [18:46:41] so it’s basically a meta-framework - makes it super easy to build your own clustering thing [18:46:52] or run any of the ‘frameworks’ which themselves are clustering things [18:47:06] halfak: my immediate usecase is to put ipython notebooks on these [18:47:20] halfak: so we can open them up fully to public consumption and scale them however we want :D [18:47:37] halfak: we can also guarantee resources - at least X gb RAM and y CPUs per user or whatever, if we want to [18:48:14] yuvipanda, I'm a little worried about sectioning off CPU and Ram. [18:48:28] nah, it’s just guarantees [18:48:34] it simply means you won’t get overcrowding [18:48:40] it isn’t sectioning them off [18:48:40] But if we can guarantee a minimum and let the rest be used flexibly... well. I'll hug you. [18:48:44] yeah htat [18:48:51] Hug en route. [18:48:54] heh [18:48:59] well it isn’t built yet :P [18:49:00] Say, did I talk to you about coming to town? [18:49:06] halfak: no [18:49:11] halfak: but when? I might not be in town ... [18:49:11] I'm going to be there on May 10-15th. [18:49:15] ooooh [18:49:17] I will be in town! [18:49:18] sweet ;) [18:49:20] \o/ [18:49:30] I was going to leave on 9th [18:49:33] but I switched to 16th [18:49:37] Woot! [18:49:53] Maybe we can schedule one of those 6 hour hack sessions. [18:50:01] We've still got the Article stats thingie. [18:50:15] yeah [18:50:28] well and toollabs stats too - I’m hoping we can make a case together before we leave :P [18:51:22] J-Mo had some good feedback. [18:51:43] I got 1.5h of spage’s time this week to help start rewriting docs [18:51:50] where? [18:51:56] * halfak digs. [18:52:00] Might have been an email. [18:53:01] Bah. Can't find it. Gist = We need to come up with a "What we want to do" and "What we need to do it" pitch. [18:53:46] Right now we have a "this is important and we need more resources" argument. [18:54:25] halfak: yeah, totally +1 [18:55:03] halfak: I’ve a fair bit of ‘what we want to do’ (be as good as heroku!) for toollabs, but not ‘what we need to do it’ [18:55:08] and I suppose you have more designs [18:55:37] I think that we ought to ask for more ops resources. Tool labs/WMF labs needs stability. [18:56:02] But I think that Heroku should be one of the first things that we list as proposed projects. [18:56:07] yes [18:56:10] I think you are absolutely right on with that. [18:56:30] halfak: are your mediawiki libraries already installed on tool labs? [18:56:48] use virtualenv! [18:56:54] harej, not that I know of. I have been helping people set up virtualenvs -- which they should be doing anyway. [18:56:57] Yeah. [18:57:55] and if i wanted to install it, invoke it in your code, what should i do? i can't quite piece together what the name of the module is [18:58:50] halfak: so yeah, more opsen. [18:59:04] harej, trouble with mediawiki-utilities? [18:59:14] halfak: like, we need: 1. logstash, 2. an object store, 3. better NFS (or ideally no NFS) [18:59:23] is mediawiki-utilities the name of the module? [18:59:32] harej, reference this: https://pythonhosted.org/mediawiki-utilities/ [18:59:49] Ah! MIT license. Good choice. [18:59:53] regretfully, "mediawiki-utilities" is the name of the package, but the module is just "mw" [18:59:54] I run mitlicense.org. [19:00:00] +1 for MIT [19:00:10] practical open source FTW. [19:00:12] * yuvipanda introduces harej to wtfpl [19:00:12] okay so that's problematic because I use another module also called "mw" [19:00:25] I love that we live in a world where the GPL is mostly counterproductive. [19:00:28] * harej growls at legoktm [19:00:42] harej, what module is that? [19:00:54] this one: https://github.com/legoktm/supersimplemediawiki [19:00:56] oops [19:00:57] lol [19:01:19] legoktm: can i commit something to change the name to legomw or something? [19:01:21] harej: just move mw.py to better_mw.py and "import better_mw" [19:01:24] legoktm, yet another api package? [19:01:27] or that. [19:01:35] halfak: you can never have too many!!! [19:01:38] I shouldn't be owning this namespace. [19:01:42] heh https://github.com/yuvipanda/python-mwapi [19:01:51] and sigma loves pushing his ceterach [19:01:58] legoktm, I've been considering breaking up mediawiki-utilities into mwapi, mwxml and mwdb. [19:02:10] +1 [19:02:25] The problem I have is the shared types. [19:02:27] I want to build a ‘starter pack’ for toollabs tools, and I’d like them to be composable [19:02:33] halfaklib [19:02:36] lol [19:03:33] mwapi would look like supersimplemediawiki at the base. [19:03:51] i want to also include auto-continuation too. [19:03:58] So there might be a query() method. [19:04:13] halfak: i'm assuming your mediawiki module also includes all the same stuff as lego's? [19:04:17] so i would only need one thing? [19:04:18] Then, if we feel it is necessary, we can implement a wrapper API. [19:04:22] harej, yes. [19:04:31] It seems so. [19:04:35] I wrote ssmw when I was annoyed at pywikibot for some reason [19:04:44] And don't uncouple all the different approaches yet! I'm going to need both database and API. [19:04:46] and apparently other people started using it? [19:05:20] (I have a strong preference for database queries, but I am cautioned against using database queries for checking for reversion.) [19:05:49] halfak: how much of this code could be shared with pywikibot? it already has some extremely robust API code [19:05:50] harej, only because my DB wrapper assumes reasonable table names and I haven't fixed it for labDB [19:06:20] pywikibot is robust but it's reaalllly complex [19:06:23] legoktm, generally I find pywikibot to be super heavyweight and I don't want to deal with it. [19:06:28] So I'm not sure. [19:06:53] halfak: you're going to be in lyon, yes? [19:06:57] Yup [19:07:08] it's super heavyweight because it covers literally every edge case :P [19:07:10] ^ _ ^ [19:07:24] legoktm: which is poor software design in my opinion [19:07:49] harej: not really. it means I can write a bot and only look at it once or twice a year and everything is still running fine [19:07:50] i've expressed this opinion with regard to wikiproject banners. you make them do too much and then you have to throw the whole thing out because it's unmaintainable [19:08:02] legoktm, sounds like the bots I run with my own API code. [19:08:03] i guess pywikibot is not nearly as bad though [19:08:06] https://pypi.python.org/pypi/wmflabs [19:08:48] halfak: but every time the API makes a breaking change, you have to go in and adapt for it. I can just git pull and everything is fine [19:08:48] legoktm, cool :) [19:08:59] I like the db connector. I can make use of that. [19:09:06] What kind of conn is returned? [19:09:44] legoktm, well, that's why I maintain a library. [19:09:47] https://github.com/legoktm/wmflabs-lib/blob/master/wmflabs/db.py#L24 [19:09:58] I've honestly never ran into a breaking change like that though. [19:10:11] legoktm, nooooo. [19:10:14] Boo to oursql [19:10:18] Yay to standards! [19:10:23] wat :/ [19:10:23] DBAPI v2 FTW [19:10:44] oursql does not conform to https://www.python.org/dev/peps/pep-0249/ [19:10:45] are you a pymysql person? [19:10:50] yes [19:11:06] I used to do oursql because MySQLdb isn't worth it. [19:11:10] ^ [19:11:19] But when yuvi introduced me to pymysql I converted everything. [19:11:38] apparently I haven't touched that code since 2013 :P [19:13:46] legoktm, seems like we should have a meetup of the python library people at Lyon to discuss the state of things. [19:14:25] +1 [19:14:33] yw :D [19:15:05] how about we standardize libraries and templates for ‘things running on toollabs / labs interacting with things' [19:15:05] ? [19:16:29] legoktm: halfak ^ [19:17:24] +1. [19:17:26] yuvipanda: I wouldn't limit it to just labs, I think we need to look at the entire python mw ecosystem [19:17:36] +1 again [19:17:40] I don’t even know where pywikipediabot sits [19:17:41] yeah [19:17:57] But it would be good to make sure that all the things work nicely in a labs env. and we document how to do that. [19:18:45] yuvipanda: pywikibot* runs everywhere :P I know people who run it on the same server where MW is installed [19:20:12] so how do we schedule a thingy? [19:20:24] Not sure. I think we do a phab ticket. [19:20:29] * halfak searches around [19:20:58] legoktm: halfak but we all work for the WMF! This clearly is useless [19:21:09] https://phabricator.wikimedia.org/tag/wikimedia-hackathon-2015/ [19:21:41] * halfak adds a meeting proposal [19:23:30] https://phabricator.wikimedia.org/T97950 [19:36:03] https://lists.wikimedia.org/pipermail/pywikipedia-l/2015-May/009273.html [19:37:23] I don't know how useful I would be for a python framework sprint/meeting but I am a consumer and would be interested. [19:39:33] 0.o pywikipedia-l [19:39:37] and I'm not on the list [19:40:37] harej, still nice to have you come and say, "Doing is frustrating and poorly covered by available libraries" [19:40:47] Foo could be the intersection of two libraries too. [19:42:36] e.g. right now, if you want to us mw.lib.reverts and pywikibot, you'll need to set up two connection managers. [19:43:00] shit, i'm going to have to use both, aren't i [19:43:04] if i am going to be making edits [19:43:05] It would be nice if there were some simple library beneath all API things so that "session" objects can be shared. [19:43:05] legoktm: can you forward that to labs-l too? [19:43:46] harej, making edits with mw.api is not a terribly nice experience yet... so yeah. [19:44:03] 21.11 <+halfak> But when yuvi introduced me to pymysql I converted everything. [19:44:10] o/ Nemo_bis [19:44:17] https://github.com/PyMySQL/PyMySQL/issues/339 [19:44:19] Hi [19:45:48] pymysql doesn't work in tools, unless it does in some non-obvious way; I was surprised not to find a python example in https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database [19:46:23] yuvipanda: sent [19:47:25] Nemo_bis, weird. [19:47:30] * halfak goes to try things out. [19:47:35] OH! I think I know what it is. [19:47:51] Nemo_bis, in your defaults file, does the password have quotes around it? [19:48:03] If so, remove them and try again. [19:48:13] single quotes [19:48:18] legoktm: sweet [19:48:22] halfak: what if as a workaround, i have a mediawiki-utilities script that does the work and then a separate pywikibot script that makes the post? [19:48:30] halfak: also from the user? [19:48:48] halfak: we could either submit a patch to pymysql or remove the quotes [19:48:50] Nemo_bis, yeah. [19:48:59] you should file a bug [19:49:12] harej, those two libraries can be imported at the same time. No problem. [19:49:31] mediawiki-utilities is very small as an import. [19:49:50] yuvipanda, I think we should file a bug. [19:49:56] :D [19:49:58] I've been meaning to do that. [19:50:03] on pymysql or toollabs? [19:50:16] pymysql [19:50:21] halfak: you were right! updating the bug [19:50:29] if it works for mysql client, it should work with pymysql [19:50:33] Nemo_bis, \o/ [19:51:50] halfak: I had just resorted to use mysqldb and it worked well; it's silly that I lost over one hour on such a thing though [19:52:19] Hence, knowledgeable people, please don't be afraid of documenting your personal preferred solutions on https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database and friends :) [19:57:16] ^ Agreed. That shouldn't be under "tool labs" though. [19:57:24] It should be general to LabsDB. [20:11:26] does notconfusing ever hang out here? [20:11:52] harej, he usually is around pretty often, but I think he's AFK a bunch recently. [20:11:59] o/ sdesabbata [20:12:07] Saw your post from 7AM localtime. [20:12:09] :) [20:12:14] okay. you may know the answer to my question in any instance. did he develop a database of the extant references on wikipedia? or was that someone else? or am i hallucinating? [20:12:33] I did that! [20:12:34] :) [20:12:41] What kind of references are you looking for? [20:13:08] Is there any way to cross reference that database with a wikipedia category? [20:13:10] Oh wait. Notconfusing did something like that too. He has a live DOI extractor. I did a historical DOI extraction and I also did a historical tag extraction. [20:13:14] halfak: hi :) [20:13:42] harej, e.g. what references appear in what category? [20:13:54] halfak: you may know me well enough to know what direction i am taking this ;] [20:14:01] sdesabbata, I'm not sure how to answer your questions but I bet Ironholds might be able to. [20:14:31] harej, we're going to take over the world? Mwahahaha! [20:14:39] halfak: thanks for the ref [20:15:16] sdesabbata, maybe EdSaperia would have an idea too [20:15:57] halfak: my guess is that's quite related to the fact that most constituencies are quite stable in terms of votes [20:16:36] halfak: but the lack of overlap is quite impressive [20:16:37] halfak: an idea i had is to cross reference your database with different wikiproject categories, and to come up with a list of most commonly used references in those articles. though i am not sure if such a list would actually be useful. [20:16:44] Makes sense. Is it the underdog that has the article or the incumbent? [20:16:52] My big thing these days is cross-referencing existing indices with wikiproject articles [20:17:18] harej, I think that sounds interesting. [20:17:31] Do you think you could produce a list of page_id, wikiProject pairs? [20:17:38] If so, this should be pretty easy. [20:17:39] harej: that sounds great [20:17:55] I think most WikiProjects have decent categories. [20:18:10] * halfak gets a link to the dataset. [20:18:19] Hey guillom. [20:18:28] We're looking at links in that dataset [20:18:35] hey [20:18:41] guillom, did some work looking at the most common links across the wiki [20:18:47] halfak: I'm already working on a thing that generates a list of every page in a given WikiProject based on the quality assessment categories [20:18:52] harej, is looking to do breakdowns by WikiProject. [20:19:00] oh, hmm [20:19:11] * halfak feels like Mr. MatchMaker [20:19:54] harej: you’re aware of https://tools.wmflabs.org/enwp10/cgi-bin/pindex.fcgi right [20:20:18] oh that doesn’t actually list the articles [20:20:19] nvm [20:20:27] it doooes but can an application use it? [20:20:37] does enwp10 have some kind of interface, for programming applications..? [20:20:54] harej, I wonder if they have a public DB you could query [20:21:01] i emailed theo about it [20:21:58] that thing is amazing though, it so effortlessly lists all the articles for a wikiproject, despite the category clusterfuck that presently exists [20:22:14] if only it had some kind of application programming interface, an API if you will [20:23:01] harej, na. Who wants to help other people do things? ;) [20:23:14] url seems simple enough to be able to query from a script without API [20:23:39] but I don't want to make a web scraper? [20:23:41] http://www.crummy.com/software/BeautifulSoup/ [20:23:43] dude i work only with JSON [20:23:50] IF you have to resort to such things. [20:25:29] lxml.html > BeautifulSoup [20:26:17] yuvipanda, good to know. Never used either. Just heard of Beautiful Soup before. [20:26:27] so like, i'm not sure if i have to resort to that if i already developed a workaround that uses the labs db directly? :) [20:26:44] though the screen scraping thing, if i pulled it off, would probably be more robust [20:26:56] you shouldn’t, no :) [20:27:27] harej, please DB first [20:27:46] * halfak has never seen screen scraping and "robust" in the same sentence before [20:28:09] this is essentially the logic i came up with: https://www.evernote.com/shard/s444/nl/75511172/bf2ce00d-4f17-46a0-b5f3-997a8112ed30/ [20:28:17] err [20:28:23] i need to make that note public, hold on [20:28:52] https://www.evernote.com/shard/s444/nl/75511172/bf2ce00d-4f17-46a0-b5f3-997a8112ed30/ [20:29:03] i give up, pastebin it is [20:29:51] okay, this definitely works: http://pastebin.com/raw.php?i=egVW7Gm6 [20:31:45] halfak: btw, stefano.oiilab.org/sv17UpP0/vis-a-wik [20:31:54] sorry [20:32:01] wrong link [20:32:23] http://sdesabbata.github.io/vis-a-wik/ [20:33:21] sdesabbata, this visualizes the interwiki links, right? [20:33:36] yes, links and langlinks [20:33:56] Oh, so there's internal links too [20:34:04] Yeah. I see now [20:34:31] What's a query I can run that will demo some awesomeness? [20:34:59] I tried "Beccaria" and it was curious [20:35:09] halfak: still a lot of work to do on the interface [20:35:14] But I suspect it was because of the selection limit [20:35:26] the idea is that you should select a number of pages, not just one [20:35:42] (i.e. I didn't select the same topics in it.wiki vs. de.wiki and it.wiki vs. fr.wiki) [20:35:44] you can also do multiple searches [20:35:51] and add pages after pages [20:36:01] sdesabbata: what's the max? [20:36:35] I didn't set one, but after some 20, it gets really cluttered [20:36:56] IIRC I wasn't able to select more than a few dozens [20:37:06] It refused to proceed or something [20:37:49] consider that it has to query the Wikipedia API a lot [20:38:10] at least two queries per page [20:38:13] harej, why don't you use the "WikiProject_%_articles" categories? [20:38:16] They are hidden. [20:38:24] But they are part of the std. template set. [20:38:27] halfak: not every wikiproject uses them [20:38:34] harej, you're right. [20:38:36] milhist e.g. has all the articles in subcategories [20:38:48] But it doesn't look like it is part of your masterlist query. [20:39:05] because it would be redundant to query that *and* the different assessment subcats [20:39:16] Would it? [20:39:23] Nemo_bis: its an alpha, just a few hours of work, I hope to get feedback on how editors might use that [20:39:24] yes? [20:39:27] Are there more articles with assessments than a Wikiproject category? [20:39:40] wikiproject % articles = {assessment categories, unassessed articles} [20:39:51] i also include the category for unassessed articles, the fallback [20:39:59] yes, but did you check this empirically? [20:40:01] sdesabbata: I think the main benefit is in finding gaps in coverage or isolated clusters of topics [20:40:21] i did with at least two wikiprojects [20:40:26] kk [20:40:35] sdesabbata: https://it.wikipedia.org/wiki/Progetto:Connettivit%C3%A0 [20:40:37] Nemo_bis: exactly, it might especially useful for multi-language editors [20:40:55] sdesabbata: most hyperactive editors are crosswiki anyway [20:41:15] Nemo_bis: that's interesting [20:41:24] Or at least Scott Hale found something like that [20:41:26] Nemo_bis: I will look into that [20:41:30] sdesabbata: the main tool we currently have to identify gaps is https://meta.wikimedia.org/wiki/Mix%27n%27match [20:41:48] Nemo_bis: yes, he is a colleague of mine [20:41:50] We currently lack tools to solve interwiki conflicts, in part due to https://phabricator.wikimedia.org/T43345 [20:42:27] In ancient times we used pywikibot's interwiki.py graph [20:42:40] https://meta.wikimedia.org/wiki/Interwiki_graphs [20:43:11] Nemo_bis: I do remember that [20:43:16] IIRC last time I managed to make it work was 2006 :p [20:43:33] Nemo_bis: :D [20:43:56] Nemo_bis: the aim for that alpha is to develop into a collaborative visualization tool [20:44:08] Nemo_bis: a Wiki-style visualization tool [20:44:45] Nemo_bis: so that editors can collaborative look at gaps to fill, pages to translate, etc [20:45:25] sdesabbata: aharoni will be very interested in hearing about that :) [20:45:53] sdesabbata: but you should also comment on the talk page of https://it.wikipedia.org/wiki/Progetto:Connettivit%C3%A0 , there are some very active editors there [20:46:34] Nemo_bis: that's great to know, I am not much in contact with the italian community yet [20:46:42] Nemo_bis: but I look forward to it [20:48:04] sdesabbata: sono quattro gatti ma dei carri armati :) [20:48:36] Nemo_bis: fantastico :) [20:48:40] https://phabricator.wikimedia.org/T87410 this is not what I was looking for :/ anyway, ask Amir maybe; could test some form of crosslinking between the tool and ContentTranslation perhaps [20:49:33] sdesabbata: anyway, on your tool, search "Beccaria" in fr de it wikipedia and tell me if you see something interesting [20:50:18] What it *seemed* to show to me is a very closely connected cluster of articles related to Cesare Beccaria on it and de, and a much more dispersed set of articles in fr vs. it [20:50:45] On a similar observation one could write an entire paper about the influence of Milanese englightment on the Austrian empire and Germanic culture :P [20:50:54] But I think it was just a bug of selection [20:51:21] anyways halfak that reference data set... is it kept up to date? does it match references and article titles? [20:52:07] Nemo_bis: it seems to give quite different vis if you search in ita and compare to fra, or if you search in fra and compare to ita [20:52:52] Nemo_bis: well, this is exactly the type of mid-scale analysis I had in mind, that the tool should facilitate :) [20:52:52] harej, I build it off the dump files periodically. It does include article titles and page_ids. [20:53:12] yay. and do i have access to this? [20:53:26] Nemo_bis: as doing this just looking at pages would take ages [20:53:37] Do you want the historic one (which includes additions and removals) or the one that just grabs the ref tag from the most recent revisions? [20:54:10] I'm not sure what I want. [20:54:31] got to go [20:54:33] Most recent revisions might be most practical. [20:54:51] harej, for some reason, I'm not seeing it in put public datasets folder. Will move it. Sec. [20:55:00] Nemo_bis: I got all your links, thank you so much! a presto! :) [20:55:30] harej, in the meantime, see https://github.com/halfak/mwrefs [20:58:42] * halfak downloads at 6.5MB/s [20:58:43] :) [20:59:24] sdesabbata: yes, a graph for just few dozens articles can make one save hours [20:59:39] a presto [21:08:36] * halfak runs off to have a Sunday [21:08:42] Have good one folks! [21:08:42] o/