[05:34:42] ahh, NOW I remember why I hate writing python [05:52:24] I used Python to get valuable information, like the number of footy-related edits made to wikipedia! [05:57:35] harej, I'm using it for Project Thaumiel [05:57:47] project what [15:09:57] _o/ [15:36:07] euugh [15:55:06] morning [15:59:22] Ironholds, why you hating on python? [15:59:48] * halfak away your messages from last night [15:59:52] *saw [16:00:42] halfak, the language is confusing! [16:00:45] also I was writing it until 4am [16:01:06] :P It's commonly picked in CS programs as an introductory language because it is the opposite of confusing! [16:08:54] halfak, are you kidding? [16:09:06] "okay, give me the logic chains of C. now fuzz it a bit and introduce formatting-specific syntax" [16:09:13] "okay, cool" [16:09:18] Nope. [16:09:21] Very serious [16:09:31] Python is held up as a nice language to write. [16:09:41] well, I spent 4 hours writing it and did not enjoy [16:09:53] How was the first time you wrote interesting C? [16:09:59] although that may have been simply because I spent those four hours writing something to turn the dump files into a consumable...thing. [16:10:10] Why did you do that? [16:10:14] I already wrote that. [16:10:14] okay, fair point. The first time I wrote interesting C I think I wandered around the office looking like a madman [16:10:28] going "WHAT THE FUCK IS A CONST CHAR* AND HOW DO I MAKE IT INTO A CHAR" [16:10:34] lol [16:10:39] ooh, you did? Where! [16:10:59] https://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump [16:11:09] Turns an XML dump into an iterable of pages and revisions. [16:11:20] oh! [16:11:24] the pageview dumps! [16:11:27] Oh! [16:11:29] the XML dumps I always use mwutils :D [16:11:44] yeah... that's an interesting problem :) [16:11:50] "interesting problem" [16:11:52] Pull request to mediawiki-utilities eventually? [16:11:54] :) [16:12:02] with a pageview processor? [16:12:15] It would be cool to have a page_counts module in core. [16:12:17] yeah [16:12:20] Sure! [16:12:29] at the moment it's very....eeeeeeeh. [16:12:42] you know we store mobile pageviews in distinct rows? [16:12:51] and the rationale for the limited format is "saving space" [16:13:03] which is fine, but oh, we're waiting HOW MANY bytes including the average response size?! [16:13:07] *wasting [16:16:24] on the plus side, I now have a Python IDE I don't hate. [16:17:32] Ironholds, if you can share the code, I'd love to review and discuss API/processing strategy. :) [16:17:49] sure! [16:18:06] I have a couple of open questions I dunno the answer to yet :). I learned something late last night that simplifies things, though, so I'mma implement that first [16:18:34] kk [16:36:28] oh jesus hell [16:36:34] halfak, these files are SPACE DELIMTED [16:36:39] * Ironholds cries [16:37:58] * Nettrom passes the kleenex [16:38:31] and it's not even lower-casing the project names! [16:42:09] ...space delimited? that one is new [16:43:07] hi all [16:46:32] halfak, oh, I found a great storage system to use for this data (CC YuviPanda) [16:46:41] pingfs? [16:46:52] Riak? [16:47:00] oh? [16:47:07] https://en.wikipedia.org/wiki/Riak [16:47:13] I… remember looking at it and going ‘looks very opencore' [16:47:16] but that was ages ago [16:47:18] SPACE DELIMITED!? [16:47:26] i know right? [16:47:27] * halfak punches the air [16:47:30] recommended by a couple of databasey friends; the idea is I'd store project+pagetitle as a hash and use that as a key [16:47:45] halfak, I had to explain to paultag yesterday that our infrastructure was perl and hell [16:47:48] his response was...let me find it. [16:47:55] https://twitter.com/paultag/status/575149371127369728 [16:48:16] Ironholds: nice. you should also look at cassandra. we use it in prod as well, and we just hired one of the major contributors to cassandra... [16:48:22] (he also coined the term ‘NoSQL’ I am told) [16:48:28] YuviPanda, ooh, we did? I was looking at Cassandra, yes! [16:48:34] Reddit uses it for their analytics data [16:48:38] Ironholds: yup. restbase is cassandra [16:48:53] but as I understand it Cassandra's main advantage is ultra-fast write times? [16:49:19] I am not sure. I haven’t looked at them at all [16:49:41] I was just blindly suggesting it because it would be easier to set up in our environ ;D [16:50:00] in hindsight, a fairly useless suggestion. ignore me. [16:50:17] Ironholds: this is just for a subset of the data, right? ‘last month’ perhaps? [16:50:29] yes, but I'd like the infrastructure and code to be portable and expandable [16:50:44] like, the idea is if it works I can just go "here is a big, documented Python codebase and import method and datastore. Go nuts." [16:50:47] right. [16:51:34] so I guess you’d have to do some minor multiplication about how much data all of it is going to end up being, and make sure that the storage system can handle at least twice taht [16:52:30] yup [16:52:50] YuviPanda, FWIW that amount is "less than the amount you calculated" [16:53:00] our pageviews dumps, they include the bytes_served count for that page [16:53:06] quite why anyone thought we needed that I have no idea [16:53:07] that seems a bit useless [16:53:15] (or a byte useless) [16:53:18] it seems incredibly useless and incredibly databloaty [17:05:13] * guillom is also having fun with Python these days. [17:05:23] ah, standup... /me skips, nothing to report as I've been bus with other things, but I did get an ICWSM paper accepted [17:06:58] (Where "fun" involves a third-party web API that keeps sending me 200s ever though it's not doing what I want.) [17:07:07] even* [17:17:04] congrats guillom :) [17:17:55] Nemo_bis: Why? [17:25:09] guillom: well, every other week you mention a new programming language you're learning :p [17:25:23] Or so it feels. [17:26:45] Nemo_bis: Ah :p I learned some Python a few months ago to write MrMetadata. I just haven't played with external APIs, because I relied on pywikibot. [17:38:48] halfak, is there any python data structure that's...I guess, an array? [17:39:05] so I can say "if this string matches a value in col1, append the pertinent col2 value to it!" [17:39:15] "list" is an arraylist. [17:39:15] Ironholds: python has two different type of array things [17:39:29] halfak, cool! So I can have a list of lists and that'll do it? [17:39:32] lists are number-indexed; arrays are associative. this=that and so forth [17:39:34] Yup [17:39:42] awesome! [17:39:43] That's a fine way to represent a matrix [17:39:56] harej, you've been doing this too long. Want to help with a project? :) [17:40:04] halfak, great! Dezachteing [17:40:10] Too long? I wrote my first python script ever *last week* [17:40:27] As I said, too long. [17:40:40] heh [17:41:57] I don't know how much free time I have, but what do you need help with? [17:42:08] Thaumiel! [17:42:11] I have experience writing Python scripts for analytics purposes. Well, I have experience writing Python *script* for analytics purposes. [17:42:13] Oooh! [17:42:18] I would love to help with that. [17:42:20] yay! [17:44:10] What do you need me to do? [18:11:05] harej, so what's the latest WMUSA info? [18:21:05] Ironholds: there is a planning meeting this weekend. i hope to launch scholarship/request for papers next week. likelihood of that is... we'll see :) [18:21:33] harej, yay! What sorta papers? [18:21:38] Erik gave me clearance to attend [18:21:49] I mean that liberally. Request for sessions, really. [18:22:34] what sorta sessions? ;) [18:23:03] anything tangentially wiki-related. submit a presentation and it will probably get accepted [18:23:19] can it be a standup comedy routine? [18:23:24] probably not [18:23:26] I mean, my presentations are always standup comedy [18:23:28] unless it's relevant! [18:23:29] but formally, this time [18:23:40] what if it's wiki-related humour? [18:23:44] sure [18:23:54] likely the program commitee will see "Oliver Keyes, Wikimedia Foundation" and go "ooooh" and you'll automatically get points [18:24:21] also Emily is on the committee, at-large no less. [18:25:08] Ironholds: harej true story, that happened at the American Consulate today... [18:25:24] My visa interview was about 1min long, most of which was athe consular officer gushing about how much he loves Wikipedia [18:25:33] (this was about 3h before NSA vs WMF was unveiled :P) [18:26:29] harej, Emily is always at large, she's a MENACE! [18:26:35] YuviPanda, nice! [18:26:57] UK Border Force is less impressed by Wikipedia. [18:50:33] hey harej, can I pitch you a visualisation and you can see if you think it'd be useful? [18:50:43] sure [18:50:47] see http://bl.ocks.org/kerryrodden/7090426 - I want to do one of these, right? [18:50:56] in blue and red, where red indicates failed proportion and blue successful [18:51:34] for, from inside to out, read requests, edit attempts, abuse-filter checks, spam-blacklist checks and revert checks [18:55:08] Ironholds, milimetric is half-way there./ [18:55:29] hm? [18:55:32] brt, in a meeting [18:55:51] ironholds is considering a starbursty plot of edit success [18:56:55] halfak, yep! [19:01:29] yes, ok, starburstiness [19:01:42] Ironholds: where's the data? what's it look like? [19:03:29] milimetric, not generated it yet! [19:03:43] right, but where would you get it from? [19:04:17] request logs + MW [19:04:35] and use session id to tie it all together? [19:05:48] no, just a general temporal aggregate [19:05:58] I'll draw something when I have some spare time :( [19:07:24] hm, Ironholds: there are simple ways to show just basic un-related proportions [19:07:42] sunbursts are meant for activity that's all tied together somehow [19:08:21] "of this inner circle, X % made it to this outer circle segment, while Y % made it to this other segment" [19:08:27] gotcha [19:08:45] well, we can say that of requests to ?action=edit in [hour], all of the edits in [hour] (or the vast majority) are likely to come from there ;p [19:10:27] sure, and you can get that from the EL data too, that ties it together much more tightly (soon for Wikitext too) [19:10:56] but I'm not sure about the next circle that you're intending - "all read requests" [19:11:11] mmn, fair point [19:11:18] I'll noodle on it. Currently at -30 bandwidth anyway [19:11:18] the proportion going to action=edit is going to be very small [19:11:55] yeah, i think you can get at this data, my only point is a sunburst would make it look kind of like a circle with a little stick sticking out of it (small segment for editing) [19:12:04] heh [19:12:12] and maybe a better viz would be overlapping area charts in logscale [19:12:28] so you can show the general shape of the edits as it follows the other things you theorize influence it [19:13:03] but also - if you need help putting this in a sunburst, I'm more than happy to help, I wasn't trying to pee on your idea [19:13:07] :) [19:13:27] sure! [19:13:30] like I said, bandwidth :/ [19:14:01] basically, for a sunburst you need data like this: [19:14:02] step1-step2-step3, 345 [19:14:02] step1, 2309 [19:14:02] step1-step2-step3-step4, 123 [19:14:28] and if you look at the file I used, it also has wiki and date so I can play with the filters [19:14:32] neat [19:14:54] but yeah, those columns are the basic format and the code handles the rest - definitely poke me if you get in trouble [19:23:38] * halfak finishes off implementation of diff optimizations and prepares to run new altiscale job. [19:29:55] so many legal requests :/ [19:39:04] halfak, someone just asked me who my mentor was in the context of academia [19:39:10] JFYI, if you start calling me padawan I'm gonna split ;) [19:39:47] :P [19:39:56] Just so long as I get to be a jedi [19:40:20] downside, you get sliced in half in a terrible film, though [19:40:47] Yeah, but then I get to be a ghost who shows up at the end of the other movies. [19:41:05] ghost can program all day without eating or sleeping. [19:41:11] And haunt the computers of their foes. [19:43:24] I have *my* python to haunt me [19:45:05] We'll make you a convert yet. [19:45:21] Come over to the ... er ... awesome side? [19:47:40] the pythonic side? [19:48:00] :) [19:49:18] hey halfak: have you ever had to POST urlencoded JSON data to an API? I can't seem to get it to work and would welcome a working example, if you know of one. [19:50:12] Oh yeah. I have done it a couple different ways. [19:50:25] So you want the post body to be *just* the URL encoded JSON? [19:50:28] urllib.quote! :D [19:50:47] I4ve been trying with urllib & urllib2 [19:50:52] I've* [19:51:01] guillom, I recommend "requests" [19:51:06] best library for it by far. [19:51:25] halfak: yes, I was eyeing that one, but I wasn't seeing a difference in what it was doing [19:51:32] Maybe I'll just try and see if it works [19:51:33] thanks :) [19:52:18] guillom, I'm scoping out the docs now on setting the body content. [19:52:24] Because I keep getting that code 200, and it isn't helpful. [19:52:39] halfak: I have the docs for requests open. [19:53:36] Looks like you can do "requests.post(, json=" [19:53:44] http://docs.python-requests.org/en/latest/api/#requests.post [19:53:49] :) [19:53:55] Thanks :) [19:53:58] So they have you covered with a builtin [19:54:02] no problem :) [19:55:03] Very few people seem to use the API I'm trying to access, so the docs aren't great. Few examples, and a few time-wasting typos. Hopefully this will help. [21:26:40] @seen halfak [21:26:53] Hi Amir1! [21:26:57] Hi! [21:27:03] I'm in a meeting now, but I'll be done in ~30 minutes. [21:27:10] great :) [22:01:03] Ironholds: waiting for erik (and the room) [22:02:39] kk [22:31:24] o/ Amir1 [22:31:30] sorry for the delay. [22:31:36] hey halfak – in hindsight, I don’t need to be there with Terry, but if you talk to him please defend the work we’ve done and make sure we don’t throw it away just because implementation is lagging ;) [22:31:48] halfak: hey :) np [22:32:07] Can you hop into a hangout? [22:32:32] DarTar, +1 [22:32:40] Will bring teeth and nails to fight with [22:33:08] my boy :) [23:13:28] there? [23:13:33] halfak: ? [23:13:41] Hey [23:13:56] :-) [23:13:58] So, that diagram depicts the dependency tree for our feature extractor. [23:14:19] sorry for the noise, I'm in dorm computer site and people playing DOTA in three AM, shouting [23:14:34] No worries :) I'm glad you could make it to chat either way. [23:14:52] no problem [23:15:09] So, a cool thing that this dependency tree lets you do is specify features based on other features and "datasources" [23:15:29] I approve of Dota [23:15:40] Note that added_badwords_ratio depends on proportion_of_badwords_added and proportion_of_prev_badwords. [23:16:16] Those features in turn depend on other features (e.g. words_added and badwords_added) [23:16:26] But beneath that, we start to look at "datasources" [23:16:41] Datasources are extracted from the API. [23:17:11] "contiguous_segments_added" is one very important datasource that a lot of our features depend on. [23:17:58] So, the cool thing that we do with dependency injection and a cache is allow many features to be specified while minimizing the computation that it takes to generate them. [23:18:08] So, let's say that we wanted two features for a model. [23:18:23] added_badwords_ratio and added_misspellings_ratio [23:18:40] Both of those features depend on a lot of features and datasources, but it's mostly overlap. [23:18:51] The dependency solver will make sure that we take advantage of that overlap. [23:19:32] So, the text of the current revision will only be requested once even though many features may depend on it. [23:20:05] This strategy for extracting features gives you a lot of power in expressing the feature set used by your machine learning model. [23:20:36] You don't really need to think about where the data is coming from. All you need to do is specify the features that you want and the dependency solver will figure out the best way to gather them. [23:20:59] See https://github.com/halfak/Revision-Scoring/blob/master/demonstrate_extractor.py [23:21:46] Amir1, I'm curious what features you extract from WikiData and whether this strategy would work for those features as well. [23:22:03] * Amir1 is looking [23:22:42] some featured would work [23:22:45] like user [23:22:53] user-related features [23:23:22] but since Wikidata doesn't support direct editing [23:23:49] and editing is limited to certain areas, we should define several new feattures [23:23:52] *features [23:24:07] like [23:24:16] [23:25:06] +1 :) [23:25:13] halfak: e.g. see https://www.wikidata.org/wiki/Q19547921 (a randomly picked item) [23:25:53] but we shouldn't give up on things like bad words added [23:25:55] and what about something like , for each property existing on wikidata? [23:26:32] because certain kind of vandalism I saw before is replacing good descriptions with swears [23:26:49] Helder: number of properties is changing all the time [23:26:54] or "subfeatures" for each property, like [23:27:06] We might have an interesting time performing a diff against JSON. [23:27:10] It would make things unreasonably complicated [23:27:13] And by interesting, I mean fun :) [23:27:42] halfak: that would be possible too [23:27:57] and I think it would make things a lot easier [23:28:38] [23:28:50] (or removed) [23:28:58] yes, that is good [23:29:23] It looks like we can keep using the API strategy to extract features. https://www.wikidata.org/w/api.php?action=query&prop=revisions&titles=Q19547921&rvprop=timestamp%7Cuser%7Ccomment%7Ccontent [23:29:35] The content is just a json blob that we can parse. [23:30:53] there is one big challenge [23:31:02] It's a little bit hard to explain [23:31:18] while ago wikidata changed their parsing system [23:32:12] so in some edits you see the json content is changed entirely but the actual edit was a minor modification in the sitelink [23:32:37] Amir1 was it changed in a structured way? [23:32:44] yes [23:32:48] Such that we could replicate it internally? [23:33:14] * Amir1 is looking for an example [23:33:23] lzia: salam :) [23:33:47] halfak: I think so, but we have to define two ways of parsing the json [23:33:52] old way or the new way [23:34:14] Amir1 +1 [23:34:33] or we check if the json is the old version and we change it to the new version internally and then compare it [23:34:56] Yeah... The later was what I had in mind. [23:34:59] So, the revscoring system is designed to be re-used in other python projects. [23:35:02] in pywikibot uses the latter [23:35:31] halfak: an exmaple: see https://www.wikidata.org/w/index.php?title=Q7251&action=history [23:35:39] (cur | prev) 00:29, 30 August 2014‎ Dexbot (talk | contribs | block)‎ . . (55,466 bytes) (+24,459)‎ . . (‎Changed link and badges for [aswiki]: এলান ট্যুৰিং, Q17437798) (undo) (restore) [23:35:47] the bot just added a badge [23:35:59] but 24,459 chars were added [23:36:17] I see. [23:36:52] I can also work on linking the Revscore with pywikibot [23:37:10] so people can easily write a revert bot using both pywikibot and revscore [23:37:59] That should be pretty easy. There are a few datasources that will need to be re-implemented. [23:38:37] e.g. revision.doc, parent_revision.doc, user_info.doc, site.doc, previous_user_revision.doc [23:38:47] But then everything else will just work. [23:39:00] Currently, the system makes use of mediawiki-utilities API library. [23:39:16] But I don't see a reason why it would be a problem to switch. [23:39:40] Just so long as the tests pass :) [23:40:08] I've got to run. [23:40:21] I'm not saying revscore uses pywikibot [23:40:35] I'm saying pywikibot uses revscore [23:40:37] OH! [23:40:42] :) That too [23:40:44] :D [23:40:57] Amir1, you'll be hearing back from me in the next couple of days re. IEG stuff. [23:41:16] ok :) [23:41:19] I'd like to have a brief hack session with you to dig into the feature extractor. [23:41:33] thanks for letting me contribute :) [23:41:41] sure [23:41:50] No problem. Glad to have you around :D [23:42:03] talking about hack, Will you come the Lyon hackathon? [23:42:12] I'll be at a conference over the weekend and next week, so some time in the next couple of days or after Thursday next week would work. [23:42:15] Yes. :) [23:42:42] We can work on the revscore or other things there [23:42:58] I would be very happy to do that :) [23:43:03] Ok really leaving now. o/ [23:43:03] me too [23:43:07] :) [23:43:12] see you around [23:44:26] bye! [23:44:32] o/