[00:22:46] hey halfak, guess what I did? [00:22:59] you know the whole, geolocate through Python by passing a big-ass TSV or JSON blob out and using pygeoip? [00:23:32] That there's for suckers. Next Monday, I unveil *drumrolls*.... C++-based geolocation. It's (fast) fun on the bun. [00:32:13] Ooh. Sounds fun. Oh much faster than pygeoip? [00:32:19] Ironholds, ^ [00:33:12] halfak, still finding out! [00:33:35] From an R POV, much faster because of the lack of writing to file, reading in, etc, etc. [00:33:50] I still need to compare native Python versus C++ [00:34:03] (I'm also working on integrating the user agent parser's C++ port. That's gonna be a lot more work.) [00:41:23] * halfak creates 150 technical debt [00:41:28] take that, future me [00:43:23] hahah [00:43:28] Future Oliver is my favourite sucker. [00:47:12] * halfak documents profusely to show a good-faith effort. [01:17:13] Ironholds: I guess in some old version of the app we used OpenSearch briefly... [01:17:34] well, it's a pretty commonly used old version ;p [01:18:44] Now that I can't explain. There's nothing particularly prominent about it as far as I know. [01:19:09] * Deskana looks at the mobile release history. [01:19:48] Hmm, well, it's the last one before Nearby (which required a permission change, so will block auto-updates) [01:21:21] Ironholds: https://gerrit.wikimedia.org/r/#/c/153809/ [01:21:27] "Use generator:prefixsearch instead of opensearch" [01:21:30] Well, there you have it then [01:23:30] ta [02:13:10] * halfak kicks off a streaming mapreduce for diffs [02:13:14] \o/ [02:13:20] --> evening [02:13:26] have a good one folks! [02:14:09] I'm still working. Let's go with no ;p [02:14:26] * halfak found more work too. [02:14:30] But this should be quick. [02:16:22] fair [02:16:26] currently building geo_region, btw [02:16:40] (it's hopefully going to be temporary and replaced with C++ soon enough, but getting that working has proven...a pain. [02:16:51] fucking std:: string to const char * conversion [02:17:22] truncation! [02:17:47] hmn? [02:18:29] Oh -- just that std:string would have a variable width and a char * wouldn't. [02:18:34] yup [02:18:45] std::string actually contains const_pointer which references the underlying C-like basic string [02:18:51] const char * = a pointer to a const char, right? [02:18:54] yup [02:19:03] You could also have char const *. [02:19:08] don't gimme a headache ;p [02:19:17] I've got it 90% working. It now accepts inputs without segfaulting. [02:19:24] I just need to get it to output without segfaulting ;p [02:19:28] Which would be pointer than can only point to one particular memory location. [02:19:48] you're hurting my lawyer-brain! [02:20:15] or even const char const * which is a pointer that an only point to a particular memory location which contains a const char. [02:20:20] :) [02:20:33] I taught C++, I don't really know it though ;) [02:20:56] dammit, I was gonna enlist you [02:21:15] start with a C++ connector to GeoIP.h, then a standardised CRAN library! [02:21:19] AND THEN, THE WORLD, or something. [02:21:24] I suck at practical problems. [02:21:46] I'm not clear on how one practically works with libraries in C++. [02:22:01] I just figured it out for the sake of putting together a couple of assignments. [02:22:24] aha [02:22:45] I finally worked out how to make the libraries play nice (sourceCpp still doesn't work, but build-and-reload does) - that was my original blocker [02:22:53] now it's "trying permutations of things until something doesn't make it crash" [02:23:01] which is...pretty much how I learn anything, now that I think about it. [02:23:14] oh god. Is that how practical C++ works? [02:23:21] yes [02:24:09] o/ Guerillero [02:24:11] :) [02:24:19] it's the wonders of compiled languages! [02:24:32] if you write something that is syntactically invalid, it'll tell you when it compiles [02:24:47] if you write something syntactically valid that just doesn't work, it'll silently compile and then blow up when you look at it funny [02:24:51] * Ironholds jazz hands [02:25:07] at least when R blows up it doesn't segfault. Actually, I've managed to make it segfault a few times, but it was pretty hard. [02:25:21] regardless, this was not how I wanted to spend 9:30pm [02:25:29] derefrence NULL and it will happen [02:26:21] how are you halfak? [02:27:02] Not bad. Also trying to salvage an evening. [02:27:15] Gotta get a links table loaded for tomorrow. [02:27:23] * halfak becomes a human data pipeline [02:28:20] It works! But my hadoopin crashed. [02:29:30] Just crash already hadoop! [02:37:37] Ok. That's enough. GOod night! [02:38:16] night halfak! [14:55:07] wheee, got the C++ working. [14:55:18] You know, I should really point this into a generic R/Cpp library. [14:55:25] R doesn't have a geolocation API through MaxMind, see. [14:59:35] ...bwuh [14:59:40] dispatch happened so fast my terminal didn't change [14:59:42] it just completed [15:04:14] hey halfak, who wants benchmarks? ;) [16:13:08] o/ everyone [16:13:53] morning Nettrom! [16:13:58] wow, this place is quiet this morning [16:35:54] Hey Ironholds. Having a weird morning. [16:35:59] you okay? [16:36:03] Totally. [16:36:10] Gave an interview to a russian newspaper. [16:36:23] Re. coverage bias in Wikipedia. [16:36:37] ...please tell me it wasn't Russia Today ;p [16:39:12] Nope. Sputnik something or other. [16:39:22] they are in London and serve russian readers. [16:40:18] Say, where do I find the maxmind DB for use in pygeoip? [16:40:30] I'm trying a speed test. [16:41:05] 'angon [16:41:20] halfak, https://github.com/Ironholds/WMUtils/blob/master/inst/geo_country.py [16:41:34] you want flags=1, which caches the db in memory, if you want it equivalent [16:41:54] Boo. We need to identify ipv6 ourselves. [16:41:55] Lame. [16:42:01] ehh [16:42:06] if(re(":")) [16:42:10] hey leila, I got bored and wrote an interface to MaxMind's API into WMUtils [16:42:17] it can geolocate 10k IPs to country level in /sixty miliseconds/ [16:42:26] * Ironholds dances [16:42:38] that's very useful Ironholds. thanks! [16:42:50] don't thank me yet, it's not released [16:43:01] but it means (amongst other things) that there will no longer be Python dependencies to the library [16:43:07] just install.packages() and you're off! [16:51:48] 785ms for python [16:52:03] So, an order of magnitude. [16:58:24] yay! [16:58:36] now I just need to work out why region retrieval keeps crashing. I think I know why... [17:50:37] Ironholds: your c++ libs are on stat1002 now! [17:50:44] ottomata, thankee! [17:50:52] that means I can integrate ua-parser's C++ port :D [17:51:15] Ironholds: methinks if you can code C++, you can code java! [17:51:21] i will not longer believe you when you say that! [17:51:30] I can barely write C++! I can, however, now read it. [17:51:35] but yes, I'll look at Java next ;p [17:51:38] i can barely do either! [18:30:08] wth https://metrics.wmflabs.org/static/public/dash/#projects=itwiki/metrics=RollingSurvivingNewActiveEditor [18:43:58] halfak: ping [18:44:15] Hey gwicke [18:44:17] hey [18:44:31] Erik and Siko are asking me about mentoring https://meta.wikimedia.org/wiki/Grants:IEG/Revision_scoring_as_a_service [18:44:42] and I'm wondering if you are planning to mentor this as well [18:44:50] I'm leading it. :) [18:44:57] ah, great ;) [18:45:02] can someone get ewulczyn another "achievement unlocked"? ;p [18:45:21] gwicke, We could definitely use some support. I'm also interested in your plans for restbase and how we might take advantage of that. [18:45:24] halfak: I knew that you are deeply involved there, but saw you listed as a 'volunteer' [18:45:35] Yeah. That's 'cause I can't get paid. [18:45:37] ;) [18:46:14] as in "can't get paid from the grant"? [18:46:26] Oh yes. I accept moneys otherwise :) [18:46:33] kk ;) [18:46:53] But seriously though, I'd like to sit down with you to discuss the project and your vision for MediaWiki services. [18:47:02] However, I'd like to delay that about 1 week. [18:47:03] yes, that would be great [18:47:04] Would that be OK? [18:47:20] I shall respond soon re mentoring [18:47:46] Are you in PST? [18:47:47] for that it would be useful to have an implementation sketch [18:47:57] yes, PST [18:48:20] it sounds like most of this would be offline processing, with the results being stored somehow & exposed through an API [18:48:39] although it should also be timely I suspect [18:49:04] to be useful to detect vandalism quickly [18:49:19] Yeah. I was imagining that we'd have a service that generates and caches scores when requested -- and another service that asks the first to generate scores for revisions as they happen. [18:49:41] So that new requests would always be able to get a score and that most requests would get a score from the cache. [18:49:45] yeah [18:49:56] similar to a lot of other services [18:50:03] parsoid for example [18:50:04] Yup [18:51:02] sounds sane to me [18:51:23] I am happy to help with the normal plumbing around caching, refresh jobs & API [18:51:50] but I would assume that you'd mentor anything to do with the actual classification / scoring [18:52:30] gwicke, I just sent an invite for next Wednesday. [18:52:34] gwicke, sounds about right. [18:52:47] okay, cool [18:53:00] We already have substantial code in place to make building new classifiers from the feature set straightforward. [18:53:31] It seemed like that'd be the hardest problem so we tackled it first. [18:54:08] is this straightforward, or does it involve special resource requirements? [18:54:42] You need a lot of CPU to train the classifier, but then it can be used efficiently from there forward. [18:54:53] I suspect that training new classifiers (or updating old ones) will be manual./ [18:55:07] *nod* [18:55:29] and the classifier fits in memory on an average node? [18:55:36] Yeah. Shouldn't be a problem. [18:55:53] We should be able to have several in memory at the same time. [18:55:54] good [18:56:12] which tech are you using for this? [18:56:47] (not that it matters too much, just curious) [18:56:48] Python + scikit-learn [18:57:42] is this already a long-running service, or is it currently a one-shot commandline thing? [18:58:24] Not service is up and running yet. [18:59:25] kk [18:59:47] lets chat about it in more detail next week [18:59:50] I have Quarry / SQL / Wikimetrics question. I'm trying to duplicate the bytes_added metric, so that I'll be able to run that sort query on arbitrary cohorts without manually uploading them to Wikimetrics. [18:59:52] Sounds goo. [19:00:00] Here's what I have that doesn't work: http://quarry.wmflabs.org/query/963 [19:00:09] * halfak looks at ragesoss's query [19:01:10] ragesoss Use revision_userindex [19:01:20] * halfak reads more [19:06:39] ragesoss, page_namespace is an int [19:07:34] halfak: one final question: when is the revision scoring project scheduled to start? [19:08:19] The funding starts in January. For me, the project starts once I've cleared my evening and weekend "work" enough to take it on -- so late next week. :) [19:08:33] mostly wondering when most of the work on the services end would need to happen [19:08:39] ragesoss, timestamp format should be YYYYMMDD [19:09:00] ragesoss, halfak: also, enwiki_p.revision should be enwiki_p.revision_userindex ? [19:09:28] Nettrom, yup ([13:01:07] ragesoss Use revision_userindex) [19:09:37] halfak: I was testing it, and the other format seems to work fine in otherwise working queries. [19:09:54] ragesoss, I don't think that is possible. [19:10:00] MySQL doesn't know that rev_timestamp is a date [19:10:06] It just knows it is a string of bytes. [19:10:15] Where "1" < "2" [19:12:17] okay. I was a little confused, but that came right from the wikimetrics documentation. [19:13:29] eek. Can you point me to that? [19:13:52] ragesoss, http://quarry.wmflabs.org/query/1049 [19:14:25] halfak: http://git.wikimedia.org/blob/analytics%2Fwikimetrics/c6793b96c5e9a225b4d9de467aeecb983e5765e8/wikimetrics%2Fmetrics%2Fbytes_added.py [19:15:48] halfak: thanks! really, really appeciate it! [19:15:57] :D hth, dude [19:16:58] ragesoss. A quick demo of the timestamp problem: https://gist.github.com/halfak/d80a87ae739ee1b238ff [19:17:26] halfak: yeah, I testing another query and discovered that it also gives an answer, just different. [19:24:28] halfak: another question: is there a big difference in efficiency if I run that for rev_user_text instead of rev_user? [19:26:23] I don't think so. I was going to suggest that. Let me check. [19:27:57] Looks like there's an appropriate index on rev_user_text too. [19:28:10] How do you plan to handle the "c.rev_user IN (319203)" [19:28:22] yeah, that's what I sort of remembered, since all the other queries worked. [19:28:33] halfak: I switched it for rev_user_text. [19:28:44] see http://quarry.wmflabs.org/query/963 [19:29:04] Ragesock!? INAPPROPRIATE! [19:29:06] ;) [19:29:20] see also Rage Sauce [19:30:49] In January, we'll be building a dashboard system for courses, and it'll probably rely on running SQL queries on labs to pull a lot of the data, including this (as well as lists of articles edited) [19:31:14] so I'm trying to remove as many of the roadblocks to gathering the data as I can, ahead of time. [19:31:29] Ragesoss, it seems that there's growing demand for such a solution. [19:32:40] now if only we could get pageview data like that... [19:33:07] instead of my current plan of flooding stats.grok.se with json requests. [19:34:55] ragesoss: if you're building it on labs, might want to see if the Wiki Viewstats folks can give you access to their database, works if you have a limited number of pages/articles [19:35:41] Nettrom: The dashboard itself is probably not going to be on labs, but I was imagining it could ssh into labs to do the stuff it needs to do. [19:36:05] but I'll follow up with Wiki Viewstats. [19:36:21] I'm not 100% sure what's up with that project, the main one is down but there's a copy up [19:36:30] http://tools.wmflabs.org/wikiviewstats2/ [19:39:19] heh... that tool also seems to pull data from stats.grok.se (in addition to doing it's own stuff, I guess?_ [19:39:34] yes, they show you the stats.grok.se numbers for comparison [19:39:40] they count things slightly differently, I think [19:39:57] there should be a description of it in the documentation somewhere, IIRC [19:40:09] stats.grok.se is probably good enough for our purposes for now, since it's been pretty reliable and we won't (I think) be overwhelming it with the scale of our project. [20:18:56] okay, structure pointers with char pointers within them [20:18:58] I give up for today. [20:19:18] we'll just have to do city- and region-level geolocation via python [21:04:40] ewulczyn, good news! [21:04:41] nothing broke [21:11:24] DarTar, new run complete [21:11:30] Hey Ironholds, do you have example pageview hive query I could use? [21:11:34] still got some problems with project names, which I'll have to iron out later, but. [21:11:40] halfak, like, the entire definition in one query? [21:11:51] nope :(. I've got a very barebones example? [21:12:04] it doesn't handle some localised idiosyncracies, but it gets most of the weirdness. [21:12:37] Ironholds: \o/ [21:12:42] halfak: would you be interested in using the webstatscollector porlted one? [21:12:50] Ironholds: who should we poke from the other channel? [21:13:12] ottomata, that one's 15% off. [21:13:16] haha, ok cool [21:13:17] just checkin! [21:13:18] :) [21:13:18] DarTar, marcel, probably. [21:13:23] its the only one I know of :p [21:13:29] Ironholds, barebones would be great. [21:13:36] ottomata, what do you mean? [21:13:51] Are you talking about a HIVE query? [21:14:01] Or code for processing request logs? [21:14:05] this crazy things [21:14:06] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webstats/insert_hourly_pagecounts/insert_hourly_pagecounts.hql [21:14:07] yeah [21:14:18] it is a hive query [21:14:28] halfak, very basic but https://github.com/Ironholds/EveryDayImSessioning/blob/master/Queries/mobile_web_readers.hql#L17-L30 [21:14:29] using the same logic that webstatscollector uses [21:14:37] you need different params for apps/mobile web, unfortunately (womp womp) [21:14:38] that is what pagecount-all-sites is using [21:14:50] Ironholds, ottomata: can we haz them on wikitech? [21:16:20] hah, DarTar, would you also like me to put the code of webstatscollector on wikitech ?:p [21:17:00] ottomata: no, I’ll pass on that one :) [21:17:22] DarTar, for you, I'll write up a /complete/ definition for wikitech ;) [21:17:59] Yes please. :) Really, I just want to a way to filter out the non-pageview requests so that I can work with what's left over. [21:18:10] I wonder if setting up a view would be possible/appropriate. [21:23:57] will do! [21:27:24] leila: is WDI the package you were referring to? [21:27:35] DarTar, no let me check [21:27:47] k [22:45:49] hey DarTar, you getting in the call? [22:46:42] brb going to make tea. [23:39:11] FYI(leila, ewulczyn, Ironholds) -- DarTar and I are going to move around a bunch of trello cards. Don't worry. [23:39:28] tnegrin, ^ [23:39:29] tnegrin: ^ [23:39:34] Thanks for the heads up, halfak [23:40:19] things are going to be crazy for an hour or so while we clean up, but eventually all the cards will be back, before we archive something we’re not sure about you’ll receive a ping on Trello