[04:22:38] evening research party people [04:28:24] i optimized my report script [04:28:30] instead of tooking two months it took under 18 hours [04:29:16] NICE [04:29:17] what did you do? [04:29:28] replaced a couple of the API calls with database queries [04:29:45] loading thousands of pages over the API is... slow [04:29:48] yeahhhh [04:30:14] tools.projanalysis@tools-trusty:~/projanalysis$ cat WikiProject\ Football.csv | grep *TOTAL* [04:30:14] *TOTAL*,9708,1394522,1404230 [04:30:40] 1.4 million edits to football related articles in a year [15:52:31] morning [15:54:20] Hey Ironholds [16:01:23] _o/ [16:01:45] o/ guillom [16:01:55] Good morning :) [16:02:01] halfak: Great work on the CC track research page! [16:03:03] :D I hope it proves useful [16:03:14] Still a lot of work I'd like to do if hours were free [16:03:36] Oh, how I understand. [16:04:08] the cluster needs more machines or less stupid jobs [16:04:09] urgh [16:06:19] guillom: hey! wanna see one of the things I’m working on in my spare time? [16:06:21] https://jupyter.wmflabs.org/ [16:06:25] (since you liked quarry) [16:06:39] can someone explain to me why we don't just stick stats.grok.se on labs? [16:06:48] given that the dumps are publicly wgettable [16:07:19] good Q [16:07:25] ... [16:07:32] this is going to end with me trying to reimplement stats.grok isn't it [16:09:08] heh. Someone already did that on labs and then they disappeared. [16:09:31] YuviPanda: It looks cool; I've got no idea what it does though :) I logged in, clicked on "My server" and got a 500. [16:09:38] bah [16:09:41] https://tools.wmflabs.org/wikiviewstats/ [16:09:45] guillom: I’ll poke you again once I fix it :) [16:09:55] YuviPanda: Fair enough .) [16:09:57] darnit [16:10:00] guillom: but basically, it is a way to run ipython notebooks on labs, with access to db, dumps, etc [16:10:01] well, I'll implement it BETTER [16:10:15] Ironholds, implement a maintainer :) [16:10:29] Or a service that requires no human intervention. [16:10:29] ah, I'm not a good enough engineer for that, I'm afraid [16:10:33] heh [16:11:16] a service that requires no human intervention? Got it! [16:11:20] Presenting...Rocks As A Service. [16:11:39] Solid as a rock. Actually a rock. [16:11:57] this reminds me of the dude I know who did a presentation at BarCamp london on a noSQL store with system-speed read/write times and infinite capacity [16:12:09] "Storing your data in /dev/null" [16:12:16] it was actually a really well done presentation! [16:12:22] Ironholds: BTW I have something for you if you're bored during your Copious Free Time. It involves looking at the most-viewed enwp articles and figuring out 1. which are the most viewed "all the time" and 2. which are the most viewed seasonally (e.g. Christmas, July 4th, etc.). [16:12:44] guillom, unfortunately, that would require me to believe that the per-article pageview counts are worth a warm bucket of spit [16:13:16] they have approximately zero automata or crawler detection [16:13:46] Ironholds: Well, I need to use *something* to figure out "popular articles" to add them to my keyword list. [16:14:01] It's either that or Google Webmaster Tools. [16:14:09] then those are as good as anything! [16:14:14] just don't include Angelsburg [16:14:35] Do I even want to ask… why? [16:14:41] because Angelsburg gets 20 million hits a week [16:14:58] Angelsburg gets 20 million hits a week because Operations decided they needed a page to test server response times on [16:15:03] and they picked a random city in germany [16:15:05] and didn't tell us [16:15:37] * halfak facepalms [16:15:38] I'm laughing but I don't know if it's because it's funny or because it's terrifying. [16:15:38] and, when I threw a patch into puppet to change it to /wiki/Undefined, which ALSO gets millions of fake hits as a result of an error somewhere in our JS somewhere, -2d it because "oh, analytics should just add an exception" [16:16:17] That doesn't make any sense. [16:16:30] and then I spent an extended period of time fantasising about hauling off and screaming at some of our engineers for being INCREDIBLY FUCKING STUPID in deciding that changing something in a way that cost them precisely NOTHING was more of a PITA than putting OTHER engineers on the hook for adding an infinitely expanding list of exceptions to sort out /their/ shitty business practices that they /fail to inform us about/ [16:16:33] Do they set their user-agent to something reasonable for the request? [16:16:41] halfak, Twisted. Fucking. PageGetter. [16:16:47] * Ironholds seethes [16:17:01] it's just a python script. A python script they hit us with millions of times a week for testing. [16:17:24] guillom, oh, and testing mobile responsiveness? yeah, that hits the main page. Because why be consistent? [16:17:34] >_< [16:18:03] I've got half a mind to push the patch in again and ask for an explanation of why they think it's acceptable to make WMF employees spend donor money implementing an ever-increasing list of exceptions rather than saving all that money by hitting +2. [16:18:23] so yes, I seethed and raged and felt like James F for a long time and now I try not to think about it because that attitude makes me /so very angry/. [16:19:18] Ironholds, any idea if someone is picking up automata detection now? [16:19:31] to my knowledge, nobody, except some halfhearted ellery experimentation [16:20:06] which is overkill because he's looking at concentration measures as a first line of defence when it's a computationally EXPENSIVE first line of defence, and we'd do better with a regex and then concentration measures on everything left over, but.. [16:20:49] (it's better than nothing, mind. I'm just tired of seeing us pick the slow-to-run way instead of the slow-to-implement way. We run a lot more often than we implement) [16:21:03] what I really want to do in an ideal world is take ottomata's streaming pageviews experimentation, right? [16:21:15] and then I want to steal a page from FR-Tech's playbook and build up a set of heuristics for automata [16:21:26] and then I want to give each request a probability as it comes through [16:22:03] * Ironholds nods firmly [16:22:05] Sounds solid to me. [16:22:32] agreed, but I'm not seen as skilled enough to implement it usefully. [16:23:15] so instead, we get zero automata filtering. Woo? ;p [16:23:38] guillom, so what's the most pressing thing on our plate right now? I'm working on this www.wikipedia.org piece. [16:23:47] (also, do *we* get an IRC channel? All the cool kids are doing it ;p) [16:25:26] Ironholds: My priority this week is the keyword thing. Because once it's set up it can just work silently on its own. [16:25:54] Ironholds, you guys should stay in this channel. [16:25:56] Ironholds: I say the cool kids are already in this channel. Works for me :p [16:26:01] Note that it isn't called R&D :) [16:26:03] Heh :) [16:26:57] neat! [16:28:04] Ironholds: Which is why I was asking about the pageview-based keyword list. But it's not my place to assign work to you, hence why I said "if you have time and are bored" :) [16:28:25] Aha. Well, it's kind of a massive task because our infrastructure is...not oriented to do that with historic data :( [16:28:35] Fair enough. [16:28:44] like: we are very gradually shifting away from infrastructure with a series of flaws, the most major of which is: it's optimised for very specific tasks/ways of working. [16:28:57] Google Webmaster Tools is also limited to the last 3 months. I'll figure out something. [16:29:05] and one legacy of that is that if you want the most prominent titles in the last 12 months? [16:29:12] whelp, grab a year of pageviews, aggregate and sort. [16:29:25] I may ask leila, since she's done some work on this as well. [16:29:28] is there an API? No, you screen-scrape to work out the filenames and then wget them and then unzip them and then smush them together :( [16:29:40] makes sense! [16:30:33] Everything that invelves the word "scrape" sounds painful. [16:30:42] involves* [16:31:38] guillom, yep [16:31:55] there's no API and while the filenames are sequential, there are variable numbers [16:32:11] so, August 2014? Ah! 22 sequential files starting 20140801 [16:32:14] Ironholds, do you know if we can access the directories that host those files directly? [16:32:14] September? 14. [16:32:28] Of course. Because it wouldn't be fun if things were too simple. [16:32:43] halfak, I actually don't. One would think they'd live on the partition on stat2 with the other dumps? [16:32:49] I haven't investigated [16:33:17] Oh dear god. They are in a single directory. /mnt/data/pagecounts/incoming/ [16:33:45] Wait... this seems to only go back to the beginning of the year. [16:37:25] * Ironholds headdesks [16:37:41] but it's okay, we don;t have any pressing need for more warm bodies who know how to burn legacy stuff with fire [16:38:18] We should be able to get the pagecount logs on NFS for both the analytics cluster and labs. [16:38:24] That would remove the need to scrape. [16:38:27] da [16:39:59] halfak, this might actually be the thing that makes me learn python [16:40:44] making a RESTful pageviews API [16:41:00] halfak: yeah, a thing in stat* that rsyncs to labs NFS would be useful, I reckon [16:41:09] http://pageviews.wmflabs.org/en.wikipedia.org/20141001/Double-entry_bookkeeping_system/ [16:41:34] hey YuviPanda, you wanna collaborate on a thing? ;) [16:41:40] \o/ python [16:41:46] the pageview thingy? [16:41:53] oh man [16:41:55] I saw the link [16:41:57] AND WAS SO HAPPY [16:41:58] writing the only good API in the wikimedia system [16:41:58] YuviPanda: if that rsync is to exist, you should do it from datasets, where that data actually lives, not the stat mount [16:42:04] YuviPanda, that is, my fake link? ;p [16:42:06] datasets1001 i think [16:42:09] WHY DO YOU RAISE MY HOPES AND THEN CRASH EVERYTHING [16:42:14] hahahahaha [16:42:18] I'm saying we should build this thing! [16:42:20] ottomata: yup yup. just a ‘shared’ one. in some form. [16:42:23] then you could be responsible for that URL! [16:42:32] then mediawiki devs can look at the URL and go "oh man, I wish our thing did that" [16:42:34] Ironholds: so toby tried to convince me to do that. [16:42:42] and I did some calculations... [16:42:49] and nope’d out. our labs infra isn’t big enough. [16:42:52] you’d need raw hardware [16:42:59] really? :/ [16:43:01] hrm [16:43:06] yup [16:43:13] say you want hourly counts? [16:43:16] stick that data in RESTbase [16:43:20] talke to gabriel :) [16:43:25] so it’s 24 * 365 * whatever [16:43:26] hour*numberofpages==hour [16:43:30] hi everyone [16:43:30] *ow [16:43:34] hey leila :) [16:43:38] * whatever years [16:43:41] that’s a lot of data :) [16:43:42] hi Ironholds. :-) [16:43:46] o/ leila [16:43:51] the largest labs instance gives you 160gigs of space [16:43:52] hi halfak. [16:43:52] morning leila [16:43:55] YuviPanda, 24 * 365 * 742 * number_of_pages * 2 [16:43:56] mornin guillom. [16:44:09] each year [16:44:11] that is a big number [16:44:14] we can double it to give you 300G if people need. [16:44:16] yeah [16:44:20] hmn [16:44:26] I just did a back of the envelope calculation for enwiki alone [16:44:26] well, I may build a prototype system in my spare time [16:44:30] like, just take a month of data or something [16:44:31] you should! [16:44:33] yup [16:44:43] particularly since our pageview count files currently contain more data than they need [16:44:51] I'll use URLdecode + the new pageviews count to make it less stupid [16:44:54] * Ironholds nods firmly [16:44:58] * Ironholds starts a job running [16:45:22] ooh, you know what we need? [16:45:28] a uri_path normaliser UDF [16:45:28] Ponies? [16:47:35] actually, what we really need is something that UTF-8s and decodes a URL when it makes it into the system (ottomata, cc) [16:47:58] because at the moment we have localised variants throughout the db and it makes filtering a pain [16:48:04] %D%I%E%D%I%E [16:48:29] YuviPanda, oh,d id you not know this? [16:48:36] what locale are the strings in hadoop in? ;) [16:48:50] why, whatever locale they were sent with, of course! :D [16:48:53] Ironholds: you shoudl use page_idj : [16:48:53] :) [16:48:55] anyway :) [16:49:01] ottomata, that doesn't work for apps! [16:49:11] relatedly, I discovered a load of /wiki/ alternatives we've been missing [16:49:16] I should send an email about that. [16:49:25] Ironholds: we shoudl make it work for apps [16:49:36] getting the title from the url is just buggy all around [16:49:51] ottomata, well, I think apps needs to make it work for apps [16:50:05] we need to make apps make it work for apps [16:50:05] as I understand it the problem is they don't use mediawiki so they bypass the extension that resolves this [16:50:09] * Ironholds nods [16:50:14] it needs to be a analytics priority to make it happen [16:50:19] I agree! [16:50:22] it wasn't for mw, i just had ori do it [16:50:59] I bet ori will just do it for apps too someday :P [16:51:07] I think he already has a patch in one of the apps... [16:52:50] neat! [16:52:57] then we should get them to prioritise reviewing it ;p [16:53:52] YuviPanda, but seriously [16:53:56] the structure we need is something like... [16:54:12] http://pageviews.wmflabs.org/en.wikipedia.org/2014/10/01/Barack_Obama/ [16:54:27] Ironholds, what about a date range? [16:54:36] This looks like it would have the partition problem [16:54:41] halfak, yeah; point! [16:54:59] so we could do project/page/dates and then have a reserved pagename for "all" or something. [16:55:16] and then that defaults to actually showing some nice JS visualisation of the JSON data, while */raw gives you the actual JSON blob [16:55:19] * Ironholds beardstrokes [16:55:36] you know who we need in on this? [16:55:42] we need paultag. [16:55:48] http://pageviews.wmflabs.org/enwiki/Barack_Obama?start=20140101&end=20140102 [16:56:01] I like! I don't like the /enwiki/ but I like everything else. [16:56:04] Or maybe just ?date=20140101 [16:56:08] How come? [16:56:12] It's a standard identifier [16:56:17] within the db structure [16:56:18] Use in XML dumps and the DB. [16:56:32] okay, if it's in the XML dumps too, fair point [16:56:35] enwiki is fine then [16:56:37] also, does en.wikipedia.org include en.m.wikipedia.org? [16:56:41] it does! [16:56:49] so that's ambiguous [16:57:01] you are smarter than I :D. I'm mostly just spitballing [16:57:11] All I want is paultag, a new cluster and about three months [16:57:15] that's not so much to ask [16:59:18] halfak, the stats.grok.se json structure is pretty sensible, though [16:59:28] you might have to split it to provide mobile/desktop views split [16:59:30] but other than that! [17:00:09] +1 [17:00:44] and if the API is not written in a dumb way, dealing with the new pageviews structure will be a piece of cake... [17:00:46] * Ironholds beardstrokes [17:01:01] I can write this. I have the technology and the absence of a life, and I've almost wound up my existing spare-time project. [17:10:37] Ironholds: you were telling me something about wikiwand, weren’t you? [17:10:43] do we know how much traffic they send us? [17:14:15] no [17:14:19] :( [17:17:17] Ironholds: like, they don’t set a proper UA? [17:17:28] I’m not asking you to do any work now, btw. :) Just asking if you already have prior knowledge [17:18:21] YuviPanda, the wikiwand service caches locally, is the problem [17:18:33] aaah [17:18:42] on their own servers? [18:36:02] Does anyone know if http://wikistics.falsikon.de/long/wikipedia/en/ is actually updated? [18:36:14] not actually heard of it before [18:36:32] It doesn't have any "Last updated on". [18:39:02] And as runner-up in the "Geocities-looking pages that might have the data I'm looking for" category, there's also http://wikitop.alwaysdata.net/wikitop_en_portal.html [18:40:40] ow [18:41:14] I can't even figure out what half the stuff on that page refers to. [18:42:41] Hmm, only one month of data. [18:53:59] Some of the items on https://tools.wmflabs.org/wikitrends/english-most-visited-this-month.html I can understand, but others puzzle me. [21:04:37] DarTar, are we meeting? [21:22:25] o/ kevinator [21:22:33] Do you have a minute to get on a call? [21:22:45] sure... [21:22:52] batcave? [21:23:33] yup [21:35:49] hey halfak [21:42:03] hey Ironholds, are you ironholds@gmail.com? [21:42:12] J-Mo, yup [21:42:25] k. wanted to make sure it wasn't a clever troll. [21:42:33] (or rather, that it wasn't some OTHER clever troll) [21:43:07] :( [21:43:17] J-Mo: did I show you https://github.com/harej/projanalysis ? [21:52:55] hey DarTar [21:53:03] howdy [21:53:03] (Sorry I missed you earlier) [21:53:08] same here [21:53:11] what’s up [21:53:42] Hanging out with kevin in the batcave. Making the C-levels prioritize our uniques work :) [21:54:09] good! [22:52:48] o/ kevinator [22:52:52] just talked to Nuria [22:53:08] and :-) [22:53:14] We discussed an implementation that will get us day/week/month granularity arbitrarily. [22:53:27] She'll update docs and try that in the changeset :) [22:53:41] awesome \o/ [22:53:55] any changes to the use case matrix? [22:53:56] There might be some pushback from ops because we'll be trying to set a cookie on every request. [22:54:18] ok [22:54:25] Nope. [22:54:27] Looks good. [22:59:19] kevinator, halfak : document updated: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_visit_solution#How_will_we_be_counting:_Plain_English [23:09:42] cool, thanks Nuria. I’m reading the diff right now [23:14:41] nuria, looks good to me. [23:21:12] nuria: looks good to me too [23:21:21] halfak, kevinator : ok [23:32:51] * Ironholds headscratches [23:33:10] I just got asked to do an OSBridge talk. I don't actually know what I'd do one on, buuut...