[00:06:46] drdee: any ideas on pig OOM error? [00:10:13] drdee: seems like it happend right at the end, fyi [00:23:22] YuviPanda labs is dead [00:23:23] :( [00:23:29] we've been trying to figure out why [00:23:42] whee :D [00:24:09] Hey DarTar - sorry labs is down and we were in the middle of deploying [00:24:20] but when it's back up, you'll have a custom datasources page on your own instance [00:24:59] oh ok, labs list says it's because someone was running processes on Labs [00:25:17] ask on -labs? [00:25:27] also - apologies to everyone for not being responsive on IRC these last two weeks. Being in the office is Very distracting [00:25:37] yeah, Yuvi, they're rebooting Bastion [00:26:02] well, the process thing was 3 days ago. [00:26:13] but yeah. labs is unavailable. [00:30:41] YuviPanda, DarTar: Labs is having an issue with ssh. ryan's ETA is "hopefully soon" [00:30:48] :) [00:32:27] I still have a shell on kripke [00:32:31] so it's possible i can work around it. [00:32:50] oo, that is a good call, just keep a screen session on stat1 to kripke [00:54:27] YuviPanda http://mobile-reportcard.wmflabs.org/ [00:54:38] wooohooo [00:55:58] dschoon: I should probably clean that up a little :) [00:56:01] also, http://mobile-reportcard-dev.wmflabs.org/ [00:56:05] *nod* [00:56:12] sadly, there's no way into labs atm [00:56:18] but you can test locally. [01:09:10] dschoon: my local limn instance doesn't work, sadly (neither me nor milimetric were able to debug) [01:09:56] okay, we'll try to take a look at that tomorrow (maybe) [01:10:48] but thanks for getting this deployed :) [01:12:35] dschoon: I suppose I can update the data / configs myself by logging into labs and updating them? [01:12:39] (once labs is back up [01:12:39] ) [01:13:24] if you update the data / configs, you'd have to pull from the labs instance and reset permissions [01:13:33] (since www-data:www needs access to these files) [01:13:45] so the easiest way to do that is with our fabric deployer [01:13:50] ooh. [01:14:04] is there a doc on how I can use the fabric deployer? [01:15:30] YuviPanda, it's literally gonna be issuing "fab mobile_dev deploy" [01:15:37] wheee :D [01:15:40] but one sec, I'll get you the repository [01:15:43] oh, sorry [01:15:51] in your case it's fab mobile_dev deploy.only_data [01:15:58] and if I change config? [01:16:06] deploy does code and data, deploy.only_data does only data [01:16:12] okay! [01:16:13] what do you mean by config? [01:16:24] graphs/* datasource/* [01:16:37] (I am getting rid of a few timeseries, too cluttered) [01:16:39] yep, those are all considered "data" and deployed by deploy.only_data [01:17:09] ah [01:17:10] okay :D [01:17:11] thanks! [01:17:17] still can't login to labs though [01:17:33] milimetric: also I just figured that I have an rpi running linux. I should / could setup limn there and see how that goes :) [01:17:48] it's not going to be blazing fast - I had to turn off half my vim plugins to get vim to a usable speed [01:17:53] ok, my guess is that you're still hitting permissions issues [01:18:05] on my local instance? [01:18:15] yeah [01:18:20] unless npm start does funky things with it... [01:18:37] yeah, it's very strange either way [01:18:48] I'll have to get a blank VM and try it [01:18:52] milimetric: ok [01:19:00] milimetric: I see that everyone has 'r [01:19:17] permissions are what i'd expect [01:20:38] lemme paste what permissions I have [01:20:57] ok [01:21:42] drwxrwxr-x for all the directories, and -rw-rw-r-- for all the users [01:21:51] *not users, files [01:22:23] but you should get some sleep :) it's working on labs at least and we'll help you more tomorrow [01:23:52] milimetric: yeah i have that [01:23:53] too [01:24:06] so weird [01:24:15] milimetric: are oyu logged into labs? [01:24:45] ok, maybe I'll try to make a .sh file that does all the necessary commands, then run it on a blank VM and see if I get the same issue [01:24:59] yup! [01:25:01] no, I can't get to labs yet, there's some magic ssh config change I can't figure out [01:25:07] maybe i've an older version? [01:26:12] milimetric: if I make a commit now, can it be deployed? [01:26:25] yes, I'll deploy it to test out the fabric changes [01:26:35] as soon as I get access to labs (shouldn't take long) [01:27:03] I'm making changes blind [01:27:05] * YuviPanda no likey [01:27:39] do you guys put sugar in your coffee? [01:28:13] I'm unsure if you're asking me, but I do put an average of 10 packets of sugar in a large cup of coffee [01:28:18] washes out the bitterness [01:29:09] yes, I thought so, I try to do that too [01:29:22] lol [01:29:44] it's ok Yuvi, hopefully Edit UI will get prioritized soon and we can all stop making blind changes [01:29:53] :) [01:29:55] milimetric: I pushed [01:30:05] * milimetric looks [01:30:23] milimetric: also json sucks for editing by hand :P [01:30:31] i saw yaml there sometime. I guess it takes yaml too? [01:31:02] yeah [01:31:23] erosen: here? [01:31:23] we prefer yaml when we do it by hand, we're ok with json if it has to be done that way [01:31:35] yup [01:31:35] sup? [01:31:37] the datasources don't need to have their columns removed for the graphs to stop showing them [01:31:37] erosen: do you track just *.m.wikipedia.org in your stats ? [01:31:42] erosen: or also wikibooks [01:31:46] erosen: erm wiktionary [01:31:48] that's the idea of those [01:31:53] erosen: and all those other ones [01:31:58] they point to data and the graphs visualize data [01:32:00] eveything [01:32:02] milimetric: yeah, but I was removing them from the SQL too [01:32:03] so [01:32:05] erosen: oooooOOOOooo [01:32:11] hmmm [01:32:11] gotcha [01:32:14] i mean not in the number we're comparing though [01:32:38] erosen: ok, in the number we're comparing you just count *.m.wikipedia.org right ? [01:32:42] i just track them, as I have that information parsed out and I have files sitting around with counts for all those projecst [01:32:49] yes [01:32:56] but now that you as I might as well check [01:33:00] erosen: so your input is preprocessed somehow ? [01:33:05] erosen: how is it preprocessed ? [01:33:15] no [01:33:20] ok [01:34:11] if you are refering to the files sitting around, those are just the results of my counting [01:34:43] erosen: oh so that's the output [01:34:47] yeah [01:35:03] i'm checking the actual code, just to make sure ;) [01:35:07] erosen: do you count images also ? [01:35:23] i don't think so [01:35:27] ok [01:35:34] so you have a filter for that [01:35:38] YuviPanda, the changes look ok, I'll ping when they're deployed [01:35:41] erosen: do you filter on mimetype ? [01:35:42] my decision criterion for a page view is just whether hte url is a ….org/wiki/... [01:35:46] milimetric: thanks [01:35:51] no mime-type [01:35:58] erosen: mm ok [01:36:01] I tried that and it had a very small effect [01:38:00] erosen: so you are counting /wiki/ but not /w/ [01:38:23] well I'll try with /wiki/ and see what happens [01:38:47] yup, just wiki [01:39:19] one last thing I should point out is that my code does periodically raise exceptions when parsing log lines [01:39:23] and I skip them [01:39:29] YuviPanda: all better: http://mobile-reportcard-dev.wmflabs.org/ [01:39:29] :) [01:39:40] i should really know how often this happens [01:39:44] erosen: invalid lines I suppose [01:39:56] output them to a file in the catch block [01:39:58] average_1rifter: yeah, well, by definition [01:40:15] good call [01:41:23] milimetric: "name" gets displayed as the title, and "slug" is for URL, right? [01:42:07] yes YuviPanda, but slug must mirror the url, which comes from the file [01:42:21] sure, that's fine [01:42:21] id actually must mirror the url and slug must mirror id [01:42:25] I just needed to know which one to change to change the display [01:42:30] I can explain why but it's complicated :) [01:42:38] change all three [01:42:44] filename/slug/id [01:42:53] for now, that's how it has to be unfortunately [01:43:15] YuviPanda: pretty sure, yes [01:43:34] milimetric: no, I don't want to touch those :D [01:43:36] DarTar: dashboards are up [01:43:37] just the "name" field [01:43:48] \o/ [01:43:53] dschoon: link? [01:44:00] DarTar: http://ee-dashboard.wmflabs.org/ is http://ee-dashboard.wmflabs.org/dashboards/metrics [01:44:05] there's also http://ee-dashboard.wmflabs.org/dashboards/features [01:45:55] that's great :) [01:47:20] I'll start hacking the settings, glad that these dashboards have a permanent home [01:48:55] sounds great. let us know if you have any issues, DarTar [01:50:06] heading home [01:50:08] ta~ [01:56:59] milimetric: i'm going to test fabric deploy now :) [01:57:42] milimetric: I'm getting a 'command not found' [01:57:44] (on kripke) [01:58:29] oh well [01:59:04] erosen: response status code check ? [01:59:11] erosen: none right? [01:59:18] i don't thikn so [01:59:25] ok [01:59:38] * average_1rifter is deleting code now to adapt to embr_py [02:02:21] average_1rifter: just checked, no status code checking [02:02:27] I also tried that and found minor differences [02:03:02] cool [02:08:59] does anyone know how to run fab on kripke? [02:09:06] (it tells me currently that fabric is not installed) [02:10:15] install fabric? [02:10:17] with pip ? [02:10:25] * average_1rifter hasn't used fabric [02:10:54] average_1rifter: oh [02:11:06] well, I also guess I need to know *where* the fab file is [02:11:40] http://docs.fabfile.org/en/1.6/installation.html [02:13:32] hmm [02:13:39] clearly I'm missing some critical piece of info [02:13:41] also it is 8AM [02:13:44] I should go sleep :| [02:20:29] ok [05:18:54] milimetric: hey :) [05:18:57] milimetric: you up ? [05:30:03] drdee: hi [05:30:10] drdee: new reports are on the way [05:30:16] ok [05:30:21] just december? [05:30:31] december and november, so we can compare and see if the bump occurs [05:31:12] drdee: can you let me know when you crunch december and november through kraken ? [05:31:16] I'm really curious about the outcome [05:32:09] I mean if you plan to do that [05:32:22] sure but we first need to solve this problem :) [05:32:26] ok [05:32:51] when were tabs introduced ? [05:32:58] feb 1 [05:33:05] ok [05:33:43] we should make like a timeline of events that produced changes in the datasets, I'll try to find some mediawiki extension that does this. I think there's a timeline [05:46:15] erosen: can you link me up one more time to your results please ? [05:46:20] the results of embr_py [05:46:26] drdee: http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r31-embr-py-rules/pageviews.html [05:46:29] this is the new one [05:46:35] uhm still the gap is there [05:47:06] the code now has exactly the same rules as Evan's, I'll have to look again on the languages to see if the arrays are the same [05:47:25] i am not surprised [05:49:22] ok [05:49:32] hey average_1rifter , one sec [05:49:39] erosen: ok [05:52:35] switched to Evan's list of languages [05:52:38] another run [05:53:35] erosen: can we diff our input sets ? or md5sum their contents [05:53:44] sure [05:53:52] ok we'll check sizes to [05:53:53] which file are you running it on exactly? [05:54:14] erosen: /a/squid/archive/sampled-geocoded [05:54:24] erosen: you're using /a/squid/archive/sampled right ? [05:54:30] yup [05:54:36] ok, I'm going to check sizes now [05:55:24] average_1rifter: here is the output http://gp.wmflabs.org/data/datafiles/gp/daily_mobile_wp_views_by_country.csv [05:55:54] you can visualize it in limn but it crashes because there are too many lines [05:56:02] hey average_1rifter sorry missed your ping [05:56:21] I'm up, trying to tie together some user metrics API stuff for DarTar [05:56:35] hey milimetric [05:56:42] hey Dario :) [05:56:44] how goes [05:56:50] good [05:57:15] I'm working on the timeseries example, seems pretty simple [05:57:26] nice :) [05:57:39] hopefully it's not too late but I should have something by tomorrow morning [05:57:56] my only impediment is I have to do laundry at some point because I have no more socks :) [05:58:17] we kind of concluded we wouldn't showcase any dataviz but we can always add something at the very last minute if available [05:59:16] socks: if you want to deprioritize them I can lend you a couple :D [06:00:40] haha [06:01:39] yeah, if it's ready, and we have time to add it to the presentation, it'd be nice to have [06:02:38] http://diffchecker.com/4jPg3L7i [06:02:43] erosen: LHS is you [06:02:45] erosen: RHS is me [06:02:59] :( I don't know why the output got considerably bigger [06:03:05] woah [06:03:06] weird [06:03:17] all I did between the two was use udp-filter to geocode them [06:03:18] average_1rifter where do you files come from? [06:03:42] erosen: they're the files you're using passed either through udp-filter or through a oneliner to split the ip column for geocoding [06:03:59] ya [06:04:13] I'll check one last thing, the size in lines [06:04:21] and after that I'm switching to your datasource [06:04:33] the geocoding just adds one field, right? [06:04:47] yes [06:04:54] which looks about right then [06:05:08] it does ? [06:05:10] the factor is roughly 21 / 20 [06:05:34] unless they are just two letter country codes [06:05:42] they're 2 letter [06:05:44] hmm [06:06:03] that would imply that the log line is only 40 characters long [06:06:06] which can't be true [06:06:16] okay, I agree that they are weird again [06:34:51] erosen: the whole dataset differs by just 415 lines [06:35:07] interesting [06:35:23] i assume the geocoded dataset has fewer? [06:35:31] because udp-filter eats a few? [06:35:42] has more, for some reason [06:35:56] I would expect it to have fewer also [06:36:03] interesting [06:36:17] well that clearly can't be throwing us off [06:36:20] good to know [06:36:36] tried with your list of languages [06:36:37] http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r32-embr-py-logic-replaced-languages/pageviews.html [06:36:50] difference is extremely small (I had one or two more languages) [06:37:08] ok, now I'm switching to your datasource [06:37:13] /a/squid/archive/sampled [06:37:28] seems reasonable [06:41:33] wait, I don't even need the geocoding for this [06:41:45] I can directly switch to your datasource [06:41:52] erosen: you have any field count checks ? [06:41:58] yeah [06:41:58] sorry, just double checking [06:42:01] yes ? [06:42:05] what's your minimum ? [06:42:25] they aren't checks so much as custom ways of dealing with it when two of hte fields get merged [06:43:00] hm, I don't have that [06:43:19] erosen: do you happen to know if a big percentage of lines have fields that get merged ? [06:43:42] i don't actually know significant an effect it has [06:43:48] i could check though [06:45:15] ok, what kind of logic do you have for that ? [06:45:21] is embr_py still on github ? [06:45:29] ya [06:45:47] can't access it [06:45:47] https://github.com/embr/embr_py [06:46:06] https://github.com/wikimedia/metrics/tree/master/pageviews/embr_py [06:46:11] oh sorry [06:46:11] ok [06:46:39] basically i delete the 11th token (zero-indexed) if i have too many tokens [06:47:18] because it tends to arise from a space in the mime_type field [06:47:25] ok [06:48:13] i running a quick count on a recent file righ tnow [06:48:22] ok [06:48:42] I'm switching datasources and doing another run [06:49:58] k [06:52:14] running [06:52:23] eta 20m-30m [06:52:27] less actually [06:52:45] hopefully i'll have an answer about the effect of the extra field by then [06:52:51] ok [06:53:21] thanks btw, if a lot of lines need field-merging then I'll need to implement that as well [06:53:58] erosen: how long did it take you to develop your python framework for making these reports ? [06:54:38] they emerged over a few weeks off and on [06:54:39] I think things went wrong on the Perl script I've written because I pumped too much logic without making reports for each new discarding rule.. [06:56:53] I also made another mistake [06:57:02] I should've made tags for each new reports I'd release [06:57:05] so on a sample of 100k total lines from a few days ago, I found that I deleted the extra token 3.2k times [06:57:17] so as to be able to return to a previous version of the code associated with a given report version [06:57:26] good call [06:57:51] I've had the same headaches because I can't figure out hte difference between code versions [06:58:36] so again, it shouldn't really matter [06:59:31] the delta is 41% [06:59:36] so 3% yes, won't account for it [06:59:51] but looking closer it happen on 1677 / 22021.0, or 7%, of the total page view requests (after filtering) [07:00:11] it did ? [07:00:15] still not enought, i know, but interesting that there is a correlation [07:02:33] in a few minutes the new reports are coming out [07:02:47] cool [07:03:20] could you run the counting of the merged-field lines on a bigger sample than 100k ? [07:03:54] sure, I'm running it on an entire file now [07:04:04] ok, thanks [07:07:02] [travis-ci] develop/bb8b5b6 (#107 by milimetric): The build passed. http://travis-ci.org/wikimedia/limn/builds/5306716 [07:10:08] average_1rifter: Counter({(False,): 1630509, (True,): 105015, (None,): 515}) [07:10:23] read: no-merge, merge, other-error [07:10:44] 0.0644 [07:11:02] that is on filtered requests, btw [07:12:31] so it's 6.5% ? [07:12:39] yeah [07:12:52] ok, could be part of the explanation [07:13:20] erosen: do you count discarded lines ? [07:13:29] yeah [07:13:32] that is the last number [07:13:33] 515 [07:14:03] apologies for the cryptic formatting [07:14:33] I changed the datasource [07:14:34] http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r33-embr-py-logic-changed-datasource/pageviews.html [07:14:46] lowered minimum fields to 9 instead of 10 [07:14:54] gap is still there [07:20:46] ok let me re-check my code [07:20:51] maybe I'm doing something wrong [07:26:10] well obviously I am uhm.. [07:26:36] [travis-ci] master/2bcf24b (#72 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5306965 [07:53:37] average_1rifter: I'm off to bed, catch you tomorrow probably [08:05:46] [travis-ci] master/a6b5c44 (#73 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5307429 [08:21:14] [travis-ci] master/2201d31 (#74 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5307613 [14:55:41] hi erosen [14:55:46] hey sup? [14:56:05] erosen: the data is beating me [14:56:17] erosen: but I'm not tapping out [14:56:38] erosen: http://www.mediawiki.org/wiki/User:Spetrea#2013-March-07 [14:56:42] average_1rifter hehe, well glad to hear you've still got some hope [14:57:42] what do you think about doing a more granular analysis a few thousand lines, so we can see how they differ? [14:58:01] erosen: I am all for that [14:58:39] i'm wondering what the best approach would be. maybe we just create a file with the lines that matched [14:58:55] and then we can diff those, to see which you are counting and I'm not (if any) [15:00:15] dschoon: seems like something you might like: http://www.brainpickings.org/index.php/2013/03/07/a-map-of-the-world-according-to-illustrators-and-storytellers/ [15:37:53] erosen: ok finished the last run I had to do, proved it's not related to the forks I'm doing either [15:38:11] average_1rifter: that's good news [15:38:37] average_1rifter: do you want to go ahead with comparing a specific file? [15:38:40] erosen: yes [15:38:49] presumably the one that is the first day of the bump [15:39:00] erosen: I'll create a sampled input file with data from nov => dec [15:39:08] sounds good [15:39:11] erosen: then we'll use that in both my implementation and yours [15:39:15] sounds good [16:08:34] erosen: ls sampled-1000.log-201211* sampled-1000.log-201212* | xargs zcat | perl -ne 'print if $. % 10000 == 0' | gzip > /tmp/mix.gz [16:08:55] I'm running that to produce a 1:10000 sampled mixed gzip for november and december [16:09:14] hopefuly it will be large enough to reproduce the problems [16:09:17] but small enough to run fast [16:09:36] if it doesn't work, we'll try again(with a different remainder in that module thing, or some other stuff) [16:09:48] by "doesn't work" I mean that we don't reproduce the problem [16:09:49] sounds good [16:09:53] ya [16:10:48] average_1rifter just one thought: to make it easy to tell whether it does reproduce the problem, might it make sense to create two files--one for nov and one for dec? [16:11:08] or do you already have code for grouping log lines by month [16:11:20] it's not an issue for me [16:12:06] erosen: I'll add code for htat [16:12:16] k [16:12:37] average_1rifter i'm getting off a train now, i'll be back in 20 [17:10:55] erosen: sweet! [17:12:29] glad you're excited [17:28:41] good morning peoples [17:29:07] aye [17:38:08] erosen: got the datasets ready [17:38:13] nice [17:38:18] where do they live? [17:38:18] /tmp/mix_nov.gz [17:38:23] /tmp/mix_dec.gz [17:38:25]