[00:16:11] drdee: i'm talking to dan foy about VUMI stuff, and I keep forgetting whether we actually have a reliable unsampled log for all requests [00:16:13] do you know? [00:18:33] dschoon: do you know if we have an unsampled log? ^ [00:23:07] i guess it's unsampled [00:23:21] but you would have to check with ottomata to be sure [00:24:21] k, basically dan raised the question of how we track how much SMS traffic is coming form each partner [00:24:53] and I realized I don't think we can do that unless we have an unsampled log, or go down the udp-filter route, or add custom varnish behavior [00:24:59] drdee: thoughts? [00:50:14] i'll think about it, erosen. i glanced at those varnish configs when looking into the beta site stuff. [00:53:47] cool [01:49:46] dschoon: what is the chart type for the new datasources? [01:50:33] that field is obsolete [01:50:36] erosen ^^ [01:50:40] awersome [01:50:51] I'm updating the limnpy to use the new formats [01:51:21] so it is just 'chart' : {'type' : 'timeseries'}? [01:51:26] these are the fields: [01:51:26] https://github.com/wikimedia/limn/blob/master/src/data/datasource/datasource.co#L65 [01:51:52] cool [01:52:03] usually, "defaults" contains an exhaustive list of the attributes a configuration object supports [01:52:04] do you know where in the limn object in the browser I can find examples? [01:52:12] totes [01:52:35] so, the src directory hierarchy is replicated as an object hierarchy on limn [01:52:45] k [01:53:00] src/graph/node/vis/line-node.co --> limn.graph.node.vis.LineNode [01:53:11] (for the classes, that is) [01:53:25] the active model and view are at limn.model and limn.view [01:53:39] aha that's what I was forgetting [01:53:48] for a page with a single graph, limn.view is a GraphView [01:53:55] limn.model will be a Graph [01:54:01] you can also go [01:54:05] I see [01:54:20] how do i find a datasource object model from such page [01:55:34] limn.data.DataSource.lookup( DATASOURCE_ID, function(err, ds) ) [01:56:00] if you're already sure the source is loaded, you can just go: [01:56:15] limn.data.DataSource.get(ID) [01:56:33] that merely checks the cache. [01:56:54] cool [01:57:05] actually maybe i was making things too complicated [01:57:20] does something like this yield an up to date version? http://reportcard.wmflabs.org/data/datasources/rc/rc_comscore_region_uv.yaml [01:58:44] pretty sure datasources have not changed at all [01:59:12] excepting the new field "type" to go along with "format" [01:59:21] type defaults to "timeseries", so you should be fine [02:00:53] dschoon: can you clarify: """ excepting the new field "type" to go along with "format" """ [02:01:16] check out the comment here: https://github.com/wikimedia/limn/blob/master/src/data/datasource/datasource.co#L69 [02:01:21] and lmk if that helps? [02:01:56] the idea is that "format" is how the data is serialized/encoded/formatted. csv, tsv, xml, json, etc [02:02:05] that makes sense [02:02:08] type is about the contents [02:02:17] gotcha [02:02:29] so geojson has special meaning. it's polygon data for drawing map outlines [02:02:29] i just couldn't tell if there was some relationship between the two [02:02:58] because i think format was in the old version [02:03:12] they're usually independent, but the XXXjson types obviously imply format=json [02:03:14] so really just type is new [02:03:20] yep. [02:03:20] ya, fo sho [02:03:24] as i said, i think :) [02:03:29] hehe [02:03:30] and type defaults to "timeseries" [02:03:31] we'll find out [02:03:35] cool [02:03:36] so i think you don't have to do anything? [02:03:51] so hopefully datasources haven't changed at all [02:04:02] ...it also occurs to me i'm sitting in legal. [02:04:05] well i'm going to drop the "chart" key [02:04:08] and you might be on the other side of the room [02:04:10] nope [02:04:13] ah, ok :) [02:04:15] i'm in PA today [02:04:15] hehe [02:04:17] yeah, drop it [02:04:26] cool [02:07:27] aiight, i should head home [02:07:27] i'll be back online in a bit. [02:07:27] laterz [02:07:28] cool [02:07:29] in case you have questions [02:07:30] lates [05:05:34] erosen [05:05:37] oh too late [05:05:42] haha, i just read his question [05:05:52] I have numbers on unsampled reliability [05:06:03] want to graph it once I get data syncing from /wmf/public to stat1001 [14:27:59] mornin [15:10:33] gooooood morning guys [15:14:17] milimetric, ottomata [15:14:30] hey diederik :) [15:16:37] morning [15:18:41] mooooororoning [15:27:49] milimetric [15:27:52] shall we puppetize limn? [15:28:04] yes! [15:28:06] let's do it [15:28:30] hokay, so, uhhh, yeah, what's needed [15:28:39] install package... [15:28:53] set some configs [15:28:54] start a service? [15:29:08] ok, so here's the list of things: [15:29:14] 1. apache installed [15:29:22] 2. supervisor installed [15:29:41] 3. deb package that average_drifter built installed [15:30:00] i wonder if we can use upstart instead of supervisor, ops would be much happier about that [15:30:10] 4. conf file for apache -> input is the path where 3. installs to and the port that 2. will use [15:30:26] 5. conf file for supervisor -> input is the same as apache [15:30:41] oh, one more input for 4. -> the domain name that it's going to serve on [15:30:49] oh ok, lemme see what upstart says [15:31:01] should be easy [15:31:06] its just running a process [15:31:13] setting some env vars [15:31:18] yeah, looks the same from their description [15:31:32] where is stefan's .deb? [15:32:10] well, it can be built using his scripts and debianize submodule - he's working on it in a limn branch [15:32:22] let me pull latest. average_drifter if you're around we're about to use your deb [15:32:24] ah ok [15:32:37] do we have puppet modules for the other stuff? [15:32:39] apache I'm sure [15:32:40] well i just want to see where it puts stuff, and if we should make the upstart script part of it [15:32:42] but upstart? [15:32:44] apache yeah, supervisor no [15:32:50] upstart no, but it is just a couple of files [15:32:53] upstart comes with ubuntu [15:32:57] ok, cool [15:32:57] but, usually [15:33:02] packages that need to be run as services [15:33:15] include their own init (init.d or upstart) scripts [15:33:18] unless [15:33:24] we intend to run multiple limn instance on a single node [15:33:28] which probably makes sense, right? [15:33:33] well, the long term plan is to allow that [15:33:37] milimetric , ottomata can I have the apache vhost conf for limn to include it in the .deb please ? [15:33:44] but right now, we would need to install it to multiple directories [15:33:50] hmmmm [15:33:58] wait limn doesn't require apache, does it? [15:34:07] no, it's just how it's set up on labs now [15:34:10] milimetric: deb is not ready yet, working on it [15:34:14] that's just for static files, right? [15:34:36] i think it handles the routing to the localhost:LIMN_PORT [15:34:36] running server.co and going http://whatever:LIMN_PORT [15:34:38] right [15:34:40] ok [15:34:47] i don't think apache confs should be part of the .deb then [15:34:51] server.co is run by upstart [15:34:51] ottomata: cool [15:34:51] we can puppetize that bit [15:34:54] right [15:35:07] but yeah, if we want to run multiple instances [15:35:15] then maybe dont' worry about including an upstart script [15:35:20] we'll puppetize it [15:35:30] so that puppet will include an init/upstart script per instance [15:35:38] yeah, ok [15:35:42] and when the deb supports that, it'll be easier [15:36:00] is it possible to run multiple instance from the same install? [15:36:01] like [15:36:12] server.co is basically the binary [15:36:17] I should be able to set env vars [15:36:38] or config vars [15:36:42] well [15:36:45] yea, BUT [15:36:46] and run server.co from the same path but in different processes [15:36:47] milimetric: did you talk to dschoon about the version thingie ?> [15:36:47] right? [15:36:54] hi drdee [15:37:06] it uses the limn_install/var/data directory [15:37:16] so that's in common [15:37:17] what's limn_install ? [15:37:18] ooooh! [15:37:27] wherever the deb puts it average_drifter [15:37:35] but I just realized, it doesn't have to use that - we can configure it [15:37:39] milimetric: /srv/limn ? [15:37:42] yes [15:37:45] ok [15:38:05] milimetric: we need to version thing ready to include it in the package [15:38:06] so the limn code should be installed somewhere common, right? [15:38:11] ottomata: ok, so in theory, it should be able to use the same limn install to run multiple instances if I change the data directory to be configurable [15:38:12] so basically the version is governed by the tags [15:38:16] /usr/lib/limn or whatever [15:38:26] we were doing /srv/limn ottomata, is that no good? [15:38:34] as default data dir? [15:38:42] or for installing the code? [15:38:44] no as the place to install the code [15:38:46] hm [15:39:03] we can move it wherever we think it'd be more standard [15:39:20] yeah, i'm not sure in this case, limn is kinda a standalone service, but its also a website [15:39:22] right? [15:39:35] i think /srv should contain instance specific stuff [15:39:50] common reusable code and executables shouldn't go there [15:40:13] where is node installed? [15:40:55] no idea actually [15:41:32] I have a big fat juicy 5.6M limn.deb [15:41:44] containing all the node_modules [15:41:46] and all the stuff [15:42:00] /usr/lib/nodejs/ [15:42:11] ok, so /usr/lib/limn then? [15:42:28] but i feel like limn is more like apache than node [15:42:39] yeah i think so [15:42:41] (currently, of course) [15:42:44] apache is in /usr/lib/apache2 [15:42:48] ok, perfect [15:43:05] also, maybe we can symlink /usr/bin/limn to server.co [15:43:07] ottomata: so the node_modules should go in /usr/lib/nodejs/ ? [15:43:12] yo ottomata: i know you are like super super super busy but if you could setup the rsync of /wmf/public to stat1001 that would be really really helpful [15:43:13] so you could spawn up a non daemonize instance by doing [15:43:16] so average_drifter, we should change the limn install to /usr/lib and I have to make it so you can configure Limn to read data from any directory [15:43:25] limn —config-file path/to/limn.conf [15:43:27] or whatever [15:43:58] milimetric , ottomata , dschoon where would you prefer to have node_modules installed by the deb package ? [15:44:00] well, how about "set environment variables" && server.co [15:44:13] hmm, yeah I guess that's fine [15:44:19] we can make a wrapper that does that too [15:44:24] which could be /usr/bin/limn [15:44:25] /usr/lib/limn/node_modules <- average_drifter [15:44:37] ok [15:44:50] well no, the deb has to have those because ops doesn't want npm doing it's rogue crap on target machines [15:45:13] drdee, that's on my list, I just thought I'd talk with milimetric this morning about this, since that change will require review and won't happen until SF wakes up at the earliest [15:45:22] ok [15:45:41] reason i am asking is i am dying to show some of the stuff that's actually build [15:45:51] yeah [15:46:07] i was still kiiinda hoping that we could just producitonize limn and not have to sync /wmf/public [15:46:20] if I could install hadoop client on stat1, then it would be easy too :p [15:46:24] could just do hadoop fs -get [15:46:44] buuuut [15:46:44] ok ok [15:46:45] i can do [15:46:47] yeah, drdee we can have this all working and in prod by the 28th [15:46:58] and then we can show it off as a team [15:47:07] it'll probably be good to have the public stuff synced anyway [15:47:17] so we don't have to deal with any lockdowns of analytics cluster blocking access [15:47:37] rigt [15:48:09] ok, ottomata, average_drifter, I'll work on those configuration improvements now [15:48:14] mk, danke [15:48:50] milimetric: ok, and if you can find out about the version thing, we'll talk about it soon, ping me when you want me to pull [15:48:52] drdee, do you want to use concat_sort on /wmf/public/mobile [15:48:53] ? [15:49:07] so that you only have a single file to deal with? [15:49:07] yes [15:49:32] yep, average_drifter, I'm updating that as well [15:49:56] drdee, here's an example of how to do that: [15:49:56] https://github.com/wikimedia/kraken/blob/master/oozie/webrequest_loss_by_hour/workflow.xml [15:50:36] oink [15:51:54] what happened to stat1 ? [15:52:20] http://stat1.wikimedia.org/spetrea/ <== doesn't load [15:53:39] hm [15:53:40] uhh [15:53:42] we turned off apache [15:53:43] let ssee [15:53:44] we did? [15:53:47] on stat1? [15:53:54] yes me and paravoid yesterday night [15:54:05] oh, apache is running though [15:54:14] i think we stopped it [15:54:53] can we have it up again ? [15:55:00] not on stat1 [15:55:07] we really should be running all our web stuff on stat10001 [15:55:22] alright, uhm [15:55:31] can I move my stuff to stat1 ? [15:55:34] sorry [15:55:37] to stat1001 [15:55:40] or stat10001 [15:55:47] stat1001 [15:55:51] i mean, i guess so, the idea is that stat1001 is production web stuff [15:56:01] not for random 1 off hosting of things [15:56:01] no, i spoke with faidon about this [15:56:04] oh? [15:56:14] the issue is that stat1 contains private data [15:56:23] and we should minimize public access to that machine [15:56:57] i think stat1001 should be both random and production things [15:57:26] but average_drifter, you can just send me the raw data files or put them in my home folder [15:57:36] hm, ok, i was hoping that it wouldn't be random things, that way things that we want to be online are more stable (stats.wikimedia, limn report card eventually, metrics-api, etc.) [15:57:37] and right now you are working on debianization anyways :) [15:57:55] well, real random things should live in labs [15:58:00] sure [15:58:05] but not on stat1 either [15:58:10] that's fine with me [15:58:13] drdee: yes, well, a place to put the .debs somewhere where milimetric and ottomata can download them easily is good.. [15:58:17] but it's optional.. [15:58:22] right [15:58:45] i can scp from stat1 pretty easy [15:58:53] average_drifter - I think as long as anyone can build the deb easily, it doesn't matter if it takes a little long to go through the revlog [15:58:59] since that's only gonna be the first time anyway right? [15:59:18] milimetric: --update isn't implemented yet so it would be every time for now [15:59:22] uhm [15:59:27] but yeah we'll have that soon as well [16:00:05] exactly, I think in idealistic ways :) [16:14:35] average_drifter: I have a pageview by country question / proposal [16:14:51] do you have the code you use for doing this somewhere I can look at? [16:15:25] and would you be interested in starting a little repository of canonical (or de facto) methods for doing page view counting, so that we could compare different implementations on a standard data set? [16:15:54] erosen: sure [16:16:22] I gave Amit a version of the page views by country report and he found a few cases where the numbers don't match wikistats by significant margin [16:16:43] so I'm hoping we can pin down the problem, and start on a shared metrics repo [16:17:40] erosen: https://plus.google.com/hangouts/_/96856de55c688666f7bc3f769d67f799fa69298f [16:18:17] average_drifter: can't join quite yet--still on my train [16:18:22] erosen: ok [16:18:30] i can do that in 40 min, if that still works for you [16:18:50] erosen: average_drifter: let's make some flow diagrams we started for mobile page views alrady [16:18:58] drdee: yes [16:19:43] drdee: sure, but I'm primarily concerning with finding the differences at the moment (though i can imagine that the data streams could be different--but I think that is unlikely) [16:19:56] s/concerning/concerned/ [16:21:43] i can tell you all the differene [16:21:51] don't' worry :) [16:21:58] swing by my desk this morning! [16:22:01] hehe [16:22:07] seriously [16:22:28] i also pushed a pig script called Pageview to come up with a canonical page view count for kraken [16:22:30] drdee: not that I don't have utter faith in your knowledge of the stats goings on, but I'm talking about the difference between my counting algorithm and wikistats [16:22:37] it's not finished yet [16:22:42] very nice [16:22:50] right but i can tell you how wikistats works [16:22:55] and then you will be like [16:23:03] ohhhhhh yes of course there is a discrepancy [16:23:11] i see [16:23:13] see you after scrum? [16:23:18] relocating to office right now [16:23:29] sounds good [16:41:16] drdee: can I request permission to publish stuff on stat1001 ? [16:41:19] I don't have access to it atm [16:41:33] I'm refering to the new mobile pageviews reports [17:07:21] milimetric: i mostly finished the viewport refactor yesterday [17:07:36] just a few tweaks to make sure it works for all node types [17:21:10] awesome dschoon - that should be a very clean happy feeling [17:33:11] erosen: hey. thanks for the last link. I'venow gathered everything here: https://github.com/geohacker/indicwiki/tree/master/data [17:33:47] erosen: https://www.mediawiki.org/wiki/User:Spetrea/What_is_a_pageview [17:33:55] nice [17:34:05] erosen: also, about the edits by geography - do you think we can find edits within India? may be at the state level? [17:34:08] erosen: add that to your watch list, I'm adding more details to it [17:34:13] average_drifter: is that a graphviz graph? [17:34:28] erosen: yes, it is [17:34:31] average_drifter: will do, do you want to video chat after scrum with drdee? [17:34:38] erosen: yes [17:34:40] geohacker: definitely [17:35:14] geohacker i have a rather rich store of data on that matter. What level of granularity are you looking for? [17:35:31] geohacker: I can give you by city edits (for cities which contribute more than 10% of total edits) [17:35:44] or I can just give you country level edits for each language [17:36:04] erosen: city and country level would make sense right now. [17:36:14] I can think of a pretty nifty map mashup. [17:36:24] only if time allows me to code it up. [17:36:41] but otherwise we can look at both sets separately. [17:37:06] erosen: would be fantastic. [17:41:30] geohacker: I see you're having fun :) [17:41:53] geohacker: so here is the country level data http://gp-dev.wmflabs.org/graphs/hi_top10 [17:43:58] YuviPanda: indeed. did you see https://github.com/geohacker/indicwiki/tree/master/data [17:44:03] ah no :) [17:44:04] erosen: checking [17:45:43] erosen: ah yes. when I said country level I imagined these are edits within the country. [17:46:00] erosen: do you have city level data handy? [17:46:21] geohacker: sort of [17:46:39] erosen: awesome :) [17:48:35] erosen: ping me with a g+ link when you're ready [17:49:02] k [17:53:21] geohacker: I have a database with all of the city edits for all languages. can I just give you a csv version and you can filter out the languages / countries you care about? [17:53:37] geohacker: meanwhile I'll work on making some line charts of city-level data [17:53:43] but that will take much longer [17:54:42] erosen: that sounds perfect. [17:55:29] erosen: I'll filter all the indic projects and their activity in Indian cities to begin with. [17:55:38] cool [17:56:03] oh I'll geocode them and send it back to you if I'm successful. [18:02:53] milimetric: scrum? [18:10:38] erosen: just poke me whenever you have the link. thanks! [18:11:28] geohacker: sounds good -- in a meeting for a bit more [18:11:56] erosen: no worries. later. [18:36:20] average_drifter: how about now for the country report hangout? [18:36:34] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [18:36:50] erosen: sure [19:32:17] erosen: https://github.com/embr/metrics <== this is the repo ? [19:32:23] yup [19:32:26] ok [19:32:27] just putting my code in it, now [19:32:30] ok [19:40:13] milimetric: can you please try following this Deb.md file and see if you can get a deb ? https://github.com/wikimedia/limn/blob/debianization/Deb.md [19:40:29] milimetric: you should be able to get a deb that way. if you encounter any problems please let me know so I can fix them [19:41:06] ok, awesome average_drifter. I'm working on the data directory thing and will try as soon as I finish [19:41:14] ok [19:43:39] erosen: I need to ask a question. So what I've been working on is a monthly mobile pageviews report per wiki project [19:43:46] kk [19:43:48] sup? [19:43:59] erosen: but you mentioned a monthly mobile pageviews report per country [19:44:02] right ? [19:44:08] yeah [19:44:18] so we'll be working on one per country [19:44:39] we can do either [19:44:51] erosen: is Amit ok with your monthly mobile pageviews report per country ? [19:44:57] not sure [19:44:59] I should talk to him about tthat [19:45:02] I'll do that today [19:45:07] ok [20:02:53] ottomata, drdee: I'm stuck, can't access kripke or reportcard in labs [20:03:12] i think labs is having issues, i hear ryan lane say anyway [20:03:22] ssh kripke tells me Permission denied [20:03:25] k [20:10:04] erosen: I'm stalking your embr/metrics repo :) [20:10:09] hehe [20:10:12] updating the dot file presently [20:10:17] ok [20:18:28] okay, average_drifter: at long last: https://github.com/embr/metrics/tree/master/pageviews/embr_py [20:18:57] also, I just added you as a contributor [20:19:01] erosen: thanks [20:19:06] ok average_drifter, ottomata - Limn can now serve multiple instances from the same install / clone [20:19:23] erosen: I'll add my stuff too and we can try to compare our results on a 4-day period ? [20:19:31] erosen: what do you think of that ? [20:19:32] sounds good [20:19:41] erosen: you mentioned you could help me run it on kraken ? [20:19:48] i made my script so that it takes in file names as command line args [20:19:54] indeed [20:20:02] woohooo, nice! [20:20:02] ottomata: for the upstart config, the only addition to what's in supervisor configs now is the variable "LIMN_DATA" which points to the directory that *this* instance's data is linked in [20:20:07] average_drifter: I've been meaning to clean that code up a bit, so gibe me a sec [20:20:12] perfect [20:20:15] erosen: ok [20:20:17] and that is a setable environment variable, right? [20:20:25] yeah, just like NODE_ENV [20:20:34] cool [20:20:36] and I used /srv/limn-data as the default [20:20:45] hmm, ok [20:20:51] so we can probably keep all the installs similar like /srv/reportcard-limn-data [20:20:58] so, what if we created a wrapper script for server.co [20:20:59] in bin/ [20:21:00] or would you rather some other place more Linuxey? [20:21:02] just called limn [20:21:13] well, i think /srv isn't actually that standard [20:21:18] maybe, hm [20:21:23] either [20:21:30] milimetric: can I merge debianization into master ? or should we keep it as a separate branch ? [20:21:38] /var/www/limn or /var/lib/limn [20:21:42] well, we can't put something in bin because it would need different values per instance [20:21:53] right, but it would have defaults and cli options [20:22:12] and the ability to read its configs from a .conf file, or an /etc/default file even [20:22:29] the debian could then just install limn into /usr/bin/limn [20:22:32] which would just launch server.co [20:22:34] hm, ok but since this stuff makes my brains hurt (all of them) can we make it a nice to have? [20:22:39] yup [20:22:44] phew [20:22:44] :) [20:22:53] adding to Asana though [20:23:01] i could probably work on that wrapper [20:23:05] probably shoudl do it in bash [20:23:28] but cool [20:23:41] https://app.asana.com/0/701374192205/4080872673353 [20:24:01] ok, so average_drifter I'm working on debianization now [20:24:07] not in master, let's merge it into develop [20:24:17] cool [20:24:18] I will merge in master when I update all the documentation and everything [20:25:34] average_drifter - are you merging into develop or should I? [20:29:07] milimetric: I can merge it [20:29:09] I'll do that now [20:29:53] average_drifter: I'm tweaking the deb as I find stuff that's different on my system [20:30:15] milimetric: merged [20:30:36] milimetric: ok no problem, just the debian/rules right ? [20:30:39] ok, I'll tweak Deb.md in the develop branch then [20:30:50] well so far sudo aptitude install libjson-xs-perl was needed for git2deblogs [20:30:51] oh yeah, that one too [20:30:56] and the ln -s for git2deblogs wasn't done [20:30:56] yea [20:31:02] I'll add - no prob [20:31:07] ok [20:34:24] average_drifter - i don't have aptitude and I don't think it's standard [20:34:34] is it ok to change the instructions to apt-get or do we need aptitude? [20:35:39] milimetric: apt-get is fine too [20:37:43] average_drifter: does the tag have to start with a 0? I added the tag "v0.6.0" [20:37:53] it has to start with 0 yes [20:37:56] I mean [20:38:03] like a standard version [20:38:07] number.number.number [20:38:10] or number.number [20:38:18] ok, cool [20:47:24] erosen: I think I forgot to mention in the flowchart that I'm checking for http://(wikiproject1|wikiproject2|...).m.wikipedia.org.* [20:47:29] erosen: do you do that filtering too [20:47:30] ? [20:56:37] hm, drdee, brain bounce w me for a sec [20:56:38] you there? [20:56:40] (via chat) [20:56:44] yo bounce [20:57:02] k, i'm looking into making stats user be default user for oozie jobs [20:57:06] will work great [20:57:08] but [20:57:12] need to give it accces to /wmf/raw [20:57:20] we were using ldap labs projects for that [20:57:22] ryan lane said not to [20:57:24] ldap will be good [20:57:27] but not labs projects [20:57:28] so [20:57:42] also i need to get the hadoop direct ldap thing working [20:57:48] rather than shell nss stuff [20:57:53] shell nss works but is messy and no fun [20:57:53] ok [20:58:02] so, in the mean time [20:58:08] i think we should change group ownership of /wmf/raw files [20:58:12] but i'm not sure to what [20:58:15] something not managed in ldap for now [20:58:20] but I don't want to manage groups manually [20:58:25] group hdfs? [20:58:39] naw too super, we could create a new group [20:58:45] ok [20:58:47] the stats user its user group 'stats' [20:58:51] we could do that [20:58:58] and add ourselves to stats group on namenode [20:59:16] i should/will ask ops to see what they think eh? [20:59:22] yes [20:59:25] :) [20:59:44] ottomata: limn has a file called /Deb.md on the wikimedia develop branch [20:59:51] following those instructions, you can get a .deb out of it [20:59:54] oo, k [21:00:07] we can do that tomorrow - I'm out for the day, Valentine's plans :) [21:01:28] average_drifter: sorry for the lapse, I was chatting with amit [21:01:48] erosen: no problem [21:02:10] stats:stats sounds good to me, too ottomata [21:02:12] so I do do that filtering you were talking about [21:02:20] i'll figure out a way to represent that as well [21:02:53] but i did just get an update from Amit, which suggests that he is okay with starting from scratch [21:03:03] he just wants to know that the numbers are pretty good [21:03:52] average_drifter: so i'm thinking that we should join forces with Diederik and write some nice pig scripts to this correctly [21:04:35] erosen: I think that sounds great [21:04:38] erosen [21:04:43] you talking mobile numbers in kraken? [21:04:49] ya [21:04:55] i got reliability numbers for you [21:04:57] ottomata: already exists? [21:05:00] percent loss per hour [21:05:06] ? [21:05:12] yes, i'm working on standardizing there stuff for now [21:05:17] that* [21:05:19] but [21:05:31] it has been running on all data since feb 1 and outputting in my user dir [21:05:33] uhh [21:05:34] check out [21:06:01] /user/otto/webrequest_loss_by_hour.tsv [21:06:20] (I actually just stopped this job to work on computing these over again and saving in /wmf/public) [21:06:41] (or is this not at all what you are talking about) [21:06:59] i'm thinking we are talking about different things [21:07:08] we need mobile page views by country by project [21:07:20] ok, i'm talking about giving you numbers of percent log loss per hour, so you can be sure of how accurate your data is [21:07:28] gotcha [21:07:43] well we are also thinking of using the old log files [21:07:59] aye ok [21:08:05] no idea about those :p, i guess pretty good [21:08:13] unless nagios reports packet loss [21:10:14] drdee: can i chat with you in person about mobile country report stuff on kraken/ [21:12:37] erosen: he's in with kraig [21:12:42] just saw [21:27:17] erosen: so uhm [21:27:25] sum? [21:27:29] :D [21:27:30] sup* [21:27:46] 22:47 < average_drifter> erosen: I think I forgot to mention in the flowchart that I'm checking for http://(wikiproject1|wikiproject2|...).m.wikipedia.org.* [21:27:49] 22:47 < average_drifter> erosen: do you do that filtering too [21:27:52] 22:47 < average_drifter> ? [21:27:54] oh yeah [21:27:54] erosen: ^^ [21:28:08] average_drifter: i'm doing a check like that as well [21:28:45] average_drifter: actually I take that back. I wasn't doing the check, i was just parsing that part into a canonical string [21:28:53] i'll update the code and repo to only consider wikipedia lines [21:29:23] erosen: I'll tell you how I do it [21:29:25] erosen: https://github.com/wikimedia/fast-field-parser-xs/blob/master/PageViews-FieldParser/Parser.xs#L387 [21:29:33] erosen: I throw in a hash stuff like [21:29:48] erosen: http://en.m.wikipedia.org , https://en.m.wikipedia.org [21:30:00] erosen: http://ja.m.wikipedia.org , https://ja.m.wikipedia.org [21:30:01] average_drifter: just to be clear, all I am going to add is "lambda r : r['project'] == 'wikipedia'" [21:30:04] the parsing is already done [21:30:51] erosen: ok [21:31:09] but i think this would be a good thing to compare for example [21:31:43] erosen: I'm doing a run right now, after that's done, let's get 10 days (you pick a range, I'm fine with any range you pick), and then I run on that range and we can compare [21:32:12] sounds good [21:32:19] what are you thinking as for an output format? [21:32:29] just number of lines? [21:32:39] or number of request per day? [21:32:47] erosen: so for example we can have a text table like this [21:32:50] or sliced by other fieatures [21:34:02] erosen: so for example we can first limit ourselves to the range 1-10 december 2012 [21:34:10] erosen: and we can both output the following table [21:34:27] erosen: wikiproject, count [21:34:29] in that period [21:34:46] by wikiproject I mean "en", "ja", "de", "nl", etc [21:35:32] erosen: and after that we can compare. if I get a lower/higher count, we then discuss our definitions again [21:35:41] erosen: and we find the best definition [21:36:15] average_drifter: have you been able to send the table? [21:36:19] the best definition may be a yours, or mine, or a mix thereof [21:36:29] erosen: to send the table ? [21:36:32] which table [21:37:08] average_drifter: my bad, i missed this: average_drifter> [21:37:09] erosen: wikiproject, count [21:37:19] yea [21:38:47] average_drifter: updated flow chart and criteria [21:38:48] https://github.com/embr/metrics/tree/master/pageviews/embr_py [21:40:05] average_drifter: how about the schema: date, wikiproject, count [21:40:19] do you have the ability to pull out the request date without too much trouble? [21:43:29] average_drifter: can we use this range instead? /a/squid/archive/sampled/sampled-1000.log-2012120{1..9}.gz [21:43:40] just to make it easier to use bash completion [21:44:06] or bash iteration/substitution [21:46:06] erosen: I updated also [21:46:14] erosen: let me link you up [21:46:52] erosen: https://raw.github.com/wikimedia/fast-field-parser-xs/master/img/pageview_definition.png [21:47:04] erosen: we can use that range yes [21:47:43] drdee: new report with action=opensearch disarded is almost done [21:48:01] *discarded [22:05:18] erosen: the sh compliant way to do that is $(seq 1 9), i think [22:05:29] not that it matters :) [22:05:37] /a/squid/archive/sampled/sampled-1000.log-2012120$(seq 1 9).gz [22:06:16] interesting [22:06:23] you mean it doesn't rely on bash? [22:06:56] yeah [22:07:11] I don't think {1..9} works in /bin/sh or zsh [22:07:17] or csh [22:07:26] ya [22:07:43] i know all this stuff because i don't use bash :) [22:07:51] so i'm constantly patching other people's shit to not break my shell [22:07:55] {seq -s, 1 9} ? [22:08:05] do you need commas? [22:08:13] oh, sorry [22:08:15] you're right [22:08:22] I just read what you wrote about [22:08:24] *above [22:08:26] :) [22:08:34] and you can't use {} [22:08:38] it has to be $() [22:08:40] (subshell) [22:08:50] seq is an executable [22:09:44] i like that decision tree, average_drifter [22:09:49] it reminds me about http://www.asciiflow.com/ [22:09:57] which i can never find an excuse to use :) [22:09:58] but i love [22:11:44] dschoon: <3 for Valentine's day [22:11:49] hehe [22:12:03] decision tree love [22:12:20] uhm well I tried repeatedly magicdraw, then dia, then umbrello, then some other stuff [22:12:28] I ended up sticking to graphviz ... [22:12:45] asciidraw is cool too [22:12:49] asciiflow sorry [22:13:05] it's nice when you just want to paste the diagram into an email [22:13:12] less good for IRC :( [22:14:02] 24" display here, big ascii diagrams, no problem [22:15:10] dschoon: I actually had ascii charts for a presentation a year ago, and they told me "What is that ?! You need to convert that to Visio because that's what we use here" [22:15:20] heh [22:15:23] so sad [22:15:48] drdee: http://garage-coding.com/_wiki/new_pageview_mobile_reports/r29-api-requests-with-opensearch-discarded/pageviews.html [22:16:29] drdee: they're almost 200M lower, but the bump between november and december is still present [22:16:43] yes [22:16:49] but we are going in the right direction [22:17:28] ok [22:17:48] drdee: erosen made a diagram of the logic he's using, I made a diagram of the logic I'm using [22:17:50] can you give me another random sample of mime type '-' using the current filter logic [22:17:56] htat's great! [22:17:58] erosen <== https://raw.github.com/embr/metrics/master/pageviews/embr_py/Pageview_definition.png [22:18:05] <== https://raw.github.com/wikimedia/fast-field-parser-xs/master/img/pageview_definition.png [22:19:01] drdee: in ~10m I'll give you some new samples for mimetype [22:19:08] ok [22:19:11] ty [22:19:19] np [22:21:25] when all this is said and done, could https://www.mediawiki.org/wiki/Analytics/Metric_definitions#Page_views be updated? [22:22:18] HaeB; yes good point [22:55:12] preilly: do you mind transferring github.com/embr/metrics to the wikimedia account and giving the analytics team ownership? [22:56:02] erosen: https://github.com/wikimedia/metrics done! [22:56:07] yay [22:56:07] thanks [22:56:12] erosen: np [23:01:31] drdee, average_drifter: metrics repo is under wikimedia [23:01:40] average_drifter: does that give you the go ahead to commit code? [23:06:13] i think so [23:06:28] just try and see if it works [23:06:34] agree [23:06:49] drdee: what do you mean by works? [23:07:06] drdee: dvanliere is now a member of the Owners Team on GitHub [23:07:09] whether it gives you the go ahead [23:07:17] awesome preilly! ty [23:08:14] drdee: np [23:11:37] geohacker: i've got the city data for you when you're ready [23:12:40] drdee: i'm about to share the geocoded fraction of a edits from each country for each language [23:13:19] i've looked over the data and it seems to fit our plan of not showing any city with less than 10% of that country's edits [23:13:48] drdee: -2s/country/city/ [23:16:28] ok [23:16:52] drdee: i just mention it in case we should take any "data release" precautions [23:17:08] check with philippe to be real sure [23:17:13] it is all just numeric data, so in general it would take some work to abuse it [23:17:19] k, not sure he is around [23:17:25] do you know his IRC handle [23:17:25] ? [23:26:15] dschoon: python packaging q, when you get a sec [23:26:22] sure. [23:26:26] i need to head downstairs now [23:26:29] so i'll swing by [23:26:34] dschoon: do you know how to expose a method inside a file to the top-level package name sapce [23:26:37] cool [23:29:52] erosen: exports.methodName = ... ? [23:31:03] oh but you're saying top-level package namespace so I guess not.. [23:31:14] but top-level package namespace sounds like a global to me.. I may be wrong.. [23:33:43] just needed to import the files into the __init__.py namespace [23:33:50] (and optionally put them in __all__) [23:34:02] that namespace *is* __init__.py for the package root [23:35:54] oh for python [23:36:39] erosen: ok so in wikimedia/metrics I will write from scratch an implementation of the mobile pageviews ? [23:36:51] no [23:37:05] just document what you are doing in perl in a flow diagram [23:37:10] ok [23:38:26] drdee: i think a working code snippet would be important though [23:38:59] i think first document then write jointly canonical code is better [23:42:33] erosen: I can fallback to the logic in your diagram, and then incrementally add my filters until I get the 500M bump [23:42:52] I think that way I'll be able to find out where the bump comes from [23:43:13] that seems like a good idea [23:51:47] milimetric: i'm finishing the changes to limnpy for the new data format and I am wondering if you have a preference between json and yaml for the datasources [23:51:55] dschoon: thoughts? ^^ [23:52:10] in meeting [23:52:27] k