[00:00:22] that's great news!!!!!!!! [01:24:52] BOOM ! [01:25:03] 300% improvement in parsing speed [01:25:15] BAM BAM BAM [01:25:24] dschoon: you around? [01:25:45] drdee: can you make a bitbucket account ? [01:25:56] bitbucket? [01:25:57] why [01:26:12] drdee: I got a small repo, but it's private so you need an account to see it [01:27:19] drdee: can I put some code I've fiddled with on github ? it's wikistats-related [01:27:35] actually just mobile-pageviews related [01:28:42] just put it on github.com/wmf-analytics [01:28:45] ok [01:31:40] do you have a new report ready? [01:32:20] not right now, but I'll have it x3 times faster [01:32:22] working on it [01:32:26] https://github.com/wmf-analytics/fast-field-parser-xs/blob/master/Extreme-Field-Parser/Parser.xs [01:32:31] this is the parser [01:33:45] this is basically the way I'll use it https://github.com/wmf-analytics/fast-field-parser-xs/blob/master/Extreme-Field-Parser/benchmark/efp-xs.pl [01:35:14] you are funny [01:35:56] averag_drifter ^^ [01:36:03] :D [01:36:42] now I can have a report in 8h [01:36:53] awesome! [01:37:31] I was thinking of writing assembly instead of the XS to make it more fast, but my conscience kicked in [01:37:53] i hope you are kidding :D [01:38:07] i would have come over to romania to kick your ass [01:38:16] :) [13:34:09] moooooorning guys! [13:35:35] drdee: morning :) [13:36:30] you wanna demo something :D ? [13:37:50] I just wanna push to gerrit first [13:38:11] I'll fire up the new report after, had to implement a lot of stuff [13:41:03] new patchset [13:41:04] https://gerrit.wikimedia.org/r/#/c/41979/ [13:41:39] drdee: used patricia tries for the ip ranges for googlebots [13:41:57] now starting report [13:42:09] AWESOME VERY VERY AWESOME! [13:42:55] in which file are the patricia tries implemented? [13:46:03] drdee: https://gerrit.wikimedia.org/r/#/c/41979/8/pageviews_reports/lib/PageViews/BotDetector.pm [13:46:10] drdee: didn't implement them, just used them [13:46:16] Net::Patricia from CPAN [13:46:21] http://search.cpan.org/~gruber/Net-Patricia-1.20/Patricia.pm [13:47:02] k [14:17:13] morning ottomata [14:19:11] morning! [14:26:12] wanna try the vumi stuff again? (jeremy emailed and he fixed it) [14:27:15] well, i see he's getting logs in /var/log/vumi/metrics.log [14:27:18] so it looks like it is working to me! [14:29:34] and this can be easily relayed to a new instance of udp2log? [14:31:20] yup, well [14:31:28] not if the app is running on labs [14:31:33] but if they deploy in production, ja no prob [14:31:46] eqiad somewhere would be best [14:33:38] i don't think any of them knows about this, we should poke somebody in ops, maybe preilly knows about the vumi setup [14:34:04] well i mean, they can't be planning to deploy this publicly while it is on labs, right? [14:34:44] I keep asking in that thread about if/when they are deploying to production, no answer yet [14:43:58] just asked in ops channel, mark does not know anything about it either [14:44:11] i will ask one more time once west coast wakes up [14:49:46] drdee: what did i do? :) [14:50:27] 09 14:26:11 < drdee> wanna try the vumi stuff again? (jeremy emailed and he fixed it) [14:51:03] not you :) another jeremy [14:51:16] who dat? [15:07:14] ottomata: did you mess around with the limnify thing yet? [15:07:26] yup, see emails :) [15:07:42] weird, just refreshed inbox and there they were [15:21:11] goood morning [15:23:41] Good afternoon. :) [15:25:03] gooood morning milimetric [15:25:17] hey drdee [15:25:20] deploying to prod as we speak [15:25:28] this, for once, should be painless [15:25:36] then I'll process EZ's dagta [15:25:37] *data [15:25:45] i was just gonna ask :D [15:26:12] and then I'll look at ottomata's work hopefully there's still time to make a dashboard or something out of it [15:26:26] aight [15:26:34] but this raises a question [15:26:41] where would we deploy such a dashboard? [15:26:48] not the reportcard of course [15:26:54] we need a new site somewhere [15:27:08] milimetric, I have that file for you, trying to put in in hdfs public, but my proxy is being weird [15:27:11] should have that fixed in a moment [15:27:38] cool, no problem, I'm behind with the deployment still [15:44:42] hey drdee, milimetric [15:44:55] about mobile webrequest data [15:45:11] dschoon suggested that I import it from kafka every 15 minutes, or as often as possible [15:45:15] I'm doing that [15:45:23] but, my pig script generates counts per continent per hour [15:45:30] just import once per hour [15:45:47] we go from 1 report per month [15:45:50] to hourly reporting [15:45:58] there is really no need to go even more granular [15:46:02] i think dschoon was hoping to almost be able to see live data with the graph updating [15:46:09] sure i understand [15:46:14] but who asked for that? [15:46:21] hmmm, yeah maybe we can do that wehn we have more robust stuff, especially if when we are doing a full unsampled stream [15:46:34] this is nice-to-have, not must-to-have IMHO [15:47:07] aye [15:47:08] hm [15:47:22] drdee, can I make an oozie dataset work on 3 directories at once? [15:47:42] hey otto, I've got reportcard offline deploying it, just a few moments I'll give you my attention [15:47:45] you could specify multiple input datasets [15:48:11] hmm, but then each one would need to be passed as an arg to pig? [15:51:04] yes, or you would have to do a merge of the 3 directories first as a separate oozie task [15:51:17] or [15:51:30] you have one parent folder that contains the 3 sub folderes [15:51:36] then you just supply the parent folder as input [15:52:03] naw, can't do the latter there [15:52:12] this is the problem where for Jan 7 I need Jan 6 and Jan 8 data [15:52:15] too [15:52:23] and then for Jan 8, I'll need Jan 7 and Jan 9 [16:01:21] hm, I have a problem [16:01:35] I can't deploy because npm install needs /home/milimetric/.npm [16:01:48] but on reportcard2, /home/milimetric is a read-only filesystem it says [16:01:54] restart the machine [16:02:01] ... :) [16:02:04] after reboot it will be fixed [16:02:11] huH? [16:02:24] homes are readonly because of migration of some backend labs stuff [16:02:29] oh really? [16:02:29] they will become write again after rerboot [16:02:30] yes [16:02:33] ok [16:02:39] glad I asked [16:02:50] subscribe to wikimedia-labs [16:03:39] ottomata: "restart -r now" is ok to run? [16:06:24] shutdown I mean [16:06:28] and I just did it :) [16:14:02] drdee - you screwed me pretty hard :) [16:14:09] I can't ssh into the machine anymore [16:14:24] you're welcome :D [16:14:33] haha [16:14:39] but this was the solution :) [16:14:41] which machine milimetric? [16:14:43] ottomata, can I get your help somehow? ssh into reportcard2 [16:14:44] Creating directory '/home/milimetric'. [16:14:44] Unable to create and initialize directory '/home/milimetric'. [16:14:57] so i would say, head over to #wikimedia-labs and ask for assistance [16:15:01] k [16:15:34] i can't get in either [16:15:43] *** /dev/vda1 will be checked for errors at next reboot *** [16:15:43] *** /dev/vdb will be checked for errors at next reboot *** [16:15:55] sounds promising [16:16:48] ottomata: i found an interesting case where two cidr ranges overlap [16:16:58] 41.66.28.73 , [u'orange-botswana', u'orange-ivory-coast'] [16:17:17] haven't found exactly which range, but will shortly [16:20:56] oh, you mean that IP shows up in two of the logs? [16:21:15] sort of [16:22:24] I am finding out by explicitly checking an ip against all of the cidr ranges [16:22:24] and I am finding that a lot of the ips which are supposed to be coming from orange-cameroon are also matching the orange-botswana range [16:22:26] ottomata: do you think it'd be easier to spawn a new reportcard3 or something instead of fixing reportcard2? Is the puppetization for that up to date with supervisorctl and all that? [16:23:31] naw, that's not puppetized at all [16:25:06] erosen, just looking at Partner IP Ranges page, that IP you listed should only be in orange ivory coast [16:25:14] hmmm [16:25:26] sounds like I'm doing it wrong then hehe [16:26:27] uhhhhhh, I don't have an orange botswana filter running [16:26:34] yeah [16:26:49] i have my own copy of the zero partner ranges [16:27:36] um, no , i mean, i'm looking at the wiki page and noticing [16:27:46] I have 17 specific filters for partner ranges in the udp2log stuff [16:27:53] and there are 21 defined providers on that page [16:27:59] https://office.wikimedia.org/wiki/Partner_IP_Ranges [16:28:05] yeah [16:28:14] is that ok? [16:28:19] i manually translated those ip ranges into a json file [16:28:20] i thought Partner IP Ranges was supposed to be in sync [16:28:26] yeah but [16:28:26] and I am doing my own ip range checking [16:28:27] i mean [16:28:35] this means we aren't collecting logs for a lot of partners [16:28:44] yeah [16:28:50] is that ok? [16:28:51] true [16:28:53] not sure [16:29:06] afaik, amit was supposed to tell me whenever he changes that page [16:29:09] i assume that if they haven't bugged me or you about it, it means it is low priority, or just starting [16:29:11] and I keep the udp2log filters in sync [16:29:27] botswana says [16:29:27] Launch Date: Oct 2012 [16:29:36] hmm [16:29:52] i'll ping amit when he get's online [16:29:58] gets? [16:31:17] I am not running these filters: [16:31:17] 4.15 Hello Cambodia (HL) [16:31:17] 4.16 Celcom Malaysia (CL) [16:31:18] 4.17 Orange Congo (CD) [16:31:18] 4.18 Orange Botswana [16:31:35] i guess worst case we can point reportcard.wmflabs.org to test-reportcard.wmflabs.org [16:31:52] those are the last 4 with the exception of morocco right? [16:32:03] seems likely they just forgot to notify us [16:33:25] average_drifter: https://github.com/TheWeatherChannel/dClass/pull/1 [16:34:40] milimetric, no response in # labs? can I try restarting the instance again? [16:34:50] going to restart through labsconsole, maybe it does soethign special [16:34:54] sure [16:34:59] nothing in labs [16:35:00] Ryan Lane will be online in an hour or so, he can certainly help [16:35:10] someone said that they might reconsider helping me if drdee also has a problem :) [16:35:19] * milimetric doesn't have enough cred [16:35:21] :) [16:35:33] :D [16:35:59] cool, i guess it's no huge rush [16:36:19] but this is probably why we should host reportcard on production somewhere [16:36:35] ungnhhhh build your .deb! [16:36:36] hehe [16:36:42] actually......>>>...... [16:36:47] milimetric, in the meantime, shall we have a look at ez's data? [16:36:48] if we hosted it on an01 [16:36:52] hmmmm [16:36:53] naw [16:38:33] yeah, the hack for now, if all is lost with reportcard2, is just to point to kripke and work on the deb ASAP [16:39:17] im' sure all is not lost [16:39:21] we'll just wait for Ryan Lane :) [16:39:41] he usually just kicks the machine and then it magically works again [16:39:46] awesome [16:39:55] i have whatever the opposite of that is [16:51:11] ottomata: btw, the whole overlapping ip ranges thing was a mistake on my part i had a typing which created the range with the routing prefix /2--whoops, so no reason to worry about orange-botswana other than the fact that there is not filter for it currently [16:54:45] ahh, cool [16:54:46] ok cool [16:54:51] also talking to amit [16:55:09] he says he doesn't need the filters for most of the missing ones, except congo [16:55:14] but he is e-mailing you presently [17:17:21] drdee, of course there's a bug in the scripts (running EZ's data) [17:17:39] uuuuhhhhhhhh, really? [17:17:55] yep, YALP [17:17:57] [17:17:59] D: [17:18:00] yet another labelling problem [17:18:00] :D [17:18:19] the day we can bury this part of the data pipeline we should throw a party [17:19:09] oh, consider it thrown [17:26:18] drdee, brainbounce [17:26:28] i am understanding oozie datasets more, and think I can solve my problem with them [17:26:35] at your service! [17:26:49] but, I need to tell the pig script to filter out data for only the hour we are currently interested in [17:26:59] param? [17:27:03] i could easly make that a paramter, and do [17:27:10] FILTER DATA BY hour == '$HOUR' [17:27:11] or whatever [17:27:14] but [17:27:22] we have a pig udf that does that IIRC [17:27:34] i would like this script to be able to generate data for ALL hours if the $HOUR parameter is not given [17:27:34] or at least convert it to a timestamp [17:27:46] so, what I want is [17:28:00] if ($HOUR IS NOT NULL) [17:28:00] FILTER DATA BY hour == '$HOUR' [17:28:01] right? [17:28:04] yes [17:28:05] but I don't konw if I can do that in pig [17:28:10] but not sure if pig understands that [17:28:12] i tried ternary conditional [17:28:22] FILTER DATA BY ('$hour' IS NOT NULL ? hour == '$hour' : 1 == 1); [17:28:27] but that doesn't work :p [17:28:35] does pig have the concept NULL? [17:28:37] yes [17:28:56] and it actually starts the script if you don't supply the $hour variable (in the shell) [17:30:41] ? right [17:30:47] yeah I want to not have to make 2 different scripts [17:30:58] one that only works witha given hour, and one that will output buckets for all hours in the data given [17:31:06] hmm, did you know about this? [17:31:07] http://pig.apache.org/docs/r0.9.1/cont.html#embed-python [17:31:07] hehe [17:31:17] i could use that, but seems complicated [17:31:22] can do pig script templating [17:31:23] that way [17:31:42] see the Conditional Compilation example [17:32:56] yes but let's not do that :D [17:33:00] yeah, agreed [17:33:06] hmm, i could make the date match an expression [17:33:15] and provide a default expression that matches all dates [17:33:42] so if you first try this: [17:33:42] that's actually cool, because then you could use the parameter to restrict your output to whatever hours you wanted via regex [17:33:50] FILTER DATA BY ('$hour' IS NOT NULL) [17:33:56] and don't supply $hour [17:33:59] to see if that runs [17:34:17] so basically it should just ignore the filter statement [17:34:27] if that works [17:34:38] ok, i mean, that compiles and works [17:34:40] but doesn't do anything [17:34:50] that's like [17:34:53] FILTER DAT BY false; [17:34:55] this line: FILTER DATA BY ('$hour' IS NOT NULL ? hour == '$hour' : 1 == 1); [17:34:57] there's no actual filter [17:35:26] right but that's good because it means in theory you can do what you want [17:35:41] you just need to fix the ternary conditional [17:35:58] if you go into grunt [17:36:02] and do EXPLAIN() [17:36:13] that might give some insights [17:36:48] hmm, i could do [17:37:21] something like: [17:37:21] FILTER DATA BY hour MATCHES ($hour IS NOT NULL ? '$hour' : '*') [18:00:14] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [19:09:44] drdee, its amazing how active hue development is, we keep finding things that we need that have just been implmeneted [19:09:45] http://grokbase.com/t/cloudera/hue-issues/12cmqsrncq/jira-created-hue-983-oozie-coordinator-could-support-previous-dates [19:09:52] this is what I need to do my datasets [19:10:03] i can do it in oozie coordinator.xml, but not in hue yet [19:10:13] cool [19:23:44] milimetric: http://reportcard.wmflabs.org/ [19:23:47] == 502 [19:24:08] or back now? [19:24:11] :) lol [19:24:14] you must have been deploying [19:24:18] yes dude [19:24:22] I just finished [19:24:37] i think your illness has left you with strangely good timing [19:24:44] must be [19:24:48] prescient [19:24:52] you're prescientman [19:25:06] so there are issues [19:25:26] which I'm not sure whether or not we should fix before presenting to EM [19:25:36] the main one is that the wikivoyage graph no longer works [19:25:37] which? [19:25:41] huh [19:25:48] and neither do any graphs like kbye and newer things we added [19:25:54] except editors_by_geo, that one's fine [19:26:16] milimetric, just email ez [19:26:17] lol [19:26:20] http://reportcard.wmflabs.org/graphs/wikivoyage [19:26:21] okay, we should probably fix that. [19:26:37] yeah, I'm trying to re-write the graph into the new format [19:26:40] see if that does it [19:26:53] yeah [19:26:54] so let me give you a little refresher on what's happened [19:27:01] we're down to only develop and master [19:27:05] cool [19:27:08] for both reportcard-data and limn [19:27:15] i barely have a brain atm [19:27:18] just fyi [19:27:18] ok [19:27:20] oh [19:27:34] nvm, just feel free to do whatever you like and run it by me if you want to commit :) [19:27:34] hopefully the merge didn't break anything [19:27:42] a LOT of stuff was broken [19:27:45] but mostly deployment crap [19:28:00] we can tell stories around the campfire later, now we gotta get this shit running [19:28:09] the deployer is totally black magic, i said that :) [19:28:17] +1 [19:28:19] agreed [19:28:30] i'm going to stop pestering and resume recuperating [19:28:34] k [19:28:44] if you want to look at this: http://reportcard.wmflabs.org/graphs/wikivoyage [19:28:52] and see if you can think of anything, that'd be useful [19:28:53] drdee, brainbounce! [19:28:57] yooooo [19:29:07] we'll see [19:29:18] siiick [19:29:35] so, getting there, but ok [19:29:48] i need to get oozie coordinator to pass a param to pig [19:29:57] that says the current hour it is operating on [19:31:25] k [19:31:38] so, somehow in oozie i'm almost certain that is possible [19:31:54] ${coord:nominalTime()} something something [19:31:55] dunno yet though [19:32:03] i guess you haven't run into that yet? :p [19:32:37] no not yet but look at https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases [19:32:39] for examples [19:32:48] of ${coord:nominalTime()} [19:33:27] yeah i have seen that doc….HMM oh i think I found it [19:33:38] i can set properties in the coord's workflow def [19:35:56] i think thi will do it: [19:35:56] [19:35:57] HOUR_REGEX [19:35:57] ${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd_hh')} [19:35:57] [19:36:33] you know, we really start writing this down in a wiki [19:36:43] because right only you and i know how to do this [19:37:23] right, and I baarreelelly know how [19:37:26] i mean, right now we are working it out [19:37:34] if we actually make this work [19:37:40] we can standardize and doc lots of this [19:37:50] how do I submit a coordinator via CLI? [19:38:33] (googling… : ) ) [19:38:45] oozie -job submit job.properties [19:38:49] then you get an id in return [19:38:52] then [19:38:58] oozie -job run {jobid{ [19:39:20] job.properties contains the workflow path etc [19:40:32] so, deployment of new Limn to http://reportcard.wmflabs.org/ is complete [19:40:47] please kick the tires, let me know if you see anything broken [19:40:56] so I need to upload my workflow.xml and coordinator.xml to the wf.application.path I set in job.properties? [19:41:00] I know about the wikivoyage graph being messed up [19:41:11] brb, lunch [19:41:49] milimetric, clickin on the .csv file link just refreshes the page [19:42:26] it also downloads for me [19:42:41] OH! [19:42:42] sorry [19:42:45] it does for me too [19:42:47] but it does seem to do something weird [19:42:47] didn't notice it [19:42:50] to the ui [19:42:59] i looked to see if it was actually reloading and it isn't [19:43:23] "so I need to upload my workflow.xml and coordinator.xml to the wf.application.path I set in job.properties?" [19:43:37] yes but it assumes by default hdfs://user/otto/ [19:43:44] so the path should be relative to that [19:43:57] you have it set absolute in your /home/diederik/job.properties [19:43:58] the path to coordinator.xml i mean [19:44:00] can I just do that? [19:44:08] oozie.wf.application.path=${nameNode}/user/otto/oozie/webrequest/count_by_hour_by_continent_A [19:44:17] i think that's actually bad practice) [19:44:51] i shoudl just say [19:44:51] oozie.wf.application.path=oozie/webrequest/count_by_hour_by_continent_A [19:44:51] ? [19:44:57] try it [19:44:58] hue creates it with absolute path [19:45:00] ok [19:45:07] um, I need oozie url? [19:45:07] brb relocating [19:45:12] ? [19:58:34] oh that would be analytics1010.eqiad.wmnet:11000/oozie [20:14:53] ok dschoon, figured out what was wrong with wikivoyage - updates to graph format didn't make it back to kbye and wikivoyage was written manually in the old format so it didn't quite work [20:22:53] ugh... and i'm wrong again - it's the data, drdee you were right [20:23:04] i'm not sure what EZ could do about it, I'll take a look [20:26:04] huh [20:26:15] what's the issue? [20:27:06] (milimetric) [20:27:35] it doesn't appear to actually load any datafiles, btw [20:27:37] LineNode was borking if stroke wasn't specified, so I thiought it was that [20:28:19] but it's not - I think it's either Limn can't work with multiple datasources or the datasources are messed up somehow [20:28:28] ohh [20:28:38] the line thing seems possible. [20:28:46] i remember i was working on that ages ago... [20:28:58] i forget if i finished. i'm not the sharpest tool in the shed atm [20:33:02] ok, fixed the line thing [20:41:31] milimetric, dschoon [20:41:38] what's up [20:41:40] do all datafiles have to be in one file? [20:41:45] no [20:41:47] what if I had hourly data files? [20:42:00] oh you mean per metric? [20:42:03] yea, right now yes [20:42:17] each metric needs to be totally spelled out in one datafile [20:42:19] we could add support to read from stdin on limnify [20:42:25] the graph can point to metrics from different datafiles [20:42:26] and then you can just cat them? [20:42:28] oh to cat them? [20:42:31] hmmmm, yeahhhhhhhHHHHhhh [20:42:45] i'm only asking, because I can't append to a file in hdfs [20:42:49] so i'll have to recreate it anyway [20:42:55] yeah, you have to use the api to append [20:42:59] erosen, that would be cool, i can probably do with out it though too [20:43:21] i'm in the middle of something for a bit, but I could work on it this afternoon [20:43:24] i could liminify the latest, then cat datafiles/* new_data > datafile and put into hdfs [20:43:29] no hurry [20:43:31] k [20:43:42] i guess it would be cool if limnify did it, hmm, yeah [20:43:54] ooo, yeah, then I could keep them separately generated by oozie [20:43:58] yeah that would be pretty easy to just accept any number of data files [20:44:18] pd.concat will do what we need [20:44:27] and as the end of every workflow jsut rm datafile.csv && cat generated/* | limnify > datafile.csv [20:44:38] oh cool [20:44:54] which way is better stdin or multiple data args? [20:45:26] just as a note from ages ago [20:45:30] but it'd be awesome if limnify was part of limn [20:45:39] which really only means "was written in JS" [20:45:48] does it have python deps these days? [20:45:52] yeah [20:46:02] nontrivial ones, like numpy? [20:46:03] it pretty heavily relies on pandas [20:46:04] yeah [20:46:05] or pandas? [20:46:05] yeah [20:46:07] figured [20:46:11] which in turn relies on numpy [20:46:12] okay, scratch that idea :) [20:46:19] why does it have to be written in JS? so that it can use js libs to figure out format? [20:46:33] i mean I'm happy to pull out the transformation stuff into js [20:46:39] but I at least need an interface [20:46:42] for python [20:46:53] i was hoping to run it in the client [20:47:00] ah [20:47:02] so we could do all this in the UI [20:47:04] but! [20:47:09] that's okay [20:47:15] drdee!!!! YESSSSSS IT IS WORKING [20:47:21] http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/mobile_hour_by_continent_A [20:47:27] its running now, going through the existing dat [20:47:28] data [20:47:40] COOOL BEANZ MR OTTOMATA!!!!! [20:47:43] hot! [20:47:44] using 6 input files each run, but only generating output for a single hour each run [20:47:51] its really smart! [20:48:02] drdee, can I show you how this works real quick? [20:48:09] yes please do! [20:48:22] http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/oozie/webrequest/count_by_hour_by_continent_A?file_filter=any [20:48:36] coordinator? [20:48:41] there is also a coordinator.properties file, but it has the usual stuff you'd expect [20:48:41] yeah [20:48:58] look at workflow.xml first [20:49:01] because you're used to it [20:49:18] yup [20:49:19] actually, I think everythign in workflow is stuff you have seen before [20:49:23] with the parameters etc. [20:49:31] yes i have [20:49:32] the only thing I added for my stuff was ${HOUR_REGEX} [20:49:38] i'll show you how that gets computed [20:49:50] but, that can be anythign you want to figure out which hours you are interested in [20:49:55] we can probably abstract that concept out later for any timestamp [20:50:02] only one minor thing is to change the kill action, and make it send an email [20:50:03] when we get better at abstracting pig stuff [20:50:05] but that's minor [20:50:07] ah ok [20:50:08] cool [20:50:11] ok cool, so that's workflow [20:50:15] aight [20:50:16] now checkout coordinator.xml [20:50:22] reading [20:50:30] at the top is the dataset definition [20:50:48] k [20:50:55] you've seen that before too [20:50:55] for mobile [20:50:57] frequency is 15 minutes [20:50:58] yup [20:51:10] ok, the new stuff is in input-events and output-events [20:51:14] just below that [20:51:17] [20:51:18] ${coord:current(-5)} [20:51:18] ${coord:current(0)} [20:51:38] here i'm creating an input event parameter called INPUT [20:51:56] k [20:51:59] and saying that it includes all webrequest-wikipedia-mobile dataset instance between −5 instance ago and the current one [20:52:02] so [20:52:05] 6 instance total [20:52:14] so if the current one is 08:00 [20:52:22] that would be ${coord:current(0) [20:52:24] then [20:52:27] ${coord:current(-5) [20:52:35] would be [20:52:40] 06:45 [20:52:57] so for 15 minute intervals stating at 06:45 and ending at 08:00 [20:52:59] that ends up being [20:53:17] 06:45 [20:53:17] 07:00 [20:53:17] 07:15 [20:53:17] 07:30 [20:53:17] 07:45 [20:53:17] 08:00 [20:53:20] smart indeed [20:53:40] so [20:53:50] even though coord:current(0) is 08:00 [20:54:00] we are actually interested in computing the data for the 07:00 hour [20:54:10] because that is the hour for which we know we have all the data [20:54:11] so [20:54:23] the output instance is defined as [20:54:24] [20:54:24] ${coord:current(-4)} [20:54:34] coord:current(-4) == 07:00 [20:54:46] (you could compute a few different ways, i'm sure if you wanted to) [20:54:48] but that works [20:54:51] soooooooo, copy this explanation from the chat log and put it in a wiki :D [20:55:03] I will! I will! I'm not quite ready yet [20:55:08] one more thing [20:55:11] ok [20:55:13] ok ok ok ok ;) [20:55:15] i tell you waht, i'll copy it for now [20:55:18] but i won't organize it yet [20:55:24] it will be organized, believe you me! [20:55:31] I BELIEF YOU! [20:55:31] but, yeah, one more thing [20:55:32] hehe [20:55:54] down below, [20:56:01] in the section [20:56:25] there are a bunch of properties defined, those get passed to the workflow as variables [20:56:26] (that's how OUTPUT and INPUT get set for the pig sscript) [20:56:40] and that makes it a full circle! [20:56:50] I also made it so pig will filter for only an hour regex [20:56:53] which is by default .* [20:56:56] k [20:56:59] so i'm setting $HOUR_REGEX [20:57:05] to ${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, 'HOUR'), 'yyyy-MM-dd_HH')} [20:57:20] coord:nominalTime() is jsut the timestamp for the current run [20:57:26] and final final request is to run this using the 'stats' user [20:57:32] in my example that would be 08:00 [20:57:44] so, i'm subtracting an hour from 08:00 [20:57:52] and putting in the hour format the the pig script is filtering on [20:57:56] yupyup [20:58:08] anyway, yeah! [20:58:10] its running [20:58:17] whenever they release a new hue [20:58:24] this is really f***** cool [20:58:26] we should be able to select multiple datasets like this with hue [20:58:38] for now, that's not supported though [20:58:44] so you have to edit your .xml files and submit via cli [20:58:54] and this data is already visualized through limn as well [20:58:54] ? [20:59:16] not yet, that's what I need to figure out next [20:59:19] k [20:59:19] best way to script that together [20:59:24] aight [20:59:26] hopefully as an oozie action in the workflow too [20:59:46] erosen is going to make limnify accept via stdin [20:59:56] and i'm going to look into running a shell oozie action with it [20:59:58] so i think we should think about two general oozie workflow's [21:00:08] 1) merge different datasets together [21:00:16] 2) limnify a dataset [21:00:27] so that other workflows can just reuse those as necessary [21:01:46] yeah, we shoudl have a shared limnify action that can be easily tacked on [21:01:58] not sure what you mean by 1) though [21:32:09] drdee, just brain bounced with ryan for a bit about the cluster access problem i've been putting off [21:32:20] I think the final solution is going to be to keep using the browser proxy [21:32:29] and maybe even disabling the haproxy name based one (hue., oozie. etc.) [21:32:39] VPN isn't gonna happen [21:33:06] (i'm thinking for regular client access, analytists, etc) [21:35:19] lame. [21:35:20] but yes. [21:35:30] that'd be better than requiring passwords to access. [21:35:36] what's the problem with a vpn? [21:36:57] well, there would still need to be the http auth as is though, which is poopy…unless there is an apache ldap module, hmmmmm [21:37:05] it gives access to the rest of the network [21:37:14] HMMMmmmmmmmmm, wait lemme ask Ryan another q [21:37:16] why would we need that for a vpn? [21:37:24] you log in through the vpn... [21:37:35] (it patches into ldap) [21:38:08] nono, i mean [21:38:22] the http auth password has to remain if we are using a proxy [21:38:28] because it is a public facing service [21:38:31] anyone can try to hit the proxy [21:38:44] including with the IP whitelist? [21:38:46] my ldap comment was so that we wouldn't have to all use 'wmf-analytics' login, we could just use our own [21:38:52] no that's fine too [21:38:57] i'm thikning more generally [21:39:00] for future analytics [21:39:02] analysits [21:39:05] analysts [21:39:18] okay, so backing up a moment [21:39:22] what was the issue with the vpn? [21:39:31] (presuming we do it right) [21:39:33] access to whole internal network [21:39:41] (that is: coming in through the vpn doesn't mean access to -- that.) [21:39:44] i'm asking ryan about restricting via iptables on an01, we'll see what he says [21:39:51] right, so we'd have to do a lot of other legwork [21:39:59] but we'd do it the way most people do it [21:40:12] you restrict name services and access based on the interface [21:40:26] and since all vpn traffic would come in through a separate interface, you can regulate access [21:44:01] ha, ok, I got an 'I don't like it but ask the ops list' from Ryan :p [21:44:03] haha, [21:44:18] i think that's about as good as it gets :) [21:44:26] the downside is that i doubt our machine has two interfaes [21:45:29] it doesn't need it though, virtual interfaces works ok [21:45:56] vpn traffic ends up on tun0 with a different source IP for all packets [21:46:05] ah! [21:46:07] that's hot. [21:46:12] but, i don' thave much networking security experience, so I don't know if source packets could be spoofed or something [21:46:17] definitely adequate for our purposes. [21:46:25] yeah, good question. i have no idea either :) [21:52:15] hah, neither does leslie! she's looking it up :) [21:53:04] i think this means the answer is that we ignore the problem until someone says otherwise [21:53:31] my intuition says that device-based firewall rules are based on the iface origin on the machine [21:53:38] not a piece of info contained in the packet [21:53:55] which means spoofing should be impossible [21:54:21] unless you compromised the machine. [21:54:32] ...which would be an amusingly subtle way to elevate access [21:54:37] i bet that'd take forever to find [22:14:23] oh dschoon - type: timeseries was missing from the new datasources I had created. They were on the other branches I merged into. To think of it, I'm surprised this was the only problem :) [22:14:30] ah. [22:14:31] maybe that's why it was hard to find [22:14:35] yes. [22:14:38] i think i mentioned that [22:14:41] so all is well now in limn land [22:14:48] that i went through all the feature/d3 datasources [22:14:52] and added type:timeseries [22:14:53] btw [22:14:57] the reason it didn't default to it [22:15:03] http://reportcard.wmflabs.org/ is the new limn with the new data, and the ProjectColors palette [22:15:17] is because of the source to AttributeBase::update [22:15:28] it deletes all extant attributes that don't exist in the newly updated source [22:15:29] http://test-reportcard.wmflabs.org is the new limn with the new data, and the colors defined by the graphs (small bug prevented those from coming through before) [22:15:33] I asked EM which he likes better [22:15:34] er, newly updated data [22:16:02] unfortunately, it appears http://reportcard.wmflabs.org/graphs/wikivoyage has no colors [22:16:04] yeah, i might have forgotten [22:16:12] yes, that's 'cause they're not in ProjectColors [22:16:25] right. [22:16:32] I told EM that the new palette needs a few tweaks, but I don't want to bother unless he prefers it to the old one [22:16:46] my back-compat module uses whatever was defined for the line [22:16:47] s [22:16:57] nah, it didn't [22:17:02] oh? [22:17:13] oh, i probably changed the pointers for stroke [22:17:16] 'cause it assigns those properties to stroke and the LineNode was looking for them elsewhere [22:17:23] in .color not .stroke.color [22:17:25] yeah, the reason is that in the past, stroke was a nested object [22:17:32] which i started changing [22:17:39] and did not finish, due to illness [22:17:40] it's all cleaned up now in the latest [22:17:45] no worries [22:17:51] strokeWidth, strokePattern, etc [22:17:55] rather than stroke.width [22:17:56] I tried to not bother you at all, I hope you're starting to feel better [22:18:03] i am, a bit [22:18:14] i don't get sick that often, though moreso on this job, for some reason [22:18:17] so I think the biggest problem right now is 1. performance 2. reloading the page [22:18:22] probably stress, due to doing nine jobs :) [22:18:26] :) no kidding [22:18:42] milimetric, any luck with hourly limness? [22:18:48] do you remember my comments on dashboard perf? [22:18:52] i think that's the best place to start [22:19:10] about how it should render in page-order, which prevents (N-1)! reflows of the page [22:19:23] yeah, but reloading the page is something else. Once something loads in our new Limn, it grabs on really hard and doesn't let the page reload [22:19:42] re, N reflows of the page, (N-1)! reflows of graphs [22:19:45] ottomata: not yet, just got done with deployment madness [22:20:44] yeah dschoon so I'll be working on making a new tab for reportcard and adding andrew's data there, if you feel good enough to look at something, that'd be it (performance and reloading) [22:20:49] otherwise I'll take a look at it next [22:20:57] i'll do my best [22:21:03] that might mean random nonsense atm [22:21:07] maybe not [22:24:22] cool [22:24:30] ok boys, ttyt [22:26:58] laterz