[00:00:22] that's great news!!!!!!!! [01:24:52] BOOM ! [01:25:03] 300% improvement in parsing speed [01:25:15] BAM BAM BAM [01:25:24] dschoon: you around? [01:25:45] drdee: can you make a bitbucket account ? [01:25:56] bitbucket? [01:25:57] why [01:26:12] drdee: I got a small repo, but it's private so you need an account to see it [01:27:19] drdee: can I put some code I've fiddled with on github ? it's wikistats-related [01:27:35] actually just mobile-pageviews related [01:28:42] just put it on github.com/wmf-analytics [01:28:45] ok [01:31:40] do you have a new report ready? [01:32:20] not right now, but I'll have it x3 times faster [01:32:22] working on it [01:32:26] https://github.com/wmf-analytics/fast-field-parser-xs/blob/master/Extreme-Field-Parser/Parser.xs [01:32:31] this is the parser [01:33:45] this is basically the way I'll use it https://github.com/wmf-analytics/fast-field-parser-xs/blob/master/Extreme-Field-Parser/benchmark/efp-xs.pl [01:35:14] you are funny [01:35:56] averag_drifter ^^ [01:36:03] :D [01:36:42] now I can have a report in 8h [01:36:53] awesome! [01:37:31] I was thinking of writing assembly instead of the XS to make it more fast, but my conscience kicked in [01:37:53] i hope you are kidding :D [01:38:07] i would have come over to romania to kick your ass [01:38:16] :) [13:34:09] moooooorning guys! [13:35:35] drdee: morning :) [13:36:30] you wanna demo something :D ? [13:37:50] I just wanna push to gerrit first [13:38:11] I'll fire up the new report after, had to implement a lot of stuff [13:41:03] new patchset [13:41:04] https://gerrit.wikimedia.org/r/#/c/41979/ [13:41:39] drdee: used patricia tries for the ip ranges for googlebots [13:41:57] now starting report [13:42:09] AWESOME VERY VERY AWESOME! [13:42:55] in which file are the patricia tries implemented? [13:46:03] drdee: https://gerrit.wikimedia.org/r/#/c/41979/8/pageviews_reports/lib/PageViews/BotDetector.pm [13:46:10] drdee: didn't implement them, just used them [13:46:16] Net::Patricia from CPAN [13:46:21] http://search.cpan.org/~gruber/Net-Patricia-1.20/Patricia.pm [13:47:02] k [14:17:13] morning ottomata [14:19:11] morning! [14:26:12] wanna try the vumi stuff again? (jeremy emailed and he fixed it) [14:27:15] well, i see he's getting logs in /var/log/vumi/metrics.log [14:27:18] so it looks like it is working to me! [14:29:34] and this can be easily relayed to a new instance of udp2log? [14:31:20] yup, well [14:31:28] not if the app is running on labs [14:31:33] but if they deploy in production, ja no prob [14:31:46] eqiad somewhere would be best [14:33:38] i don't think any of them knows about this, we should poke somebody in ops, maybe preilly knows about the vumi setup [14:34:04] well i mean, they can't be planning to deploy this publicly while it is on labs, right? [14:34:44] I keep asking in that thread about if/when they are deploying to production, no answer yet [14:43:58] just asked in ops channel, mark does not know anything about it either [14:44:11] i will ask one more time once west coast wakes up [14:49:46] drdee: what did i do? :) [14:50:27] 09 14:26:11 < drdee> wanna try the vumi stuff again? (jeremy emailed and he fixed it) [14:51:03] not you :) another jeremy [14:51:16] who dat? [15:07:14] ottomata: did you mess around with the limnify thing yet? [15:07:26] yup, see emails :) [15:07:42] weird, just refreshed inbox and there they were [15:21:11] goood morning [15:23:41] Good afternoon. :) [15:25:03] gooood morning milimetric [15:25:17] hey drdee [15:25:20] deploying to prod as we speak [15:25:28] this, for once, should be painless [15:25:36] then I'll process EZ's dagta [15:25:37] *data [15:25:45] i was just gonna ask :D [15:26:12] and then I'll look at ottomata's work hopefully there's still time to make a dashboard or something out of it [15:26:26] aight [15:26:34] but this raises a question [15:26:41] where would we deploy such a dashboard? [15:26:48] not the reportcard of course [15:26:54] we need a new site somewhere [15:27:08] milimetric, I have that file for you, trying to put in in hdfs public, but my proxy is being weird [15:27:11] should have that fixed in a moment [15:27:38] cool, no problem, I'm behind with the deployment still [15:44:42] hey drdee, milimetric [15:44:55] about mobile webrequest data [15:45:11] dschoon suggested that I import it from kafka every 15 minutes, or as often as possible [15:45:15] I'm doing that [15:45:23] but, my pig script generates counts per continent per hour [15:45:30] just import once per hour [15:45:47] we go from 1 report per month [15:45:50] to hourly reporting [15:45:58] there is really no need to go even more granular [15:46:02] i think dschoon was hoping to almost be able to see live data with the graph updating [15:46:09] sure i understand [15:46:14] but who asked for that? [15:46:21] hmmm, yeah maybe we can do that wehn we have more robust stuff, especially if when we are doing a full unsampled stream [15:46:34] this is nice-to-have, not must-to-have IMHO [15:47:07] aye [15:47:08] hm [15:47:22] drdee, can I make an oozie dataset work on 3 directories at once? [15:47:42] hey otto, I've got reportcard offline deploying it, just a few moments I'll give you my attention [15:47:45] you could specify multiple input datasets [15:48:11] hmm, but then each one would need to be passed as an arg to pig? [15:51:04] yes, or you would have to do a merge of the 3 directories first as a separate oozie task [15:51:17] or [15:51:30] you have one parent folder that contains the 3 sub folderes [15:51:36] then you just supply the parent folder as input [15:52:03] naw, can't do the latter there [15:52:12] this is the problem where for Jan 7 I need Jan 6 and Jan 8 data [15:52:15] too [15:52:23] and then for Jan 8, I'll need Jan 7 and Jan 9 [16:01:21] hm, I have a problem [16:01:35] I can't deploy because npm install needs /home/milimetric/.npm [16:01:48] but on reportcard2, /home/milimetric is a read-only filesystem it says [16:01:54] restart the machine [16:02:01] ... :) [16:02:04] after reboot it will be fixed [16:02:11] huH? [16:02:24] homes are readonly because of migration of some backend labs stuff [16:02:29] oh really? [16:02:29] they will become write again after rerboot [16:02:30] yes [16:02:33] ok [16:02:39] glad I asked [16:02:50] subscribe to wikimedia-labs [16:03:39] ottomata: "restart -r now" is ok to run? [16:06:24] shutdown I mean [16:06:28] and I just did it :) [16:14:02] drdee - you screwed me pretty hard :) [16:14:09] I can't ssh into the machine anymore [16:14:24] you're welcome :D [16:14:33] haha [16:14:39] but this was the solution :) [16:14:41] which machine milimetric? [16:14:43] ottomata, can I get your help somehow? ssh into reportcard2 [16:14:44] Creating directory '/home/milimetric'. [16:14:44] Unable to create and initialize directory '/home/milimetric'. [16:14:57] so i would say, head over to #wikimedia-labs and ask for assistance [16:15:01] k [16:15:34] i can't get in either [16:15:43] *** /dev/vda1 will be checked for errors at next reboot *** [16:15:43] *** /dev/vdb will be checked for errors at next reboot *** [16:15:55] sounds promising [16:16:48] ottomata: i found an interesting case where two cidr ranges overlap [16:16:58] 41.66.28.73 , [u'orange-botswana', u'orange-ivory-coast'] [16:17:17] haven't found exactly which range, but will shortly [16:20:56] oh, you mean that IP shows up in two of the logs? [16:21:15] sort of [16:22:24] I am finding out by explicitly checking an ip against all of the cidr ranges [16:22:24] and I am finding that a lot of the ips which are supposed to be coming from orange-cameroon are also matching the orange-botswana range [16:22:26] ottomata: do you think it'd be easier to spawn a new reportcard3 or something instead of fixing reportcard2? Is the puppetization for that up to date with supervisorctl and all that? [16:23:31] naw, that's not puppetized at all [16:25:06] erosen, just looking at Partner IP Ranges page, that IP you listed should only be in orange ivory coast [16:25:14] hmmm [16:25:26] sounds like I'm doing it wrong then hehe [16:26:27] uhhhhhh, I don't have an orange botswana filter running [16:26:34] yeah [16:26:49] i have my own copy of the zero partner ranges [16:27:36] um, no , i mean, i'm looking at the wiki page and noticing [16:27:46] I have 17 specific filters for partner ranges in the udp2log stuff [16:27:53] and there are 21 defined providers on that page [16:27:59] https://office.wikimedia.org/wiki/Partner_IP_Ranges [16:28:05] yeah [16:28:14] is that ok? [16:28:19] i manually translated those ip ranges into a json file [16:28:20] i thought Partner IP Ranges was supposed to be in sync [16:28:26] yeah but [16:28:26] and I am doing my own ip range checking [16:28:27] i mean [16:28:35] this means we aren't collecting logs for a lot of partners [16:28:44] yeah [16:28:50] is that ok? [16:28:51] true [16:28:53] not sure [16:29:06] afaik, amit was supposed to tell me whenever he changes that page [16:29:09] i assume that if they haven't bugged me or you about it, it means it is low priority, or just starting [16:29:11] and I keep the udp2log filters in sync [16:29:27] botswana says [16:29:27] Launch Date: Oct 2012 [16:29:36] hmm [16:29:52] i'll ping amit when he get's online [16:29:58] gets? [16:31:17] I am not running these filters: [16:31:17] 4.15 Hello Cambodia (HL) [16:31:17] 4.16 Celcom Malaysia (CL) [16:31:18] 4.17 Orange Congo (CD) [16:31:18] 4.18 Orange Botswana [16:31:35] i guess worst case we can point reportcard.wmflabs.org to test-reportcard.wmflabs.org [16:31:52] those are the last 4 with the exception of morocco right? [16:32:03] seems likely they just forgot to notify us [16:33:25] average_drifter: https://github.com/TheWeatherChannel/dClass/pull/1 [16:34:40] milimetric, no response in # labs? can I try restarting the instance again? [16:34:50] going to restart through labsconsole, maybe it does soethign special [16:34:54] sure [16:34:59] nothing in labs [16:35:00] Ryan Lane will be online in an hour or so, he can certainly help [16:35:10] someone said that they might reconsider helping me if drdee also has a problem :) [16:35:19] * milimetric doesn't have enough cred [16:35:21] :) [16:35:33] :D [16:35:59] cool, i guess it's no huge rush [16:36:19] but this is probably why we should host reportcard on production somewhere [16:36:35] ungnhhhh build your .deb! [16:36:36] hehe [16:36:42] actually......>>>...... [16:36:47] milimetric, in the meantime, shall we have a look at ez's data? [16:36:48] if we hosted it on an01 [16:36:52] hmmmm [16:36:53] naw [16:38:33] yeah, the hack for now, if all is lost with reportcard2, is just to point to kripke and work on the deb ASAP [16:39:17] im' sure all is not lost [16:39:21] we'll just wait for Ryan Lane :) [16:39:41] he usually just kicks the machine and then it magically works again [16:39:46] awesome [16:39:55] i have whatever the opposite of that is [16:51:11] ottomata: btw, the whole overlapping ip ranges thing was a mistake on my part i had a typing which created the range with the routing prefix /2--whoops, so no reason to worry about orange-botswana other than the fact that there is not filter for it currently [16:54:45] ahh, cool [16:54:46] ok cool [16:54:51] also talking to amit [16:55:09] he says he doesn't need the filters for most of the missing ones, except congo [16:55:14] but he is e-mailing you presently [17:17:21] drdee, of course there's a bug in the scripts (running EZ's data) [17:17:39] uuuuhhhhhhhh, really? [17:17:55] yep, YALP [17:17:57] [17:17:59] D: [17:18:00] yet another labelling problem [17:18:00] :D [17:18:19] the day we can bury this part of the data pipeline we should throw a party [17:19:09] oh, consider it thrown [17:26:18] drdee, brainbounce [17:26:28] i am understanding oozie datasets more, and think I can solve my problem with them [17:26:35] at your service! [17:26:49] but, I need to tell the pig script to filter out data for only the hour we are currently interested in [17:26:59] param? [17:27:03] i could easly make that a paramter, and do [17:27:10] FILTER DATA BY hour == '$HOUR' [17:27:11] or whatever [17:27:14] but [17:27:22] we have a pig udf that does that IIRC [17:27:34] i would like this script to be able to generate data for ALL hours if the $HOUR parameter is not given [17:27:34] or at least convert it to a timestamp [17:27:46] so, what I want is [17:28:00] if ($HOUR IS NOT NULL) [17:28:00] FILTER DATA BY hour == '$HOUR' [17:28:01] right? [17:28:04] yes [17:28:05] but I don't konw if I can do that in pig [17:28:10] but not sure if pig understands that [17:28:12] i tried ternary conditional [17:28:22] FILTER DATA BY ('$hour' IS NOT NULL ? hour == '$hour' : 1 == 1); [17:28:27] but that doesn't work :p [17:28:35] does pig have the concept NULL? [17:28:37] yes [17:28:56] and it actually starts the script if you don't supply the $hour variable (in the shell) [17:30:41] ? right [17:30:47] yeah I want to not have to make 2 different scripts [17:30:58] one that only works witha given hour, and one that will output buckets for all hours in the data given [17:31:06] hmm, did you know about this? [17:31:07] http://pig.apache.org/docs/r0.9.1/cont.html#embed-python [17:31:07] hehe [17:31:17] i could use that, but seems complicated [17:31:22] can do pig script templating [17:31:23] that way [17:31:42] see the Conditional Compilation example [17:32:56] yes but let's not do that :D [17:33:00] yeah, agreed [17:33:06] hmm, i could make the date match an expression [17:33:15] and provide a default expression that matches all dates [17:33:42] so if you first try this: [17:33:42] that's actually cool, because then you could use the parameter to restrict your output to whatever hours you wanted via regex [17:33:50] FILTER DATA BY ('$hour' IS NOT NULL) [17:33:56] and don't supply $hour [17:33:59] to see if that runs [17:34:17] so basically it should just ignore the filter statement [17:34:27] if that works [17:34:38] ok, i mean, that compiles and works [17:34:40] but doesn't do anything [17:34:50] that's like [17:34:53] FILTER DAT BY false; [17:34:55] this line: FILTER DATA BY ('$hour' IS NOT NULL ? hour == '$hour' : 1 == 1); [17:34:57] there's no actual filter [17:35:26] right but that's good because it means in theory you can do what you want [17:35:41] you just need to fix the ternary conditional [17:35:58] if you go into grunt [17:36:02] and do EXPLAIN() [17:36:13] that might give some insights [17:36:48] hmm, i could do [17:37:21] something like: [17:37:21] FILTER DATA BY hour MATCHES ($hour IS NOT NULL ? '$hour' : '*') [18:00:14] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [19:09:44] drdee, its amazing how active hue development is, we keep finding things that we need that have just been implmeneted [19:09:45] http://grokbase.com/t/cloudera/hue-issues/12cmqsrncq/jira-created-hue-983-oozie-coordinator-could-support-previous-dates [19:09:52] this is what I need to do my datasets [19:10:03] i can do it in oozie coordinator.xml, but not in hue yet [19:10:13] cool [19:23:44] milimetric: http://reportcard.wmflabs.org/ [19:23:47] == 502 [19:24:08] or back now? [19:24:11] :) lol [19:24:14] you must have been deploying [19:24:18] yes dude [19:24:22] I just finished [19:24:37] i think your illness has left you with strangely good timing [19:24:44] must be [19:24:48] prescient [19:24:52] you're prescientman [19:25:06] so there are issues [19:25:26] which I'm not sure whether or not we should fix before presenting to EM [19:25:36] the main one is that the wikivoyage graph no longer works [19:25:37] which? [19:25:41] huh [19:25:48] and neither do any graphs like kbye and newer things we added [19:25:54] except editors_by_geo, that one's fine [19:26:16] milimetric, just email ez [19:26:17] lol [19:26:20] http://reportcard.wmflabs.org/graphs/wikivoyage [19:26:21] okay, we should probably fix that. [19:26:37] yeah, I'm trying to re-write the graph into the new format [19:26:40] see if that does it [19:26:53] yeah [19:26:54] so let me give you a little refresher on what's happened [19:27:01] we're down to only develop and master [19:27:05] cool [19:27:08] for both reportcard-data and limn [19:27:15] i barely have a brain atm [19:27:18] just fyi [19:27:18] ok [19:27:20] oh [19:27:34] nvm, just feel free to do whatever you like and run it by me if you want to commit :) [19:27:34] hopefully the merge didn't break anything [19:27:42] a LOT of stuff was broken [19:27:45] but mostly deployment crap [19:28:00] we can tell stories around the campfire later, now we gotta get this shit running [19:28:09] the deployer is totally black magic, i said that :) [19:28:17] +1 [19:28:19] agreed [19:28:30] i'm going to stop pestering and resume recuperating [19:28:34] k [19:28:44] if you want to look at this: http://reportcard.wmflabs.org/graphs/wikivoyage [19:28:52] and see if you can think of anything, that'd be useful [19:28:53] drdee, brainbounce! [19:28:57] yooooo [19:29:07] we'll see [19:29:18] siiick [19:29:35] so, getting there, but ok [19:29:48] i need to get oozie coordinator to pass a param to pig [19:29:57] that says the current hour it is operating on [19:31:25] k [19:31:38] so, somehow in oozie i'm almost certain that is possible [19:31:54] ${coord:nominalTime()} something something [19:31:55] dunno yet though [19:32:03] i guess you haven't run into that yet? :p [19:32:37] no not yet but look at https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases [19:32:39] for examples [19:32:48] of ${coord:nominalTime()} [19:33:27] yeah i have seen that doc….HMM oh i think I found it [19:33:38] i can set properties in the coord's workflow def [19:35:56] i think thi will do it: [19:35:56] [19:35:57] HOUR_REGEX [19:35:57] ${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd_hh')} [19:35:57] [19:36:33] you know, we really start writing this down in a wiki [19:36:43] because right only you and i know how to do this [19:37:23] right, and I baarreelelly know how [19:37:26] i mean, right now we are working it out [19:37:34] if we actually make this work [19:37:40] we can standardize and doc lots of this [19:37:50] how do I submit a coordinator via CLI? [19:38:33] (googling… : ) ) [19:38:45] oozie -job submit job.properties [19:38:49] then you get an id in return [19:38:52] then [19:38:58] oozie -job run {jobid{ [19:39:20] job.properties contains the workflow path etc [19:40:32] so, deployment of new Limn to http://reportcard.wmflabs.org/ is complete [19:40:47] please kick the tires, let me know if you see anything broken [19:40:56] so I need to upload my workflow.xml and coordinator.xml to the wf.application.path I set in job.properties? [19:41:00] I know about the wikivoyage graph being messed up [19:41:11] brb, lunch [19:41:49] milimetric, clickin on the .csv file link just refreshes the page [19:42:26] it also downloads for me [19:42:41] OH! [19:42:42] sorry [19:42:45] it does for me too [19:42:47] but it does seem to do something weird [19:42:47] didn't notice it [19:42:50] to the ui [19:42:59] i looked to see if it was actually reloading and it isn't [19:43:23] "so I need to upload my workflow.xml and coordinator.xml to the wf.application.path I set in job.properties?" [19:43:37] yes but it assumes by default hdfs://user/otto/ [19:43:44] so the path should be relative to that [19:43:57] you have it set absolute in your /home/diederik/job.properties [19:43:58] the path to coordinator.xml i mean [19:44:00] can I just do that? [19:44:08] oozie.wf.application.path=${nameNode}/user/otto/oozie/webrequest/count_by_hour_by_continent_A [19:44:17] i think that's actually bad practice) [19:44:51] i shoudl just say [19:44:51] oozie.wf.application.path=oozie/webrequest/count_by_hour_by_continent_A [19:44:51] ? [19:44:57] try it [19:44:58] hue creates it with absolute path [19:45:00]