[01:46:59] dschoon: [01:47:12] dschoon: I meant to tell you a joke about UDP [01:47:18] dschoon: but I was afraid you wouldn't get it [01:47:55] * average_drifter just got this from a lady [01:48:05] * average_drifter thought it would be nice to share it [01:49:16] dschoon: you there ? [14:44:22] moooooooorning ottomata, milimetric, average_drifter [14:44:45] morning! [14:44:58] did you see erik z's email? [14:45:37] hm no [14:45:42] mobile edits? [14:45:47] that one? [14:45:51] if so yes :) [14:48:12] you wanna dive into the log format? [14:51:22] ? [14:53:35] the missing x-carrier header [14:56:39] aye, k, who wrote that bit? [14:56:46] someone added something to varnish to do that, right? [14:57:05] patrick reilly [14:59:16] aye [15:00:38] adding orange morocco zero filter for amit now [15:00:43] i'm on ops RT duty this week [15:01:43] ok [15:02:13] there are some RT stat1 tickets left ;) [15:02:23] or maybe they can already be closed [15:02:53] did you create the stats user for kraken? [15:05:08] naw [15:05:13] coudl do though [15:05:25] add me a todo, if you haven't already :) [15:07:07] https://app.asana.com/0/828917834272/2831359148905 [15:07:26] title might have been a bit misleading [15:07:58] what's the title? (asana links don't really work well for me) [15:08:13] is it in kraken project? [15:08:15] "Create non-admin hdfs account for recurring jobs" [15:08:52] this is also some low hanging fruit: Fix khadoop restart to also restart oozie and hive [15:16:19] can I respond to Erik's email? [15:58:13] morning guys [15:58:18] sorry - computer problems this morning [15:59:08] dschoon, I'm working on the d3 thing, found something recent MBostock did [16:23:09] morning [16:24:37] heya drdee [16:24:40] what's up with this ticket? [16:24:40] https://rt.wikimedia.org/Ticket/Display.html?id=2307 [16:24:48] looking at it [16:24:57] an old wish [16:25:01] ignore it [16:25:26] or close it, i like to have the stuff we make more standards compliant [16:25:36] ideally it should be part of jenkins [17:30:52] do we have much in the way of tickets in rt or bugzilla? [17:31:00] milimetric: what's up? [17:34:48] hey ottomata: in re the two new fields for udp log, I am seeing cp1043.wikimedia.org not sending those new fields [17:35:52] pgehres, hm [17:35:57] can you gist a log line example? [17:36:25] sure? [17:37:04] https://gist.github.com/9d6492ea0832c5adf6d0 [17:37:14] i manually changed the IP, but that's it [17:37:37] cp1043 is the only box not matching my updated regex [17:41:34] hmmm ok cool [17:45:32] hmm, weird, the loggers are running with that format [17:46:27] pgehres, when I tail the sampled-1000 log and grep for that hostname [17:46:30] I see the extra fields [17:46:33] where did you get that log line? [17:46:46] it is from the fundraising banner logger [17:47:09] the most recent 15-minute slice that I had [17:47:54] weeird, some lines from that host ahve it, and others don't! [17:47:56] i am def seeing the fields on other hostnames [17:48:21] hmm, but the seq numbers are totally different for the lines [17:48:25] werid [17:48:25] hm [17:49:02] do you need anything more from me? [17:49:10] naw, looking into it [17:49:21] awesome, thanks. just wanted to give you a heads up [17:50:08] weird! there are two varnishncsa instances logging to locke right now! [17:50:13] killing the old one [17:50:30] weird! [17:50:46] yeah, one was old and stale, hm [17:50:48] ok, fixed. [17:50:58] good catch pgehres, thanks [17:51:07] not a problem [17:51:26] thanks for supporting udp2log [17:52:02] ottomata: would that have been doubling the impression count for things hitting cp1043? [17:52:02] yup! [17:52:08] yes [17:52:14] yikes [17:52:18] k, thx [17:53:01] so, pgehres, basically, anything after Jan 2 22:16 UTC without those extra fields should be discarded [17:53:17] k, was there only one instance before that? [17:53:33] well, i'm not sure exactly, but 22:16 UTC is approxmiately when we deployed this change [17:53:41] yeah [17:53:42] i'm not sure when the change would have made it to this host [17:53:47] uhm, let's see [17:54:11] i'm more concerned with how long we have been doubling the impressions [17:54:37] hm, you know, i'm not sure if that time is UTC, i got that from the analyics server admin log [17:54:50] cp1043 says it has been running the newer varnishncsa since [17:54:54] i can easily ignore the lines without the fields [17:57:46] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:01:47] ok, pgehres [18:01:56] ignore fields without those fields after [18:02:01] Wed, 02 Jan 2013 21:49:43 UTC [18:02:01] ottomata ^^ [18:02:22] cool, thanks ottomata [18:28:41] how can I get an account on https://rt.wikimedia.org/ ? [18:30:26] i have to vouch for you :) [18:30:36] brb , lunch [18:30:38] and I think CT is the one to talk to [18:30:58] drdee: can you vouch for me please ? :) [18:31:11] erosen: expand(CT)= ? [18:31:33] CT is director of Ops [18:31:35] CT Woo (i think he's the head of ops) [18:31:42] i need to talk to him [18:31:47] gotcha [18:31:48] but why do you need access to RT? [18:32:08] to see the bug report Andrew mailed just now [18:32:19] https://rt.wikimedia.org/Ticket/Display.html?id=859#txn-93804 [18:32:21] this one [18:33:08] i can relate--not know what the RT tickets are, but hearing talk of them is always sort of frustrating [18:33:24] s/know/knowing/ [18:33:50] i agree. it's stupid that all engineers do not have access to RT [18:34:23] i added you as CC to that ticket [18:43:57] ottomata: you about? [18:44:29] i was thinking about the mobile pageview anomaly -- that it's up by an absurd amount [18:44:49] are we sure varnishncsa isn't running as a dupe on any other machines? [18:44:56] this would explain the +44% [18:46:33] dschoon: so you're saying it would've produced duplicate lines ? [18:46:48] dschoon: I can hash all the lines and check how many of them have the same hash [18:47:00] like putting them into buckets and seeing what the top buckets contain [18:47:07] of course, without the time-field [18:47:20] becuase that would most likely be different even for the duplicate ones [18:48:17] average_drifter: not necessarily the same hash [18:48:28] because we changed the output fields [18:48:53] so the first X characters on a line would be the same [18:49:58] it'd be because we added the UA Language and X-Carrier fields [18:50:04] the last two are new [18:50:07] I could ignore those [18:50:19] okay. [18:50:34] it'd have started on Jan 2 [18:50:42] ok, having a look [18:50:47] but the problem surfaced around xmas [18:50:49] what we need is the cache server hostname [18:50:51] hmm [18:50:52] before the log format chnage [18:50:53] shit. [18:50:54] yes. [18:50:55] true. [18:51:35] dschoon: zoom success http://bl.ocks.org/4458705 [18:51:36] thanks [18:51:54] we can spend a lot of time trying to figure out what happened but average_drifter is already working on a report to replace that report using the squid logs [18:52:02] so i would say let's continue doing that [18:52:03] truth. [18:52:15] very nice, erosen [18:52:20] ok [18:56:19] average_drifter, if you hash just the hostname and the sequence number, that should be enough [18:56:24] maybe through the client ip and req url in there to be sure [18:56:49] just read the rest of the chat [18:56:51] nm then! [18:56:56] so, is this meeting starting in 4 minutes? [18:58:49] yes [18:58:51] pls join [19:03:23] i think it would have to be IP+Seq+timestamp, just to be safe [19:05:25] +hostname [19:05:31] (of the cache node) [19:06:11] average_drifter: you there? [19:06:17] are you joining the hangout? [19:09:12] dschoon: I can join yes [19:49:38] THAT email. [19:50:10] If anyone else doesn't know what email we're talking about, please ping me and i'll forward [20:17:45] brb lunch [20:23:39] we're in october with the reports... still going, 2 more months left, it's going to be here soon [20:27:08] awesoem! [20:42:50] erosen, gonna run downstairs and make some tea, but when I come back up I want to look at the limnpify thing, did you get any further with it on friday after I signed out? [20:43:00] not much further [20:43:03] aye rats [20:43:06] but I can work on it now [20:43:12] oh awesome, ok [20:52:39] can I help with that at all erosen? [20:52:54] i'm just trying to remember the date format again [20:53:27] --datefmt="%Y-%m-%d_%H" [20:53:53] thanks [20:54:27] okay so I still can't reproduce the error [20:54:36] either i can try to replicate your environment more [20:54:43] one sec, lemme do it on an01 [20:54:46] (we can check things like numpy version) [20:54:57] or I can try to debug remotely [20:55:38] basically the error is saying that the pandas.DataFrame object which is passed into the limnpy.DataSource object has the default headers [20:55:48] which in theory is okay [20:56:00] but just seems like it is more likely to be a mistake [21:01:31] ottomata: any luck? [21:01:34] hmmmm [21:01:44] weird [21:01:45] uh [21:01:54] ok on an01 [21:01:56] in my homedir [21:01:57] blog.cont [21:02:08] looks like it mabye kinda works, [21:02:22] https://gist.github.com/4478394 [21:02:35] ah yeah [21:02:49] so limnpy expects the columns to have labels [21:02:54] ok i can add those [21:03:05] you can do it through the command line if yo uwant [21:03:08] i think… [21:03:30] mv, [21:03:33] nvm* [21:03:35] ok and now [21:03:35] https://gist.github.com/4478406 [21:03:38] thought about it but didn't implement it [21:03:47] yeah [21:03:56] oh 'date' [21:03:57] ? [21:04:01] --datecol DATECOL the date column [21:04:13] pass in the argument --date=Timstamp [21:04:19] or just call it 'date' [21:04:20] Timestamp [21:04:22] yeah [21:04:26] that worked! isn't it supposed to be csv though? [21:04:35] you mean the output? [21:04:38] is it tsv? [21:04:42] yeah [21:04:52] it prints to screen using a special formatter [21:05:09] but i think the files which it creates in the datafiles/ dir are csv [21:05:29] oh! [21:05:31] datafiles and datasources [21:05:38] sorry if it is a bit bulky to have it create all those directories, but that is the way limnpy works at the moment [21:05:40] nice! [21:05:46] glad it works! [21:16:37] back [21:16:44] so, erosen, q now [21:16:47] ya [21:16:48] sup? [21:16:54] dschoon might know answer too [21:17:04] so I want to use this as an oozie action [21:17:16] so I need it installed everywhere [21:17:25] i see [21:18:04] how's pip install -e work when you make changes? [21:18:09] well [21:18:17] when you import a file [21:18:19] it automatically sees the changes [21:18:28] dschoon, it will reinstall? [21:18:28] so long as the process restarts [21:18:29] and then check where it is located, it is the very version that is checked in [21:18:35] it doesn't need to "reinstall" [21:18:36] good point [21:18:43] unless you have actions that trigger on install [21:18:48] if it's just file changes, you [21:18:50] re fine [21:19:01] hm, well erosen has an executable that is somewhere on my path after I run pip install -e [21:19:10] yeah [21:19:13] that's fine [21:19:16] nothing needs to change there [21:19:21] it is symlinked i think [21:19:24] it's a dynamic import [21:19:33] it runs a distutils command [21:19:42] which will use the right code [21:19:47] hmmmmmm, cool [21:20:01] not a symlink, but hte script uses from pkg_resources import load_entry_point [21:20:04] (i take it back, it's not symlinked) [21:20:04] to get the right one [21:20:05] cool! [21:20:08] exactly [21:20:08] (maybe hardlinked) [21:20:11] nope [21:20:13] no links. [21:20:20] it uses distutils. [21:20:30] hehe [21:20:32] (that's what pkg_resources is) [21:20:33] finally reading [21:20:50] hmm, ok so I need to find out how oozie shell actions work [21:20:54] just a warning, i feel rather gross after lunch [21:20:56] maybe I don't need this everywhere, will see [21:21:02] good to know [21:21:02] daw, [21:21:04] whatdya eat? [21:21:07] i might need to disappear and retch. [21:21:09] sushi. [21:21:10] oy [21:21:12] from the metreon [21:21:14] hmm [21:21:26] but if it doesn't fell robla too, its Real Illness [21:21:30] which sucks even more [21:26:56] hey drdee [21:27:06] when you submit a coordinator with an ${HOUR} variable [21:27:12] and the ui asks you to enter HOUR [21:27:14] do you just leave it blank? [21:28:21] yes [21:28:26] that's a small bug in Hue [21:28:26] that [21:28:27] will [21:28:30] be fixed soon [21:29:13] do you understand the 'calendar' bit in the coordinator dashboard? [21:29:25] I just submitted it, and I have 4 jobs starting at my start date for the dataset in the calendar [21:32:00] drdee do you understand the 'calendar' bit in the coordinator dashboard? [21:32:00] I just submitted it, and I have 4 jobs starting at my start date for the dataset in the calendar [21:32:44] 1 sec chatting with milimetric [21:33:09] christ this is bad. i think i have to take a half-day [21:33:18] oy [21:33:22] good luck [21:36:21] ok, leaving now before the power to do so leaves me [21:36:21] heheh, hey erosen [21:36:24] yo [21:36:31] email if y'all need me [21:36:33] why 'limnpify' instead of 'limnify'? [21:36:36] yeah [21:36:37] ok, feel better dschoon? [21:36:38] hehe [21:36:38] ! [21:36:39] just changed it [21:36:41] ncie [21:36:46] i agree [21:36:55] i was gonna mention that before you hardcode it [21:37:07] i'll push now so you can test it [21:37:13] k [21:38:39] done [21:38:45] also I'm updating the readme [21:40:00] danke, rats, i'm looking into how to execute this as an oozie action [21:40:09] might be difficult! need to upload all of the deps to hdfs [21:40:14] since it runs inside of hadoop [21:40:18] rr [21:40:47] can you puppetize it so that is just lives on the machines? [21:40:56] or is that not enough [21:42:19] i think I need to bundle it so I can upload it to hadoop as part of the job, [21:42:41] that's annoying [21:43:11] i successfully made a streaming job work by untiring it's own dependencies and then importing them [21:43:39] but I'm not sure if that will work for things like numpy and pandas [21:44:37] untiring dependencies? [21:44:47] tar [21:44:49] sorry [21:45:01] i think that might have been colloquy [21:47:33] ? [21:47:38] oh tar [21:47:40] aye [21:47:45] my client autocorrects [21:48:00] apologies for cryptic correction [22:07:11] drdee, any suggestions for limnify of my data? [22:07:18] options: [22:08:09] A. figure out how to use erosen's limnify as an oozie shell action (need to import python deps) [22:08:09] B. Go back to pig and figure out how to pivot the data in pig [22:08:09] C. Run limnify as a cron job or something outside of hadoop [22:08:43] B, especially if the cube thing is gonna do it, then C, then A [22:09:26] aye right, hm cube thing [22:10:23] naw, cube doesn't do that [22:10:36] i think cube is just a shortcut for group by counts with multiple dimensions [22:11:02] ok, for B, there are two ways [22:12:01] B1: make my previous attempt work. This hardcodes continent names into the pig script and will only work for a small number of pivoted columns [22:12:02] B2: Write a Pig UDF in Java that is basically a port of erosen's limnify in Java (minus the metadata) [22:12:20] what about a jython jar [22:13:15] i would say B2 assuming that this is something we will need in the future as well [22:13:27] jython jarrrrrr [22:13:35] jython jar…… another thing that we need to investigate.. [22:13:45] i think it is possible, but haven't tried it myself [22:14:08] i am pretty sure it is possible but how....... [22:15:29] ottomata: just updated the readme and made the limnify --help screen a bit more useful [22:15:39] i also changed a little bit about the way it deals with header rows [22:15:53] what is limn format again? [22:15:56] not that you are evening going to use it at the moment, but just fyi [22:16:13] date col1 col2 ... [22:16:28] and what is the format in pig [22:16:29] ? [22:16:39] and then there is a metadata file that gives the columns names and give the data source a title and a unique id [22:16:56] the pig format isn't aggregated by date i think [22:16:57] date continent value [22:16:58] xxxx aaa 1 [22:16:58] yyy aaa 4 [22:16:58] zzz bbb 5 [22:17:00] is the problem missing observeaat ions? [22:17:15] right, no [22:17:19] we can deal with that if we need to [22:17:33] this is easy to do via a script or something, the problem is doing it in oozie :) [22:17:39] totally [22:17:57] maybe just a cronjob outside of hadooop [22:18:07] it's also a very small dataset [22:18:21] yeah, that would work immediately [22:18:31] wont' solve the problem for the long term though [22:18:36] let's go easy [22:18:41] why not? [22:18:42] ok sounds good [22:18:55] that will be easier for me too, cause tehn I can use limnify and already ahve metadata [22:19:04] k [22:19:06] i'm also a fan of figuring out how to use python more easily [22:19:08] i'll script it so that the data gets saved in /wmf/public [22:19:15] and then milimetric can figure out how to make limn graph it [22:19:19] i know it comes with headaches but in theory it is quicker to develop with [22:19:39] at your command good sirs [22:19:48] I've just fixed my d3 zooming problem [22:19:51] cool! [22:19:53] turns out the problem was my rotten brain [22:20:04] damn thing gets me every time :) [22:20:09] ok thanks guys, i'll work on limnify cron job tomorrow, I got some sanding to do righ tnow! [22:20:15] oh rotten brain! [22:20:36] sanding! [22:20:39] you need brain derot [22:20:45] yeah man! made a cutting board with my dad on xmas [22:20:53] he's got an old planer from the 60s [22:20:59] needs sanded and oiled [22:20:59] ummm [22:21:16] awesome [22:21:34] http://www.flickr.com/photos/ottomatona/8342124913/in/set-72157632357603690/ [22:21:44] perdy [22:21:50] what kine wood? [22:21:53] cherry and oak [22:22:18] looks like a nice shop [22:22:25] that's awesome! [22:22:39] dad's shop is super nice, he just built those cabinets/workbench [22:22:41] we're making a little microwave shelf stand thing [22:22:51] milimetric! I was jsut thikning about doing that today [22:23:15] i want it to have slats too, so that I can stack pot lids upright in it somewhere [22:23:16] :) I can send pics and plan. We're just waiting on our pockethole drill guide thing and we'll get it done by Friday I think [22:23:26] my dad just got me a pockethole drill guide for xmas! [22:23:29] i do not know how to use it! [22:23:47] I've got a google sketchup of the thing so it'll be easy to change [22:23:50] i'm digging the video [22:24:12] thanks! [22:24:19] i've used the pockethole thing before. It's not too hard, just clamp it in the right place and use the drill bit that comes with it [22:24:36] coooool [22:24:39] np, I'll email it tonight [22:24:40] i'm off to lunch [22:24:43] right, and that is just so you keep a straigh angle and hide the screw in the wood? [22:24:49] ottomata: let me know how the limnify thing goes [22:24:56] will do, i'll try it out more tomorrow [22:24:59] yep, me too on limnify [22:25:20] will it be running on an10? [22:25:27] hmmmm, probably an01 [22:25:33] i can test it out just to make sure it works [22:25:34] and yes on the screw, it sort of bends a bit so you can drill on an angle but still get a fairly straight screw [22:25:54] alrighty [22:28:10] hmm, drdee [22:28:21] yahhh [22:28:27] when you submitted your coordinator [22:28:32] did it ask you to fill in INPUT and OUTPUT? [22:28:59] hold on [22:29:09] i see it in your configuration in the coordinator dashboard [22:29:32] oh wait, mine is working I think! [22:29:35] i thought it wasn't [22:29:56] ah nope, no output files [22:29:58] the path you define in the dataset that is the input/output [22:30:11] right that's what I thought [22:30:34] i'm comparing the configuration of our coordinators [22:30:40] ohhh, no i'm sorry [22:30:43] i'm looking at a workflow [22:30:45] oops [22:31:10] ahh yes! [22:31:12] it is working! [22:31:14] i see a coordinator running for you [22:31:18] yeah [22:31:19] http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/mobile_hour_by_continent/2013-01-01_00.00.00/part-r-00000?file_filter=any [22:31:33] AWESOME! [22:31:56] so we should (tomorrow) move the blog and this script to the 'stats' user [22:32:03] HMmmmmmmmm [22:32:10] no? [22:32:18] yeah I started on that today and then got distracted by meeting [22:32:21] ok [22:32:22] yeah something like that [22:32:25] but my hmmm [22:32:37] was that these are 15 input intervals, and I am generating per hour rollups [22:32:39] I need to fix that [22:32:52] hmmmm [22:32:55] weird, hmm, yeah [22:33:01] my pig script works fine on a large dataset [22:33:08] where it counts everything at once [22:33:11] just run the script hourly [22:33:12] and examines all the data [22:33:23] but the data is in 15 minute chunks [22:33:34] and, also, that isn't foolproof, right? [22:33:42] oh yeah that's actually a more common use case [22:33:43] since the file names are not guaruntees [22:33:52] of what timestamps they contain [22:33:55] they are only approximations [22:34:02] since they only represent when the data was imported into hadoop [22:34:04] not what the data contains [22:34:05] but oozie has data dependencies [22:34:14] so it can wait until a folder is created [22:34:28] yeah, but it doesn't matter, I can't run iterations of this job as is though [22:34:43] say there is jan 8 data in the jan 7 folder [22:34:48] ohhhhhhh [22:34:51] that sucks [22:35:06] but i am sure that is pretty common problem ;) [22:35:36] yeah its gotta be, i mean, this job doesn't take that long to run, i could just make it count all the data instead of running whena new dataset happens [22:35:43] buuut that's not cool [22:35:52] it took like 30 minutes or something on the 2013 mobile data [22:35:55] and that is only a weeks worth of data [22:36:53] also, drdee, before we make this stuff run as stats user, the scripts and jobs need to be more standardized [22:36:59] your blog thing isn't even in kraken yet :p [22:37:09] blog.pig [22:37:12] yeah sorry about taht [22:37:17] i'll push [22:37:24] naw, no worries, [22:37:34] we're both trying to get these scripts nice before we rely on them [22:37:38] so its cool [22:38:51] done [23:36:23] average_drifter: do you know what happened to this wikistats file: /a/wikistats_git/dumps/csv/csv_wp/StatisticsMonthly.csv? [23:36:27] on stat1 [23:38:00] looking [23:38:46] spetrea@stat1:/a/wikistats_git$ find -name "StatisticsMonthly.csv" [23:38:46] ./animations/requests/geocitylite/StatisticsMonthly.csv [23:38:46] ./dumps/csv/csv_wx_bak/StatisticsMonthly.csv [23:38:46] ./dumps/csv/csv_ws/StatisticsMonthly.csv [23:38:46] ./dumps/csv/csv_wb/StatisticsMonthly.csv [23:38:48] ./dumps/csv/csv_wv/StatisticsMonthly.csv [23:38:51] ./dumps/csv/csv_wn/StatisticsMonthly.csv [23:38:53] ./dumps/csv/csv_wx/StatisticsMonthly.csv [23:38:55] ./dumps/csv/csv_wo/StatisticsMonthly.csv [23:38:58] ./dumps/csv/csv_tst/StatisticsMonthly.csv [23:39:00] ./dumps/csv/csv_wq/StatisticsMonthly.csv [23:39:03] ./dumps/csv/csv_wp/StatisticsMonthly.csv [23:39:05] ./dumps/csv/csv_wk/StatisticsMonthly.csv [23:39:10] look like it's there [23:39:13] weird [23:39:36] erosen: if you look in top/htop you will see it's processing right now [23:39:43] cool [23:40:03] spetrea@stat1:/a/wikistats_git$ ps xau | grep "/a/wikistats_git/dumps/csv" [23:40:06] ezachte 889 89.0 0.1 94512 37804 ? R 23:36 3:26 perl WikiCounts.pl -r -m wb -i /mnt/data/xmldatadumps/public/dewikibooks -o /a/wikistats_git/dumps/csv/csv_wb/ -l dewikibooks -d auto -s /a/mediawiki/core/languages [23:40:11] i must have been in another directory [23:40:13] my bad [23:40:18] ok [23:41:14] thanks for checking [23:43:51] ty average_drifter [23:46:05] np