[11:09:28] drdee: Hi [11:09:45] drdee: Having familiarized myself a bit with the Kraken code basis [11:09:52] drdee: And reading through the Pig documentation [11:10:04] drdee: I was wondering, what the next steps would be? [11:10:23] drdee: You mentioned getting rid of dclass at some point� [11:10:32] drdee: That looks totally doable. [11:10:54] Could that be a first task? [13:21:36] morning everyone [14:13:28] MOORNING [14:13:38] BOEEM chakalalkakakakakaklall [14:14:05] * drdee snugs into the the IRC room very discretely [14:14:13] very discretely :-) [14:14:24] Morning :-)) [14:14:26] sorry for closing the door too hard [14:14:32] How do you do? [14:14:45] preettyyy gooood [14:14:48] Awesome! [14:14:49] whats up with you? [14:15:05] mornin! [14:15:06] Did you read my message in this channel from ~3 hours ago? [14:15:27] i just did [14:15:32] uhmmmm [14:15:44] we are waiting for tomasz with some feedback on an existing pig script [14:15:55] Ok. [14:16:10] Just let me know what to do next :-) [14:16:21] but maybe you can review this: https://github.com/wikimedia/kraken/pull/7 and take over from me? [14:16:43] basically we need email alerts when an oozie job fails [14:16:51] I'll have a look. Yes. [14:17:23] qchris, i can give you an overview of oozie and related concepts here if you want [14:17:32] relevantly my comment there about sub workflows [14:17:40] Yes, that'd be very welcome :-) [14:17:59] Gimme a few minutes. [14:18:26] uhhh me too [14:18:40] brb, laundry things [14:24:08] back [14:24:15] in a cafe, but can talk if you wanna [14:24:44] Back as well :-) [14:25:15] I am more of an IRC guy, but I can come to the hangout as well. [14:27:04] we can IRC right now I guess [14:27:05] hmm [14:27:05] ok [14:27:12] Ok great :-) [14:27:18] so I wrote this when I was figuring things out myself [14:27:18] https://www.mediawiki.org/wiki/Analytics/Kraken/Oozie [14:27:27] it probably needs updated with some more best practices stuff [14:27:31] but it is still relevant [14:27:46] Ok. [14:28:08] i guess, hm, peruse that for a sec, then I'll show you how some of the more recent coordinators are working [14:28:14] and also give ideas for how they could be made better [14:28:21] Great! [14:29:33] qchris: did you write the comments regarding setting up the kraken repo btw? [14:29:34] k ,ja lemme know when you've read through that [14:29:42] drdee: Yes. [14:30:19] can you add those notes to the README.md and push it to github? [14:30:53] drdee: Yes, sure. [14:33:48] ty [14:48:49] ottomata: I read it and now have an idea what you meant in your github comment :-) [14:49:01] aye, cool [14:49:12] i don' thtnk taht document talks about sub workflows though [14:49:18] those are somehting david figured out [14:49:22] they're really handy [14:49:24] lemme show ya :) [14:49:30] Ok :-) [14:49:38] so [14:49:40] * qchris takes the Oozie tour. [14:49:41] in kraken repo [14:49:46] oozie/util [14:49:56] that's the coalesce sub workflow [14:50:17] yes. [14:50:21] it contains a couple of hadoop streaming actions [14:50:33] so, basically just a big parallelized cat | sort [14:50:47] we use this to pull together the time bucketed data output of periodic jobs [14:50:49] into a single file [14:51:07] its really inefficient, in that it deletes the current file, and then re-cats all of the previous data into a new file every time it runs [14:51:08] but it works [14:51:11] anyway [14:51:19] :-) [14:51:21] since coalese-wf.xml is a standalone workflow [14:51:27] it can be used on its own as a one-off job [14:51:29] New patchset: Milimetric; "added csrf back into Metric forms" [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/72098 [14:51:41] Change merged: Milimetric; [analytics/wikimetrics] (master) - https://gerrit.wikimedia.org/r/72098 [14:51:48] if you have some files you want to coalesce, you can create a workflow.properties with the relevant settings [14:51:51] and submit it to oozie [14:51:53] OR [14:52:00] Sounds good. [14:52:01] you can use is as a subworkflow action of another workflow [14:52:02] so [14:52:22] check out oozie/mobile/platform/workflow.xml [14:52:32] its pretty self explanitory [14:52:39] after the process action [14:52:46] the coalesce action happens [14:52:51] which is a sub-workflow [14:52:57] ok to="coalesce" :-) [14:53:03] pointing at the coalesce-wf.xml for app-path [14:53:04] exactly [14:53:18] and it passes the output of the process action as the input of the coalesce action [14:53:20] effectively [14:53:27] process | coelesce [14:53:29] So a separate workflow to notify about errors is what we want :-) [14:53:36] right, well, maybe! i don't know [14:53:40] i suggested that as on option [14:53:44] drdee's might be nicer in this case [14:53:55] since it might not make sense to have a standalone workflow that just emails errors... [14:53:56] or maybe it would [14:53:57] ? [14:53:59] i'm not sure [14:54:16] can I show you a couple more things? [14:54:21] my thought was to have a macro that we could easily reuse across different workflwos [14:54:21] Sure. [14:54:26] just to give you an idea of some intentions I have for all of this? [14:54:27] xInclude seemed easy [14:54:37] yeah, that might be better, i think i just wanted to consider both ways [14:54:41] ok so qchris [14:54:45] while working on this stuff [14:55:01] i noticed that there is a very easily abstractable pattern here [14:55:13] and right now there is a *lot* of copy/pasted xml all over the place [14:55:19] which is fine [14:55:23] :-/ [14:55:30] but makes creating new jobs that follow this pattern very cumbersome [14:55:42] Yes, obviously [14:55:46] so, since oozie itself is so finicky and difficult to debug, i've been taking abstractions one step at a time [14:55:53] so, let's look at the most recent one :) [14:56:04] oozie/mobile/zero/carrier_country [14:56:12] I put a README file in there [14:56:26] * qchris reads it. [14:57:43] So you're talking about the last paragraph? [14:57:48] yeah, basically [14:57:48] so [14:57:56] we currently supply 2 differnet metrics for Zero [14:57:58] carrier and country [14:58:08] ok. [14:58:10] there is no difference in the oozie configs between the two [14:58:16] we used to have two directories there [14:58:20] (uhh, i think) [14:58:24] with the exact same stuff [14:58:28] except the pig script was different [14:58:46] so, I combined them into one, and made the pig script a parameter to set in the .properties files [14:59:07] sounds reasonable. [14:59:11] the next step would be to create even more generic coordinators [14:59:19] that parameterize everything in property files [14:59:27] anything that works like this could be done that way: [15:00:06] - input data set, input frequency [15:00:06] - output data set, output frequency [15:00:07] - pig script (takes $input and $output parameters) [15:00:21] which is pretty much how all of our oozie jobs work right now [15:00:33] Annyyyywyayyyyy :) [15:00:41] that would be nice one day :) [15:00:45] :-) [15:00:48] i just wanted to give you an overview of where things should be headed [15:01:00] but now back to drdee's email thang [15:01:02] : [15:01:02] Yes, sounds pretty good :-) [15:01:04] Ok. [15:01:09] ja, whatever is best :) [15:01:20] subworkflow vs xInclude [15:01:26] whatcha think? [15:01:27] Well ... I do not know our setup well enough to make a decision :-) [15:01:29] haa [15:01:30] yeah [15:01:42] what more information would you need? [15:02:06] we can also do both and see in practice what works bets [15:02:12] :-) I saw your java code [15:02:23] and from that I cannot know how you like to configure things :-/ [15:02:30] i'm looking over some of our oozie stuff now, thinking... [15:02:34] I'd also say that both variants look like they'd work [15:02:55] So it's rather a matter of taste to me, that any hard criteria. [15:03:04] s/that/than/ [15:03:19] what do you prefer given your current understanding? [15:03:53] As I understood it, xinclude is now to the team, subworkflows are not. So I'd take subworkflows. [15:04:20] hmm, subworkflows are basically only not new to me and david :p [15:04:36] gimme a sec, I am forming an opinion :) [15:04:40] ottomata: hehe. But better 2 than 0. [15:05:22] ok, let's do subworkflows [15:05:23] with a little tweaking, I think I'm leaning towards xInclude [15:05:40] so, i *think* that since the subworkflow is a separate workflow [15:05:40] Ha! You switched positions :-) [15:06:04] oozie would submit it as a separate workflow, which has some overhead, [15:06:13] the more subworkflows, the more job IDs there are [15:06:20] the more you have to dig deeper to debug things [15:06:37] each workflow and action (sub or not) gets its own ID [15:06:56] and is maintained in oozie as a distinct concept [15:07:23] qchris, lets do a quick hangout so I can share my screen and show you what I mean [15:07:26] this will be useful too :) [15:07:28] s'ok? [15:07:37] Ok. I'll boot the hangout machine. [15:07:44] * qchris heads to the batcave [15:19:19] drdee, we should submit an RT to get qchris access to analytics nodes, eh? [15:19:32] yes, i thought i already had done that [15:19:44] https://rt.wikimedia.org/Ticket/Display.html?id=5403 [15:20:28] ok cool [15:20:30] we need to get toby to approve [15:59:17] lol [15:59:18] self.mwSession.add(Revision( [15:59:18] rev_id=5, rev_page=1, rev_user=2, rev_comment='Evan edit 3', [15:59:18] rev_len=-4, rev_timestamp='20130724000000', [15:59:18] )) [16:01:56] In my test fixture, I'm adding an edit from evan on July 24, contributing -4 bytes. All in the spirit of making him not regret leaving :) [16:04:08] ottomata, qchris: so what is it: xInclude or sub workflow? [16:05:03] milimetric: thanks, hehe [16:05:09] :) [16:05:31] i'm gonna commit these bytes added tests (all failing) before getting lunch [16:05:36] but i can catch up if you want [16:05:53] erosen ^ [16:07:16] milimetric: k [16:07:30] let's check in post scrum; still working on cleaning up some dashboard code [16:15:10] baack [16:27:42] drdee, i'm still not sure about these file imports, they are 1.5g right now [16:27:49] even though i'm pretty sure i'm importing the proper stuff [16:28:03] do you have some charts? [16:33:54] of filesizes? [16:34:00] no but that would be a good thing to monitor [16:37:07] drdee [16:37:07] https://gist.github.com/ottomata/5935722 [16:56:08] ottomata: maybe yurik's change? [16:56:26] mayyybe, dunno [16:56:42] we did see a slight uptick in varnish traffic, so we were investigating that [16:57:10] ottomata: where should I place the cronjob in puppet ? [16:57:13] ottomata: I read the docs you gave me [16:57:23] ottomata: but I mean in the puppet repo where should I place the cron job [16:57:47] want to make a gerrit patchset for this [16:58:26] aye, (i just puppetized your deps, btw) [16:58:27] hm [16:58:33] um [16:58:44] HMM, actually, if there is a cron job to come [16:58:47] hm [16:58:51] what does your cronjob do? [16:58:56] what code does it run? [16:59:00] (like, where on stat1002?) [16:59:56] average ^ [17:00:44] ottomata: http://goo.gl/Bd0Db [17:00:51] ottomata: that's a small script to install the cron job [17:02:10] hmmm, drdee, i'm not so sure this cron job should be puppetized…….hmmm [17:02:13] 20 7 01 * * stats /bin/bash /a/wikistats_git/pageviews_reports/bin/stat1-cron-script.sh [17:02:17] unless we also want to puppetized /a/wikistats_git [17:02:20] ottomata: ^^ this is basically the cron line [17:23:13] ottomata: looks like the the varnish hostname bug is in fact bubbling up: http://gp.wmflabs.org/graphs/free_mobile_traffic_by_version [17:25:41] makes sense about %50 here [17:25:59] aye, i need to get that working, i realized it is a weee ahrder than I thought [17:26:11] because I don't know the exact times when things changed [17:26:17] I can search and make guesses [17:26:23] i worked on that a bit yesterday [17:26:27] i think i will work on that today [17:26:29] and ask you some qs [17:29:43] ottomata: let me know when you have qs [17:38:49] k [18:06:30] changing locs, back in a bit [18:09:11] erosen: I factored out the comma separated integer list field and the better boolean field [18:09:22] they're in /metrics/form_fields.py [18:09:25] (pushed) [18:12:27] milimetric: great [18:31:34] baaack [18:36:42] average, how goes it? [18:44:45] heya erosen [18:44:49] yo [18:44:49] do you konw what % of mobile traffic is zero? [18:44:55] hmm [18:44:59] i think quite smal [18:45:03] yeah i woul dthink so too [18:45:06] but I don't know off the top of my head [18:45:13] actually we have a udf that would answer that question pretty easily, eh? [18:45:21] yup [18:45:30] isZeroPageView, right? [18:51:19] ja [18:55:49] 20:55 <+gerrit-wm> New patchset: Stefan.petrea; "Added cronjob for new mobile pageviews" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/72125 [18:56:20] ottomata: what machine will the cronjob run on ? [18:56:24] stat1001 ? [18:56:25] stat1002, right? [18:56:28] uh [18:56:45] /a/wikistats_git is not on stat1001 [18:57:00] so, anything that you used to do for wikistats on stat1, should now be done on stat1002, right? [18:57:10] yes [18:57:17] and stat1001 is just for publishing reports [18:57:37] so why did Erik tell me to rsync the csv to stat1002 .. hmm [18:57:46] yurik, erosen, i'm only looking at a single 15 minute period of traffic [18:57:56] but it looks like about 0.5% of mobile traffic is zero [18:58:03] now, stat1002, through this cronjob is basically rsync-ing the csv to itself, which is obviously wrong [18:58:12] so, yurik, if your change doesn't affect the regular mobile traffic, that won't have anything do to with the increase I'm seeing [18:58:31] ottomata: does Erik run his cronjobs on stat1002 or on stat1 ? [18:58:39] he should be using stat1002 [18:58:43] since we switched over [18:58:55] so there's virtually no need to rsync. regular copying is sufficient [18:59:09] hm, well [18:59:12] where does it copy to? [18:59:20] average, my understanding is: [18:59:24] stat1002 generates data [18:59:26] ottomata, i don't think we have changed anything in terms of mobile traffic being identified as such. Zero extension rides on top of the mobile one, and at the begining it makes a decision if it needs to insert a header or not [18:59:29] which is then rsynced over to stat1001 [19:00:00] /a/wikistats_git/dumps/csv/csv_sp/ [19:00:49] ottomata: yes, but my cronjob is only producing a csv, which is then being picked up by one of Erik's cronjobs(which are also on stat1002 now) and then the reports are made and sent to stat1001 [19:01:13] ok, so you don't need to rsync then [19:01:15] but rsync is fine locally [19:01:19] so if it works its fine [19:01:25] ok [19:01:26] average [19:01:35] average, does that script take a long time to run? [19:01:41] i'd like to try to run it as the stats user before I merge [19:01:47] so we can see if there are any permission problems [19:02:48] last runtime was 10h50m [19:03:07] hmm, ok [19:03:12] i think i see some with group writing [19:03:16] i'm just looking at dir permissions [19:03:20] i will fix them, hopefully it will work :/ [19:03:51] somehow I doubt it will :p [19:04:08] there are some more problems. it's rsync-ing to stat1002.wikimedia.org which doesn't exist, the correct hostname being stat1002.eqiad.wmnet [19:04:14] I need to fix that too [19:04:37] oh ha [19:04:37] ok [19:05:16] average [19:05:16] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter day at /var/lib/git/operations/puppet/manifests/misc/ [19:05:46] i think you want [19:05:47] weekday [19:06:23] oh [19:11:21] ottomata: https://gerrit.wikimedia.org/r/72128 [19:25:42] ottomata: what is the status of the hostname mixup? [19:26:17] ottomata: Amit wants to share the dashboards with the carriers, but I wanted to give him a realistic picture of the extent of the problem [19:26:28] what problem? [19:26:53] well, basically, you can 2x your numbers, if you want but [19:26:54] righ tnow [19:27:00] i am running a pig job to find the last occurence of a bad host [19:27:04] i already have the first [19:27:07] that will give me a time range [19:27:10] once I have that [19:27:13] i'll remove the bad data [19:27:18] and then duplicate the resulting fileset [19:27:21] and put it back in place [19:27:27] then I can rerun the oozie jobs for that range [19:27:30] drdee: are you familiar? [19:27:46] drdee, i'm fixing the data I messed up with my regex typo [19:27:50] starting june 25 [19:28:06] ok yup i am familiar [19:28:40] ottomata: what is the rough date of the issue? [19:28:49] just so i can give amit the high-level picture [19:28:50] june 25 - july 4 [19:30:15] tell amit i am sorry for all the data trouble [19:30:17] ! [19:30:20] hehe [19:30:21] i will [19:30:31] or rather, I'll say WE are sorry [19:30:33] erosen, i am very curious as to what the last few days will look like [19:30:34] and that you are working on fixing it [19:30:35] especially today [19:30:36] july 5 [19:30:56] there's 30 or 40% more mobile traffic, as far as I can tell [19:31:04] why's that? [19:31:06] dunno [19:31:13] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=&vl=&x=&n=&hreg[]=%5Ecp%28104%5B6-7%5D%7C1059%7C1060%7C301%5B1-4%5D%29&mreg[]=frontend.client_conn>ype=stack&glegend=show&aggregate=1&embed=1&_=1373049536222 [19:32:02] interesting... [19:33:21] average: cronjob is installed :) [19:34:15] ottomata: thank you [19:34:37] ottomata: how did you find that out or .. what has to be done after making the change in puppet ? [19:35:05] * average wants to learn more about puppet and hopes to become Ops when he grows up [19:35:07] once it is merged, puppet will eventually run and install it [19:35:10] haha [19:35:13] i ran puppet manually [19:35:22] i saw that is installed because I can look at stats user's crontab [19:37:29] erosen, got my dates [19:37:33] bad data is roughly between [19:37:34] 2013-06-26T13:39:57 [19:37:34] 2013-07-04T14:18:45 [19:38:50] great [19:39:30] ottomata: crontab -u stats -l ? [19:39:42] except I'm not allowed -u but yeah. [19:39:46] yeah [19:39:56] thanks [19:41:36] erosen: you have any graphs around with markdown in the description? [19:41:45] i used to [19:41:46] you used to have those nice M/Z sections [19:41:48] k [19:41:48] i think i removed it [19:41:50] np [19:42:07] milimetric: let me check real quick [19:42:57] milimetric: yeah, just pure html [19:43:33] cool, thanks erosen [19:56:32] erosen, w do have missing data though [19:56:42] from the times that udp2log was borked because of all the data [20:04:06] ottomata: k [20:04:18] ottomata: what do you think about making a wikipage with outage notes and such? [20:04:27] not a bad idea [20:06:09] I'll start it if you'll fill it in..;) [20:06:12] k [20:06:24] oh hey [20:06:25] wanted to ask [20:06:33] why does [20:06:34] http://gp.wmflabs.org/graphs/free_mobile_traffic_by_version [20:06:40] still have those gaps in feb march? [20:06:51] didn't drdee, run pig over the whole timeframe and output data? [20:07:23] we manually dropped some outliers in feb and march [20:07:33] we should really interpolate missing values [20:07:37] hm, i thought those dates were because the data was in the wrong direcotries [20:07:47] also [20:07:56] hm [20:08:01] i have a hardcoded list of dates to ignore [20:08:08] hm, ok [20:08:17] ideally we could keep such a list in sync with a wikipage that explains each outage [20:08:29] erosen, I am removing bad host data from dt=2013-06-26_14.00.00 [20:08:29] …dt=2013-07-04_14.30.00 now [20:08:36] great [20:09:02] i'm going to do the same thing I did with the may data, i.e., dump the output into a single direcotry, so it won't be timebucketed well [20:09:13] but i'll be able to just run pig jobs outside of oozie [20:09:17] and save the data into /wmf/data [20:09:24] which will then be coalesced into the .tsv [20:09:32] i will do the same for the end of may [20:09:38] hm, actually, i think i can just rerun oozie for the end of may [20:09:39] hmm [20:09:40] excellent [20:09:46]