[00:08:04] (CR) Nuria: "We need corresponding puppet changes for ARCHIVE_TABLENAME : 'archive_userindex'" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161365 (owner: Milimetric) [01:20:57] (PS7) Nuria: [WIP] Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) [01:21:45] (PS8) Nuria: [WIP] Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) [01:31:19] i am going to have to do a bit of rewriting to properly test [01:31:30] the state changes [12:45:10] (PS3) QChris: Add Oozie setup to rendering webstatscollector files [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 [12:45:12] (PS5) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 [12:52:42] haha, qchris, you are supposed to argue with me about the 'webstats' name [12:52:50] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (4 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [12:53:00] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [12:53:40] ottomata: I first thought about calling out the repo name. And that there is also a "filter" binary. [12:53:50] The use of webstatscollector on wikitech, and many arguments. [12:54:01] Then I realized that I don't care :-D [12:54:09] puppet uses /a/webstats . [12:54:15] Which makes your argument valid. [12:54:33] And as webstatscollector will go away anyways ... I thought it would not be worth it. [12:55:49] haha, but I made puppet use /a/webstats :p [12:56:14] And you also made wikitech use webstatscollector. [12:56:28] I did? [12:56:51] https://wikitech.wikimedia.org/w/index.php?title=Analytics/Webstatscollector&action=history [12:56:56] i did create the page [12:57:10] well that is documentation of the code! :) [12:57:22] But as it's technology that we're phasing out, I do not care too much about the name of the dataset. [12:57:31] ok ok , then I win! :p [12:57:47] * Ironholds quietly drags sheet metal into the back of the room and begins assembling a bikeshed ;p [12:57:48] Congratulations :-D [12:57:49] ok, as for the data location, /user/hive vs /wmf/data [12:57:53] haha [12:58:13] i am not sure which I like better. I agree that /wmf/data on its own makes more sense [12:58:38] but i guess i had kinda thought that we would keep hive generated data within hive defaults [12:59:04] Sure, but that makes it awkward to use outside of hive. [12:59:13] true, agree. [12:59:39] When using non-external tables and providing a location, we should get best of both worlds. [12:59:47] Hive can kill the data files, and still [12:59:56] other tools find them at a sane place. [13:00:04] ha, now that we are making this decision, i kinda wish we had not put the leading /wmf on that path... [13:00:18] It's not too late to change that. [13:00:26] We're only having a few jobs. [13:00:40] hmm, i suppose not. users would have to get used to it, but I guess most of them just use it through hive anyway [13:00:42] so they might not notice [13:00:47] Right. [13:01:11] We'd need some top-level name ( /a, /srv, ... ). [13:01:15] /data// [13:01:18] i'm ok with /data i think [13:01:59] Where would refinery go? I guess /refinery ? Mhmm. Not sure. [13:02:10] oh [13:02:15] forgot about /wmf/refinery [13:02:15] oh [13:02:16] hm [13:02:18] HMMM [13:02:21] mehhhHHHHH [13:02:28] maybe /wmf/data is fine then [13:02:34] /wmf/data//... [13:02:44] we'll just have a redundant path there because one of hte databases is 'wmf' [13:02:51] /wmf/data/wmf [13:02:52] ? [13:02:56] I am not too sure about the database_name. [13:03:00] We're mosty hive now. [13:03:05] For hive, that makes sense. [13:03:09] But outside of hive? [13:03:32] We could also use /wmf/data/raw/$DATASET and /wmf/data/$DATASET [13:03:47] hm, well, that's ok even outside of hive, as there would be a reason beyond hive to have the separate databases or paths [13:04:01] i think i'd like to keep the directory level the same [13:04:04] for datasets, if we can [13:04:11] Ok. [13:04:35] Then we keep everything as is? [13:04:41] /wmf/data/raw [13:04:41] /wmf/data/wmf [13:04:47] i guess the database name is a loose mapping :p [13:04:50] :/ [13:04:54] wmf_raw [13:05:10] i'm ok with it, but it feels inconsistent so I don't lik eit [13:05:14] I like that "wmf_raw" transitions into "wmf" when it no longer is raw. [13:05:23] yeah i like that too [13:05:30] but i feel like the path should have been /wmf/data/wmf_raw [13:05:31] :/ [13:05:45] We can migrate to that. [13:05:58] I think both variants are ok. [13:06:22] ok, well, at least we don't have to change the wmf_raw path right now, as that is irrelevant for the webstats work, ja? [13:06:28] Right. [13:06:34] ok, so we will keep /wmf/data/wmf [13:06:37] you win on this one! :) [13:06:52] * qchris does victory dance :-D [13:07:31] (CR) Ottomata: Add Oozie setup to aggregate Webstatscollector pagecount files (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [13:09:29] qchris, is it not possible to calculate year_plus_1_hour in the workflow? [13:09:33] does that have to be a parameter? [13:10:48] The dateOffset function is not available in workflows. [13:11:08] also formatTime and the other time handling functions. [13:11:23] hm ok [13:11:44] We could prepare the archive filename in the coordinator, but that would feel wrong. [13:16:09] hm, why would that feel wrong? [13:18:44] Because it does mess with the split of coordination of which datasets to operate on (coordinator) with what/how to work (workflow) [13:19:21] That sentence does not make sense ... let me try again ... [13:19:43] Because it messes with the split of "coordination of which datasets to operate on" (coordinator) and "what/how to work" (workflow) [13:21:16] hmm, yeah i suppose, especially because you need to base the base time values for the hive query [13:21:22] btw, i just looked up if there was another way [13:21:23] and there is! [13:21:26] but we should not do it. [13:21:31] http://blog.cloudera.com/blog/2013/09/how-to-write-an-el-function-in-apache-oozie/ [13:21:47] :p[ [13:22:00] sorry [13:22:05] "hmm, yeah i suppose, especially because you need the base time values for the hive query" [13:24:00] Sure, we can have our own EL functions. [13:24:10] But would it be worth in this case? [13:24:29] nope [13:24:36] we should not do it [13:24:51] that was just new to me [13:24:53] didn't know you could do that [13:25:02] I see. [13:25:10] EL is great. [13:25:15] It's soooo extendable. [13:25:22] It's one of the cool things about JSP. [13:25:39] Sadly, we cannot use JSP's taglibs for our EL :-/ [13:26:00] We only have bare-bones EL, which is ... limited. [13:29:19] qchris, is the reason you have to do the fancy ${"$"} stuff in webstats/datasets.xml because of the single padded values? [13:29:28] so you have to do something fancy to cast them to numbers? [13:29:32] sorry [13:29:34] not single padded [13:29:38] unpadded* [13:29:47] e.g. month=9 vs month=09 [13:29:48] It's because the file is evaluated twice. [13:29:56] ? [13:29:58] We cannot do the comparisons in the first pass. [13:30:09] As there, ${MONTH} evaluates to ${MONTH} [13:30:26] So we need to skip the first pass, and evaluate in the second pass. [13:30:28] why is this not a problem in webrequest/datasets.xml? [13:30:36] There ${MONTH} evaluates to "09" [13:31:01] Because we do not do EL functions there. [13:31:24] It's only EL constants there, and they evaluate to the EL construct to get their value. [13:31:32] So ${MONTH} evaluates to ${MONTH} [13:31:40] (in the first run) [13:32:11] still don't understand... [13:32:24] where is the EL function for webstats? [13:33:18] The + is an EL function. [13:34:57] and that is to make sure it is 9 and not 09? [13:35:14] Yes. [13:35:24] weird [13:35:39] that's pretty annoying that hive and oozie can't work together better on that [13:35:47] it is 9 and not 09 because hive, right? [13:35:49] that's just how it does it? [13:35:51] month=9 [13:35:52] ja? [13:36:09] If you want weird ... the first working version of that conversion was a line thats 1060 characters long :-D [13:36:35] Right, that's because of hive. [13:37:39] haha [13:37:45] hm. [13:51:59] qchris, could you just make the coordinator's month format with the padding in the first place? [13:52:26] ${coord:formatTime(coord:nominalTime(), "M")} [13:53:12] hm MM? [13:53:13] I am not sure I understand ... [13:53:13] hm [13:53:21] so, i just checked this [13:53:21] We want to unpad ... not to pad. [13:53:28] why do we want to unpad? [13:53:37] if the orignal hive query inserts with partition (month=09) [13:53:39] then the file path will be [13:53:42] .../month=09/... [13:54:00] OH, because of that other hive problem? [13:54:00] Right, but we do not want to insert with partition (month=09) [13:54:10] We want partition (month=9) [13:54:20] because of this? [13:54:20] https://wikitech.wikimedia.org/wiki/File:Hive_partition_formats.png [13:54:24] Otherwise, comparisons in Hive will ... [13:54:25] Right. [13:54:29] oof [13:54:32] so annoying [13:54:35] Yes. Totally. [13:55:20] ok, but its ok to have the path have 09, we just don't want the partition value in hive to be 09, but becauase we are not manually specifying the path, we can't control that [13:55:29] ah, but webrequest has paths with unpadded values [13:55:34] (which is annoying. :p) [13:55:46] wait [13:55:48] does it? [13:56:09] I think it doesn't. [13:56:13] In HIve we want unpadded. [13:56:14] it is padded in the path [13:56:19] /wmf/data/raw/webrequest/webrequest_bits/hourly/2014/08/08 [13:56:26] but the hive partition value is unpadded [13:56:27] ok [13:56:29] so, IF [13:56:33] In HDFS we want padded. [13:56:39] we could control the partition path location, we would want padded [13:56:55] which is possible in hive 0.13 :/ [14:00:57] This works in Hive 0.13? [14:01:01] Mhmm. Interesting. [14:02:01] qchris, maybe only for external tables though... [14:02:02] not sure [14:02:02] https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables [14:27:11] (CR) Ottomata: Add Oozie setup to aggregate Webstatscollector pagecount files (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [14:31:20] qchris, more naming nits are coming out to me! :p [14:31:25] can we use 'generate' instead of 'render' [14:31:26] ? [14:31:32] generate_hourly_files [14:31:51] we did that for seqeuence stats [14:33:36] Sure, but I think that those two are different things. [14:33:58] sequence_stats is generating new data [14:34:19] For the hourly files, it's just changing representation a bit. [14:34:49] hm, but generate_...file is generating a file [14:34:58] somehow, rendering means somehting different to me [14:35:09] like, rendering a template, or graphics [14:35:15] hm [14:35:20] ok meeting time..:/ [15:31:50] (CR) Ottomata: "I don't like the word 'render' here, but you might be able to convince me in IRC. :)" (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [16:55:12] nuria: eventlogging seems to already have qunit setup. Jenkins runs *some* qunit tests according to the comment on the patch [17:03:14] prtksxna: great to know [17:03:24] I do not think we have js tests though [17:28:38] (PS4) QChris: Add Oozie setup to rendering webstatscollector files [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 [17:28:40] (PS6) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 [17:29:24] (CR) QChris: [C: -2] "Whoops. Accidental push." [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [17:40:09] (CR) QChris: "Now the tests finished and passed just fine." [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [17:40:22] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [17:50:12] (CR) QChris: Add Oozie setup to rendering webstatscollector files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [17:50:44] (CR) QChris: [C: -1] "Overlooked the top-level comment about the README." [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [18:00:34] (CR) QChris: "> I don't like the word 'render' here, but you might be able to convince" [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [19:03:56] (CR) Ottomata: "Ok, no that's fine, I missed that somehow. workflow.properties is fine for documentation." [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [19:10:22] Analytics / Tech community metrics: List of Phabricator users - https://bugzilla.wikimedia.org/35508#c18 (Sumana Harihareswara) I figure ECT will be able to use a list of Phabricator users once we've made the switch. [19:29:24] (CR) Ottomata: [C: 2 V: 2] Add Oozie setup to aggregate Webstatscollector pagecount files [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [19:30:14] (CR) Ottomata: [C: 2 V: 2] Add Oozie setup to rendering webstatscollector files [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [19:30:33] (PS1) QChris: Make Oozie job name endings reflect the job type [analytics/refinery] - https://gerrit.wikimedia.org/r/162364 [19:34:59] (CR) Ottomata: [C: 2 V: 2] Make Oozie job name endings reflect the job type [analytics/refinery] - https://gerrit.wikimedia.org/r/162364 (owner: QChris) [19:35:06] ah just saw that one [19:35:09] going to deploy with that now too [19:35:16] Ok. Thanks! [19:42:57] oh, i didn't ask but I assume you checked, GzipCodec worked just fine with gunzip and ilk? [19:43:03] qchris ^? [19:43:19] With it works gunzip. [19:43:25] cool [19:43:32] What is "ilk"? [19:43:35] Gonna check ... [19:43:50] :) [19:44:38] ok, i'm attempting to launch the insert hourly pageview coord [19:44:48] Yippie! [19:44:51] just gonna start it at 00:00 today [19:45:42] it is running! [19:46:08] my queue_name didn't stick! [19:46:10] GRRR [19:46:10] oh well [19:46:20] dunno, i'll let it go [19:46:23] we are going to restart them soon anyway [19:48:42] I see the workflow running, but "oozie jobs -verbose -jobtype coordinator" does not work for me on stat1002. [19:48:48] "java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long" [19:48:54] whoa, weird [19:49:00] i'm running on analytics1027 [19:49:34] There it works. Thanks. [19:49:44] i get that on stat1002 too [19:49:45] weird [19:51:06] this works on stat1002 though [19:51:06] oozie job -info 0027591-140725140105408-oozie-oozi-C [19:51:25] Strange. Thanks. [19:53:26] Analytics / General/Unknown: Listing oozie coordinators on stat1002 fails with type conversion error - https://bugzilla.wikimedia.org/71192 (christian) NEW p:Unprio s:normal a:None On stat1002, running oozie jobs -jobtype coordinator fails with java.lang.ClassCastException [1]. The co... [19:55:28] ah, qchris, typo in comment in bundle.properties [19:55:34] refers to coordinator.properites there [19:56:24] * qchris looks ... [19:57:04] Right. Fixing it. [19:58:46] oo, didn't think about this one [19:58:46] archive_directory = ${name_node}/wmf/data/archive [19:58:47] hm [19:58:51] i suppose that's cool [19:58:52] hm [19:58:54] hm [19:58:55] hm [19:58:55] hm [19:58:57] ok yeah [19:59:09] since we are not forcing /wmf/data/ mappings [19:59:11] ok ok [19:59:12] (PS1) QChris: Fix usage example in properties to generate hourly webstats files [analytics/refinery] - https://gerrit.wikimedia.org/r/162396 [19:59:48] Would you prefer /wmf/archive ? [20:00:06] I figured that even the archive is "data", so I put it there. [20:00:23] yeah, i'm not sure........ [20:00:28] on first glance I think I prefered /wmf/archive [20:00:28] but [20:00:34] i think i like this [20:00:44] hmmm [20:00:58] unless we want to enforce database names in /wmf/data...but i think we don't [20:01:05] you convinced me earlier that that would make it tied too closely to hive [20:01:09] so, /wmf/archive is fine [20:01:32] You mean /wmf/data/archive, or /wmf/archive? [20:01:35] sorry [20:01:38] /wmf/data/archive [20:01:38] yes [20:01:40] i think its fine [20:01:40] Ok. [20:01:42] probably good. [20:01:44] Cool. [20:01:45] i think :/ [20:01:47] :) [20:01:55] oh boy! [20:02:01] Also ... we'll only probably have two tables ... wmf_raw and wmf. [20:02:01] /wmf/data/archive/webstats/2014/2014-09/pagecounts-20140923-010000.gz [20:02:09] databases. [20:02:10] ja [20:02:17] databases. Right. Sorry. [20:03:16] projectcounts are pouring in too \o/ [20:05:28] :D :D [20:05:35] thar she blows! [20:06:08] i'll push the fuse issue again, and hopefully get that puppetized and mounted on stat1002 soon [20:06:22] maybe we can even rsync over somewhere public before the quarterly...MAYBE [20:06:26] Awesome! [20:08:14] (CR) Ottomata: [C: 2 V: 2] Fix usage example in properties to generate hourly webstats files [analytics/refinery] - https://gerrit.wikimedia.org/r/162396 (owner: QChris) [20:22:40] @tobie — Are you using Apache Mesos at all? [20:23:37] tobie: ^^ [20:25:25] Nope. But why do you ask? [20:25:38] preilly: ^ [20:26:27] Well I sold my company OrlyAtomics to Mesosphere and Mesosphere is the company behind Apache Mesos http://techcrunch.com/2014/09/17/mesosphere-snags-orlyatomics-in-acquihire-deal/ [20:26:59] preilly: and so? [20:27:07] I wanted to try out Mesos at the foundation if it wasn’t already being used ;-) [20:28:14] preilly: I think you're talking to the wrong tobie [20:28:29] Okay my bad [20:29:09] Oh this is Tobie Langel my bad [20:29:29] preilly: ^ this :) [20:47:47] preilly: Maybe, you're looking for tnegrin? [20:48:04] yeah I think so my bad [20:51:58] tnegrin: any desire to try out Mesos? [20:52:23] ok -- you got me [20:52:53] heh heh [20:52:58] I'm open to it but I'm curious as to what services we should run on top of it initially? [20:53:25] congrats on the sale btw [20:53:31] tnegrin: thanks! [20:54:02] we're experimenting with some of the twitter stack (scalding) right now [20:54:08] tnegrin: want to come over to Mesosphere for one of our Friday happy hours? [20:54:15] is there beer ;) [20:54:20] yeah -- that would be great [20:54:29] tnegrin: did you see https://gigaom.com/2014/09/22/benjamin-hindman-co-creator-of-apache-mesos-joins-mesosphere/ [20:54:37] Benjamin Hindman, co-creator of Apache Mesos, joins Mesosphere [20:55:02] very cool [20:55:07] tnegrin: free beer [20:55:11] sold [20:56:12] can you shoot me an email? tnegrin@wikimedia.org [20:56:17] will do [20:57:11] thanks preilly [20:59:10] Are you using MR2? [21:00:41] yes [21:06:45] Are you using YARN features?