[08:11:07] gooood morning :) [08:11:54] my schedule for this morning is to complete some mediawiki server installs, reboot aqs100[23] for kernel upgrades and then follow up on port 7000 and firewall [08:16:37] elukey: trusty kernel is out, I'm testing it a bit, will keep you posted once I've installed it on analytics* [08:17:32] super thanks! [08:17:48] context for the team: we need to reboot the whole hadoop cluster \o/ [09:28:52] moritzm: aqs100[123] are running with 4.4 now, just finished [09:30:12] nice, thanks [09:48:06] Morning all! :) [09:49:00] o/ [10:43:54] Heya elukey, how is it going ? [10:43:59] o/ [10:44:13] good, aqs100[123] are now running with linux 4.4, nothing exploded [10:44:28] Great ! [10:44:35] I am reading puppet atm to figure out how to add the hadoop nodes to ferm [10:44:40] * joal loves when things don't epxlode [10:45:05] k, thanks for that :) [10:45:18] addshore: need help on testing your spark ? [10:45:30] oohh yes, but I wont be free for another 30 mins or so! [10:45:39] okey :) [10:45:48] Let me know when you need me [10:46:20] elukey: I think we can simply add a Hiera variable to list them [10:48:47] moritzm: I think that each hadoop node has a variable called cdh::hadoop::cluster_name = analytics-hadoop [10:49:09] can we use that one to have a dynamic list or do we need to explicitly list them all? [10:49:40] never done and I am not sure if this is ETOOMAGIC or not [10:52:24] (brb in ~30 mins, lunch!) [11:10:00] * urandom explodes [11:10:28] Woooow ... Anything we can help on urandom? [11:10:34] heh [11:18:49] joal: I'm ready when you are! :) [11:19:03] addshore: Hey :) [11:19:06] Here I am :) [11:19:19] so first I have to mvn build it? [11:19:47] in the refinery-source folder: mvn clean package [11:19:55] For build and package the thing [11:20:14] (clean is to make sure you don't have leftovers from a previous compilation) [11:21:03] Warning: 'build.plugins.plugin.version' for org.codehaus.mojo:exec-maven-plugin is missing. @ line 166, column 21 [11:21:09] Wikimedia Analytics Refinery ...................... FAILURE [0.004s] [11:21:33] hm [11:21:36] oh wait, I think intelij did the command wrong! [11:21:50] oh addshore, I was thinking in a terminal :) [11:21:56] yup, running now [11:21:57] not in intellij [11:22:11] I was making intellij essentially run "mvn mvn clean package" ;) [11:22:14] intellij has it's own way of building (even with maven) [11:22:19] Right [11:23:05] okay, now i get some failed tests in core "junit.framework.ComparisonFailure: expected: but was:" [11:23:12] Tests run: 795, Failures: 4, Errors: 0, Skipped: 0 [11:23:42] Ohhh ? [11:23:46] If I'm guessing I put this down to running and building on Windows. [11:23:49] Which module ? That's weird, [11:23:56] hm, maybe [11:24:25] whole output https://phabricator.wikimedia.org/P3308 [11:24:58] indeed addshore, seems to be encoding related [11:25:03] Let's skip tests [11:25:12] (let me recall how to that in maven) [11:25:15] :D [11:25:27] -DskipTests [11:25:58] *runs* [11:26:18] * addshore goes to put the milk away quickly [11:28:56] Cool, build success!! [11:29:14] great [11:29:45] Now, find the job jar: Normally in refinery-source/refinery-job/target/ [11:29:58] and upload it to stat1002 [11:30:13] Other way to do would have been to build on stat1002 straigt [11:30:35] so just refinery-job-0.0.32-SNAPSHOT.jar ? [11:30:41] yessir [11:30:48] You have everything needed in that [11:32:55] Uploading [11:33:45] addshore: next time, we'll go for build on stat1002, like that no need to upload :) [11:33:59] yeh, that probably makes sense :) [11:34:04] and then I expect the tests would also pass [11:34:14] addshore: normally they do :) [11:35:03] bah, uploaded it to the wrong machine >.> hit 3 instead of 2... [11:35:17] addshore: should work IIRC [11:35:57] addshore: actually, no, not working :( [11:36:11] But we should be avle to copy fropm stat1003 to stat1002 easily [11:36:17] ok, removed from 3, its on 2 now! [11:36:25] ok great [11:36:41] now you'll launch the associated spark job :) [11:37:29] How? ;) [11:38:52] spark-submit --class org.wikimedia.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode cluster /home/addshore/refinery-job-0.0.32-SNAPSHOT.jar [11:39:12] This should give you the error message associated with wrong parameters parsing [11:39:14] running [11:40:03] Now addshore, I assume you want to override yout namespace default value to provide a test one ? [11:40:11] When running for real I mean :) [11:41:09] that may make sense. [11:41:19] diagnostics: Application application_1465403073998_57106 failed 6 times due to AM Container for appattempt_1465403073998_57106_000006 exited with exitCode: 10 [11:41:36] hm, I made a mistake in command: --deploy-mode client [11:42:01] I need to setup a proxy so that I can actually reach pages like http://analytics1001.eqiad.wmnet:8088/cluster/app/application_1465403073998_57106 too [11:42:18] addshore: you can, but you also can do without [11:42:22] oooh [11:42:43] Now a working command: spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode client /home/addshore/refinery-job-0.0.32-SNAPSHOT.jar [11:43:08] two errors on the previous one: wrong class, and deploy mode cluster preventing us from having logs [11:43:28] The command gives us the messages from missing parameters [11:43:36] yup! [11:43:42] So, let's add parameters to that job :) [11:44:29] spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode client --year 2016 --month 06 --day 27 --hour 01 /home/addshore/refinery-job-0.0.32-SNAPSHOT.jar ?? [11:44:54] Nope, parameters for the job come after the jar [11:45:00] ahh, okay [11:45:08] And, you also want to override namespace and graphite host [11:46:19] spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode client /home/addshore/refinery-job-0.0.32-SNAPSHOT.jar --year 2016 --month 06 --day 27 --hour 01 --namespace daily.test.articleplaceholder --graphiteHost graphite.eqiad.wmnet [11:46:52] spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode client /home/addshore/refinery-job-0.0.32-SNAPSHOT.jar --year 2016 --month 06 --day 27 --hour 01 --namespace daily.test.articleplaceholder --graphite-host graphite.eqiad.wmnet [11:47:31] yes, with graphite being: graphite-in.eqiad.wmnet [11:47:46] graphite-in ? [11:48:05] Yes, that's what we have in the restbase spark job conf [11:48:37] interesting, yes that resolves, not seen that one before! [11:48:39] I think graphite is the host for getting data out [11:48:53] and graphite-in for sending data in [11:49:01] they both resolve to the same place, I use graphite.eqiad.wmnet in my other scripts [11:49:10] okey :) [11:49:11] perhaps I should change that though... [11:49:18] I'm not sure :) [11:49:28] So: spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode client /home/addshore/refinery-job-0.0.32-SNAPSHOT.jar --year 2016 --month 06 --day 27 --hour 01 --namespace daily.test.articleplaceholder --graphite-host graphite-in.eqiad.wmnet [11:49:38] Go for it ! [11:49:46] *runs* [11:50:25] okey, now that it started: Maybe you should go for articleplaceholder.test.daily (and articleplaceholder.daily) as namespaces (like that, you find them easily [11:50:41] so how can I access http://analytics1001.eqiad.wmnet:8088/proxy/application_1465403073998_57110/ without seting up proxying? [11:50:56] oh if you want to access, you need proxy [11:51:07] But you don't need to acess since you have logs :) [11:51:12] test.* has special / different retention periods & granularity! [11:51:18] wait, I mean daily.* [11:51:28] addshore: Oh ! Didn't know that [11:51:33] ok reat [11:51:35] great [11:51:49] So what's up with the job? [11:52:07] https://github.com/wikimedia/operations-puppet/blob/5209c241eb56dd4989978573a8a57b061c52361c/modules/role/manifests/graphite/base.pp#L41 [11:52:26] saves a bit of space on graphite and also allows for longer retention [11:53:03] Exception in thread "main" org.apache.spark.sql.AnalysisException: Specifying database name or other qualifiers are not allowed for temporary tables. If the table name has dots (.) in it, please quote the table name with backticks (`).; [11:53:41] As well as that exception I have just realised that I'm trying to store hourly data in a space that will only accept daily data ;) [11:53:56] addshore: that last one is a good call :) [11:55:44] addshore: My bad on the CR : the SQL context you need is val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) [11:55:54] okay, I just checked with Lydia and we only want daily data! [11:56:06] ok cool addshore [11:56:12] it means more modifs [11:56:21] and not only my mistake on sqlContext [11:56:22] ;) [11:56:52] do I need the full class name therE/ [11:56:53] ? [11:57:03] addshore: default sorg.apache.spark.sql.SQLContext doesn't know about the hive metastore, while the subclass org.apache.spark.sql.hive.HiveContext does [11:57:12] No you don't, you can just modify the import [11:57:18] cool, just checking :) [11:57:22] :) [11:58:07] so I guess I also want to remove the --hour param [11:58:24] addshore: probably :) [11:59:25] and add day to group by? [12:00:24] addshore: remove hour from where, no need to add day in group by, since you only request data for the given day [12:00:56] but will that not give me 24 records then? [12:01:10] wait, no, bah, ignore that! [12:01:19] :) [12:01:59] (PS6) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [12:02:09] re packaging [12:02:22] addshore: doing on stat1002? [12:02:25] :-P [12:02:44] addshore: nevermind, I'm just teasing ;) [12:02:51] still locally, but I'll clone the repo on stat1002 now too! [12:03:25] Wikimedia Analytics Refinery Jobs ................. FAILURE [8.801s] :( [12:03:36] :( [12:03:40] object HiveContext is not a member of package org.apache.spark.sql [12:03:49] ahh sql.hive [12:03:52] org.apache.spark.sql.hive.HiveContext [12:03:54] Yup [12:04:26] (PS7) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [12:04:32] (CR) jenkins-bot: [V: -1] Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) (owner: Addshore) [12:05:39] okay, building locally and on stat1002 now! [12:05:45] You rock :) [12:05:52] no, you do! :D [12:06:21] Mutual back padding, always good ;) [12:06:46] I should have said you rule! https://s-media-cache-ak0.pinimg.com/564x/2d/86/9a/2d869aad72fe7175fd1dd5cabc2a2964.jpg [12:07:01] addshore: +1 ;) [12:07:24] okay, uploading the new jar (as stat1002 is still working) [12:09:15] okay, running again [12:09:45] joal hi :] [12:09:50] Hi mforns :) [12:10:03] what's up mforns ? [12:10:18] I discovered what was the thing [12:10:25] actually the code was working [12:10:27] yes ? [12:10:45] there are a couple users with lots of rights changes (not renames) [12:10:53] mforns: riiiiiiight [12:10:57] mforns: makes sense [12:10:59] user groups changes [12:11:23] I also discovered a bug in passing the user groups and blocks from events to states [12:11:27] and fixed it [12:11:35] That's good :) [12:12:23] and checked that the events that remain in the left side after the fixed point finishes (with size <= 1) are: 2350 [12:12:29] joal: afaik it finished running [12:12:32] aprox a 7-8% of all ervents [12:13:19] hm mforns, I'd like to investigate a bit on those [12:13:26] addshore: cool ! [12:13:30] addshore: success? [12:13:30] joal, sure [12:13:51] not appeared in graphite yet but normally there is a bit of delay! [12:14:58] joal: can you check https://grafana.wikimedia.org/dashboard/db/pageviews ? Do you see all the graphs correctly? [12:15:18] because sometimes grafana tricks me [12:15:24] not showing datapoints [12:15:48] some datapoints are missing, always on the same 2 charts (http status, 50X [12:15:49] and I start checking the whole world just to find out that it was nothing :D [12:15:52] elukey: -^ [12:16:46] weird [12:16:50] AQS seems working fine [12:18:24] hmm, there was: 16/06/28 12:12:16 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(73,WrappedArray()) [12:18:45] and 16/06/28 12:12:09 ERROR YarnScheduler: Lost executor 124 on analytics1049.eqiad.wmnet: remote Rpc client disassociated [12:18:46] elukey: don't know really :( [12:18:55] addshore: those are normal [12:19:09] no error in your client log? [12:19:13] addshore: --^ [12:19:40] my client log? just the stdout or should I also be looking somewhere else?> [12:19:48] stdout [12:20:01] nope, just those 2, the last line was 16/06/28 12:12:17 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. [12:21:15] addshore: ok, now that we know the job works, let's try with deploy-mode cluster [12:21:55] care to quickly explain the difference between the two? ;) now I will see less output I guess? [12:22:09] difference is about the master of the job [12:22:20] when in client mode, you have the master on stat1002 [12:22:26] for you to see the logs [12:22:38] so now the master / main work is being done on some other analytics machine not stat1002? [12:22:39] when in cluster mode, master in on one of the hadoop nodes [12:23:01] and, the last steps you do (after collect), happen in the master [12:23:15] And I'm not sure if stat1002 has write access to graphite [12:23:43] It does :) (again, thats where my current pile of scripts run) [12:23:54] Exception in thread "main" org.apache.spark.SparkException: Application application_1465403073998_57168 finished with failed status [12:24:16] https://www.irccloud.com/pastebin/bQ2Aon0T/ [12:27:19] addshore: when you want more detailled logs, you need to ask yarn (hadoop resource manager) and know your application Id [12:27:40] you can do : yarn logs --applicationId | less [12:28:21] And this tells me: ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: no such table wmf.pageview_hourly; line 4 pos 11 [12:28:24] org.apache.spark.sql.AnalysisException: no such table wmf.pageview_hourly; [12:28:27] Which makes no sense :( [12:28:42] yup, thats what I see too [12:29:55] when trying to run the query in hive I get "NoViableAltException(292@[287:1: atomExpression : ( KW_NULL -> TOK_NULL | constant | castExpression | caseExpression | whenExpression | nonParenthesizedFunction | ( functionName LPAREN )=> function | tableOrColumn | LPAREN ! expression RPAREN !);])" [12:30:38] FAILED: ParseException line 5:15 cannot recognize input near '%' 'd' 'AND' in expression specification [12:30:46] oh... [12:30:54] %d should be %s? [12:30:59] addshore: double percent is scala escaping [12:31:20] so at the end of the LIKE .... [12:31:24] you should only have 1 percent [12:31:51] oh wait *realises why copying the query into hive didn't work* hah...... [12:31:58] okay! will repackage! [12:32:06] ? [12:32:10] Can you tell me more? [12:32:11] (PS8) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [12:32:28] well, I copied it in with the %d and didn't replace them with a year month and day ;) [12:32:39] Ah, ok :) [12:32:51] But you shouldn't change the double percent in your scala [12:32:55] packaging on stat1002 directly this time :) [12:32:59] ok [12:33:25] this double % is needed for scala to escape the format [12:34:39] okay, now I'm confused, is https://gerrit.wikimedia.org/r/#/c/295896/7..8/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala right? [12:35:02] I don't think so: I was saying to remove the double % for hive :) [12:35:16] ahhh! :D [12:36:11] so why is it %d in the query and not %s ? [12:36:43] addshore: same as in java or c, year, month and are integers, so formatted as so [12:36:55] %d Decimal number (integer, base 10) ahh yes [12:39:24] addshore: currently trying your code directly in spark-shell [12:39:31] okay! [12:40:52] addshore: Actually, seems there is no data for day = 27 [12:41:07] I just went and checked that in Hive and I see some :/ [12:41:27] select * from pageview_hourly where year = 2016 and month = 06 and day = 27 limit 1; gives me a result [12:41:39] oh wait, you mean no data for the articleplaceholder page? [12:41:42] correct [12:41:55] *facepalm* I think I know what is wrong again [12:42:05] If there was no data for pageview hourly for yesterday, I wouldn't have time for you ;) [12:42:12] It should be Special:AboutTopic ..... [12:42:55] (PS9) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [12:43:39] re packaging on stat1002 now.. [12:45:29] addshore: You should test your hive queries before trying to include them in production jobs: ) [12:47:27] yup, that would also explain why nothing appeared in graphite.. [12:47:56] addshore: probably: no data = nothing to iterate over in the foreach = no data [12:49:27] hmm, but I do still get the "no such table wmf.pageview_hourly; line 4 pos 11" issue [12:49:46] addshore: try again with client mode please [12:51:08] (PS10) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [12:51:12] doing [12:53:17] no exceptions.... [12:53:39] right addshore ... I need to talk about that with my fellow ops [12:53:43] elukey: --^ [12:55:54] Can I ask for a TL;DR? [12:56:01] :) [12:56:09] otherwise I'll read the backlog [12:56:10] elukey: You surely can :) [12:56:36] elukey: seems that we have an issue with spark connecting to hive metastore when in cluster mode [12:58:02] wow elukey, I have understood it now (maybe) [12:58:22] elukey: seems related to the hive-site.xml file not being found on remote workers from spark [13:02:24] addshore: seems that you have values in graphite [13:02:51] *refreshes* [13:03:16] yup! [13:03:53] (PS11) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) [13:04:00] im guessing that was from the last client run! [13:04:22] addshore: I think so, I'm investigating the cluster thing [13:06:03] addshore: yeah, found the thing [13:06:11] hmm [13:07:23] anything that I can do? [13:08:41] elukey: not really, I'm trying a fix with spark parameters now [13:10:21] joal: I'm going to dash for lunch and potentially a haircut! I'll be back shortly! [13:10:45] addshore: np, can you paste your command before going (for me to tesT) [13:11:22] moritzm: created https://gerrit.wikimedia.org/r/#/c/296389/, not sure if there is a smarter way [13:11:27] joal: yup [13:12:02] spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode client /home/addshore/refinery-source/refinery-job/target/refinery-job-0.0.32-SNA PSHOT.jar --year 2016 --month 06 --day 25 --namespace daily.test.articleplaceholder --graphite-host graphite-in.eqiad.wmnet [13:12:51] thanks addshore [13:16:36] one comment added [13:17:07] but looks fine in general [13:17:29] super thanks! Will fix it in a minute [13:20:26] the cassandra line above uses a similar construct, maybe you can fix that along? [13:22:00] ah just seen the comment, will also amend them [13:26:26] done! https://gerrit.wikimedia.org/r/#/c/296389/4/manifests/role/aqs.pp [13:26:58] I'll review in 20 mins or so, can you make a new PCC run? [13:27:50] halfak: I guess we are not having that meeting, huh ? [13:28:02] See my meassges above. [13:28:20] hmm... maybe they didn't come through. [13:28:27] * halfak copy-pasted [13:28:27] halfak: no message :( [13:28:32] o/ joal & milimetric [13:28:35] I don't have anything for the agenda this week. [13:28:38] But am available to chat [13:28:41] ok cool :) [13:28:43] * halfak is waist deep in some ORES refactoring [13:28:48] While relevant to "live systems" -- probably not something that needs a meeting [13:28:52] * halfak hops out of the call [13:29:05] Last one was sent at 9 minutes past the hour [13:29:14] sorry halfak, I'm half asleep, just got back last night [13:29:21] No worries. FUn time? [13:29:31] oh yeah, great [13:30:02] halfak: as for me, didn't mean to disturb, just didn't get the pings :) [13:30:05] now the worst part starts: prioritizing between all the amazing ideas [13:31:17] hi milimetric :] [13:31:25] milimetric, yeah. this is a good problem. [13:31:30] TOo many cool things to work on :) [13:31:44] :) good but so sad :) [13:31:57] halfak: meeting not needed now I think, but it will be good to have your opinion when we'll have moved forward on data extraction :) [13:32:06] hey mforns! [13:32:25] wanna hang out in a bit when I get set up? [13:35:04] joal, cool sounds good [13:37:10] milimetric, sure! [13:58:47] addshore: for when you come back, I have found the solution for having the thing work in cluster mode (needed for next step: oozie) [14:03:31] elukey: also, can we move the ops sync meeting to a bit earlier (why not now if you want), I have my soon to care at that time (might even miss standup depending on my wife agenda) [14:07:00] joal: I am back! (although dashing about again in 25 mins)! [14:07:59] addshore: spark needs some more conf to work with hive in cluster mode [14:09:47] okay! [14:12:12] joal: I think we can skip it, we shouldn't have too many things to discuss [14:12:15] right? [14:12:36] correct, but in case you want, I'm here :) [14:12:40] elukey: --^ [14:17:15] sure :) [14:26:44] any smart analytics around please ? :) I am looking for stats about Browser version usages on our sites [14:27:06] namely percent usage of Firefox latest version versus Firefox ESR (long term release) [14:28:42] no matter eventually found https://analytics.wikimedia.org/dashboards/browsers/ :D [14:30:21] though that is one year old apparently :( [14:34:05] hashar: would you need older data? [14:38:41] elukey: na it is fine [14:38:53] I got confused by the interface, apparently it default to June 2015 for the time range [14:39:10] selecting June 1st 2016 -- June 28th 2016 got me the data I wanted :] [14:40:54] super :) [14:56:16] milimetric: I'm asking for a debrief of great new ideas in standup ;) [15:00:43] definitely, I wrote them down and I'll share [15:00:59] mforns: you wanna hang out? [15:01:06] milimetric, yes :] [15:01:17] to the batcave! [15:01:20] (I missed that :)) [15:01:26] hehe [15:11:15] okay joal back for good this time! [15:25:59] mforns: I'm going to break up https://phabricator.wikimedia.org/T134790 into subtasks now that we know what there is left to do [15:26:13] And I'll make a set for you and a set for me. (user & page) [15:27:48] milimetric, makes total sense, I also have a task in "in progress" that should be the analog of yours... [15:28:01] oh ok [15:28:57] milimetric, I can break up the user one, after you do the page one if you want [15:29:22] ok, sure, I'll ping you [15:29:25] if I can see the way yo did the page one, I can follow the naming and structure you did [15:29:48] cool, thanks! [15:32:02] Analytics-Kanban: Page History: sqoop tables into Hadoop and build Hive tables - https://phabricator.wikimedia.org/T138850#2412019 (Milimetric) [15:33:53] Analytics-Kanban: Page History: design algorithm to reconstruct page history - https://phabricator.wikimedia.org/T138851#2412036 (Milimetric) [15:35:25] Analytics-Kanban: General: Write deserializer for php-serialized data in mediawiki - https://phabricator.wikimedia.org/T138852#2412051 (Milimetric) [15:38:02] Analytics-Kanban: Page History: write scala for page history reconstruction algorithm - https://phabricator.wikimedia.org/T138853#2412079 (Milimetric) [15:38:30] ok mforns those are the only ones I think are in scope for this task for me ^ [15:38:42] you'll probably have another for you that deals with all the other types of events [15:38:50] milimetric, thanks a lot [15:39:09] but the other tasks we talked about, like productionizing the code and running on enwiki is for the next task I think [15:40:17] milimetric, ok [15:40:28] Analytics-Kanban: Page History: design algorithm to reconstruct page history - https://phabricator.wikimedia.org/T138851#2412036 (Milimetric) p:Triage>Normal [15:40:39] Analytics-Kanban: Page History: sqoop tables into Hadoop and build Hive tables - https://phabricator.wikimedia.org/T138850#2412019 (Milimetric) p:Triage>Normal [15:40:42] Analytics-Kanban: General: Write deserializer for php-serialized data in mediawiki - https://phabricator.wikimedia.org/T138852#2412051 (Milimetric) p:Triage>Normal [15:41:00] Analytics-Kanban: Page History: write scala for page history reconstruction algorithm - https://phabricator.wikimedia.org/T138853#2412079 (Milimetric) p:Triage>Normal [15:41:13] heh, we probably shouldn't echo these types of changes ^ :) [15:41:36] addshore: sorry didn't notice the ping [15:41:43] addshore: to me, the thing looks good :) [15:41:53] I'm still here! :) [15:41:58] cool! [15:41:58] addshore: next step will be oozie :) [15:42:11] this part will be less fun [15:42:16] awesome (I'm going to scroll back through everything here later and write a doc page) ;)( [15:42:24] sounds great ! [15:42:37] so, how do I oozie? ;) [15:42:40] milimetric, I will rename my task that is already in progress and make it a parent task [15:42:47] Here is a command that worked for me : [15:42:48] spark-submit --class org.wikimedia.analytics.refinery.job.WikidataArticlePlaceholderMetrics --master yarn --deploy-mode cluster --jars /usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/hive/lib/datanucleus-core-3.2.10.jar,/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar --files /usr/lib/hive/conf/hive-site.xml /home/addshore/refinery-source/refinery-job/target/refinery [15:42:55] -job-0.0.32-SNAPSHOT.jar --year 2016 --month 06 --day 23 --namespace daily.test.articleplaceholder --graphite-host graphite-in.eqiad.wmnet [15:43:12] addshore: You can notice the --jars and --files options [15:43:25] yup [15:43:29] addshore: easiest is to take example [15:43:43] folder is different: refinery (not refinery-source) [15:44:06] then use oozie/restbase as a starting poitn [15:44:23] addshore: I think we have doc on oozie on wikitech, let me have a look [15:44:40] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie [15:45:34] oooh, a lovely big page! :D [15:45:59] addshore: :D [15:46:08] elukey: here? [15:51:51] joal: here I am sorry, was writing into a phab task [15:51:56] np elukey [15:52:03] will discuss after standup [15:57:46] (PS1) Addshore: DRAFT Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [15:57:51] joal: ^^ [15:58:32] purely taken from the restbase example with things changed (is there a way to test this)? [15:58:57] addshore: this reminds me a question as well: in graphite, why not use daily.wikidata.... ? [15:59:22] I could! [15:59:23] Analytics-Kanban: User history in hadoop - https://phabricator.wikimedia.org/T134793#2412191 (mforns) p:Triage>Normal [15:59:40] addshore: in metrics, there seem to be some exisitng with the prefix [15:59:46] It's an interesting one, as none of this data actually comes from wikidata [15:59:52] :) [16:00:01] but an extension that uses wikidata, and all data is from other sites [16:02:38] Analytics-Backlog, Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: More solid Eventlogging alarms for raw/validated {oryx} - https://phabricator.wikimedia.org/T116035#2412215 (Milimetric) [16:26:32] Analytics-Kanban: General: sqoop tables into Hadoop and build Hive tables - https://phabricator.wikimedia.org/T138850#2412261 (mforns) [16:27:17] Analytics-Kanban: General: sqoop tables into Hadoop and build Hive tables - https://phabricator.wikimedia.org/T138850#2412019 (mforns) I changed to General, instead of Page History, because it actually serves both page history and user history. Please, change back if you think it was better before. [16:29:31] Analytics-Kanban: User History: design algorithm to reconstruct page history - https://phabricator.wikimedia.org/T138859#2412267 (mforns) [16:34:43] going offline! bye!! o/ [16:35:46] Analytics-Kanban: User History: write scala for page history reconstruction algorithm - https://phabricator.wikimedia.org/T138861#2412308 (mforns) [16:37:06] Analytics-Kanban: User history in hadoop - https://phabricator.wikimedia.org/T134793#2412322 (mforns) [16:37:08] Analytics-Kanban, Patch-For-Review: Extract edit oriented data from MySQL for small wiki - https://phabricator.wikimedia.org/T134790#2412324 (mforns) [16:37:54] Analytics-Kanban: User History: write scala for page history reconstruction algorithm - https://phabricator.wikimedia.org/T138861#2412308 (mforns) [16:37:56] Analytics-Kanban: User History: design algorithm to reconstruct page history - https://phabricator.wikimedia.org/T138859#2412327 (mforns) [16:37:58] Analytics-Kanban, Patch-For-Review: Extract edit oriented data from MySQL for small wiki - https://phabricator.wikimedia.org/T134790#2277147 (mforns) [18:55:46] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2412626 (Nuria) Any updates? [19:11:28] wikimedia/mediawiki-extensions-EventLogging#566 (wmf/1.28.0-wmf.8 - 4e4ebf1 : Mukunda Modell): The build has errored. [19:11:28] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/4e4ebf1f4b9e [19:11:28] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/140891141 [20:06:51] Quarry: Queries running for more than 4 hours and not killed - https://phabricator.wikimedia.org/T137517#2412801 (Dvorapa) Look on the screenshot. There are queries marked as running under queries marked as completed. This looks broken at least [20:47:53] bye a-team! see you tomorrow [20:50:30] hallo [20:50:44] as you may have heard, I've collecting interlanguage links clicks data [20:51:01] there's something extra I'd like to check [20:51:19] to correlate them with the number of site visits on each day [20:51:32] what's a good way to do that? [20:54:06] a-team ^ [20:55:20] aharoni: the pageview api has that data, and you can get in on the pageview tool on tool labs [20:56:27] aharoni: http://tools.wmflabs.org/pageviews [20:56:45] Click on the project level views at the bottom [20:57:17] If you want it on the cluster, the data is in the projectview_hourly table [21:00:25] Oh, nicey [21:01:26] Thanks milimetric