[00:25:40] odd, actually its not the map or the reduce stage thats failing, its the actual container that manages the hive operation. [00:26:34] i have to dig around to find what actually sets the memory for that ... [01:25:14] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3063363 (10Pchelolo) I've put a very WIP solutions that uses change-prop here: https://github.com/wikimedia/change-propagation/pull/165 The solution re-uses the ro... [01:39:53] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3063406 (10Nuria) >The webrequests_text events conform to this schema: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema No, they do not... [05:49:57] joal: so hate to bother you, but if you have any ideas: Since the upgrade getting hive jobs killed for running over memory. https://hue.wikimedia.org/jobbrowser/jobs/job_1488294419903_2065/job_attempt_logs/0 is an example, check syslog and look for beyond to see container_e42_1488294419903_2065_01_000004 killed for using 1.1GB of 1GB physical memory used. that container is [05:50:03] https://hue.wikimedia.org/jobbrowser/jobs/job_1488294419903_2065/tasks/task_1488294419903_2065_m_000000/attempts/attempt_1488294419903_2065_m_000000_0/logs [05:50:55] (stdout confirms). afaict that is the oozie driver (not a mapper or reducer, but the orchestration container). I can't figure out what config param sets that 1GB memory size. i thought perhaps oozie.launcher.yarn.app.mapreduce.am.resource.mb, but changing that had no change [05:51:06] * ebernhardson also knows you wont be here for a couple hours, but leaving it here for later :) [06:57:44] thanks ebernhardson! [06:57:55] I am seeing other oozie jobs killed too.. [06:59:35] !log restarted manually via Hue UI the webrequest-load-coord-maps failed jobs [06:59:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:06:25] !log restarted manually via Hue UI the webrequest-load-coord-misc failed jobs [07:06:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:07:10] only a test, let's see how it goes.. it keeps saying that we are using al old workflow version, it might be resolvable if we kill/restart the bundles? [07:09:22] !log restarted manually the pageview-druid-monthly-coord (february job failed) [07:09:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:09:46] 10Analytics, 10ChangeProp, 10EventBus, 06Revision-Scoring-As-A-Service, and 3 others: Create generalized "precache" endpoint for ORES - https://phabricator.wikimedia.org/T148714#3063692 (10Ladsgroup) The endpoint is being implemented https://github.com/wiki-ai/ores/pull/189. Once this gets merged and deplo... [07:13:38] !log restarted manually the pageview-hourly-coord failed jobs [07:13:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:17:38] !log restarted manually the browser-general-coord failed jobs [07:17:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:17:53] turns out i ignored the most obvious parameter, upping ooozie.launcher.mapreduce.map.memory.mb fixed the problem [07:18:40] i thought that controlled something different, a job that was only the oozie launcher but not the hive driver, but increasing the memory there has let all my failed jobs rerun succesfully [07:19:05] nice :) [07:19:16] but you didn't have this issue before the upgrade right? [07:19:25] (just to start collecting feedbacks) [07:19:28] right, this job has been running without errors for a month or two [07:19:44] and then today i had ~6 failures [07:22:28] weird, load jobs failing again [07:30:33] druid monthly fails for "FAILED: IllegalArgumentException java.net.URISyntaxException: Illegal character in scheme name at index 0: >hdfs://analytics-hadoop/tmp/0000936-170228165458841-oozie-oozi-W-monthly-druid-pageviews-2017-2" [07:36:43] meanwhile misc/maps seems failing in the generate_sequence_statistics step (got the info from the oozie cli) [07:37:02] sounds different from my errors, they were exiting with exit code 143 [07:38:08] yeah [07:38:17] there is something subtle going on :D [07:38:37] to be honest I was expecting it, no big upgrade brings little issues :D [07:39:08] (better: every big upgrade brings some issues) [07:40:39] mmm can't find logs for the generate_sequence_statistics step [07:40:42] grrr [07:48:11] for example: LauncherMapper died, check Hadoop LOG for job [resourcemanager.analytics.eqiad.wmnet:8032:job_1488294419903_2368] [07:48:36] but that job id is not retrievable with oozie -log job_1488294419903_2368 [07:49:26] elukey: you may need to look at mapred logs [07:49:37] may be [07:49:42] madhuvishy: hellooooooo [07:49:43] o/ [07:49:46] hiii [07:50:18] yeah not sure how! Hue does not show them to me since it complains that the job used an old workflow (we upgraded it) [07:50:33] surely joal will come online and resolve the problem in 5 mins [07:50:38] ahahha [07:51:03] ha ha - hue link? i may not remember enough but can look [07:51:19] https://hue.wikimedia.org/oozie/list_oozie_workflow/0000956-170228165458841-oozie-oozi-W/?coordinator_job_id=0000144-170209095235657-oozie-oozi-C&bundle_job_id=0000143-170209095235657-oozie-oozi-B [07:51:45] but you are right, maybe I can try with hadoop logs somewhere? [07:51:49] never used the cli [07:51:51] checking [07:52:03] mapred logs --applicationId or something like that [07:56:15] madhuvishy: how are you??? [07:57:58] sudo -u hdfs mapred job -logs job_1488294419903_2368 - nice!! [07:58:14] yes that ^ :) [07:58:18] elukey: I'm good! [07:58:22] 2017-03-01 07:27:06,622 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1488294419903_2368_m_000000_3: Container [pid=25625,containerID=container_e42_1488294419903_2368_01_000006] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 5.0 GB of 2.1 GB virtual memory used. Kil [07:58:25] in portland for an acm conference [07:58:28] ling container. [07:58:30] ebernhardson: ---^ [07:58:51] madhuvishy: ahhh I saw the update in the holiday section! Have a good time in portland :) Never been there, only in SEA [07:59:50] elukey: :) next time then! it's great [08:04:17] yeah I keep seeing " is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 5.0 GB of 2.1 GB virtual memory used. Killing container." [08:04:23] joal: ---^ [08:05:35] * elukey brb [08:31:08] not sure what changed and why we are seeing this issue only now [08:31:24] a quick workaround is to add the Xmx config to the oozie workflow [08:31:29] but it seems not very clean [08:41:42] I need to step away ~1hour to bring my cat to the vet, will brb asap [08:42:10] not sure what is the best course of action now, need more feedback (and confirmations!) before proceeding [09:17:29] Hi a-team [09:17:35] Hi ebernhardson [09:17:53] ebernhardson: I'm glad you found the oozie launcher option :) [09:18:34] a-team: We're gonna need to do the same ebernhardson did: bump oozie launcher memory [09:23:27] joal: o/ [09:23:33] Hi elukey [09:23:45] Was waiting for you to dicuss actions for our dear oozie [09:24:15] is there a preferred place where the new oozie option should reside? [09:24:30] afaics we can put it in the workflow config but it is not that great [09:24:39] not sure if there is a cdh option though [09:24:49] to control JAVA_OPTS [09:26:00] elukey: this config is already set in jobs workflow.xml files if I don't mistake [09:26:06] ah nice [09:26:18] so we just need to change/deploy/kill/start? [09:26:19] elukey: that means we can override it using a -D setting while launching job [09:26:31] elukey: But, errors still look weird [09:27:20] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:23] the weird thing is that 1GB is really low for Xmx, we set more than that.. afaics oozie launcher does not use hadoop's config [09:28:16] elukey: what you call hadoop config is the default setting for containers, that is overridable (and usually overwritten) by jobs [09:28:20] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [09:28:41] elukey@analytics1039:~$ ./check_hadoop_yarn_node_state [09:28:41] WARNING: YARN NodeManager Node-State: UNHEALTHY [09:28:55] weird.. [09:29:14] joal: yes sorry for the lack of precision :) [09:29:42] elukey: oozie uses a special yarn container to manage each of its jobs (the oozie launcher container) [09:30:02] elukey: this container memory limits are set using oozie_launcher_memory property [09:30:31] joal: in the workflow files as property or somewhere else? [09:31:43] elukey: the parameter is defined by default in workflow files (to be sure it is set), and used in hive actions with oozie.launcher.mapreduce.map.memory.mb [09:32:24] joal: so do we need to do a refinery change? [09:32:37] (I am trying to get where we should put the new Xmx) [09:33:03] elukey: for testing, we can restart a job (i suggest load-maps) overriding the parameter manualkly [09:33:17] elukey: If it does the trick, let's bump the default value everywhereb [09:33:27] I like it [09:33:42] elukey: trying so [09:39:09] !log Kill and restart webrequest-load-maps coordinator checking for new oozie_loader_memory parameter (starting from 2017-02-28T18:00) [09:39:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:39:38] elukey: new maps job: 0001114-170228165458841-oozie-oozi-C [09:42:36] elukey: for jobs in oozie, looks like we are hitting that: https://issues.cloudera.org/browse/HUE-5419 [09:45:02] mmm I didn't see it [09:45:04] but might be [10:04:16] Reportcards http://reportcard.wmflabs.org/ hasn't been updated in a while, is there any other place where one can find monthly pageviews for all projects? [10:07:16] Ainali: sure there is! https://analytics.wikimedia.org/dashboards/vital-signs [10:08:44] elukey: Is there a quick way to get a total or do I need to manually select all 800+ wikis there? [10:10:44] never done it, not sure if there is the option [10:11:03] I found this in which I only need to make 10 additions to get a total: https://stats.wikimedia.org/EN/ProjectTrendsPageviews.html [10:11:14] But at least it gives a total for all wikipedias [10:11:47] are you interested on raw pageview numbers only or unique visitors? [10:13:26] Pageviews mostly, but if visitors is the only thing available, I'll take it [10:14:07] joal: 0001130-170228165458841-oozie-oozi-W failed for the same reason.. [10:14:27] Ainali: nono we also have uniques, I wanted to ask :) [10:14:41] elukey: quick question - how do you get the logs? yarn logs? [10:14:46] And really it is not the absolute numbers that I am interested in, but monthly trends [10:15:14] joal: oozie job -info 0001130-170228165458841-oozie-oozi-W && sudo -u hdfs mapred job -logs job_1488294419903_2877 [10:15:41] makes sense elukey, thanks [10:15:48] elukey: will try with 2G mem [10:15:59] joal: where are you putting the setting? [10:16:02] (curious) [10:16:10] elukey: command line [10:16:21] ahhh ok [10:16:46] joal: what threshold you used for the last attempt? [10:16:52] !log Kill and restart webrequest-load-maps coordinator checking for new oozie_loader_memory parameter (starting from 2017-02-28T18:00 - 2g launcher memory) [10:16:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:17:00] elukey: used 1G [10:17:14] elukey: default setting in workflow file says 256 [10:17:22] last time used 1024 [10:17:31] Now tried 2048 [10:17:37] elukey: 0001159-170228165458841-oozie-oozi-C [10:18:11] When I present the trends for the pageviews for a small subset of articles (around 1600 in ~70 projects) it would be interesting to compare that with the overall trend [10:18:52] joal: mmm because in the logs it says that it was crossing the 1GB boundary even before [10:19:02] elukey: weird :( [10:20:02] Ainali: ah got that.. I would reccomend to follow up with milimetric on this, he should be online in the early afternoon [10:20:52] elukey: just double checked: memory setting is correctly passed to worker in hadoop [10:20:59] elukey: now will it be enough ... [10:21:37] joal: also we cap the containers to 2GB IIRC [10:21:46] elukey: false [10:22:09] elukey: I can run spark jobs with 1 yarn container being 32Gb memory [10:22:15] elukey: Thanks for the tip! [10:22:43] joal: oh ok, so you can override it if you wish? [10:23:01] elukey: 2Gb is the DEFAULT value for a yarn container [10:23:10] joal: okok thanks :) [10:23:19] elukey: I even thing that default value for reducer is overriden to 4Gb [10:23:48] yup [10:23:49] # Reduce container size and JVM max heap size (-Xmx) [10:23:49] cdh::hadoop::mapreduce_reduce_memory_mb: 4096 # 2 * 2G [10:23:52] cdh::hadoop::mapreduce_reduce_java_opts: -Xmx3276m # 0.8 * 2 * 2G [10:24:20] ooooook now it is clearer [10:24:22] elukey: One day I'll need to forget all those memory settings ;) [10:24:51] joal: beers help!! :D [10:25:07] hehehe, I prefer wine, but globally the same :) [10:25:45] me too! [10:25:57] red preferably [10:26:38] especially everything that comes from https://en.wikipedia.org/wiki/Valpolicella [10:29:49] :D [10:32:20] joal: 0001162-170228165458841-oozie-oozi-W is running refine \o/ [10:33:16] elukey: Yay ! [10:33:24] elukey: hue bug is really a pain :( [10:34:30] elukey: the thing I don't understand is how the heck are text and upload jobs not failing ???? [10:35:36] joal: sorry I am not getting what is the issue.. [10:36:14] well elukey, if oozie webrequest-load jobs fail because of memory issue in the launcher, I'd expect all of them to fail [10:36:22] and so far, only some of them do [10:36:33] Even worse: some of them NEVER fail [10:37:09] * joal is full of perplexitudeness [10:38:06] ah yes that part is also a mistery to me [10:38:22] the other weirdness is getting "This workflow was imported from an old Hue version, save it to create a copy in the new format or open it in the old editor." [10:38:35] that maybe resolved killing/starting again? [10:38:51] (I checked Coordinator webrequest-load-coord-text : Workflow webrequest-load-wf-text-2017-3-1-8) [10:39:09] in the meantime 0001162-170228165458841-oozie-oozi-W completed [10:39:22] so I vote to apply it everywhere [10:39:56] elukey: Do you mind if we wait for at least a few of them to finish first? [10:40:07] sure sure [10:40:36] but I am pretty sure that we have resolved the issue (especially after the report of ebernha*dson ) [10:41:37] elukey: I think so as well - but paranoid is not only for you ;) [10:42:08] joal: you are completely right, no rush [10:45:13] it would be wise also to check periodically https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?from=now-2d&to=now [10:45:20] something has changed but nothing dramatic [10:45:25] (in GC trends) [10:59:36] git st [10:59:39] oops :) [11:02:13] 06Analytics-Kanban: Bump default oozie launcher memory usage - https://phabricator.wikimedia.org/T159324#3064080 (10JAllemandou) [11:02:55] (03PS1) 10Joal: Bump default oozie launcher memory usage to 2048 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340489 (https://phabricator.wikimedia.org/T159324) [11:03:00] elukey: --^ [11:03:03] if you have a minute [11:04:04] checking! [11:05:29] (03CR) 10Elukey: [C: 032] Bump default oozie launcher memory usage to 2048 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340489 (https://phabricator.wikimedia.org/T159324) (owner: 10Joal) [11:05:45] elukey: I think I have not forgotten any ... [11:07:25] joal: I checked with WikimediaSource/refinery grep -rni "256" oozie/ [11:07:36] it matches what you sent.. [11:08:30] also grep -rni "oozie_launcher_memory" -A 1 oozie/ seems sound [11:31:00] joal: Moritz upgraded nginx on the host in which archiva runs.. it seems fine to me after a quick check but let me know if you'll see any issue [11:31:10] elukey: sure ! [11:31:15] elukey: Thanks for the ping [11:34:28] elukey: do we wait for ottomata approval before restarting jobs, or do we move forward? [11:37:54] joal: I would vote to proceed, maybe we restart only the ones that have been failing to test them carefully [11:38:04] (before applying it everywhere else) [11:38:08] wdyt? [11:38:21] there is not much that we can really do otherwise [11:38:26] and I believe this is the correct fix [11:41:49] elukey: my idea would have been to relauch every job using our restart script [11:42:03] elukey: I'm going to dryrun it, see what it would do [11:45:08] do we have a restart script?? [11:45:12] :O [11:49:00] elukey: refinery/bin/refinery-oozie-rerun [11:49:19] 10Analytics, 10Analytics-EventLogging, 07Russian-Sites: Add ops-reportcard dashboard with analysis that shows the http to https slowdown on russian wikipedia - https://phabricator.wikimedia.org/T87604#3064166 (10MarcoAurelio) [11:49:20] elukey: it misses a config param overwriting option, but I can do that [11:49:49] and actually elukey, trying the script anew told me we have jobs we forgot to starst :) [11:50:37] 10Analytics, 10Analytics-EventLogging, 07Russian-Sites: Add ops-reportcard dashboard with analysis that shows the http to https slowdown on russian wikipedia - https://phabricator.wikimedia.org/T87604#3064177 (10MarcoAurelio) [11:53:07] :) [11:53:36] elukey: we need to wait for mforns, but the 2 jobs he wrote for banners are not started ;) [12:46:35] 06Analytics-Kanban, 13Patch-For-Review: Bump default oozie launcher memory usage - https://phabricator.wikimedia.org/T159324#3064283 (10JAllemandou) a:03JAllemandou [12:46:58] (03PS1) 10Joal: Add configuration update to ozzie rerun script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340498 (https://phabricator.wikimedia.org/T159324) [12:47:03] elukey: --^ if you have a minute [12:51:39] (03PS2) 10Joal: Add conf change option to oozie rerun script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340498 (https://phabricator.wikimedia.org/T159324) [12:57:40] elukey: oozie is in such a meesy state that we''l have to rerun jobs manually I think :( [13:12:56] joal: here I am sorry [13:13:08] (I was having lunhc) [13:16:43] joal: what is the plan? fixing the script and then restart the jobs? [13:31:49] (03CR) 10Elukey: [C: 031] "Checked argument positions and naming, code looks good to me. One minor nit that is not important." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340498 (https://phabricator.wikimedia.org/T159324) (owner: 10Joal) [13:32:16] as far as I can see the code looks good [13:58:30] elukey: it was turn to be gone :) [13:59:11] elukey: I think best would be to indeed merge both, use the script to relanuch all jobs, and then fix single failed ones [13:59:17] (03PS3) 10Joal: Add conf change option to oozie rerun script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340498 (https://phabricator.wikimedia.org/T159324) [13:59:21] elukey: does it make sense? [14:02:55] no halfak nor milimetric :( [14:03:23] joal: yep it does [14:04:09] elukey: the script realaunches with start date equals to last not run one - so we'd have to manually go over failed ones, but I still think it's better to relaunch everything automagically :) [14:04:26] yes :) [14:07:56] HI! [14:08:00] just read backscroll [14:08:02] yall are the best [14:08:13] sounds really tricky. lemme make sure I understand: [14:08:19] some oozie launchers OOM [14:08:20] but not all [14:08:27] and we don't know why some OOM but not others? [14:08:35] correct ottomata, that's current status [14:10:26] is this also probably why those couple of maps/misc jobs failed yesterday? [14:10:31] not the camus weirdness hypothesis? [14:11:15] ottomata: yessir ! [14:12:03] o/ [14:12:15] it is also true that 256Mb are really low for the oozie launcher [14:12:41] ya, i remember a while ago joal and I messed with it, and added the option into the properties files [14:12:48] but we reset it to 256 because that was the default [14:12:53] :) [14:12:56] and it turned out to not be our problem [14:13:04] don't remember what was going on at that time...do you remember that joal? [14:15:17] i don't ottomata [14:16:06] ottomata: Since yo're here now, do you mind having a quick look at the patches I suggested? [14:16:16] Then idea would be to move on and fix :) [14:16:20] i see the one that bumps mem up to 2048 [14:16:31] i can merge that [14:16:34] what other one? [14:16:53] this one: https://gerrit.wikimedia.org/r/340498 [14:17:06] Would allow to automagically relaunch all jobs with new conf [14:17:54] (03CR) 10Ottomata: [V: 032] Bump default oozie launcher memory usage to 2048 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340489 (https://phabricator.wikimedia.org/T159324) (owner: 10Joal) [14:20:39] joal: i don't even remember this awesome script! it grabs job def json from oozie server, and then mangles the actual submitted xml job conf, instead of using .properties to fill stuff in like we usually do on the CLI? [14:21:07] (03CR) 10Ottomata: [V: 032 C: 032] Add conf change option to oozie rerun script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340498 (https://phabricator.wikimedia.org/T159324) (owner: 10Joal) [14:22:03] almost that ottomata - doesn't use CLI, uses oozie API to get/submit job info [14:22:16] right [14:22:22] oozie server to get the json/xml [14:22:25] of the job def [14:22:36] ottomata: wrote that after the kafka upgrade mess - I sweared to myself never to relaunch ALL jobs at once manually :) [14:24:08] hah yea [14:24:23] elukey: moving forward ? [14:24:50] yep [14:25:10] at this point we should deploy the refinery [14:25:11] elukey: can you deloy refinery ? I'm taking notes on jobs to restart manually [14:25:15] okok [14:26:46] rsync: write failed on "/srv/deployment/analytics/refinery-cache/revs/33db287d98335c08017a94014252852c99983b2e/.git/fat/objects/fbaef2dcac5abbc2a6dcc10b21408e3c358e97e9": No space left on device (28) [14:26:50] ahhahaha [14:26:58] :) [14:27:36] hahah [14:27:43] on which host?! [14:27:48] 1002/ [14:27:49] ? [14:29:42] yeah I am deleting some scap revs [14:32:05] joal: all good, refinery deployed [14:34:43] elukey: awesome - got the list of stuff to restart [14:35:06] ottomata, elukey: Weird though, some jobs have sent emails and are marked as ok in hue :S [14:37:12] ??? [14:37:53] I think they might have been relaunched manually, but it's weord [14:38:09] I relaunched a lot of them early this morning, this might explain it? [14:38:18] elukey: It would :) [14:38:30] elukey: The weird thing is that sometimes it doesn't fial :S [14:38:32] How weird [14:38:47] Anyway, proceeeeeeeeding! Fire a da hole ! [14:38:50] !log Restart all hdfs oozie jobs with 2048M launcher memory (using script) [14:38:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:39:35] elukey: quick check: have you deployed onto hdfs? [14:40:41] I'm assuming no, I'll do it [14:41:46] !log Deploying refinery onto hdfs (before restarting jobs) [14:41:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:42:56] ah no sorry I forgot! [14:59:52] joal: I'm sorry I got my days confused, I thought yesterday was today and we skipped the meeting, I ran some errands this morning [15:00:06] np milimetric :) [15:00:17] milimetric: prop issues to solve anyway :) [15:01:04] joal: let me know if you need help, don't want to step on your shoes [15:01:49] triple checking stuff elukey - Seems ok for most, one issue to investigate [15:03:13] elukey: typos in one job :( [15:09:09] okok, no rush, just wanted to help :) [15:10:55] ottomata, elukey: Further investigation on pageview failures: all errors happened at whitelist_check stage [15:14:45] (03PS1) 10Joal: Fix typos in monthly job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340510 [15:15:09] joal: it is strange that oozie launcher would die as a result of some other part [15:15:12] when i looked at that map job [15:15:20] i didn't see any particular errors in the actually yarn app [15:15:23] just that it had died [15:15:34] ottomata: I came to the same conclusion [15:16:17] ottomata: this gives a different log: https://yarn.wikimedia.org/jobhistory/attempts/job_1488294419903_1135/m/FAILED [15:17:18] ottomata/ elukey: Do you mind quick merging 2 things --> https://gerrit.wikimedia.org/r/#/c/340510/, https://gerrit.wikimedia.org/r/#/c/339661/ [15:20:49] ottomata, elukey: Even weirder stuff: restbase-coord - not using hive, only spark - 3 failures shown in hue UI, only one in hadoop jobs ! [15:20:52] How bad [15:21:52] (03CR) 10Elukey: [V: 032 C: 032] Fix typos in monthly job loading pageview in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340510 (owner: 10Joal) [15:22:42] (03CR) 10Elukey: [V: 032 C: 032] Correct webrequest comments [analytics/refinery] - 10https://gerrit.wikimedia.org/r/339661 (https://phabricator.wikimedia.org/T157951) (owner: 10Joal) [15:22:48] Thanks elukey [15:23:02] joal: shall I deploy? [15:23:33] If you don't mind, please go (+hdfs ;) [15:23:37] sure [15:24:25] thanks elukey [15:24:42] ok, got some answers to my weirdnesses [15:25:30] huh that's the oozie launcher [15:25:44] ok, sooooo, def OOM, right? that's where we get that from? [15:25:58] ottomata: nothing better [15:25:58] maybe just some OOM for some reason, and they just happen to take a similiar amount of time to do it [15:26:05] so they die at about the same way through the job [15:28:55] ottomata: possible [15:30:33] refinery deployed [15:35:18] (03PS1) 10Joal: Bump jar version of oozie restbase metrics job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340512 [15:36:15] 10Analytics, 10Analytics-Cluster: Automate refinery jar cleanup - https://phabricator.wikimedia.org/T159337#3064622 (10Ottomata) [15:36:22] 10Analytics, 10Analytics-Cluster: Automate refinery jar cleanup - https://phabricator.wikimedia.org/T159337#3064635 (10Ottomata) p:05Triage>03Low [15:36:53] (03CR) 10Ottomata: [V: 032 C: 032] Bump jar version of oozie restbase metrics job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340512 (owner: 10Joal) [15:43:25] git up [15:43:28] oop [15:46:34] y [16:00:57] a-team: standddupppp [16:13:49] 10Analytics, 10Analytics-EventLogging: Move eventlogging backend to hadoop - https://phabricator.wikimedia.org/T159170#3064701 (10Milimetric) @Tbayer I think you, @JKatzWMF, @Neil_P._Quinn_WMF, and @mpopov raise good points that we definitely want to address in this work. I'll try to catalog them here and I a... [16:29:22] 10Analytics, 10RESTBase, 06Services: REST API entry point web request statistics at the Varnish level - https://phabricator.wikimedia.org/T122245#3064767 (10Nuria) Sorry, forgot to include raw data table schema: col_name data_type comment hostname string from deserializer sequenc... [16:30:20] 10Analytics, 10Recommendation-API: productionize recommendation vectors - https://phabricator.wikimedia.org/T158973#3064784 (10Fjalapeno) Thanks @leila! [16:31:40] milimetric: Error in Piwik (tracker): An unexpected website was found in the request: website id was set to '6' ., referer: https://analytics.wikimedia.org/dashboards/vital-signs/ [16:32:22] hm... [16:32:25] that's new [16:34:13] milimetric, elukey : with piwik upgrade php code changed [16:34:37] milimetric, elukey : we just need to update our links to piwik to be in the new form [16:34:37] nuria: we didn't upgrade it recently right? [16:34:57] elukey: mmm.. we must have [16:35:11] no, the code I was looking at was the saem [16:35:14] *same [16:35:31] mmmm, no wait, our prior link: [16:35:34] couple of small differences in the JS code [16:36:15] and now I can also see Error in Piwik (tracker): SQLSTATE[HY000] [2002] Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (11) [16:36:18] hahahah [16:36:35] so probably we fixed varnish and now piwik is failing again [16:36:40] ah sorry you are right, no tracking code is correct [16:42:29] but joal, we had spark 1.6.0 before, no? [16:42:35] oh no [16:42:36] it was 1.5 [16:42:40] yessir [16:42:56] We were using external shuffle though [16:43:02] so, bizare [16:43:03] (03PS2) 10Milimetric: Clean up Config [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/340375 [16:43:28] joal: do we have a support spark 2.x ticket yet? [16:43:33] if not, let's sneak it in for tasking :p [16:43:44] we d oott [16:43:47] we do [16:43:51] sneak it in! [16:45:28] done ottomata, top column in tasking [16:46:58] (03PS1) 10Milimetric: Update piwik tracking [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/340526 [16:47:12] (03CR) 10Milimetric: [V: 032 C: 032] Update piwik tracking [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/340526 (owner: 10Milimetric) [17:21:16] joal: you want me to stand in at scrum of scrums? [17:21:36] I owe you and Marcel one from last week? [17:25:53] milimetric: last error on bohrium seems to be at [Wed Mar 01 16:43:18.563464 2017] UTC [17:26:00] that is more or less when you merged right [17:26:01] ? [17:26:35] elukey: yep, it gets deployed by puppet so yeah, I merged beginning of tasking [17:27:15] I just checked and it's not throwing it any more. It's nice to know piwik throws an error in that case [17:27:34] super [17:27:39] thx much for checking [17:27:54] now I need to figure out what explodes once in a while :D [17:30:18] hola nuria et al. [17:30:27] lzia: hi [17:31:18] nuria: do you think it makes sense to start an excel sheet for the headcounts of AP? it's very hard to see who is requesting how much of what team [17:31:45] and people change 0.01 to 0.05 and this is not tracked and reflected, nuria. [17:40:57] lzia: mmm. no, 0.01 is too detailed, probably should not be there at all, anything less than 0.5 actually [17:41:11] lzia: so, i do not think that is needed at this time [17:42:12] got you. I'm going to make a quick one for research and we can expand it if others find it useful (I agree that some numbers are too decimal;) [17:49:19] lzia: i comented on doc, but think that 0.1 is 1 month and a half of work of 1 person [17:50:11] lzia: with this level of planning (yearly) i do not think is possible to do such a refined alocation (might be for some projects though but doesn't seem likely) [17:50:43] agreed, nuria. [17:51:07] Hey milimetric, would be great [17:51:27] ottomata, elukey: Looks like we lost the dynamic allocation setting for spark :( [17:51:40] k, will do then [17:54:27] oh?! [17:54:31] in spark confs? [17:55:24] joal: [17:55:25] ? [17:59:04] joal: how can I help right now, want to batcave? [18:00:55] ottomata: sorry was away for a minute [18:01:00] ottomata: batcave ! [18:13:08] looks like you guys are working on it, I am going afk ! [18:20:25] milimetric: thanks a lot mate ! [18:21:44] ottomata: modified worked great - will try it with oozie [18:22:30] great [18:33:03] (03PS1) 10Joal: Bump CDH version and update spark jobs accordingly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340540 [18:34:00] OHHHH right joal [18:34:44] joal: added '- make sure a big spark job works, not just hive webrequest refine (sometimes spark deps change)" to list of things to do next time [18:35:23] col ottomata :) [18:35:33] (03CR) 10jerkins-bot: [V: 04-1] Bump CDH version and update spark jobs accordingly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340540 (owner: 10Joal) [18:35:35] (03PS1) 10Joal: Update oozie mobile apps metrics jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340542 [18:36:39] joal: basic +2 on those, can merge when you are ready, i haven't done the refinery-source release in a while [18:36:57] great ottomata [18:38:12] wow I commited a patch that doesn't compile !!!! I owe a round of beers to the team ! [18:39:46] (03PS2) 10Joal: Bump CDH version and update spark jobs accordingly [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/340540 [19:12:57] (03PS1) 10Joal: Remove pagecounts projectcounts from dump-check [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340548 [19:13:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:14:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [19:27:03] going for dinner a-team, spark issue is not yet solved (a new one occured) [20:11:44] Hey ottomata, are you around? [20:11:55] ya [20:12:06] I think I hit that: https://community.cloudera.com/t5/Cloudera-Manager-Installation/CDH-5-5-1-fresh-Oozie-gt-Spark-gt-Hive-Unable-to-instantiate/td-p/35593 [20:12:29] ottomata: do you recall how to add a jar to the oozie lib ? [20:12:56] ottomata: I remember we thought about that some times ago, but can't put my head around it anymore [20:14:43] looking [20:16:12] ottomata: might not be derby (could be mysql connector) [20:16:31] hm, there is the oozie sharelib, but it shoudl have the jars it needs [20:16:33] looking more [20:17:39] joal: have you seen this error? [20:18:04] ottomata: sudo -u hdfs yarn logs -applicationId application_1488294419903_5295 | less [20:18:35] weird it does mention derby though [20:19:43] and, the derby jar with that class is in the oozie sharelib [20:20:53] this is very weird [20:21:25] and also ./mobile_apps/session_metrics/bundle.properties:oozie.use.system.libpath = true [20:23:48] ottomata: another issue of jar versions :) [20:24:00] maybe so, but where is the conflicting one? [20:24:28] nonono ottomata : just noticed another set of failing jobs because of old jar version [20:26:03] oh [20:26:07] sheesh [20:26:09] hm [20:26:10] https://hue.wikimedia.org/jobbrowser/jobs/job_1488294419903_5583 [20:26:22] mapreduce.job.classpath.files [20:26:27] maybe its not printing all of them out? [20:30:22] I don't get it ottotmata :( [20:31:23] joal: i wonder if oozie is not properly adding all of the required hive stuff whne its running a spark action [20:31:24] hm [20:35:54] joal: can you try [20:36:17] -Doozie.libpath=hdfs://analytics-hadoop/user/oozie/share/lib/lib_20170228165236/hive [20:36:18] ? [20:38:41] ottomata: trying [20:42:56] ottomata: jobs started [20:48:14] ottomata: I tried adding the derby jar as dependency: didn't work [20:49:55] added as dependency how? [20:50:01] in pom.xml [20:51:41] hm [20:51:47] oh and recompiling [20:51:49] that gave you the same error!? [20:52:14] it failed (didn't check the error) [20:52:27] New one with your suggestion fails as well [20:52:34] ottomata: https://hue.wikimedia.org/oozie/list_oozie_workflow/0002264-170228165458841-oozie-oozi-W/?coordinator_job_id=0002263-170228165458841-oozie-oozi-C&bundle_job_id=0002261-170228165458841-oozie-oozi-B [20:53:37] ottomata: error looks way better with the seting you provided [20:54:06] org.apache.spark.sql.AnalysisException: Table not found: `wmf`.`webrequest`; line 5 pos 9 [20:54:06] ? [20:54:10] yes [20:54:15] ooo ya getting somewhere [20:54:17] that sounds fixable :) [20:57:26] (03PS2) 10Joal: Update oozie mobile apps metrics jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340542 [20:57:28] (03PS2) 10Joal: Remove pagecounts projectcounts from dump-check [analytics/refinery] - 10https://gerrit.wikimedia.org/r/340548 [21:02:29] ottomata: dumb question, have we updated hdfs://user/hive/hive-site.xml ? [21:02:48] ottomata: Cause the db issue, while better, still seems bizarre to me [21:03:20] ottomata: from the dates, looks like not [21:05:39] hmm - looks like this file hasn't changed [21:06:28] ? [21:06:34] hmmmmm [21:06:40] yeah, probably not actually! looking [21:06:51] i think it is only ensured that it is there by puppet [21:06:53] i think... [21:07:46] but joal [21:07:48] no diff? [21:07:58] in current hive-site and that file [21:08:06] joal: the db issue sounds like it sa problem in the recent patch [21:08:07] no? [21:08:17] you added webrequest_table as a property [21:08:18] right? [21:08:28] ottomata: I did that [21:08:45] ottomata: I tested the job manually, worked fine [21:09:05] hm, but maybe the thing is being quoted really weird by oozie [21:09:07] when passing to the job [21:09:24] ottomata: nope [21:09:25] https://hue.wikimedia.org/oozie/list_oozie_workflow/0002264-170228165458841-oozie-oozi-W/?coordinator_job_id=0002263-170228165458841-oozie-oozi-C&bundle_job_id=0002261-170228165458841-oozie-oozi-B [21:09:36] hm [21:10:30] joal: just curious, why did you change to table? [21:10:38] so that you could use spark hive integeration rather than read data from file? [21:11:09] ottomata: to prevent an error from the previous patch due to base-path and partition-style folders in spark [21:15:59] ottomata: used exactly the same params as oozie did (almost extacly) - The spark job started - Seems to be realted to hive config well launch from oozie :( [21:16:53] hmmmm [21:18:03] joal hm [21:18:09] you don't have hive_site_xml in this job [21:18:11] is that necessary? [21:18:25] hm that's for a hive action [21:18:26] hm [21:18:40] ottomata: I do pass it [21:18:44] how? [21:19:29] joal: https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Oozie-Spark-Action-Hive/td-p/40129 [21:19:30] ? [21:19:31] maybe? [21:19:37] Add this to the tag in the action definition: [21:19:37] --conf spark.yarn.appMasterEnv.hive.metastore.uris=thrift://:9083 [21:20:16] ottomata: Will try ! [21:20:26] ottomata: if this work, that's really ugly :) [21:20:47] hm maybe better [21:20:47] --files hdfs:///user/hue/oozie/workspaces/hue-oozie-1463575878.15/hive-site.xml [21:20:56] hmmm [21:21:04] wonder if we just put hive-site into oozie share lib... [21:21:27] ottomata: I do: --files HIVE-SITE-PATH [21:21:32] oh [21:21:36] on the CLI? or where? [21:21:41] how do you pass thta in oozie? [21:21:47] to spark-opts? [21:22:07] ottomata: https://gerrit.wikimedia.org/r/#/c/340542/1/oozie/mobile_apps/session_metrics/workflow.xml [21:22:12] line 107 [21:22:43] oh its not merged [21:22:44] sorry [21:22:46] i was reading my lcoal [21:24:31] hm, i guess try the --conf, but i think the files thing is better and should do the same thing [21:24:40] ottomata: I think so [21:25:33] ottomata: also, other spark jobs fail (wikidata ones), but not all - the ones not using sqlContext strongly (restbase) are not failing [21:25:43] or more precisiely: not using Hive context [21:28:07] ottomata: I'm gonna stop for tonight and will continue investigation tomorrow [21:28:16] I stopped the spark related jobs [21:29:12] ok joal rats. [21:29:16] so, wait, did --conf not work? [21:29:18] As you say [21:29:22] I didn't try [21:29:24] oh ok [21:29:27] hm ok [21:29:30] checked other jobs [21:29:35] so, basically: oozie using hive context doesn't work? [21:29:46] ottomata: correct, that's where I stand so far [21:30:10] ottomata: hiveContext works in shell and submit, but not with oozie [21:30:35] ok [21:30:38] i'll see if i can poke a bit [21:30:54] thanks - will continue tomorrow as well [21:30:59] good evening [21:32:30] leila: i was wronggg and EVERYONE loved teh spreadsheet [21:32:54] leila: I guess I am the only one with spreadsheet allergy [21:33:21] nuria: I get your point. but I saw how things were going super crazy [21:33:30] ottomata: another thing while I'm at it: the other job that had an old version was mediacounts [21:33:34] nuria: I wish I can retire from being the spreadsheet-lady soon. ;) [21:34:06] ottomata: Problem is that a function used in that old version has gone a major refacto, and therefore not usable as before [21:34:16] leila: they still are crazy as headcount should add up to whole numbers at the end (not halfs unles]s you have part timers) [21:34:22] hm pk [21:34:25] ottomata: I have checked results, they don't really match :( [21:34:27] joal: so what hive context jobs are there? [21:34:31] yikes [21:34:34] just this session one [21:34:37] and a discovery one? [21:35:11] ottomata: hive context issues: mobiles apps sessions (2 jobs), wikidata (2 jobs) [21:35:12] agreed, nuria. [21:36:20] ottomata: old version issue: Mediacounts load (from 0.0.8) --> see insert_hourly_mediacounts.hql, line 27 [21:36:22] leila: i showed to finance and they were like [21:36:26] leila: no compredou [21:36:42] ;p [21:37:33] anyway ottomata - need to sleep - will get back to that tomorrow morning with a fresher head [21:38:17] joal: super thanks for your work [21:40:12] ok joal ya you go sleep! [21:40:13] thank you [21:40:15] good night [22:43:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:44:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [22:59:33] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:01:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [23:10:09] nuria: in number 9, you will need Erik's help, right? [23:10:30] Erik's work is not currently captured, and a lot of it is with Analytics, nuria. right? [23:23:19] 10Analytics, 10Analytics-Cluster, 10EventBus, 10MediaWiki-Vagrant, and 2 others: Kafka logs are not pruned on vagrant - https://phabricator.wikimedia.org/T158451#3066167 (10Pchelolo)