[06:10:25] PROBLEM - YARN NodeManager JVM Heap usage on analytics1053 is CRITICAL: CRITICAL - analytics_hadoop_yarn_nodemanager is 3956580202 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [06:14:25] RECOVERY - YARN NodeManager JVM Heap usage on analytics1053 is OK: OK - analytics_hadoop_yarn_nodemanager is 3888158670 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [07:35:59] !log re-run wikidata-specialentitydata_metrics-wf-2018-2-17 via Hue [07:36:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:51:45] today only ttttteeeeaaaaammm eeeeuuuurrrooooopppppppppppppeeeeeeeeeeeee [08:10:08] morning elukey :) [08:12:38] morning :_ [08:12:40] :) [08:12:41] [zk: localhost:2181(CONNECTED) 5] ls /yarn-leader-election/analytics-hadoop-labs [08:12:44] [ActiveBreadCrumb, ActiveStandbyElectorLock] [08:12:44] joal: --^ [08:12:59] I am checking zookeeper znodes and found out exactly what yarn uses [08:13:14] more or less same stuff for hdfs ha [08:13:15] :) [08:13:55] it is a bit tricky to expand a zookeeper cluster in a safe way [08:14:14] hm [08:14:24] never done that [08:17:03] so the main issue is that both zk servers and clients have the list of nodes [08:17:36] and up to now our version (3.4.x) doesn't have dynamic repartitioning, that would allow a change on the fly of the cluster (of course) [08:17:46] (why wouldn't us be lucky? ) [08:18:07] now the procedure is something like: [08:18:34] 1) add the new zk cluster configuration to the new zk nodes, spin them up and check that they join the zk ensemble correctly [08:18:48] at this stage the other zk nodes are not fully aware of them [08:19:06] 2) restart the followers, and make sure that they are ok with the new cluster [08:19:15] 3) restart the zk leader [08:19:34] at this stage though the clients are not aware of the new nodes [08:19:54] sooo then as last step rolling restart of all the clients (kafka + hadoop) to pick up the new config [08:20:09] Double checking my understanding elukey [08:20:23] After 1), only new nodes know thay are part of the clustrer [08:20:45] after 2), new nodes + followers know, and after 3 "everybody knows" [08:20:56] in theory yes [08:21:02] makes sense [08:21:17] elukey: And about data rebalancing - Any option? [08:21:22] everybody == zk nodes only, the clients are still aware only of the old cluster nodes [08:21:44] joal: data rebalancing? [08:22:30] elukey: You were saying that in our version, dynamic repartitioning is not present - To me that means that, even after grown the cluster the way you describe, data wqill still b [08:22:48] data will still be only present on the nodes that were present before [08:23:03] I might be misunderstanding something though :) [08:23:11] ah no sorry that thing is to manage the cluster, when a new node joins it should get the full view [08:23:19] but I am not fully sure about this [08:23:29] elukey: ok super - I should read and learn more about zk [08:23:55] joal: me too, I am writing these things to get pointers :) [08:29:45] joal elukey morning yall! :) [08:30:01] Morning fdans :) [08:32:30] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3982335 (10JAllemandou) Took longer that I reminded but it's done: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01 Looks like that patches work :) [08:35:23] elukey: would you have a minue this morning to help me with my un-root-ness? [08:36:33] joal: sure! [08:38:04] elukey: For presto to run with slider I need the creation of folders on every cluster node, with ownership to yarn user, which launches the processes [08:38:30] elukey: I had planned to use: /tmp/yarn/presto/data and /tmp/yarn/presto/etc [08:39:01] elukey: Thing is, I don't have yarn sudoers :( [08:40:22] joal: do you want to launch a process on each node? [08:40:41] elukey: nope, but I can't know where they'll be launched [08:41:10] joal: I didn't get what you need sudo for though [08:41:45] elukey: those folders I want to create need to belong to the yarn user [08:42:37] joal: ok but you don't need sudo for that, if you create them first.. if you want I can do it with cumin, is it what is needed? [08:44:05] elukey: I get "Operation not permitted" when chwoning [08:44:25] on what host/dir? weird [08:44:49] elukey: I first created forlders under my username, so folder is: /tmp/joal/presto [08:45:03] They exist on any worker node - Here I tried an1028 [08:45:16] elukey: could be that yarn is a system-only user? [08:49:09] it shouldn't be an issue, I am probably still asleep, checking on the host [08:52:48] I think it is the sticky bit on tmp that might trick us [08:53:46] or I am a super ignorant about chown [08:54:28] joal: sorry gimme 1 min I'll find the problem, today is a difficult morning for my brain [08:54:31] :D [08:55:02] No worries elukey - :) [08:58:14] joal: I am deeply embarassed, you are right, I need to sudo to chown [08:58:22] I was completely unware of that [08:58:36] and it makes sense [08:59:01] elukey: in my mind, if I want to impersonate yarn to give some more space/power/whatever, I need to e it [08:59:03] otherwise you can inject arbitrary files to others [08:59:10] right [08:59:40] joal: you are completely right, I show my deep ignorance of having sudo handy on a daily basis :D [09:00:14] now that I wasted 20 minutes of your time with a stupid thing, lemme do what you asked :d [09:00:35] Leaning is no waste of time elukey :) [09:00:46] to recap: /tmp/yarn/presto/data and /tmp/yarn/presto/etc on all the workers, chown -R yarn:yarn ? [09:01:04] elukey: actually, creating and chown to yarn please :) [09:01:15] elukey: I'm gonna drop the /tmp/joal ones [09:01:27] or elukey, if you could with cumin that's be greatr [09:01:40] yep yep [09:02:00] can I go or do you need to save files? [09:02:08] elukey: please go [09:04:15] /tmp/joal cleared [09:05:54] joal: all fone [09:05:56] *done [09:06:02] Many thanks elukey! Trying [09:06:12] joal: thanks for the patience :) [09:07:05] elukey: to me, learning of basics linux is good on monday morning :) [09:09:28] * elukey blames fdans [09:10:07] :D [09:12:05] elukey: excuse me sir, how dare you [09:16:29] ahahha [09:48:16] Gone for some erands - will be back in 1h [09:55:28] 10Analytics, 10Analytics-EventLogging: Should it be possible for a schema to override DNT in exceptional circumstances? - https://phabricator.wikimedia.org/T187277#3982615 (10Tbayer) >>! In T187277#3973410, @Nuria wrote: >>appy to review older discussions if you find the links, but for now it looks like it's c... [11:27:56] * elukey errand + lunch! [11:49:49] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3983026 (10bmansurov) @JAllemandou thanks! Would you be able to give some instructions on how to run these patches? The first one seems straightforward, but not sure about the scala one. [11:52:58] 10Analytics-Tech-community-metrics, 10Developer-Relations: Review entries in https://github.com/Bitergia/mediawiki-repositories/ to exclude/include - https://phabricator.wikimedia.org/T187711#3983039 (10Aklapper) p:05Triage>03Normal [11:53:30] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain difference in number of repositories when trying to manually exclude imported third party repositories - https://phabricator.wikimedia.org/T184420#3983055 (10Aklapper) 05Open>03Resolved I come to believe that this is not a pr... [11:57:05] 10Analytics-Tech-community-metrics: Consider setting up "Projects" in wikimedia.biterg.io - https://phabricator.wikimedia.org/T187661#3983071 (10Aklapper) [11:57:07] 10Analytics-Tech-community-metrics, 10Regression: Git repo blacklist config not applied on wikimedia.biterg.io - https://phabricator.wikimedia.org/T146135#3983070 (10Aklapper) [11:58:49] 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Epic: Visualization/data regressions after moving from korma.wmflabs.org to wikimedia.biterg.io - https://phabricator.wikimedia.org/T137997#3983078 (10Aklapper) [11:58:55] 10Analytics-Tech-community-metrics, 10Regression: Git repo blacklist config not applied on wikimedia.biterg.io - https://phabricator.wikimedia.org/T146135#2651767 (10Aklapper) 05Open>03stalled I think the way to go forward here would be putting blocklisted repositories into a "Project" (depends on T187661)... [11:59:35] 10Analytics-Tech-community-metrics, 10Regression: Exclude blocklist upstream repositories in the default view on wikimedia.biterg.io (by setting up a "Project"?) - https://phabricator.wikimedia.org/T146135#3983081 (10Aklapper) [12:01:32] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3983090 (10JAllemandou) @bmansurov : There is an example command line in the header-comment of the XmlConverter file. Little reminder: these two patches deal with huge datasets (2TB of bz2 compressed XML and 18TB of snappy compres... [12:20:12] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3983143 (10bmansurov) @JAllemandou makes sense. @diego how soon do you need these files? Can we wait until the patches are productionized? [12:22:45] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3983147 (10JAllemandou) @bmansurov and @diego : Data is available up to 2018-01 included at `hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snaphsot=2018-01`. I think we're not going to put more effort into productionization as... [12:25:08] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3983153 (10bmansurov) That's great. I didn't realize you did it for every wiki! [12:26:08] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3983157 (10JAllemandou) @bmansurov No worries :) The whole point of this two things is to work for 'every' wiki :) [12:36:45] PROBLEM - YARN NodeManager JVM Heap usage on analytics1069 is CRITICAL: CRITICAL - analytics_hadoop_yarn_nodemanager is 3896638819 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [12:38:45] RECOVERY - YARN NodeManager JVM Heap usage on analytics1069 is OK: OK - analytics_hadoop_yarn_nodemanager is 3844894675 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [12:43:45] PROBLEM - YARN NodeManager JVM Heap usage on analytics1069 is CRITICAL: CRITICAL - analytics_hadoop_yarn_nodemanager is 3896638819 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [12:51:45] RECOVERY - YARN NodeManager JVM Heap usage on analytics1069 is OK: OK - analytics_hadoop_yarn_nodemanager is 3844894675 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=17&fullscreen [13:32:47] elukey: help !!! [13:42:07] joal: here I am! [13:42:21] these are false positives, new prometheus alarms, everything is good [13:42:34] I had a new idea about how to check heap usage [13:42:42] going to fix those monitors in 10 min [13:42:42] ok elukey :) [13:42:51] elukey: I'd need a sudo help again [13:43:16] joal: why do you need sudo? Can't you just chwon yourself? [13:43:18] * elukey runs away [13:43:20] :D :D :D [13:43:25] :D [13:43:33] what can I do to help?? [13:43:58] elukey: you know how many me there are inside my brain, don't ask me to chown too much or I'll go overflow [13:44:28] elukey: for debug purpose I'd need to access yarn folders on an analytics node [13:44:50] But it's kinda tricky because I actually don't really know what I'm expecting to find :( [13:45:28] joal: what if we chown joal:yarn? [13:45:30] do you think that it would help? [13:46:06] or even r+x to others for this experiment, if no pii data is expected to be there [13:46:40] elukey: if this place is not empty, could we do that? analytics1052:/var/lib/hadoop/data/e/yarn/local/usercache/joal/appcache/application_1518549639452_21660/ [13:50:13] joal: long term I'd say that we could file a access request to sudo as yarn on all the hadoop nodes [13:50:25] elukey: could help [13:50:39] elukey: phab ticket to ops? [13:51:20] elukey@analytics1052:~$ sudo ls -l /var/lib/hadoop/data/e/yarn/local/usercache/joal/appcache/ [13:51:23] total 0 [13:51:28] joal: yep [13:53:31] elukey: Ops-Access-Requests I guess [13:53:46] yep, if you want I can take care of it later [13:54:00] created elukey [13:55:02] I think we could simply add a sudoers rule to analytics-admins [13:57:35] elukey: very much [14:03:06] joal: https://gerrit.wikimedia.org/r/#/c/412704/ [14:03:08] there you go [14:03:10] elukey: would you mind giving me few minutes with batcave to try to unlock my stuff? [14:03:19] so the iter is: 3 waiting days, plus the ops meeting approval [14:03:23] that will happen next monday [14:03:34] awesome :) [14:03:46] I'll try to help for the moment [14:22:39] hey a-team, I’m feeling pretty poorly right now so I’m going to lay down for a couple hours, so I’ll miss standup :( [14:23:12] hey fdans - No worries, hope you'll feel better soon [14:25:06] * elukey sends wikilove to fdans [14:36:08] helllooo [14:51:26] Hi mforns :) [14:51:31] hello joal :] [14:51:43] How are you? [14:56:45] good! [14:57:06] you?!?! [14:58:39] Good as well :) [15:04:16] Gone to catch Lino :) [15:12:07] elukey, ping? o.o [15:12:41] mforns: pong! [15:15:24] elukey, :] [15:15:38] how are ya? [15:15:55] on friday I was looking into the problems in mediawiki snapshot cleaner [15:16:12] and it turns out the problem is heap space [15:16:24] I think [15:16:43] because the table that is iterated has too many partitions, around 85k [15:17:20] I tried to look at grafana, but could not find metrics [15:18:36] mforns: about hive? In there we used to have only server and metastore, but now jmxtrans is not there anymore so we don't have them (due to some outstanding upstream bugs that prevent prometheus to run). In your case I think it is hive cli no? [15:18:53] yes [15:19:32] elukey, is there a way to give hive more heap? [15:19:47] in theory simply adding JAVA_OPTS [15:19:54] lemme see on the hive script [15:23:16] mforns: like https://community.hortonworks.com/content/supportkb/48788/i-am-seeing-outofmemory-errors-when-i-run-a-hive-q.html [15:23:54] elukey, ok! will try thanks! [15:23:56] did you get a OOM somewhere that we can check? Is the container that fails or the hive cli itself when getting the data? [15:24:02] I guess the former [15:24:17] elukey, I got GC overhead errorsa [15:24:25] *errors [15:25:03] but googling around, found some people that had a similar problem and the GC error was due to heap space [15:25:18] not sure though in our case [15:37:24] GC overhead is almost surely related to heap size, the GC spend more time removing garbage than the overall code execution [15:37:48] mforns: but where do you get it? from hive in CLI or in a yarn container? [15:38:03] elukey, both [15:38:25] wait, no, cron also executes the hive cli [15:38:26] so, cli [15:39:27] mforns: sure, but is it the output from a container? Because two things might happen: 1) the Yarn's JVM container fails or 2) the hive CLI JVM (getting results back from hadoop) fails [15:39:54] I see [15:40:02] not sure [15:40:41] elukey, want me to show you in da caif for a sec [15:40:42] ? [15:41:06] mforns: sure [15:41:10] ok, omw [17:04:30] mforns, elukey - Sorry ! Had to take that call [17:04:36] joal, np! [17:04:40] we're deploying [17:04:42] mforns, elukey - Query still running, no failure as of now [17:04:49] ok :] [17:04:57] mforns: There must have been something wrong with DB state [17:05:10] aha, I hope! [17:05:37] mforns: Actually, we should have done that: not delete the data, delte, restore, and try to delete partition ... [17:05:56] aha [17:05:58] anyway mforns - Might be for next month if things still go wrong (hopefully not!) [17:06:59] joal, yea, we can propose to reduce the #snapshots in standup tomorrow [17:07:12] maybe for that table only [17:07:20] it will be a small change in the script [17:07:21] mforns: we can, but we're not even sure it is the issue [17:07:33] right [17:14:32] !log deployed eventlogging - https://gerrit.wikimedia.org/r/#/c/405687/ [17:14:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:24:14] elukey: presto has started :) [17:24:20] elukey: and it's fast :) [17:24:26] niceeeeeeeeee [17:25:41] joal: if I get it correctly, slider adds a "Meso-like" appeal to a bare Yarn+HDFS environment right? [17:25:46] *Mesos [17:26:05] elukey: that's the idea [17:26:28] * elukey finally got it [17:26:39] elukey: slider acts as a middle man negotating resources with yarn and launching instances where affeted [17:27:24] also stuff that needs to keep running, even if containers fail etc.. [17:27:32] really really coool [17:27:42] elukey: with elasticity [17:28:10] joal: on how many workers it runs currently? [17:28:15] (slider I mean) [17:28:27] elukey: I have 5 containers [17:28:48] elukey: 1AM (managing the slider-app), 1 presto-coord, 3 presto wokrders [17:28:59] elukey: currently trying to scale to 8 presto workers [17:30:06] ah so each slider app has a "scheduler" basically that runs as yarn container, and that in turns launches other ones (in your case the presto ones) [17:30:35] This is super awesome - scaled to 8 nodes in 10 seconds :) [17:31:17] wow [17:31:22] elukey: actually, this pattern is yarn - Any app has a special comtainer named ApplicationMaster that manages he app, and other containers that does the work [17:32:58] elukey: with 8 nodes, takes 5 seconds to sum up one day of pageviews :) [17:35:48] joal: yep I am aware about the AM but I thought that slider was somehow bypassing it [17:36:07] elukey: give AM is yarn and slider is inside yarn, well it has to do with it :) [17:36:20] a-team - Going for diner ! Back after [17:36:22] o/ [18:27:50] * elukey off! [18:47:07] mforns: REquest for repairing the table has finished succesfully :) [18:47:15] joal, woohoo [18:47:39] ok, I will execute the snapshot script to clear the other 2 tables [18:48:00] the script broke with the metrics one, and there were still other 2 remaining [18:48:23] k mforns - let mi know if I can help :) [19:07:02] mforns: I'm gonna be willing to show you presto tomorrow I think :) [19:07:12] suuuure! [19:07:13] mforns: you and the res of the team I mean [19:07:19] perfect [19:07:26] prety impressive [19:07:32] by your enthusiasm looks good! [19:07:57] Well, being bettern than hive is not really difficul, but man that is some improvement! [19:09:38] :D [19:09:56] joal, the script finished without problems [19:10:03] Yay ! \o/ [19:10:17] let's see next month... xD [19:10:25] thanks for your help! [19:41:59] Gone for tonight a-team :) [19:42:26] bye joal!