[08:08:50] Goo dmorning :) [08:30:05] o/ [09:41:08] elukey: how did the bigtop test on 1031 work out? [09:43:10] moritzm: not bad, I also reached out to some people in big top, they are really welcoming [09:43:59] moritzm: the main issue in swapping one node at the time is that the yarn node manager doesn't work with mixed protocol buffer versions between 2.6 (CDH) and 2.8 (bigtop) [09:44:04] but hdfs does [09:44:31] nice [09:44:42] I am testing spark2 encryption in hadoop test (that still gives me headaches) so I tried to remove other variables :D [09:44:59] what we are trying to think now is what is the best strategy if we migrate [09:45:07] we can either: [09:45:28] 1) do it in place, shutting down the cluster, swapping package and finally upgrading HDFS [09:45:59] 2) use the new 34 hadoop workers that we'll get to think about a new bigtop cluster, and move data to it [09:46:11] both options are a bit complicated [09:47:09] but the community of bigtop is 100 times better than CDH [09:47:18] so I'd be really happy to work with them [09:48:31] joal: very interesting from the analytics1031's rollback [09:48:31] 2020-01-20 09:43:09,203 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/var/lib/hadoop/data/k/hdfs/dn/ [09:48:35] org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /var/lib/hadoop/data/k/hdfs/dn. Reported: -57. Expecti [09:48:38] ng = -56. [09:49:58] basically /var/lib/hadoop/data/m/hdfs/dn/current/VERSION of course changes [09:53:01] I think I need the -rollback option [09:53:16] upgrading is not super easy :D [10:01:07] need to go to the doctor in a few, will probably last one hour (physiotherapy), ttl! [11:12:26] 10Analytics, 10Product-Analytics: request access to Hue - https://phabricator.wikimedia.org/T243109 (10nshahquinn-wmf) Adding #analytics since they're the ones who can get you access 🙂 [11:49:47] 10Analytics, 10Product-Analytics: request access to Hue - https://phabricator.wikimedia.org/T243109 (10elukey) 05Open→03Resolved a:03elukey Done! :) [12:16:00] interesting, with -rollback the datanode works now [12:16:14] found it in https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#Downgrade_and_Rollback [12:19:26] !log restart zookeeper on an-conf100X to pick up openjdk-11 updates [12:19:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:33:29] nice finding elukey ! Will be very usefull for upgrade in case of rollback ! [12:38:33] joal: also another interesting thing [12:38:42] hadoop checknative -a | grep openssl [12:38:47] openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared object file: No such file or directory)! [12:39:10] WAT? [12:39:25] this is on an-tool1006, it seems that we use the Java JCE's stuff as opposed to openssl's due to https://issues.apache.org/jira/browse/HADOOP-12845 [12:39:44] that is not a big deal I think, but I am wondering if this causes some trouble between scala vs python [12:40:09] I was about to say that [13:34:52] joal: if you have time, there is something that I don't understand [13:35:01] in test I usually run [13:35:02] spark.sql("SELECT count(*) FROM wmf.webrequest where year=2019 and month=11 and day=20 and hour=10 ").show(); [13:35:03] Please elukey :) [13:35:19] that hour is, on hdfs, 4mb [13:35:26] and it was usually quick [13:35:43] not it seems taking ages, with a ton of workers and stages [13:35:58] I am a bit confused about why this is happening [13:36:21] elukey: I have an idea - Is it the first time you run it? [13:36:26] I rolledback the bigtop change on the only worker with the test, and also as precaution the extra crypto settings (256bits etc..) [13:36:55] joal: the count(*) yes, but I switched from the "usual" select etc.. limit 10 [13:37:18] what do you mean by switch? [13:37:26] changed the sql query [13:37:31] because I get a OOM error [13:38:49] with dynamic allocation it consumes like 180 workers [13:38:53] (checked via Yarn UI) [13:38:58] never happened [13:39:05] Bizarre! [13:39:26] elukey: would have something changed in conf and make you query prod cluster instead of test? [13:40:42] in theory no, will try with debug logging to see if it hits hive in test [13:40:51] ok [13:41:16] this crypto thing is making me crazy :D [13:41:16] elukey: might be related to schema change and table recreation [13:41:21] :( [13:41:32] it worked very well [13:50:21] checked the LOCATION of the webrequest table on hive, it is analytics-test, then [13:50:24] Trying to connect to metastore with URI thrift://analytics1030.eqiad.wmnet:9083 [13:50:29] and from hive the query takes less than a min [13:51:25] elukey: have you run a query against webrequest after having recreated the table? [13:51:30] with spark I mean? [13:51:43] I think these are the first [13:52:05] elukey: this is the thing then [13:52:38] can you explain? :) [13:52:43] elukey: spark precomputes some parquet schema/stats into metastore, and when those columns are not presetn, it goes over the full ime [13:52:59] the full table sorry, and it takes some time [13:53:50] with debug logging I can see [13:53:51] 20/01/20 13:51:00 INFO DAGScheduler: ResultStage 0 (sql at :24) finished in 58.572 s [13:53:59] try again now [13:54:06] should be fast as usual [13:54:11] but then the query hangs after a lot of Remove broadcast etc.. [13:55:05] hm [14:03:18] elukey: I think the hanging is due to updating mysql, I hope the thing will finish :S [14:05:49] joal: the parquet schema stats? [14:06:03] yeah - fells weird [14:07:58] I think that the pi example works, so probably hive-related.. it makes sense, it is the last big thing that I did [14:11:12] going afk for a coffee, maybe it will help :) [14:16:35] ok elukey - I got it work - And through that I also understand better how the "parquet-stats" works: tasks get stats, and they are collected onto driver, that aggregates them and then writes to metastore - For the thing to work for large tables (many partitions), enough driver memory is needed (it succeeded fast for me with 4G) [14:16:59] joal: ahhhhhhhhhhhhhhhh [14:17:09] that explains it! And you get the hit on the first query [14:17:16] does it work now? [14:17:36] correct - It works yes (retried a query, done with 2 workers in seconds) [14:17:59] joal: wow, thanks a million [14:18:08] you are da best as always [14:18:18] elukey: force kill your process might be needed (driver overwhelmed, linux kill helps) [14:18:59] big lesson learned today about spark [14:19:48] yep works like a charm [14:46:11] ok now I can make queries, and of course the misterious heisenbug does not appear [14:47:44] ah no I cannot from python, but can from scala this time [14:53:02] very interesting [14:53:07] I have done something like [14:53:18] sudo ln -s /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.2 /usr/lib/x86_64-linux-gnu/libcrypto.so [14:53:25] on an-tool1006 and all the workers [14:53:42] so hadoop checknative -a now finds libcrypto.so (1.0.2's version) [14:53:52] and so far all seems good, no problems [14:56:47] mmmm [15:02:46] joal: not sure if the heisenbug is gone, but it looks good [15:03:14] encryption is back to 256 bit, RPC + IO (so any shuffle file spilled on disk is encrypted as well) [15:32:19] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Enable encryption in Spark 2.4 by default - https://phabricator.wikimedia.org/T240934 (10elukey) ok so today I found in the debug logs a warning that was indicating the failure to load openssl's crypto libs, and the fallback to standard JCE crypto. After a... [15:32:24] --^ [15:53:25] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Enable encryption in Spark 2.4 by default - https://phabricator.wikimedia.org/T240934 (10elukey) Very interesting that the heisenbug seems now only triggering a warning, but not stopping pyspark: ` elukey@an-tool1006:~$ spark2-submit --master yarn /home/j... [16:01:46] indeed elukey - pyspark2 started straight away on an-tool1006 [16:02:04] Let's give it half a day of semi-life, and then declare heisenbug gone :) [16:18:40] (03PS15) 10Fdans: Add vue-i18n integration, English strings [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/558702 (https://phabricator.wikimedia.org/T240617) [16:47:24] a-team https://imgflip.com/i/3mo1iu [16:47:36] hahahah [16:47:38] nice [16:48:25] btw a-team, mediarequests per file finished loading to cassandra! [16:48:41] yayyy! [16:48:56] \o/ Awesome fdans :) [16:49:12] awesome pic mforns :) [16:49:36] hehe [16:51:14] fdans: greaaattt [16:51:18] joal: https://www.hipeac.net/2020/bologna/#/ :O [16:52:26] Nice elukey! Do you think it's something that could be of interest for us (in addition to come and visit you, obviously)? [16:53:23] maybe but it started today, just found out :( [16:53:27] could have been interesting! [17:10:14] (03PS2) 10Fdans: Add language selection functionality to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/564047 (https://phabricator.wikimedia.org/T238752) [17:48:05] * elukey off! [19:56:56] a-team: I'm trying to submit an Oozie job on the command line, but I'm getting an "Oozie URL not available" error. What URL should I specify? The docs on Wikitech (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie) show commands being run without any URL specified. It seems like that's because there should be a $OOZIE_URL environmental variable, but for some reason I don't have one. [19:57:43] Hi nshahquinn [19:58:39] if you don't have the $OOZIE_URL specfified (not nice :( ) here is mine: http://an-coord1001.eqiad.wmnet:11000/oozie [19:58:44] nshahquinn: --^ [19:59:17] nshahquinn: another thing - If you launch a job from another user, it is expected not to have the env-var defined [20:00:05] nshahquinn: When launching prod jobs, I use: oozie job --oozie $OOZIE_URL [20:00:26] This passes the env-var defined in my session to the other-user session :) [20:00:35] joal: do you mean if I launch a job, sudoing as someone else, it won't be defined? I'm not doing anything like that [20:00:47] that's what I meant nshahquinn [20:01:07] on stat1004 when I do : echo $OOZIE_URL [20:01:14] I get the url I pasted above [20:01:39] joal: okay, thank you! when I do that, I get nothing...don't know why. But I'll set it again and add it to the docs :) [20:01:49] :( [20:01:58] Thanks a lot nshahquinn :) [20:02:10] thank you :) [20:06:05] joal: ahh, I think it wasn't set because I was in my SWAP virtual environment :) [20:06:14] Ahhhhh ! [20:06:33] not sure if it makes sense, I'm not good enough in venv to know that :) [20:40:28] 10Analytics, 10Product-Analytics: Add Kerberos authentication to Product Analytics Oozie jobs - https://phabricator.wikimedia.org/T241092 (10nshahquinn-wmf) [20:41:58] 10Analytics, 10Product-Analytics: Add Kerberos authentication to Product Analytics Oozie jobs - https://phabricator.wikimedia.org/T241092 (10nshahquinn-wmf) 05Open→03Resolved [20:49:32] 10Analytics, 10Analytics-Kanban, 10Dumps-Generation: Some xml-dumps files don't follow BZ2 'correct' definition - https://phabricator.wikimedia.org/T243241 (10JAllemandou) [20:49:43] 10Analytics, 10Analytics-Kanban, 10Dumps-Generation: Some xml-dumps files don't follow BZ2 'correct' definition - https://phabricator.wikimedia.org/T243241 (10JAllemandou) a:03JAllemandou [20:50:05] gone for tonight :)