[08:08:50] <joal>	 Goo dmorning :)
[08:30:05] <elukey>	 o/
[09:41:08] <moritzm>	 elukey: how did the bigtop test on 1031 work out?
[09:43:10] <elukey>	 moritzm: not bad, I also reached out to some people in big top, they are really welcoming 
[09:43:59] <elukey>	 moritzm: the main issue in swapping one node at the time is that the yarn node manager doesn't work with mixed protocol buffer versions between 2.6 (CDH) and 2.8 (bigtop)
[09:44:04] <elukey>	 but hdfs does
[09:44:31] <moritzm>	 nice
[09:44:42] <elukey>	 I am testing spark2 encryption in hadoop test (that still gives me headaches) so I tried to remove other variables :D
[09:44:59] <elukey>	 what we are trying to think now is what is the best strategy if we migrate
[09:45:07] <elukey>	 we can either:
[09:45:28] <elukey>	 1) do it in place, shutting down the cluster, swapping package and finally upgrading HDFS
[09:45:59] <elukey>	 2) use the new 34 hadoop workers that we'll get to think about a new bigtop cluster, and move data to it
[09:46:11] <elukey>	 both options are a bit complicated
[09:47:09] <elukey>	 but the community of bigtop is 100 times better than CDH
[09:47:18] <elukey>	 so I'd be really happy to work with them
[09:48:31] <elukey>	 joal: very interesting from the analytics1031's rollback
[09:48:31] <elukey>	 2020-01-20 09:43:09,203 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/var/lib/hadoop/data/k/hdfs/dn/
[09:48:35] <elukey>	 org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /var/lib/hadoop/data/k/hdfs/dn. Reported: -57. Expecti
[09:48:38] <elukey>	 ng = -56.
[09:49:58] <elukey>	 basically /var/lib/hadoop/data/m/hdfs/dn/current/VERSION of course changes
[09:53:01] <elukey>	 I think I need the -rollback option
[09:53:16] <elukey>	 upgrading is not super easy :D
[10:01:07] <elukey>	 need to go to the doctor in a few, will probably last one hour (physiotherapy), ttl!
[11:12:26] <wikibugs>	 10Analytics, 10Product-Analytics: request access to Hue - https://phabricator.wikimedia.org/T243109 (10nshahquinn-wmf) Adding #analytics since they're the ones who can get you access 🙂
[11:49:47] <wikibugs>	 10Analytics, 10Product-Analytics: request access to Hue - https://phabricator.wikimedia.org/T243109 (10elukey) 05Open→03Resolved a:03elukey Done! :)
[12:16:00] <elukey>	 interesting, with -rollback the datanode works now
[12:16:14] <elukey>	 found it in https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#Downgrade_and_Rollback
[12:19:26] <elukey>	 !log restart zookeeper on an-conf100X to pick up openjdk-11 updates
[12:19:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:33:29] <joal>	 nice finding elukey ! Will be very usefull for upgrade in case of rollback !
[12:38:33] <elukey>	 joal: also another interesting thing
[12:38:42] <elukey>	 hadoop checknative -a | grep openssl
[12:38:47] <elukey>	 openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared object file: No such file or directory)!
[12:39:10] <joal>	 WAT?
[12:39:25] <elukey>	 this is on an-tool1006, it seems that we use the Java JCE's stuff as opposed to openssl's due to https://issues.apache.org/jira/browse/HADOOP-12845
[12:39:44] <elukey>	 that is not a big deal I think, but I am wondering if this causes some trouble between scala vs python
[12:40:09] <joal>	 I was about to say that
[13:34:52] <elukey>	 joal: if you have time, there is something that I don't understand
[13:35:01] <elukey>	 in test I usually run 
[13:35:02] <elukey>	 spark.sql("SELECT count(*) FROM wmf.webrequest where year=2019 and month=11 and day=20 and hour=10 ").show();
[13:35:03] <joal>	 Please elukey :)
[13:35:19] <elukey>	 that hour is, on hdfs, 4mb
[13:35:26] <elukey>	 and it was usually quick
[13:35:43] <elukey>	 not it seems taking ages, with a ton of workers and stages
[13:35:58] <elukey>	 I am a bit confused about why this is happening
[13:36:21] <joal>	 elukey: I have an idea - Is it the first time you run it?
[13:36:26] <elukey>	 I rolledback the bigtop change on the only worker with the test, and also as precaution the extra crypto settings (256bits etc..)
[13:36:55] <elukey>	 joal: the count(*) yes, but I switched from the "usual" select etc.. limit 10
[13:37:18] <joal>	 what do you mean by switch?
[13:37:26] <elukey>	 changed the sql query
[13:37:31] <elukey>	 because I get a OOM error
[13:38:49] <elukey>	 with dynamic allocation it consumes like 180 workers
[13:38:53] <elukey>	 (checked via Yarn UI)
[13:38:58] <elukey>	 never happened
[13:39:05] <joal>	 Bizarre!
[13:39:26] <joal>	 elukey: would have something changed in conf and make you query prod cluster instead of test?
[13:40:42] <elukey>	 in theory no, will try with debug logging to see if it hits hive in test
[13:40:51] <joal>	 ok
[13:41:16] <elukey>	 this crypto thing is making me crazy :D
[13:41:16] <joal>	 elukey: might be related to schema change and table recreation
[13:41:21] <joal>	 :(
[13:41:32] <elukey>	 it worked very well
[13:50:21] <elukey>	 checked the LOCATION of the webrequest table on hive, it is analytics-test, then 
[13:50:24] <elukey>	 Trying to connect to metastore with URI thrift://analytics1030.eqiad.wmnet:9083
[13:50:29] <elukey>	 and from hive the query takes less than a min
[13:51:25] <joal>	 elukey: have you run a query against webrequest after having recreated the table?
[13:51:30] <joal>	 with spark I mean?
[13:51:43] <elukey>	 I think these are the first
[13:52:05] <joal>	 elukey: this is the thing then
[13:52:38] <elukey>	 can you explain? :)
[13:52:43] <joal>	 elukey: spark precomputes some parquet schema/stats into metastore, and when those columns are not presetn, it goes over the full ime
[13:52:59] <joal>	 the full table sorry, and it takes some time
[13:53:50] <elukey>	 with debug logging I can see
[13:53:51] <elukey>	 20/01/20 13:51:00 INFO DAGScheduler: ResultStage 0 (sql at <console>:24) finished in 58.572 s
[13:53:59] <joal>	 try again now
[13:54:06] <joal>	 should be fast as usual
[13:54:11] <elukey>	 but then the query hangs after a lot of Remove broadcast etc..
[13:55:05] <joal>	 hm
[14:03:18] <joal>	 elukey: I think the hanging is due to updating mysql, I hope the thing will finish :S
[14:05:49] <elukey>	 joal: the parquet schema stats?
[14:06:03] <joal>	 yeah - fells weird
[14:07:58] <elukey>	 I think that the pi example works, so probably hive-related.. it makes sense, it is the last big thing that I did
[14:11:12] <elukey>	 going afk for a coffee, maybe it will help :)
[14:16:35] <joal>	 ok elukey - I got it work - And through that I also understand better how the "parquet-stats" works: tasks get stats, and they are collected onto driver, that aggregates them and then writes to metastore - For the thing to work for large tables (many partitions), enough driver memory is needed (it succeeded fast for me with 4G)
[14:16:59] <elukey>	 joal: ahhhhhhhhhhhhhhhh
[14:17:09] <elukey>	 that explains it! And you get the hit on the first query
[14:17:16] <elukey>	 does it work now?
[14:17:36] <joal>	 correct - It works yes (retried a query, done with 2 workers in seconds)
[14:17:59] <elukey>	 joal: wow, thanks a million
[14:18:08] <elukey>	 you are da best as always
[14:18:18] <joal>	 elukey: force kill your process might be needed (driver overwhelmed, linux kill helps)
[14:18:59] <elukey>	 big lesson learned today about spark
[14:19:48] <elukey>	 yep works like a charm
[14:46:11] <elukey>	 ok now I can make queries, and of course the misterious heisenbug does not appear
[14:47:44] <elukey>	 ah no I cannot from python, but can from scala this time
[14:53:02] <elukey>	 very interesting
[14:53:07] <elukey>	 I have done something like
[14:53:18] <elukey>	 sudo ln -s /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.2 /usr/lib/x86_64-linux-gnu/libcrypto.so
[14:53:25] <elukey>	 on an-tool1006 and all the workers
[14:53:42] <elukey>	 so hadoop checknative -a now finds libcrypto.so (1.0.2's version)
[14:53:52] <elukey>	 and so far all seems good, no problems
[14:56:47] <elukey>	 mmmm
[15:02:46] <elukey>	 joal: not sure if the heisenbug is gone, but it looks good
[15:03:14] <elukey>	 encryption is back to 256 bit, RPC + IO (so any shuffle file spilled on disk is encrypted as well)
[15:32:19] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Enable encryption in Spark 2.4 by default - https://phabricator.wikimedia.org/T240934 (10elukey) ok so today I found in the debug logs a warning that was indicating the failure to load openssl's crypto libs, and the fallback to standard JCE crypto. After a...
[15:32:24] <elukey>	 --^
[15:53:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Enable encryption in Spark 2.4 by default - https://phabricator.wikimedia.org/T240934 (10elukey) Very interesting that the heisenbug seems now only triggering a warning, but not stopping pyspark:  ` elukey@an-tool1006:~$ spark2-submit --master yarn /home/j...
[16:01:46] <joal>	 indeed elukey - pyspark2 started straight away on an-tool1006
[16:02:04] <joal>	 Let's give it half a day of semi-life, and then declare heisenbug gone :)
[16:18:40] <wikibugs>	 (03PS15) 10Fdans: Add vue-i18n integration, English strings [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/558702 (https://phabricator.wikimedia.org/T240617)
[16:47:24] <mforns>	 a-team https://imgflip.com/i/3mo1iu
[16:47:36] <fdans>	 hahahah
[16:47:38] <fdans>	 nice
[16:48:25] <fdans>	 btw a-team, mediarequests per file finished loading to cassandra!
[16:48:41] <mforns>	 yayyy!
[16:48:56] <joal>	 \o/ Awesome fdans :)
[16:49:12] <joal>	 awesome pic mforns :)
[16:49:36] <mforns>	 hehe
[16:51:14] <elukey>	 fdans: greaaattt
[16:51:18] <elukey>	 joal: https://www.hipeac.net/2020/bologna/#/ :O
[16:52:26] <joal>	 Nice elukey! Do you think it's something that could be of interest for us (in addition to come and visit you, obviously)?
[16:53:23] <elukey>	 maybe but it started today, just found out :(
[16:53:27] <elukey>	 could have been interesting!
[17:10:14] <wikibugs>	 (03PS2) 10Fdans: Add language selection functionality to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/564047 (https://phabricator.wikimedia.org/T238752)
[17:48:05] * elukey off!
[19:56:56] <nshahquinn>	 a-team: I'm trying to submit an Oozie job on the command line, but I'm getting an "Oozie URL not available" error. What URL should I specify? The docs on Wikitech (https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie) show commands being run without any URL specified. It seems like that's because there should be a $OOZIE_URL environmental variable, but for some reason I don't have one.
[19:57:43] <joal>	 Hi nshahquinn 
[19:58:39] <joal>	 if you don't have the $OOZIE_URL  specfified (not nice :( ) here is mine: http://an-coord1001.eqiad.wmnet:11000/oozie
[19:58:44] <joal>	 nshahquinn: --^
[19:59:17] <joal>	 nshahquinn: another thing - If you launch a job from another user, it is expected not to have the env-var defined
[20:00:05] <joal>	 nshahquinn: When launching prod jobs, I use: oozie job --oozie $OOZIE_URL
[20:00:26] <joal>	 This passes the env-var defined in my session to the other-user session :)
[20:00:35] <nshahquinn>	 joal: do you mean if I launch a job, sudoing as someone else, it won't be defined? I'm not doing anything like that
[20:00:47] <joal>	 that's what I meant nshahquinn 
[20:01:07] <joal>	 on stat1004 when I do : echo $OOZIE_URL
[20:01:14] <joal>	 I get the url I pasted above
[20:01:39] <nshahquinn>	 joal: okay, thank you! when I do that, I get nothing...don't know why. But I'll set it again and add it to the docs :)
[20:01:49] <joal>	 :(
[20:01:58] <joal>	 Thanks a lot nshahquinn :)
[20:02:10] <nshahquinn>	 thank you :)
[20:06:05] <nshahquinn>	 joal: ahh, I think it wasn't set because I was in my SWAP virtual environment :)
[20:06:14] <joal>	 Ahhhhh !
[20:06:33] <joal>	 not sure if it makes sense, I'm not good enough in venv to know that :)
[20:40:28] <wikibugs>	 10Analytics, 10Product-Analytics: Add Kerberos authentication to Product Analytics Oozie jobs - https://phabricator.wikimedia.org/T241092 (10nshahquinn-wmf)
[20:41:58] <wikibugs>	 10Analytics, 10Product-Analytics: Add Kerberos authentication to Product Analytics Oozie jobs - https://phabricator.wikimedia.org/T241092 (10nshahquinn-wmf) 05Open→03Resolved
[20:49:32] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Dumps-Generation: Some xml-dumps files don't follow BZ2 'correct' definition - https://phabricator.wikimedia.org/T243241 (10JAllemandou)
[20:49:43] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Dumps-Generation: Some xml-dumps files don't follow BZ2 'correct' definition - https://phabricator.wikimedia.org/T243241 (10JAllemandou) a:03JAllemandou
[20:50:05] <joal>	 gone for tonight :)