[00:10:07] <sukhe>	 fdans: I asked my family in India to check and while it's possible they missed an app, they certainly covered the most popular ones
[00:10:19] <sukhe>	 I am leaning towards that the image is requested but not displayed
[01:26:12] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Anti-Harassment, 10Event-Platform, and 2 others: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 (10Ottomata) Hi all, this PHP client migration blocker has been a pain, hopefully we are getting close to done.    I...
[01:38:18] <sukhe>	 is there a gap between say when I make a webrequest and when it shows up in a Hive query?
[01:38:31] <sukhe>	 like if I were to query for Feb 9 UTC, would it show up now?
[03:47:02] <fdans>	 sukhe: just logging in to say wowowowowow super amazing job, incredible detective work
[06:11:37] <elukey>	 !log disable systemd timers on an-launcher1002 (prep step for bigtop)
[06:11:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:11:48] * elukey coffee
[06:14:44] <elukey>	 !log disable timers on labstore nodes (prep step for bigtop)
[06:14:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:24:11] <joal>	 Good morning
[06:29:51] <lexnasser>	 joal: morning!
[06:36:20] <wikibugs>	 (03PS1) 10Lex Nasser: Fix unit tests that ensure certain requests fail and clean up all unit tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404)
[06:38:30] <lexnasser>	 joal: goodnight!
[06:42:42] <joal>	 Good night lexnasser :)
[06:51:30] <elukey>	 good morning :)
[06:54:43] <elukey>	 joal: very interesting readin https://github.com/prestodb/presto/issues/15685
[06:55:05] <elukey>	 basically the issues that I was seeing with presto 246 were all related to a typo in the worker hive config file (my bad)
[06:55:20] <elukey>	 but the division by zero stacktrace was basically (no nodes available)
[06:55:49] <elukey>	 (in our case - no nodes available for the given catalog)
[06:56:12] <elukey>	 not particularly happy about this, Trino might have a little bit more flexibility
[06:56:21] <elukey>	 anyway, the presto upgrade should be unblocked :)
[07:09:46] <elukey>	 !log stop namenode on an-worker1124 (backup cluster), create two new partitions for backup and namenode, restart namenode
[07:09:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:12:06] <elukey>	 !log stop airflow on an-airflow1001 (prep step for bigtop)
[07:12:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:13:23] <joal>	 elukey: I'm going to manually kill all jobs except Flink - they are users jobs
[07:14:00] <elukey>	 joal: sure
[07:19:51] <joal>	 !log Killing yarn users applications
[07:19:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:20:03] <joal>	 elukey: sending a message to slack to tell we start
[07:23:39] <elukey>	 joal: let's advertise that the actual downtime will happen after the backup, just in case
[07:23:49] <joal>	 Ack elukey - makes sense
[07:24:00] <elukey>	 better: let's say that we are draining the cluster etc.. (prep steps)
[07:25:11] <elukey>	 in the meantime I am rolling out the new apt config (plus the hive/yarn settings that the cluster needs)
[07:25:16] <joal>	 ack elukey 
[07:25:23] <mforns>	 hello teammmmm :]
[07:25:39] <joal>	 Good morning mforns :)
[07:25:43] <elukey>	 hola mforns!
[07:25:47] <mforns>	 morning
[07:26:54] <elukey>	 ahahha did you see https://phabricator.wikimedia.org/T273741 ? We hit hacker news :D
[07:27:42] <joal>	 :)
[07:28:06] <elukey>	 !log roll out new apt bigtop changes across all hadoop-related nodes
[07:28:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:30:46] <joal>	 elukey: distcp is happier with the healthier backup namenode :)
[07:34:43] <elukey>	 subtitle: distcp complained about Luca not doing his job correctly and it was right :D
[07:38:11] <joal>	 not exactly :)
[07:38:30] <joal>	 elukey: last run of distcps has started, so far so good
[07:39:31] <joal>	 elukey: there is a script somewhere restarting a spark job for analytics-search
[07:39:40] <joal>	 elukey: ores_bulk_ingest.py
[07:39:54] <joal>	 I kill the app, it's back in no time
[07:39:59] <elukey>	 ahahahh
[07:40:05] <joal>	 :(
[07:40:12] <elukey>	 so it cannot be on an-launcher1002
[07:40:14] <joal>	 dcausse: if by any chance you're nearby
[07:40:23] <elukey>	 lemme check puppet
[07:40:27] <joal>	 sure elukey 
[07:41:23] <elukey>	 not in there
[07:42:18] <elukey>	 maybe it is a stat100x timer
[07:42:28] <joal>	 elukey: could be a cron :(
[07:44:56] <elukey>	 don't find it on stat mm
[07:45:14] <joal>	 could be from airflow machine?
[07:46:37] <joal>	 elukey: launched by Erik - I'm assuming the cron/timer is on one of discovery machine
[07:48:19] <elukey>	 joal: in theory no, they don't have access to our network
[07:48:24] <joal>	 hm
[07:49:14] <elukey>	 spark.yarn.dist.archivesfile:///home/ebernhardson/ores_venv.zip#venv
[07:49:27] <elukey>	 not great paste buuut it is Erik for sure :D
[07:49:41] <joal>	 :)
[07:49:44] <elukey>	 srv/deployment/wikimedia/discovery/analytics/spark/wmf_spark.py
[07:50:05] <elukey>	 lemme check Erik's crons
[07:50:07] <joal>	 ack
[07:51:05] <elukey>	 nothing on the stat boxe
[07:51:08] <elukey>	 *boxes
[07:51:09] <joal>	 :(
[07:51:14] <joal>	 an-launcher?
[07:52:11] <elukey>	 he doesn't have access to it
[07:52:40] <joal>	 elukey: it's Erik we're talking about :)
[07:53:22] <elukey>	 I know I know :)
[07:54:32] <elukey>	 nothing on airflow1001 that I can see
[07:55:18] <joal>	 elukey: I think the thing could be a process launched in backgrounf
[07:55:23] <joal>	 instead of a timer
[07:55:44] <joal>	 a process polling yarn every minute of so, checking that the app is running and relaunching if needed
[07:56:25] <elukey>	 could be a good bet.. so I found /home/ebernhardson/ores_venv.zip on an-airflow1001
[07:56:32] <elukey>	 lemme check again
[07:56:32] <joal>	 hm
[07:57:08] <elukey>	 bingo :)
[07:57:39] <elukey>	 it is a tmux yes 
[07:57:47] <joal>	 \o/
[07:58:14] <elukey>	 that I have to kill now though.. hopefully it is fine
[07:58:24] <joal>	 elukey: +1
[07:58:47] <elukey>	 joal: done
[07:59:14] <joal>	 elukey: yarn app killed
[07:59:18] <joal>	 checking
[07:59:46] <joal>	 all good
[08:00:01] <joal>	 elukey: We can continue to tear-down the tools :)
[08:00:09] <joal>	 Thanks for finding the script :)
[08:01:59] <elukey>	 :)
[08:02:04] <elukey>	 so the new apt config is deployed
[08:03:04] <elukey>	 !log stop presto-server on an-presto100x and an-coord1001 - prep step for bigtop upgrade
[08:03:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:04:54] <elukey>	 done
[08:05:11] <elukey>	 !log stop oozie an-coord1001 - prep step for bigtop upgrade
[08:05:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:05:17] <elukey>	 joal: --^
[08:05:24] <joal>	 ack elukey 
[08:05:41] <elukey>	 done, lemme know when ok to go for hive
[08:05:51] <joal>	 elukey: I'm monitoring distcp jobs - still quite some to go, but it's moving
[08:05:55] <joal>	 please go elukey 
[08:06:42] <elukey>	 David is killing flink
[08:06:55] <joal>	 awesome - thanks dcausse 
[08:07:00] <elukey>	 !log stop hive on an-coord100[1,2] - prep step for bigtop upgrade
[08:07:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:08:01] <elukey>	 all right done
[08:08:17] <elukey>	 !log stop jupyterhub on stat100x
[08:08:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:15:12] <elukey>	 joal: if you are using the esam bastion change it, Arzhel is doing maintenance :D
[08:15:21] <joal>	 ack elukey 
[08:15:26] <dcausse>	 flink job is killed but there's still the session cluster (application_1609949395033_179785) running, can't login to stat machines I can't kill it
[08:15:37] <joal>	 ack dcausse - doing it now - thanks mate
[08:16:06] <joal>	 !log Kill flink yarn app
[08:16:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:16:09] <joal>	 done
[08:16:23] <dcausse>	 thanks!
[08:19:19] <elukey>	 !log umount /mnt/hdfs from all nodes using it
[08:19:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:19:54] <elukey>	 ok joal we are ready
[08:20:08] <joal>	 elukey: wait wait wait, i'm not !!!! :)
[08:20:13] <elukey>	 I can also set safemode if you want
[08:20:19] <elukey>	 ahahah nono I mean for the backup
[08:20:25] <joal>	 I know - joking :)
[08:21:24] <joal>	 last backup run has started, 'big' stuff is mostly done, 'numerous' stuff not yet
[08:21:35] <elukey>	 one thing that we forgot is the million cron jobs scheduled on the stat boxes
[08:21:43] <elukey>	 but we are not going to stop all of them :
[08:21:44] <elukey>	 :D
[08:21:51] <joal>	 nope :)
[08:22:08] <joal>	 empty cluster - \o/
[08:24:00] <joal>	 elukey: let's activate safemode if easy
[08:25:27] <elukey>	 !log set hdfs safemode on for the Analytics cluster
[08:25:29] <elukey>	 joal: done
[08:25:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:26:01] <elukey>	 ok so now we wait one/two hours
[08:26:09] <elukey>	 we are on track timing wise
[08:26:23] <joal>	 elukey: I hope I can make it faster, but really not sure
[08:27:40] <joal>	 elukey: you can go for coffee, I'm on backup copy :)
[08:28:43] <joal>	 elukey: actually no
[08:28:57] <joal>	 elukey: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot issue delegation token. Name node is in safe mode.
[08:29:04] <joal>	 I had not forsee that!!!
[08:29:06] <elukey>	 looool
[08:29:19] <joal>	 elukey: already running jobs are ok, but new ones can't be launched
[08:29:23] <joal>	 MEH
[08:29:44] <elukey>	 !log leave hdfs safemode to let distcp do its job
[08:29:44] <joal>	 meaning: you can't set hdfs in read-only mode and use it
[08:29:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:29:47] <joal>	 how bad
[08:30:04] <joal>	 fixed elukey - job started
[08:30:10] <elukey>	 it is also called "safe mode" so maybe it is considered more than read only
[08:30:12] <joal>	 thanks
[08:30:17] <joal>	 yeah you're right
[08:30:19] <elukey>	 but interesting indeed
[08:30:23] <elukey>	 I wouldn't have thought about it
[08:30:26] <elukey>	 anyway, all good :)
[08:30:30] <joal>	 :)
[08:30:38] <elukey>	 I am going to stop replication on mysqls in a bit
[08:30:57] <joal>	 sure
[09:44:20] <joal>	 elukey: last 2 jobs running
[09:45:03] * elukey dances
[09:45:19] <joal>	 elukey: unfortunately it still represents some time :(
[10:04:05] <elukey>	 backup of hive/oozie dbs done
[10:04:25] <elukey>	 !log stop mysql replication an-coord1001 -> an-coord1002, an-coord1001 -> db1108
[10:04:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:20:24] <elukey>	 joal: going to step away for ~15 mins, I did everything before the stop cluster step
[10:20:31] <elukey>	 once distcp finishes we can start
[10:21:19] <joal>	 exactly elukey - I monitor, but it's long :(
[10:54:27] <wikibugs>	 10Analytics: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey)
[10:54:43] <elukey>	 extended the maintenance to 14PM
[10:55:34] <elukey>	 I am going to quickly eat something before starting
[10:56:58] <joal>	 ack elukey - the last distcp jobs are in the long phase of commiting (deletion of old phase)
[10:57:20] <joal>	 elukey: this part is not parallelized, and going over millions of files take time :(
[10:58:00] <joal>	 elukey: If you prefer, I can also stop the jobs, as I am sure that all the needed data is copied with the correct perms - We might just have a little bit more than needed :)
[11:02:43] <joal>	 I target the job being fully done in less than 1/2h - I'm sorry for the extended wait :(
[11:08:15] <elukey>	 nono let's do things properly :)
[11:08:31] <elukey>	 half an hour is good to avoid headaches later on
[11:08:43] <elukey>	 also I hope that the cookbooks will do their job in a reasonable amount of time
[11:08:58] <joal>	 ack elukey
[11:09:06] <joal>	 only one letf
[11:31:36] <joal>	 elukey: We're ready :)
[11:32:25] <elukey>	 wow
[11:32:32] <elukey>	 all right, stopping the cluster joseph
[11:32:46] <joal>	 BANZAI
[11:32:52] <joal>	 sending a message on slack
[11:35:59] <mforns>	 crossing fingers
[11:36:11] <elukey>	 hdfs save namespace is taking a bit but it is expected with such a cluster
[11:36:49] * mforns back in 40 mins
[11:37:27] <joal>	 elukey: imagine a thousand-nodes cluster :)
[11:37:59] <elukey>	 I cannot :D
[11:38:09] <elukey>	 hdfs namenode federation etc..
[11:38:12] <elukey>	 what a mess
[11:38:24] <elukey>	 in such case you hire hadoop committers :D
[11:38:28] <joal>	 yep
[11:38:59] <joal>	 I have become a distcp aficinados - I can tell where to improve :)
[11:40:05] <elukey>	 it is now stopping daemons
[11:43:22] <elukey>	 hdfs datanodes are stopped gently so 60 of them take time :)
[11:44:32] <elukey>	 then master nodes and finally journals
[11:44:49] <elukey>	 once done we'll start the upgrade 
[11:47:20] <elukey>	 it should take ~10/15 mins more
[12:00:32] <elukey>	 joal: we are at the journalnodes, after them (5) we should be all down
[12:03:13] <joal>	 ack elukey - ready for anything I can help with :)
[12:04:52] <elukey>	 joal: ok kicking off the upgrade :)
[12:05:05] <joal>	 Ack!
[12:10:04] <elukey>	 it is upgrading the worker nodes
[12:13:25] <wikibugs>	 10Analytics: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey)
[12:14:03] <elukey>	 this step should take 15ms more in theory
[12:14:08] <elukey>	 15mins
[12:14:20] <elukey>	 ms would have been a too high bar to reach :D
[12:14:25] <joal>	 :)
[12:21:08] <fdans>	 good morning team, reading backscroll
[12:21:16] <joal>	 good morning fdans :)
[12:22:56] <mforns>	 hellooo fdans :]\
[12:26:21] <elukey>	 good morning
[12:26:43] <elukey>	 fdans: so we are upgrading the packages basically, ETA 20 mins more or less for master+workers
[12:26:55] <elukey>	 then we'll go through clients one-by-one testing
[12:27:13] <elukey>	 all steps in https://etherpad.wikimedia.org/p/analytics-bigtop-upgrade
[12:28:08] <moritzm>	 the / partition is full on an-master1001, known?
[12:28:30] <joal>	 wow - thanks moritzm
[12:28:43] <joal>	 elukey: --^
[12:29:19] <elukey>	 ouch
[12:29:51] <klausman>	 +Well, 9G of logs in /var/log/hadoop-hdfs
[12:30:27] <klausman>	 I also recommend apt clean to clean out /var/cache/apt/archives (another 2.5G)
[12:30:57] <klausman>	 (And to find these things `sudo ncdu -x /`)
[12:31:55] <elukey>	 I dropped the namenode saved for the kerberos migration (LOL) and now we are down to 85% usage
[12:31:59] <elukey>	 moritzm: thanksssss
[12:32:26] <elukey>	 klausman: +1 on apt clean, doing it
[12:32:48] <elukey>	 we are at 79% all good
[12:32:52] <elukey>	 for the logs we have a task about it
[12:32:57] <klausman>	 Also, the hadoop logs are not compressed. Are they rotated by hadoop itself or by logrotate?
[12:33:20] <elukey>	 by hadoop itself, I hope with the new version to have the log4j version to be able to do it
[12:33:26] <klausman>	 Ack.
[12:33:41] <elukey>	 also if you check lvs there are a lot of things unbalanced
[12:34:13] <elukey>	 TIL ncdu
[12:34:19] <klausman>	 Ah, today's biznis has created a shit-ton of audit logs
[12:34:29] <elukey>	 yep yep
[12:34:34] <klausman>	 Basically, hdfs-audit.log.* is from today
[12:35:01] <elukey>	 in this case I think that distcp is the one to blame
[12:35:13] <elukey>	 since it just tried to scan 50M+ files for changes :D
[12:35:22] <joal>	 Soory :S :)
[12:35:41] <joal>	 it must have been ongoing for some time (a few days at least)
[12:35:51] <klausman>	 -rw-r--r--  1 hdfs hdfs   268435648 Feb  9 09:59 hdfs-audit.log.20
[12:35:53] <klausman>	 Not really.
[12:35:59] <klausman>	 -rw-r--r--  1 hdfs hdfs   268435570 Feb  9 10:29 hdfs-audit.log.1
[12:36:23] <klausman>	 half an hour of doing things -> 5.1G of logs.
[12:36:29] <elukey>	 this was from this morning probably, I suspect all getfileinfo RPC calls
[12:36:43] <elukey>	 yes yes it would make sense
[12:37:08] <elukey>	 klausman: you can drop 3/4 of the last audit ones if you have time
[12:37:09] <klausman>	 Yep, about half a million of those per logfile
[12:37:24] <klausman>	 We have twenty files, so drop the fifteen oldest?
[12:38:15] <elukey>	 let's say 10
[12:38:20] <klausman>	 ok
[12:38:26] <elukey>	 thanks :)
[12:38:46] <klausman>	 And done. 12G free/73% used on / now
[12:39:54] <elukey>	 <3
[12:39:56] <klausman>	 I presume the two backup tars in /root are temporary and will be removed once we're done with today's migration?
[12:40:02] <joal>	 I'm surprise we have no older audit files - I run regular distcp jobs since a few days
[12:40:04] <joal>	 weird
[12:40:29] <klausman>	 Well, I dunno how many files log4j is keeping around. It might be that the traffic created today expired everything older
[12:40:45] <klausman>	 The number 20 sounds like a limit of that kind.
[12:41:06] <joal>	 agreed
[12:43:11] <elukey>	 ok so the hdfs daemons are ok after the install, the yarn nodemanager are a little angry that the RM is not up, but all good
[12:43:49] <elukey>	 now I am going to re-enable slooowly on worker nodes
[12:44:05] * klausman puts fingers in ears and waits for explosion
[12:44:59] <elukey>	 thanks for the trust :D
[12:45:59] <elukey>	 changes are ok in the canary, proceeding with all
[12:52:25] <elukey>	 joal: in the meantime, can you folks coordinate on what to test etc..?
[12:52:32] <joal>	 sure
[12:52:39] <elukey>	 <3
[12:52:53] <fdans>	 I'm available to test anything
[12:53:15] <elukey>	 so the first step will be hdfs only, then we'll upgrade stat100x and hive etc..
[12:53:20] <joal>	 about tests: ack
[12:55:49] <elukey>	 namenode 1001 loading inodes
[12:55:55] <joal>	 fdans: for hdfs, lets both do stuff (read, touch files, try other users)
[12:56:22] <fdans>	 joal: sounds good
[12:56:28] <joal>	 fdans: then when ready there will be: Hive, beeline, spark, oozie + hue (AQS rerun)
[12:56:40] <elukey>	 klausman: can you please keep an eye on an-master100[1,2] disk usage if you have time?
[12:57:18] <joal>	 the only thing we don't test manually is camus
[12:57:26] <klausman>	 Will do
[12:57:34] <joal>	 elukey: do you wish we do that (test camus manually) ?
[12:58:02] <elukey>	 joal: we can restart the timer, but later on :)
[12:58:22] <joal>	 sure elukey - I can also devise a manual run of something else
[12:58:28] <joal>	 just to test the mechanics
[12:59:47] <fdans>	 joal: which host should I be testing hdfs stuff?
[13:00:43] <joal>	 fdans: let's here from elukey about the ones updated
[13:01:13] <fdans>	 joal: sorry, got ahead of myself :) I'm writing down a list of commands to test
[13:01:26] <joal>	 no prob fdans 
[13:01:27] <joal>	 :)
[13:08:17] <elukey>	 the namenode is still not out of safemode
[13:10:04] <mforns>	 me too will test!
[13:10:59] <joal>	 Yay - MOAR TESTERS :)
[13:11:59] <elukey>	 so the namenode sees only 26 workers up, instead of 59
[13:12:02] <elukey>	 going to check why
[13:15:14] <klausman>	 disk space is still unchanged, btw
[13:16:17] <elukey>	 so some datanodes try to contact 1002 for some reason, going to upgrade the standby as well, since 1001 is active
[13:17:27] <wikibugs>	 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10mforns) @Ottomata Is a sanitization refine job needed? I thought the current sanitization job was sanitizing everything that is inside the event database. My id...
[13:20:17] <elukey>	 and 1002 is fetching the fsimage from 1001, so it will take a bit
[13:24:52] <elukey>	 it is taking a long time to move few GBs, but I checked via tcpdump and things are moving
[13:25:23] <elukey>	 ok cookbook finished
[13:25:31] <elukey>	 now let's see how 1002 behaves
[13:26:07] <elukey>	 it seems as if half of the datanodes need to try to contact the standby
[13:26:11] <elukey>	 and half the active
[13:26:29] <joal>	 elukey: round robin?
[13:30:52] <elukey>	 no idea...
[13:30:59] <elukey>	 so the standby is up and running
[13:31:42] <elukey>	 I am restarting one datanode that doesn't show up to see how it goes
[13:32:06] <elukey>	 never happened in testing
[13:32:59] <elukey>	 very slow to start
[13:33:11] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:33:39] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:36:15] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:36:16] <elukey>	 for some reason some datanodes are refusing to read the datanode dirs
[13:36:23] <joal>	 :(
[13:38:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:38:41] <elukey>	 !log restart hdfs-datanode on an-worker1080 (test canary - not showing up in block report)
[13:38:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:39:01] <elukey>	 !log restart hdfs-datanode on analytics10[65,69] - failed to bootstrap due to issues reading datanode dirs
[13:39:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:39:56] <elukey>	 ok I see some progress
[13:41:55] <joal>	 elukey: do you wish to brainstorm?
[13:42:47] <joal>	 elukey: I have access to the yarn UI - it says 56 workers
[13:43:45] <joal>	 s/56/59
[13:43:48] <elukey>	 joal: yeah but those are yarn node managers, datanodes are in trouble
[13:43:57] <joal>	 I read your note yes
[13:44:06] <joal>	 do you wish to explain in the cave?
[13:44:09] <joal>	 elukey: --^
[13:45:07] <elukey>	 joal: gimme a second to check
[13:45:18] <joal>	 of course elukey 
[13:50:11] <joal>	 elukey: I don't know how ou do but the number of nodes seen by the active namenode grows every now and then
[13:50:29] <elukey>	 joal: we can bc 
[13:50:36] <joal>	 sure elukey - I'm there
[13:54:11] <elukey>	 fdans: can you do PR and advertise that it is taking more than expected?
[13:54:35] <fdans>	 elukey: understood
[13:54:51] <fdans>	 no ETA to be given yet right?
[13:56:14] <elukey>	 fdans: I'd say a couple of hours, we are having trouble in bootstrapping hdfs sadly
[13:56:19] <elukey>	 for some unknown issue
[13:58:02] <fdans>	 elukey: 
[13:58:14] <fdans>	 (srry, pressed enter accidentally)
[13:58:57] <fdans>	 elukey, all good, let's all take a deep breath and carry on, I'm informing :)
[14:01:57] <ottomata>	 o/ just getting on checking email!
[14:03:51] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[14:04:19] <icinga-wm>	 PROBLEM - Presto Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[14:05:41] <elukey>	 downtimed --^
[14:08:24] <elukey>	 !log restart analytics1069's datanode with bigger heap size
[14:08:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:10:03] <elukey>	 ottomata: we are in bc
[14:10:07] <elukey>	 trobles sadly :(
[14:10:34] <wikibugs>	 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) Oh, my understanding was that tables that only tables that didn't have underscores in them were dropped from event.  Hm, I suppose we should add entri...
[14:12:08] <ottomata>	 elukey:  coming in a just a few
[14:29:29] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:29:31] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:31:10] <elukey>	 downtiming
[14:37:27] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:40:07] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[14:50:41] <elukey>	 !log restart analytics1069 with 16g heap size to allow bootstrap
[14:50:43] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:50:49] <elukey>	 !log restart analytics1072 with 16g heap size to allow bootstrap
[14:50:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:54:49] <elukey>	 !log restart an-worker1090 with 16g heap size to allow bootstrap
[14:54:52] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:57:15] <elukey>	 !log restart an-worker1102 with 16g heap size to allow bootstrap
[14:57:19] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:01:32] <elukey>	 !log restart an-worker1103 with 16g heap size to allow bootstrap
[15:01:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:01:35] <elukey>	 !log restart an-worker1104 with 16g heap size to allow bootstrap
[15:01:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:18:50] <wikibugs>	 10Analytics: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey)
[15:32:27] <wikibugs>	 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi)
[15:38:49] <elukey>	 !log restart namenode on an-master1002 
[15:38:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:55:25] <ottomata>	 !log restart datenode on an-worker1115
[15:55:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:58:44] <ottomata>	 !log restart datanode on analytics1058
[15:58:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:08:06] <ottomata>	 !log restart datanode on an-worker1080 withh 16g heap
[16:08:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:14:26] <ottomata>	 !log restart datanode on analytics1059 with 16g heap
[16:14:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:30:58] <elukey>	 !log restart datanode on ana-worker1100
[16:31:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:39:46] <wikibugs>	 (03CR) 10Milimetric: "looks great, small note about the console.log" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser)
[16:44:32] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Uncaught TypeError: navigator.sendBeacon is not a function - https://phabricator.wikimedia.org/T273374 (10Milimetric) a:05Milimetric→03Ottomata
[16:59:32] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey)
[17:45:58] <wikibugs>	 (03PS1) 10Nettrom: Remove the Growth schemas from the whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663020 (https://phabricator.wikimedia.org/T273826)
[17:52:58] <wikibugs>	 10Analytics-Radar, 10Product-Analytics, 10Growth-Team (Current Sprint), 10Patch-For-Review: Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10nettrom_WMF)
[17:53:58] <wikibugs>	 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10mpopov) @fdans: Maybe? I'm not sure what modifications @mforns did to the standard sanitization process to enable 270 days of retention for @nettrom_WMF's...
[18:10:25] <wikibugs>	 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10LGoto)
[18:11:04] <wikibugs>	 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10LGoto) p:05Triage→03Medium
[18:27:01] <daniram>	 hello a-team, I'm not able to connect to jupyterhub. I'm trying with "ssh -N stat1004.eqiad.wmnet -L 8880:127.0.0.1:8880" as usual, but this time it fails with error "channel 2: open failed: connect failed: Connection refused". any ideas?
[18:27:37] <milimetric>	 daniram: there's a big upgrade in progress, the cluster is going through a lot at the moment :)
[18:28:09] <milimetric>	 (was announced on the mailing lists and various slack channels, status is that it's in progress and will be for a few more hours)
[18:31:39] <daniram>	 I didn't know:)  thanks
[18:34:12] <icinga-wm>	 PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie
[18:35:10] <ottomata>	 daniram: you should sign up for this mailing lislt
[18:35:10] <ottomata>	 https://lists.wikimedia.org/mailman/listinfo/analytics-announce
[18:35:11] <ottomata>	 :)
[18:37:03] <joal>	 ottomata: None of us has an anaconda notebook at hand to test (not yet but soon) - would you have one?
[18:37:24] <icinga-wm>	 RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie
[18:37:24] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[18:38:24] <icinga-wm>	 RECOVERY - Presto Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down
[18:39:10] <daniram>	 thanks ottomata :)
[19:06:16] <wikibugs>	 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF) @fdans : The parent task asks to revert the changes made in T237124. There's also the second child task, T273826, for updating the whitelist....
[19:13:38] <wikibugs>	 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: remove deletion timers for Growth's sanitized EL tables - https://phabricator.wikimedia.org/T274297 (10nettrom_WMF)
[19:18:42] <wikibugs>	 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF)
[19:20:44] <wikibugs>	 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF) I updated the task description to make the ask clear. I've also updated the description of the parent task to make it clearer what we're look...
[19:21:49] <wikibugs>	 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: End wider data purge window - https://phabricator.wikimedia.org/T273815 (10nettrom_WMF)
[19:35:40] <fkaelin>	 joal i have anaconda envs/notebooks on stat1008 I can test on...
[19:35:59] <joal>	 Hi fkaelin - Thanks for the offer, andrew tested :)
[20:05:33] <icinga-wm>	 PROBLEM - Hive Metastore on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[20:06:27] <ottomata>	 !enabling puppet and runnin puppet on an-launcher1002, restarting all timers there
[20:22:50] <icinga-wm>	 RECOVERY - Hive Metastore on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive
[20:31:25] <joal>	 !log Rerun webrequest-load-coord-[text|upload] for 2021-02-09T06:00 after data was imported to camus
[20:31:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:34:34] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10MediaWiki-API, 10Platform Engineering, and 2 others: Load API request count and latency data from Hadoop to a dashboard - https://phabricator.wikimedia.org/T108414 (10sdkim) >>! In T108414#6812482, @mpopov wrote: > Also, with the addition of SQL Lab & Presto to Su...
[20:35:33] <wikibugs>	 10Analytics: Make camus (or gobblin) jobs run in `essential` or `production` queue - https://phabricator.wikimedia.org/T274298 (10JAllemandou)
[20:40:11] <ottomata>	 joal 
[20:40:16] <ottomata>	 https://www.irccloud.com/pastebin/KFLW8XKx/
[20:40:16] <joal>	 Yes :)
[20:40:19] <joal>	 thanks :0
[20:50:37] <razzi>	 !log rebalance kafka partitions for codfw.resource-purge
[20:50:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:51:30] <joal>	 !log Rerun webrequest-load-coord-[text|upload] for 2021-02-09T07:00 after data was imported to camus
[20:51:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:18:24] <mforns>	 fdans: checked that there's no other 'rows' literal in any hive query under refinery/oozie
[21:19:20] <wikibugs>	 10Analytics-Radar, 10Better Use Of Data, 10MediaWiki-API, 10Platform Engineering, and 2 others: Load API request count and latency data from Hadoop to a dashboard - https://phabricator.wikimedia.org/T108414 (10bd808) >>! In T108414#6816549, @sdkim wrote: > Given there is no data being collected and this ta...
[21:28:41] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:29:07] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:30:27] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[21:31:29] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:00:17] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:02:23] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:04:05] <razzi>	 !log rebalance kafka partitions for eqiad.resource-purge
[22:04:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:11:59] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:14:35] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:23:14] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:23:50] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:25:58] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:29:14] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:31:35] <wikibugs>	 10Analytics-Radar, 10MediaWiki-API, 10Patch-For-Review, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Mholloway) I could probably pick this up as a 10%-ish exerci...
[22:51:21] <ottomata>	 paste for refine URI stacktrace
[22:51:23] <ottomata>	 https://www.irccloud.com/pastebin/qfy1lpD8/
[23:00:29] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:06:23] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:08:23] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:10:19] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:22:28] <wikibugs>	 10Analytics, 10Product-Data-Infrastructure, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10colewhite) >>! In T265938#6812467, @Krinkle wrote: > What does "presented alongside" mean? The ability to issu...
[23:28:46] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:31:16] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:33:56] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[23:44:13] <wikibugs>	 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10JAllemandou) Problems we have found a solution for (even if not great):  * Hive jobs failing due to new reserved keywords in HQL  * Hive jobs failing due to UDF type problem whe...