[04:09:38] <wikibugs>	 (03PS5) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556)
[04:09:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric)
[09:15:03] <elukey>	 joal: o/
[09:15:28] <elukey>	 I have all the patches ready to go for the Hadoop heap size and prometheus changes
[09:16:04] <elukey>	 but since some environment files are changing I'd wait for Andrew to triple check this afternoon
[09:16:33] <elukey>	 in the meantime, I am going to reboot analytics1029 and 1030 (this one has the increased Xmx settings) to verify that kernel + jvm are ok
[09:25:34] <elukey>	 all the changes are in https://puppet-compiler.wmflabs.org/compiler02/9084/
[09:25:45] <elukey>	 basically what I'd like to get a +1 for is the following
[09:26:43] <elukey>	 1) hadoop-env.sh is the same across all the analytics nodes, but not all the variables are used on every node. For example, YARN_RESOURCEMANAGER_OPTS will not be picked up by a worker node
[09:27:05] <elukey>	 due to some hiera refactoring, YARN_RESOURCEMANAGER_OPTS will only be configured on analytics100[12]
[09:27:17] <elukey>	 2) same thing for yarn-env.sh
[09:27:51] <elukey>	 (sorry bad examples, YARN_RESOURCEMANAGER_OPTS is contained in yarn-env.sh, but you got the point :)
[09:30:47] <elukey>	 added comments to https://gerrit.wikimedia.org/r/#/c/394256/
[09:32:13] <joal>	 Hi elukey - reading a bit
[09:33:03] <joal>	 elukey: I think \I don't understand the point of yarn-env.sh
[09:36:12] <elukey>	 sorry my bad
[09:36:27] <elukey>	 so YARN_RESOURCEMANAGER_OPTS should be set in yarn-env.sh, not hadoop-env.sh
[09:36:45] <elukey>	 I used the wrong variable to make the hadoop-env.sh's example
[09:37:09] <elukey>	 I should have used say HADOOP_NAMENODE_OPTS
[09:39:29] <joal>	 Ahh, I get it now
[09:42:57] <elukey>	 does it make any sense?
[09:43:15] <elukey>	 (in the meantime, 1029 and 1030 rebooted fine and are running with new kernel+jvm)
[09:44:00] <joal>	 Yay !
[09:44:48] <joal>	 elukey: I think it makes sense, but given my non-knowledge of hadoop-puppet codebase, naming convention issues are not super easy to grasp ;)
[09:46:25] <joal>	 Also elukey, I'm a bit suprised we use 4G for nodemanagers and 2G for resource manager - I would have expected the latter to be more greedy
[09:49:40] <elukey>	 joal: yeah but from https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=12&fullscreen&orgId=1&from=now-7d&to=now it seems doing really fine with 2G
[09:50:14] <elukey>	 the nodemanagers are a bit busier - https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=17&fullscreen&orgId=1&from=now-7d&to=now
[09:50:45] <elukey>	 the blue line that spikes up to 4G (almost) is analytics1030, that is running with the new settings
[09:52:02] <joal>	 elukey: is it better to have less of those bumps? Cause the pattern looks the same as the blue line for others, except with a smaller scale
[09:54:03] <elukey>	 joal: bigger bumps will translate in less frequent young GC collections and old gen collections as far as we saw with the past Xmx tunings
[09:54:37] <elukey>	 once in a while a old gen collection (longer than the actual ones, but not much more) will happen dropping not used objects
[09:56:28] <elukey>	 I also wanted a bit more heap to have room for the prometheus agent, that should not be heavy of course but I didn't want it to impact in any way with the datanode/nodemanager business
[09:58:12] <elukey>	 let me know your thoughts, I am all open for discussion
[09:58:42] <joal>	 I get the old/new GC frequency pattern change - I am just not good enough in GC to know how better it'll be to do less frequent old-GC passes
[09:59:31] <joal>	 I'm assuming it should be better by not slowing down the JVM process as regularly, but don't know if it'll impact a lot the system perfs
[10:00:01] <joal>	 anyway elukey - giving some more meory for the system to work presumably better is a good idea :)
[10:01:11] <elukey>	 it has worked reasonably well so far (especially the -Xms == Xmx settings) so I hope that it will give some breath to nodemanagers/datanodes 
[10:01:26] <joal>	 sounds good elukey :)
[10:01:37] <elukey>	 it will reduce, of course, the theoretical maximum containers that a single node can allocate
[10:01:48] <elukey>	 but it is never a win/win :D
[10:02:10] <joal>	 elukey: we are not resource constraints so much in grid-computing space, so if our managers feel better, it's all good :)
[10:02:48] <elukey>	 joal: super good to have these brain bounces with you, it is good since we are all in sync with what happens with the cluster
[10:03:09] <elukey>	 and I can also get a more experienced feedback from you and Andrew
[10:03:36] <elukey>	 so please feel free to stop me anytime when I do some change that you don't trust completely :)
[10:03:37] <joal>	 Many thanks to you elukey for leting me know - I don't provide value for real, but at least I know, and ask questions :)
[10:04:54] <joal>	 elukey: if you have a minute, would you mind giving your opinion on https://phabricator.wikimedia.org/T176785 new name for metric? I'll ask ottomata as well, and possibly we could merge today for a deploy next week
[10:12:12] <elukey>	 joal: checking now!
[10:15:17] <elukey>	 +1 to have only analytics.mw_api.varnish_request added
[10:15:44] <elukey>	 this will be a clear and new namespace that eventually restbase should move too
[10:16:00] <elukey>	 it shouldn't be difficult in grafana to use both
[10:16:09] <elukey>	 in the same graph I mean (old name + new name)
[10:16:15] <elukey>	 to have continuity in datapoints
[10:16:29] <elukey>	 but this is of course a choice for the services team :)
[10:17:41] <joal>	 It is - And from what I read on that ticket, I have the feeling thay don't want to have to hcange dashboards :)
[10:20:26] <joal>	 ok elukey, i have your +1, I'll update the oozie patch, nad wait for ottomata +1 to merge
[10:39:31] <elukey>	 about the remaining dbs to check in https://phabricator.wikimedia.org/T156844#3785514
[10:39:34] <elukey>	 for db1047
[10:39:59] <elukey>	 I am pretty sure that they are useless, since 3 are belonging to past employees
[10:40:10] <elukey>	 and one is Dario's, but few GBs
[10:40:20] <elukey>	 so my plan is to just mysqldump them to stat1006
[10:40:27] <elukey>	 and then proceed with the decom
[10:41:44] <joal>	 elukey: do we delete database shawn as suggested by JMo?
[10:42:23] <elukey>	 yep yep, I'll just let Chris destroy the disks during decom :)
[10:42:40] <elukey>	 I am planning to dump dartar, zexley and diederik
[10:43:27] <elukey>	 so db104[67] will be free to go through decom
[10:44:19] <joal>	 okey
[10:45:18] <joal>	 elukey: I'm actually pretty sure the DBs for zexley and diederik will never be used, but we've not received confirmations, so let's not drop :)
[10:45:44] <elukey>	 I am sure too, but better be safe :)
[10:45:51] <joal>	 :)
[10:56:06] <joal>	 elukey: about druid restarts - I'd like us to brainstorm on the correct way to do it given that it's FR-banner-period, meaning the realtime jobs are more important than during other periods
[11:01:01] <wikibugs>	 (03PS3) 10Joal: Change restbase job to also count mw_api requests [analytics/refinery] - 10https://gerrit.wikimedia.org/r/392703 (https://phabricator.wikimedia.org/T176785)
[11:02:58] <elukey>	 joal: yes definitely
[11:08:30] <wikibugs>	 10Analytics: Enhance mediawiki-history page reconstruction with best historical information possible - https://phabricator.wikimedia.org/T179692#3799233 (10JAllemandou)
[11:14:34] <elukey>	 I am trying to figure out from https://github.com/druid-io/tranquility/blob/master/docs/overview.md if there is a way to configure tranquillity in a way that prevents dataloss when a node is down
[11:14:45] <wikibugs>	 10Analytics: Implement digest-only mediawiki_history_reduced dataset in spark - https://phabricator.wikimedia.org/T181703#3799239 (10JAllemandou)
[11:15:21] <elukey>	 but probably the main issue is that we have on-hour segments right?
[11:15:22] <joal>	 elukey: IIRC we duplicate ingesting tasks - I don't know however how it works internally
[11:15:46] <fdans>	 joal: is it possible for me to load data to cassandra in beta from stat1005 if I specify that host in the job properties? 
[11:15:48] <joal>	 elukey: i don't know if it's 'main' issue, but it is one of them :)
[11:15:55] <joal>	 nope
[11:15:58] <joal>	 fdans: --^
[11:16:10] <joal>	 fdans: no access between hadoop and beta 
[11:16:39] <fdans>	 so I have to load on prod cassandra, on a test keyspace joal 
[11:17:49] <joal>	 fdans: that's correct - the beta deploy of AQS will help you with test-keyspace definition
[11:24:15] <elukey>	 joal: there is an interesting bit about replication
[11:24:17] <elukey>	 Replication involves creating multiple tasks with the same partitionNum. Events destined for a particular partition are sent to all replica tasks at once. The tasks do not communicate with each other. If a replica is lost, it cannot be replaced, since Tranquility does not have any way of replaying previously sent data.
[11:27:59] <elukey>	 I am wondering if having two replicas for each real time would be enough
[11:28:41] <elukey>	 the "task.replicants" setting
[11:28:48] <elukey>	 from https://github.com/druid-io/tranquility/blob/master/docs/configuration.md
[11:29:52] * elukey is probably talking nonsense
[11:30:33] <wikibugs>	 (03PS1) 10Joal: Fix mediawiki page history reonstruction [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/394279
[11:31:01] <joal>	 elukey: currently looking at tranquility conf in banner job
[11:31:39] <elukey>	 I am also assuming that the replica will not be on the same host :D
[11:31:45] <elukey>	 but it might be a strong assumption 
[11:31:49] <joal>	 elukey: I think we use 3 tranquility replicants
[11:33:29] <joal>	 elukey: indexing tab on druid-coord UI confirms
[11:34:43] <elukey>	 joal: checked with ps auxff | grep druid, the three realtime jobs are on druid1003 currently :(
[11:34:55] <joal>	 elukey: overlord confirms
[11:35:01] <elukey>	 children of middleManager
[11:35:04] <elukey>	 uffffff
[11:35:07] <joal>	 hm
[11:35:37] <elukey>	 joal: so we can definitely reboot two hosts :D
[11:35:43] <joal>	 :D
[11:36:11] <elukey>	 maybe it spawns those jobs on the druid overlord leader?
[11:36:40] <elukey>	 not sure if we can force a change 
[11:36:41] <joal>	 I think that's what it does
[11:36:49] <elukey>	 that would be handy
[11:37:15] <elukey>	 need to go in ~10 mins for lunch + errand, but I'll double check later on
[11:37:19] <elukey>	 very interesting
[11:37:27] <joal>	 elukey: later !!
[11:38:15] <elukey>	 it feels like the overlord leader do not control all the middle manager, but only its local one
[11:38:39] <joal>	 elukey: I have the same feeling, or something similar
[11:39:53] <elukey>	 " Overlords and middle managers may run on the same node or across multiple nodes while middle managers and Peons always run on the same node."
[11:40:22] <elukey>	 this is from http://druid.io/docs/0.9.2/design/indexing-service.html
[11:40:25] <joal>	 hm - this seems a bit contradictrory to me :)
[11:41:06] <joal>	 it probably means an overlord can manage middle-managers accross multiple machines, while a middle-manager always manages peons on its own machine
[11:41:32] <elukey>	 yep
[11:41:37] <elukey>	 all right going afk, ttl!
[11:41:43] <ema>	 tranquillity replicants
[11:41:47] <ema>	 druid overlords
[11:41:52] <ema>	 this channel sometimes sounds like #wikimedia-metal at times
[11:41:57] <joal>	 :D
[11:42:15] * joal listens to headbaging stuff
[11:42:21] <ema>	 :)
[11:42:36] <joal>	 how are you ema?
[11:43:40] <ema>	 joal: I'm alive! Yourself? :)
[11:44:20] <ema>	 (sorry for the interruption btw, I could not keep the #-metal joke to myself)
[11:44:23] <joal>	 mostly alive as well - a bit zombistic with the kids not letting me sleep as I'd like - but nothing major
[11:44:38] <joal>	 I've not yet starting to bite them, it's ok
[11:45:01] <joal>	 Thanks for sharing some laugh ema :)
[12:08:20] <wikibugs>	 (03CR) 10Joal: [C: 032] "Self merging fix" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/394279 (owner: 10Joal)
[12:12:04] <wikibugs>	 (03Merged) 10jenkins-bot: Fix mediawiki page history reonstruction [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/394279 (owner: 10Joal)
[12:13:55] <wikibugs>	 (03PS4) 10Fdans: [wip] Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[12:31:58] <joal>	 taking a 
[12:32:03] <joal>	 break a-team
[13:45:09] <elukey>	 joal: (whenever you are back after the break) - I found an interesting thing in http://druid.io/docs/0.9.2/configuration/indexing-service.html
[13:45:22] <elukey>	 selectStrategyHow to assign tasks to middle managers. Choices are fillCapacity, fillCapacityWithAffinity, equalDistribution and javascript.
[13:45:56] <elukey>	 but there seems to be not a lot documentation about it
[13:47:05] <elukey>	 it would be awesome if "fillCapacity" means "fill local first, then check remote" and "equalDistribution" means "equally distributed between all overlords"
[13:48:01] <elukey>	 so in the ideal world, we have three realtime jobs running, one for each druid node, and as long as we don't restart all the nodes in the same time window  (like one hour bucket) we are fine
[13:56:36] <elukey>	 ah nice there is documentation about them
[13:57:55] <elukey>	 so each overlord has the concept of "workers", that I didn't know
[13:57:59] <elukey>	 (checking the console now)
[13:58:13] <elukey>	 Fill Capacity
[13:58:14] <elukey>	 Workers are assigned tasks until capacity.
[13:58:20] <elukey>	 Equal Distribution
[13:58:21] <elukey>	 The workers with the least amount of tasks is assigned the task.
[13:59:06] <elukey>	 and this can be changed via POST 
[13:59:17] <elukey>	 so the default is Fill Capacity
[13:59:24] <elukey>	 that is basically what the overlord is doing
[13:59:46] <elukey>	 so I'd like to test equalDistribution
[14:00:25] <elukey>	 because if it behaves as I think, those three realtime indexers will likely be spread among the nodes, rather than being in one
[14:02:27] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943#3799759 (10elukey)
[14:20:29] <ottomata>	 hiii elukey!  when you have some time : https://gerrit.wikimedia.org/r/#/c/394144/ :)
[14:20:44] <ottomata>	 i'm going to get cergen and certs created today, then that patch should work.
[14:22:34] <wikibugs>	 10Analytics-Kanban, 10DBA, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3799814 (10elukey) >>! In T156844#3786146, @Capt_Swing wrote: > The `shawn` table belonged to Shawn Walker, a research intern in 2011. Th...
[14:24:02] <elukey>	 ottomata: o/ - I have it on my list! 
[14:29:08] <ottomata>	 dankeee!
[14:34:07] <joal>	 Hi elukey - very interesting finding !
[14:35:34] <elukey>	 joal: what do you think about it? Worth a try?
[14:35:57] <joal>	 elukey: I'd like to better understand the concept of worker in that case, but yes, definitely
[14:36:29] <elukey>	 joal: I think it should be == to a task
[14:36:38] <elukey>	 buuuut ETOOMANYTERMS
[14:37:12] <joal>	 hm - :D
[14:37:44] <joal>	 I'm assuming a worker is a peon (or something similar) - started by middlemanager - and we have 1 worker per task
[14:38:07] <joal>	 but this doesn't fit the explanation for equal-distribution
[14:38:31] <joal>	 another try: worker is a middlemanager, and its work is all the work it manages with its peons
[14:38:43] <joal>	 Taking this definition, it works I think
[14:39:03] <elukey>	 # --- Druid MiddleManager
[14:39:03] <elukey>	 profile::druid::middlemanager::monitoring_enabled: true
[14:39:04] <elukey>	 profile::druid::middlemanager::properties: druid.worker.ip: "%{::fqdn}" druid.worker.capacity: 12
[14:39:11] <elukey>	 argh horrible paste
[14:39:17] <elukey>	 anyhow, druid.worker.capacity: 12
[14:39:29] <joal>	 ok, makes sense
[14:39:31] <elukey>	 that is the available workers per node from the console
[14:39:43] <elukey>	 "Maximum number of tasks the middle manager can accept."
[14:40:00] <elukey>	 so task should be == to worker
[14:40:21] <joal>	 more precisely, every taks is done by a worker :)
[14:40:31] <elukey>	 yes :)
[14:41:29] <elukey>	 ottomata: not sure if you've read my conversation with Joseph about the distribution of the tasks by the overlord, whenever you have time I'd like to know your opinion
[14:43:16] <ottomata>	 elukey:  just read backscroll from last couple of mins
[14:43:17] <ottomata>	 is there more?
[14:46:15] <joal>	 ottomata, elukey - I don't think so
[14:47:06] <joal>	 ottomata, elukey - reading http://druid.io/docs/latest/operations/rolling-updates.html, there seem to be a nice way to disable middlemanagers, therefore ensure new tasks get started on different nodes
[14:47:38] <ottomata>	 elukey:  not sure what you want opinion on!  :)
[14:48:00] <joal>	 elukey: if you agree, I suggest rolling-restart of druid100[12], disable middlemanager on druid1003, wait for task to be finished, then restart
[14:48:25] <elukey>	 one at the time :D
[14:48:30] <joal>	 huhu
[14:48:37] <wikibugs>	 (03PS5) 10Fdans: [wip] Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[14:49:35] <elukey>	 ottomata: so from what I can read in http://druid.io/docs/0.9.2/configuration/indexing-service.html there seems to be a way to set how the overlord schedules tasks to the middle managers.. atm we run the realtime jobs in three "replicas", but they end up on the same node since  selectStrategy is by default fillCapacity
[14:50:00] <elukey>	 but there is the option of "equalDistribution"
[14:50:12] <ottomata>	 ah
[14:50:22] <elukey>	 so I wanted to know if it makese sense in our case
[14:50:37] <ottomata>	 hmm, don't see why not...fill capacity will just fill up one node before assigning to the next?
[14:50:39] <elukey>	 joal: reading the doc! It is surely what we need, lemme see what it says
[14:50:49] <elukey>	 ottomata: IIUC yes
[14:52:24] <ottomata>	 elukey:  +1 then
[14:52:26] <ottomata>	 :)
[14:52:49] <elukey>	 joal: ahh nice! 
[14:52:51] <joal>	 elukey: here is what I suggest - we try to update overlords dynamically with JSON
[14:53:10] <joal>	 A new set of tasks will start very soon - could be fun to try
[14:53:15] <elukey>	 +1
[14:53:20] <joal>	 ok, doing so 
[14:53:23] <elukey>	 <3
[14:56:03] <elukey>	 ottomata: hahhaha I just realized that I +1 my own code review with the comments for your ssl kafka changes and didn't realize
[14:56:09] <elukey>	 just +1ed the correct one sorry
[14:58:32] <joal>	 !log Update druid overlord config to equalDistribution dynamically
[14:58:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:59:40] <joal>	 elukey: I'm watching druid overlord UI, checking for new tasks :)
[15:01:17] <elukey>	 me too!
[15:01:45] <joal>	 elukey: WORKS !
[15:01:56] <elukey>	 \o/
[15:02:06] * elukey dances
[15:02:11] <ottomata>	 haha
[15:02:11] <ottomata>	 :)
[15:02:18] <joal>	 elukey: let's wait for druid1003 to be empty (in 10 minutes, given realtime wait for that amount of time for late arrival of events)
[15:02:29] <joal>	 Then we can restart it :)
[15:02:49] <joal>	 elukey: Good catch on overlord config !
[15:03:05] <elukey>	 it is still not perfect since two replicas got to the same node, but it is good enough
[15:03:29] <joal>	 elukey: knowing there was 3 tasks on d1003, i's not surprising :)
[15:03:30] <elukey>	 I am going to read again the page that you linked to be sure that I don't miss anything!
[15:03:34] <elukey>	 yeah
[15:04:13] <elukey>	 the other thing is druid.indexer.task.restoreTasksOnRestart=true
[15:04:18] <elukey>	 really interesting
[15:04:24] <joal>	 elukey: Yessir
[15:04:38] <joal>	 elukey: I have seen that - could be interesting to set since we need to restart?
[15:04:47] <elukey>	 definitely
[15:04:54] <elukey>	 joal: if you agree I'd proceed in this way
[15:05:15] <elukey>	 merge the hadoop patches and start the rolling reboots of some nodes with new heap + prometheus
[15:05:25] <elukey>	 tomorrow morning druid, together with druid.indexer.task.restoreTasksOnRestart=true
[15:05:43] <elukey>	 and now we just see how equalDistribution works over night
[15:06:23] <joal>	 elukey: works for me except that tomorrow I'll do a late start - we do druid tomorrow around 14:00?
[15:07:28] <elukey>	 ack!
[15:07:35] <elukey>	 all right starting with hadoop then
[15:07:51] <joal>	 awesome elukey, thanks a lot :)
[15:07:59] <joal>	 Nice one: 16:05:25 < elukey> tomorrow morning druid, together with druid.indexer.task.restoreTasksOnRestart=true
[15:08:02] <joal>	 oops
[15:08:08] <joal>	 nice one: https://i.redd.it/h4ngqma643101.jpg
[15:09:10] <elukey>	 ahahahah
[15:13:10] <joal>	 ottomata: do you mind double cheking https://phabricator.wikimedia.org/T176785 ? I got a +1 from elukey, updated patches - waiting for your approval :)
[15:14:53] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10Services (watching): Add action api counts to graphite-restbase job - https://phabricator.wikimedia.org/T176785#3799976 (10Ottomata) I don't love it, but I don't have a better suggestion.  Proceed!
[15:14:53] <ottomata>	 commented
[15:19:32] <joal>	 Thanks ottomata
[15:20:31] <wikibugs>	 (03Abandoned) 10Joal: Update mw-history page reconstruction (restores) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/388444 (https://phabricator.wikimedia.org/T179690) (owner: 10Joal)
[15:21:48] <joal>	 ottomata: would you mind reviewing (can be quick) the 2 CRs associated with the task above?
[15:22:58] <elukey>	 people on analytics1029 we have the prometheus jmx exporter running for both nodemanager and datanode
[15:23:04] <elukey>	 plus the new heap settings
[15:23:04] <joal>	 elukey: very interestingly, indexing tasks shuffle between realtime and batch - that's great :)
[15:23:15] <joal>	 elukey: Great !
[15:23:44] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Grow RestbaseMetrics spark job to count MW API [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/392700 (https://phabricator.wikimedia.org/T176785) (owner: 10Joal)
[15:24:14] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Change restbase job to also count mw_api requests [analytics/refinery] - 10https://gerrit.wikimedia.org/r/392703 (https://phabricator.wikimedia.org/T176785) (owner: 10Joal)
[15:24:24] <ottomata>	 woohoo elukey!
[15:25:11] <joal>	 Thanks ottomata 
[15:29:46] <milimetric>	 nice about the equalDistribution
[15:41:11] <elukey>	 ouch just seen the oozie alers
[15:46:54] <elukey>	 can you guys access hue?
[15:47:01] <elukey>	 it gives me 500 and it times out
[15:47:13] <elukey>	 it happened already once days ago, and then recovered by itself
[15:47:38] <elukey>	 I can only see WebHdfsException: HTTPConnectionPool(host='analytics1001.eqiad.wmnet', port=14000): like exceptions
[15:47:46] <elukey>	 that are right since webhdfs is disabled
[15:50:52] <elukey>	 !log restart hue on thorium - timeouts and 500s
[15:50:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:51:21] <elukey>	 now it works
[15:52:07] <ottomata>	 hm, i wonder if we need to tell hue not to enable the hdfs browser app
[16:01:06] <nuria_>	 ping joal
[16:01:47] <wikibugs>	 (03PS6) 10Fdans: [wip] Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[16:13:00] <joal>	 elukey: oozie alerts are fdans testing
[16:13:47] <elukey>	 goooood
[16:19:07] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging to deploy next week" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/392703 (https://phabricator.wikimedia.org/T176785) (owner: 10Joal)
[16:19:23] <wikibugs>	 (03CR) 10Joal: [C: 032] Grow RestbaseMetrics spark job to count MW API [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/392700 (https://phabricator.wikimedia.org/T176785) (owner: 10Joal)
[16:21:55] <joal>	 !log wikidata-wdqs_extract-wf-2017-11-30-15
[16:21:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:24:20] <wikibugs>	 (03Merged) 10jenkins-bot: Grow RestbaseMetrics spark job to count MW API [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/392700 (https://phabricator.wikimedia.org/T176785) (owner: 10Joal)
[16:25:27] <wikibugs>	 (03CR) 10Nuria: Fix mediawiki page history reonstruction (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/394279 (owner: 10Joal)
[16:52:09] <joal>	 elukey: are we restarting nodes currently?
[16:52:25] <ottomata>	 oh, haha
[16:52:28] <ottomata>	 was about to ask that too
[16:52:33] <ottomata>	 analytics1032 just get restarted?
[16:52:50] <elukey>	 yep
[16:52:56] <elukey>	 also 1033
[16:52:58] <ottomata>	 ok
[16:53:00] <ottomata>	 cool
[16:53:32] <joal>	 mwarf - it killed my spark job :()
[16:53:48] <joal>	 elukey: any plan on restarting others soon?
[16:54:20] <elukey>	 joal: yeah all the 103* row, but I can stop if you want
[16:54:58] <elukey>	 I can restart tomorrow morning
[17:01:08] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban: Purge refined JSON data after 90 days - https://phabricator.wikimedia.org/T181064#3800366 (10Ottomata) https://gerrit.wikimedia.org/r/#/c/392733/
[17:01:31] <wikibugs>	 10Analytics-Kanban: Check data from new API endpoints against existing sources  - https://phabricator.wikimedia.org/T178478#3800368 (10Nuria)
[17:06:12] <wikibugs>	 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3551186 (10Nuria) This requires ops dicussion, likely to get done next quarted (starting January)
[17:10:50] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Add "Interwicket" to the list of bots - https://phabricator.wikimedia.org/T154090#3800395 (10Nuria) a:05Nuria>03ezachte
[17:16:02] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Provide top domain and data to truly test superset - https://phabricator.wikimedia.org/T166689#3800420 (10Ottomata)
[17:16:36] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Productionize Superset  - https://phabricator.wikimedia.org/T166689#3304286 (10Ottomata)
[17:17:28] <wikibugs>	 10Analytics-Kanban: Add documentation for .m suffix code to pagecounts-ez doc page - https://phabricator.wikimedia.org/T180452#3800422 (10Nuria) a:03mforns
[17:20:22] <wikibugs>	 (03PS8) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207
[17:20:39] <joal>	 ottomata: quick question for you - would you mind if I move your scala code from refinery-core to refinery-job?
[17:21:03] <ottomata>	 sure, which code?
[17:21:29] <joal>	 ottomata: this scala code is the only one in core as of today, and I don't know which way to go - either scalaify java code in core, remove scala code from core
[17:22:01] <joal>	 ottomata: HivePartition, SparkJsonToHive amd SparkSQLHiveExtensions
[17:22:50] <wikibugs>	 10Analytics-Kanban, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Trending-Service, and 2 others: Trending Edit's worker offsets disappear from Kafka - https://phabricator.wikimedia.org/T181346#3786853 (10Nuria) Ping @mobrovac we have little availability in the near term to look at this, can you tak...
[17:24:27] <wikibugs>	 10Analytics, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Trending-Service, and 2 others: Trending Edit's worker offsets disappear from Kafka - https://phabricator.wikimedia.org/T181346#3800432 (10Nuria)
[17:24:57] <wikibugs>	 (03PS7) 10Fdans: [wip] Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[17:29:41] <ottomata>	 oh i see, otherwise you'd hvae to make  a refinery-core-spark2 module or seomthig joal?
[17:32:33] <joal>	 That's the idea - any bump of version in scala oriented deps has 2 modules
[17:32:48] <joal>	 And, in opinion, the job itself belongs to job (spark oriented)
[17:32:52] <joal>	 ottomata: --^
[17:37:36] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3800473 (10elukey) Just spoke to Chris and Faidon on IRC, and with my team. The best option seems to be repurpose notebook1002.eqiad.wmnet to kafka1023.eqiad.wmnet (new hostname), and assign to...
[17:41:54] <elukey>	 rebooting an1034/36 (35 is a journal node and will be done later
[17:44:28] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3800495 (10elukey) Updating after a chat with Faidon: better to see if there is a onsite spare to repurpose, but for that I'd need to ping @RobH  :)
[17:55:20] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3800512 (10RobH) So kafka1018 is a Dell PowerEdge R720xd.  It is a 2U server, with 12 LFF disk bays.   It has dual Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 48Gb memory, and 12 * 2TB disks.  We...
[17:55:41] <wikibugs>	 10Analytics, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Trending-Service, and 2 others: Trending Edit's worker offsets disappear from Kafka - https://phabricator.wikimedia.org/T181346#3800516 (10mobrovac) p:05High>03Low We don't have much bandwidth to look into it right now, but given that {T...
[18:19:17] <elukey>	 !log re-run webrequest-load-wf-text-2017-11-30-14 (failed due to hadoop reboots)
[18:19:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:20:03] <elukey>	 !log re-run  webrequest-load-wf-upload-2017-11-30-16 (failed due to hadoop reboots)
[18:20:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:21:10] <elukey>	 ottomata: I just rebooted an1029->34,36 (not 29 and 35 since they are journal nodes, will do it as last step)
[18:21:15] <elukey>	 all seems good
[18:22:50] <elukey>	 going afk but please ping me if anything comes up :)
[18:22:52] * elukey off!
[18:22:56] <elukey>	 byyeee
[18:23:36] <wikibugs>	 (03PS8) 10Fdans: [wip] Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[18:24:13] <ottomata>	 k! great
[18:24:14] <ottomata>	 laters
[18:28:20] <wikibugs>	 (03PS9) 10Fdans: [wip] Add pageview by country oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[20:06:42] <wikibugs>	 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all data available - https://phabricator.wikimedia.org/T181751#3800952 (10Nuria)
[20:06:55] <wikibugs>	 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all timeperiod available - https://phabricator.wikimedia.org/T181751#3800965 (10Nuria)
[20:08:48] <nuria_>	 ottomata: what directory do i need to put things in so it ends up on https://analytics.wikimedia.org/datasets/archive/public-datasets/
[20:10:33] <ottomata>	  /srv/published-datasets
[20:10:42] <ottomata>	 oh but archive
[20:10:42] <ottomata>	 hm
[20:10:50] <ottomata>	 ah yeah
[20:10:53] <ottomata>	 stat1006
[20:11:03] <ottomata>	  /srv/published-datasets/archive/public-datasets
[20:12:40] <nuria_>	 ottomata: we have pageviews from antartica
[20:13:09] <nuria_>	 ottomata: like 20 per year
[20:14:45] <wikibugs>	 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all timeperiod available - https://phabricator.wikimedia.org/T181751#3800982 (10Nuria) Select run:  use wmf;      SELECT         year,  country,         views,         row_number() OVER (PARTITION BY year ORDER BY views DESC) as number...
[20:15:43] <ottomata>	 nice!
[20:15:44] <ottomata>	 :)
[20:15:52] <ottomata>	 and some from space I suppose!
[21:00:57] <wikibugs>	 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all timeperiod available - https://phabricator.wikimedia.org/T181751#3801168 (10Nuria)
[21:32:12] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3792843 (10Dzahn) I saw this host as DOWN when looking at Icinga as it was in the unacknowledeged section (though notifications were disabled).  Then i searched Phab for the host name, which usu...
[21:45:18] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Understand Kafka ACLs and figure out what ACLs we want for production topics - https://phabricator.wikimedia.org/T167304#3801261 (10Ottomata) > As explained before we also need to explicitly set ACLs for cluster operations between...
[21:47:05] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Understand Kafka ACLs and figure out what ACLs we want for production topics - https://phabricator.wikimedia.org/T167304#3801277 (10Ottomata) @elukey, SSL and auth enabled, and `log4j.logger.kafka.authorizer.logger=DEBUG`, I get in...
[21:48:31] <HaeB>	 ottomata: saw my note here yesterday?  the web team plans to re-enable the popups instrumentation for another week (to record data from a new version of the schema) https://phabricator.wikimedia.org/T181493 
[21:49:05] <HaeB>	 just as a heads-up, because i don't know how schema updates are handled in the new Hive refinement process
[21:49:38] <ottomata>	 HaeB ok cool, what's been changed about the new schema?
[21:49:42] <ottomata>	 just field additions?
[21:49:46] <ottomata>	 no type changes, right? :)D
[21:51:31] <HaeB>	 ottomata: yes https://meta.wikimedia.org/w/index.php?title=Schema:Popups&diff=17430287&oldid=17092584
[21:53:30] <ottomata>	 great HaeB that should work then
[21:54:04] <ottomata>	 so, we haven't started any 90 day deletion stuff yet, but just to be sure, let us know when the experiement is over
[21:54:16] <ottomata>	 and we'll make sure to back up the event.popups table as is into your hive db
[21:54:23] <ottomata>	 so it won't be touched by whatever future deletion stuff we enable
[21:55:57] <HaeB>	 cool, thanks! it should still be purged too eventually according to the protocol (as opposed to deleted)
[22:16:07] <ottomata>	 ya eventually we'll get there