[00:05:49] <nuria_>	 HaeB: where was the pageviews05 table?
[00:09:23] <HaeB>	 nuria_:  in staging on stat1003 (where did that get moved to during the server update?)
[00:10:01] <nuria_>	 HaeB: mysql:research@analytics-slave.eqiad.wmnet [staging]>
[00:10:36] <nuria_>	 HaeB: but that database does not have a pageviews05 table
[00:13:08] <nuria_>	 HaeB: or wait was it not mysql?
[00:13:33] <HaeB>	 ...more precisely, on analytics-store
[00:14:08] <HaeB>	 https://www.irccloud.com/pastebin/xL2skQKw/
[00:15:50] <nuria_>	 HaeB: ok, i see 
[00:16:00] <nuria_>	 HaeB: there are two staging dbs now
[00:17:15] <HaeB>	 well, they always differed between -store and -slave, IIRC
[00:17:31] <HaeB>	 i understand everything in them was preserved during https://phabricator.wikimedia.org/T156844
[08:17:35] <elukey>	 HaeB: correct! 
[08:45:00] <elukey>	 completing the reboot of the analytics103* nodes!
[09:16:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[09:28:21] <elukey>	 fdans: o/
[09:28:33] <elukey>	 if you are checking oozie alerts please ping before restarting jobs
[09:28:47] <elukey>	 I am going to avoid restarts of failed jobs for a bit
[09:28:54] <elukey>	 since I need to reboot a lot of nodes
[09:29:24] <fdans>	 gotcha elukey
[10:11:15] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3659740 (10Jason821) Hi,  Is this prometheus druid metrics exporter open-sourced? really interested to integrate!
[10:15:29] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3659740 (10MoritzMuehlenhoff) @Jason821: Sure, everything we develop is open source. You can get it from https://gerrit.wikimedia.org/r/#/admin/projects...
[10:22:33] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3802245 (10Jason821) Hi,  That's great, but does it support druid 0.10.0 already?  The git repo says it only supports 0.9.2 by far.
[10:24:56] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3802246 (10elukey) >>! In T177459#3802245, @Jason821 wrote: > Hi, >  > That's great, but does it support druid 0.10.0 already?  The git repo says it onl...
[10:33:35] * elukey is happy --^
[10:39:56] <wikibugs>	 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3802296 (10elukey)
[10:46:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[11:07:58] <fdans>	 elukey: should I restart the jobs that just failed?
[11:08:49] <elukey>	 nope
[11:09:05] <elukey>	 I mean, if it is yours please do
[11:09:09] <elukey>	 otherwise I'll do it in a bit
[11:09:44] <fdans>	 I'm guessing my query got interrupted because of a reboot right elukey? :)
[11:10:25] <wikibugs>	 10Analytics-Kanban, 10DBA, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3802437 (10elukey) Opened https://phabricator.wikimedia.org/T181784 to fully decom db104[67]
[11:17:26] <elukey>	 fdans: yep sorry :(
[11:17:56] <fdans>	 nonono that's totally fine elukey :)
[11:18:38] <elukey>	 I'd like to do 1044->1049 today and then stop
[11:18:58] <elukey>	 I'll do 1050->68 next week
[11:19:02] <elukey>	 soooo booring :D
[12:05:39] <joal_>	 Hi my friends !
[12:05:43] <joal_>	 I'm here from now on :)
[12:06:10] <elukey>	 joal_: o/
[12:06:32] <elukey>	 don't pay attention to oozie, I am waiting the last reboots before doing the job restarts :)
[12:06:40] <joal_>	 no prob elukey 
[12:06:50] <joal_>	 how many still to go elukey ?
[12:07:03] <elukey>	 three to reach 1049
[12:07:14] <elukey>	 then next week will do 50->69
[12:07:25] <joal_>	 Amost there for this week !
[12:16:46] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Understand Kafka ACLs and figure out what ACLs we want for production topics - https://phabricator.wikimedia.org/T167304#3802681 (10elukey) >>! In T167304#3801261, @Ottomata wrote: >> As explained before we also need to explicitly...
[12:18:43] <joal_>	 elukey: While Im thinking of it - Shouldn't we productionize the overlord config for equalDistribution?
[12:20:09] <elukey>	 joal_: I thought the same, but I didn't find the option to set in the druid's properties 
[12:20:19] * joal_ grumbles
[12:20:20] <elukey>	 only the one for dynamic config
[12:20:25] <elukey>	 but maybe I have missed it
[12:20:28] <elukey>	 it is really handy
[12:28:31] <joal_>	 elukey: from what I read, it seems that config can only be done dnamically :(
[12:28:34] <joal_>	 https://groups.google.com/forum/#!topic/druid-user/USzFUGUO8SY
[12:29:52] <elukey>	 :(
[12:30:37] <elukey>	 joal_: shall we reboot druid1001 and see how it goes? Maybe a couple of minutes after the hour
[12:31:43] <joal_>	 elukey: You can go for it now - tasks  are shared among 3 workers
[12:32:22] <joal_>	 elukey: all hadoop nodes done for today?
[12:32:23] * elukey likes when joseph feels bold 
[12:32:36] <elukey>	 1049 still running a ton of containers
[12:33:00] <joal_>	 elukey: cluster is super busy now - Beginning of month
[12:33:11] <joal_>	 elukey: So I suggest going for it, and that'll be it
[12:34:38] <elukey>	 !log re-run webrequest-druid-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
[12:34:39] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:35:58] <elukey>	 !log re-run pageview-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
[12:35:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:36:55] <elukey>	 !log re-run webrequest-load-wf-text-2017-12-1-10 and webrequest-load-wf-text-2017-12-1-9 (failed due to Hadoop reboots)
[12:36:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:37:23] <wikibugs>	 (03PS6) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556)
[12:37:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric)
[12:37:40] <elukey>	 !log re-run  webrequest-load-wf-upload-2017-12-1-10 and  webrequest-load-wf-upload-2017-12-1-7 (failed due to Hadoop reboots)
[12:37:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:38:11] <elukey>	 not sure about aqs-hourly-wf-2017-12-1-8
[12:38:27] <elukey>	 was it run as part of a coordinator?
[12:39:51] <elukey>	 all right stopping daemons on druid1001 and reboot (pivot will be impacted)
[12:40:07] <wikibugs>	 (03PS7) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556)
[12:40:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric)
[12:40:33] <elukey>	 even if I can temporarily switch it to druid1002
[12:40:43] <milimetric>	 ok, fdans, everything's working now I think, but I may have left some ugly code in there somewhere and I still have to fix the tests and this ugly purple color that now happens by default
[12:41:10] <milimetric>	 but you can take a look if you want.  I'll be finishing today and then we can have a review anyway, just in case you're curious early
[12:42:08] <elukey>	 !log temporarily switch pivot's config to druid1002 (to reboot druid1001)
[12:42:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:43:15] <elukey>	 joal_: druid1001 stopped
[12:44:46] <elukey>	 from the overlord console I can still see two real time indexers
[12:50:14] <joal_>	 elukey: It seems the task got back up after restart !
[12:50:55] <joal_>	 And no impact on pivot I think
[12:51:12] <joal_>	 some drops in number around 12, but hopefully not related
[12:51:37] <elukey>	 yep was about to say that
[12:51:51] <elukey>	 didn't touch druid before 10m ago 
[12:52:03] <joal_>	 elukey: I think it's not related
[12:52:11] <elukey>	 all right analytics reboots completed for today
[12:52:18] <joal_>	 awesome :)
[12:52:30] <joal_>	 elukey: I think we're safe to reboot all druid then :)
[12:52:54] <elukey>	 one every hour is fine 
[12:53:02] <elukey>	 joal_: do you know aqs-hourly-wf-2017-12-1-8 ?
[12:53:16] <joal_>	 I do elukey :)
[12:53:18] <elukey>	 it failed but I don't see what coordinator belongs to (if any)
[12:53:36] <joal_>	 This extracts aqs-aggregated logs, for usage analysis
[12:53:52] <joal_>	 actually, I should push those numbers to druid :)
[12:58:24] <joal_>	 have you found the coordinator elukey ?
[13:01:53] <elukey>	 nope :(
[13:04:38] <joal_>	 elukey: hidden in second page
[13:05:40] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add the prometheus jmx exporter to all the Hadoop daemons - https://phabricator.wikimedia.org/T177458#3802747 (10elukey) Metrics exported by the basic config from the HDFS datanode:  ``` Hadoop_DataNode_BlockChecksumOpAvgTime{name="DataNodeActivity-analy...
[13:06:33] * elukey cries in a corner
[13:07:12] <elukey>	 !log re-run aqs-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots)
[13:07:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:09:09] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add the prometheus jmx exporter to all the Hadoop daemons - https://phabricator.wikimedia.org/T177458#3802752 (10elukey) Metrics exported by the Yarn Node manager:  ``` elukey@analytics1029:~$ curl http://analytics1029.eqiad.wmnet:8141/metrics -s | grep...
[13:13:03] <fdans>	 awesome milimetric!
[13:13:10] <fdans>	 (sorry, was lunchin)
[13:25:23] <joal_>	 elukey: I don't know what we have changed, but last realtime tasks failed :(
[13:27:17] <elukey>	 lovely
[13:27:22] <elukey>	 all three replicas?
[13:27:51] <joal_>	 hm, don't know elukey - they're just listed as failed
[13:28:10] <joal_>	 Have we changed the parameter for tasks -restart?
[13:28:34] <elukey>	 no we didn't, but it shouldn't matter
[13:29:13] <elukey>	 the tasks should be created for each hour right? New ones I mean
[13:29:31] <joal_>	 yes, they've been created, but failed
[13:31:16] <elukey>	 Unable to grant lock to inactive Task [index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0]
[13:31:31] <joal_>	 :(
[13:35:02] <elukey>	 https://groups.google.com/forum/#!topic/druid-user/Wj91V3nby6I
[13:35:13] <elukey>	 so from here it seems that it might have something to do with Zookeeper
[13:35:34] <elukey>	 but it seems weird
[13:35:43] <joal_>	 I have seen that elukey - but I really wonder how it's even possible
[13:36:33] <joal_>	 Ah ! Maybe - tranquility was connected to overlord 1003 - it changed to 1002 - so tranquility creates it's task, updates zookeeper but doesn't send it correctly to overlord
[13:37:06] <elukey>	 wasn't it 1002 -> 1003 ?
[13:37:18] <joal_>	 hm, possible elukey 
[13:38:00] <elukey>	 atm the overlord leader is 1003
[13:38:10] <joal_>	 k
[13:38:47] <elukey>	 so in theory the next hour should be fine
[13:38:52] <elukey>	 if this is a temp glitch
[13:39:50] <joal_>	 I hope it is elukey :)
[13:40:04] <joal_>	 if not, I'll restart the tranquiloity job
[13:41:48] <elukey>	 2017-12-01T13:41:23,016 INFO com.metamx.http.client.pool.ChannelResourceFactory: Generating: http://druid1003.eqiad.wmnet:8090
[13:41:51] <elukey>	 2017-12-01T13:41:23,556 WARN io.druid.indexing.common.actions.RemoteTaskActionClient: Exception submitting action for task[index_realtime_banner_activity_minutely_2017-12-01T12:00:00.000Z_2_0]
[13:41:55] <elukey>	 java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[druid1003.eqiad.wmnet:8090] logs for exceptions.
[13:41:58] <elukey>	 this is from the middlemanager on 1003
[13:42:19] <joal_>	 Maaaaan
[13:44:35] <elukey>	 so it probably didn't like the change in the overlord
[13:46:00] <joal_>	 hm, what did we change elukey ?
[13:46:05] <joal_>	 only the master, right?
[13:46:44] <elukey>	 nothing as far as I know, 1001 was not the overlord master
[13:48:02] <joal_>	 hm, that's really uncool
[13:58:03] <joal_>	 man - cluster is super hugely busy
[14:01:21] <elukey>	 interesting from druid1002's overlord log
[14:01:22] <elukey>	 2017-12-01T12:43:32,887 INFO io.druid.indexing.overlord.RemoteTaskRunner: Kaboom! Worker[druid1001.eqiad.wmnet:8091] removed!
[14:01:34] <elukey>	 then
[14:01:51] <elukey>	 2017-12-01T12:44:44,060 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0xe95e328c024806da,
[14:01:54] <elukey>	 likely server has closed socket, closing socket connection and attempting reconnect
[14:02:13] <elukey>	 so this --^ was probably a zk session on druid1001
[14:02:32] <elukey>	 2017-12-01T12:44:44,161 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED
[14:02:35] <elukey>	 2017-12-01T12:44:44,166 INFO io.druid.curator.discovery.CuratorServiceAnnouncer: Unannouncing service[DruidNode{serviceName='druid/over
[14:02:38] <elukey>	 lord', host='druid1002.eqiad.wmnet', port=8090}]
[14:02:56] <elukey>	 2017-12-01T12:44:44,955 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to druid1003.eqiad.wmnet/10.64.53.103:2181, initiating session
[14:03:05] <elukey>	 this might explain why the overlord master changed
[14:03:38] <elukey>	 2017-12-01T12:44:44,979 INFO io.druid.indexing.overlord.TaskMaster: Bowing out!
[14:04:04] <elukey>	 now if tranquillity thinks that druid1002 is the master
[14:04:10] <elukey>	 and it doesn't change its settings
[14:04:24] <joal_>	 elukey: I have no clue what tranquility thinks :)
[14:04:32] <joal_>	 I just hope it'll change i's settings
[14:04:35] <elukey>	 any logs that we can check?
[14:04:58] <elukey>	 funny
[14:05:06] <elukey>	 now we have two indexers running 
[14:05:10] <elukey>	 for 14:00
[14:05:12] <joal_>	 elukey: not really - streaming job runs on hadoop, therefore doesn't finish - therefore no log aggregation
[14:05:50] <elukey>	 not event the logs of the appmaster?
[14:05:53] <joal_>	 This is weird indeed
[14:06:13] <joal_>	 elukey: we can check logs one by one - I can tell you where appmaster is
[14:09:45] <joal_>	 elukey: driver is on 10.64.36.106, application_1504006918778_285237
[14:11:09] <joal_>	 ok, found why we are so behind in hadoop
[14:11:35] <joal_>	 the 2 monthly jobs for uniques (per_project_family and per_domain) are running concurrently
[14:11:39] <joal_>	 I need to change that
[14:17:46] <joal_>	 elukey: another weird thing: pivot data is superweirdly present during the time task was supossedly down
[14:17:49] <joal_>	 Man
[14:17:56] <joal_>	 That is very unreliable
[14:19:35] <elukey>	 joal_: lol?
[14:19:51] <joal_>	 mwarfol
[14:20:35] <elukey>	 I am almost convinced that it was zookeeper the issue
[14:20:45] <elukey>	 but this complicates a lot the reboot procedure
[14:22:42] <elukey>	 I followed all the logs from druid1002 to druid1003 but then I can't really find a good root cause
[14:23:32] <joal>	 elukey: maybe making sure we disable the realtime workers before restarting?
[14:24:40] <elukey>	 but then we loose data no?
[14:24:44] <elukey>	 "realtime data"
[14:26:34] <joal>	 elukey: if we do it one by one, normally no
[14:28:42] <elukey>	 joal: but we just did that shutting down the one on druid1001 and it caused this mess
[14:28:44] <joal>	 the other weird thing elukey is the failure for new task on druid1001
[14:29:31] <elukey>	 I bet that restarting tranquillity fixes this
[14:40:11] <elukey>	 I restarted the middle manager on druid1001
[14:40:18] <elukey>	 so now it lists a [] 
[14:40:52] <joal>	 let's wait for new hour elukey u
[14:53:47] <elukey>	 2017-12-01T13:05:26,354 INFO io.druid.indexing.overlord.TaskQueue: Received FAILED status for task: index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0
[14:53:52] <elukey>	 2017-12-01T13:05:26,354 ERROR io.druid.indexing.overlord.RemoteTaskRunner: WTF?! Asked to cleanup nonexistent task: {class=io.druid.indexing.overlord.RemoteTaskRunner, taskId=index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0}
[14:54:07] <elukey>	 I tried to follow index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0
[14:59:10] <joal>	 That's weird man :(
[14:59:22] <joal>	 elukey: Do you want me to restart streaming job?
[15:00:03] <elukey>	 index_hadoop_webrequest_2017-12-01T14:31:08.876Z FAILED
[15:00:05] <elukey>	 :(
[15:00:28] <joal>	 Mwarf :(
[15:00:29] <elukey>	 wait a sec for the new jobs
[15:01:43] <elukey>	 joal: index_hadoop_webrequest_2017-12-01T14 is sent by hadoop right?
[15:02:01] <elukey>	 webrequest-druid-hourly-wf-2017-12-1-8 failed
[15:02:16] <joal>	 it is elukey 
[15:02:35] <joal>	 but it is managed by druid
[15:02:55] <elukey>	 ahhaha joal take a look to the overlord console
[15:03:29] <joal>	 hm - looks like druid1001 is back in the game
[15:03:52] <elukey>	 and also that there are real time indexers for 14:00
[15:04:08] <joal>	 Yes, up to 16:10
[15:04:23] <joal>	 This is expected behavior )
[15:04:28] <elukey>	 is it??
[15:04:33] <elukey>	 I am missing some stuff the
[15:04:35] <elukey>	 *then
[15:04:58] <joal>	 elukey: tranquility allows for waiting for late events
[15:05:04] <joal>	 we have set this to 10 minutes
[15:05:59] <elukey>	 it still doesn't make sense why there are indexers for UTC 14 and UTC 15
[15:06:26] <elukey>	 elukey@druid1003:/var/log/druid$ date
[15:06:26] <elukey>	 Fri Dec  1 15:05:30 UTC 2017
[15:06:50] <elukey>	 what is banner_activity_minutely-2017-12-01T14:00:00.000Z-0001 supposed to index ?
[15:06:52] <joal>	 UTC14 are waiting for events to arrive, up to 10 minute late
[15:06:58] <joal>	 They'll be finished at 16:100
[15:07:27] <elukey>	 ah sorry you are saying 16:00 our timezone
[15:07:36] <joal>	 correct sir, sorry my mistake
[15:07:42] <elukey>	 nono now it makes sense!
[15:07:45] <elukey>	 slow friday
[15:07:52] <elukey>	 okok so it seems back in the correct shape then
[15:08:20] <joal>	 agreed
[15:08:33] <joal>	 except for hadoop related jobs :(
[15:09:00] <elukey>	 !log rerun pageview-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency
[15:09:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:09:21] <joal>	 elukey: let's wait for that one
[15:09:24] <elukey>	 yep
[15:09:29] <elukey>	 super curious
[15:10:26] <elukey>	 another theory that I have is that when rebooting nodes, we should check what zookeeper node holds sessions from the overlord master
[15:10:37] <elukey>	 and possibily avoid a overlord change 
[15:11:30] <elukey>	 there you go task submitted to the overlord (from the logs)
[15:11:44] <elukey>	 and assigned to druid1001 correctly
[15:20:14] <elukey>	 !log rerun webrequest-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency
[15:20:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:20:33] <elukey>	 actually I flipped them, I restarted webrequest then pageviews
[15:20:41] <joal>	 no prob
[15:21:57] <joal>	 elukey: pivot is now refreshed with data making more sense
[15:22:16] <joal>	 elukey: like, a drop between 1 and 2 pm UTC
[15:23:01] * elukey nods
[15:33:25] <AndyRussG>	 Hi all! Quick question... have there been issues with analytics infrastructure this morning? Seeing some bizarre data in banner activity logs on Druid/Pivot: https://goo.gl/WfKTKG
[15:33:32] <AndyRussG>	 thx in advance!!
[15:33:50] <wikibugs>	 (03PS6) 10Fdans: [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529)
[15:35:04] <wikibugs>	 (03PS8) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556)
[15:35:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric)
[15:36:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans)
[15:38:18] <elukey>	 AndyRussG: hi! yes 
[15:38:45] <elukey>	 we had issues while rebooting a druid node for maintenance :(
[15:38:54] <elukey>	 so realtime data got affected
[15:38:58] <AndyRussG>	 elukey: ah ok understood
[15:39:17] <AndyRussG>	 but so Hive data is still good, correct?
[15:39:38] <elukey>	 yep, plus we'll backfill that data once the regular hadoop jobs will run
[15:40:38] <AndyRussG>	 right :) cool thanks much :D Mmmm also got a report about some issues on stat1005, don't have the details yet... any ideas about possible issues there?
[15:41:33] <elukey>	 AndyRussG: nope, but possibly some overload due to people crunching data.. let me know if you have more info and I'll try to check!
[15:41:38] <elukey>	 no alarms fired today though
[15:42:27] <AndyRussG>	 elukey: ok thanks! yeah much, I'll let you know in a bit..
[15:43:47] <AndyRussG>	 heheh switch positions of the words "yeah" and "much" ^
[15:45:05] <AndyRussG>	 word-level dyslexia? or dyscribia?
[15:46:44] <elukey>	 probably simply Friday kicking in :)
[15:47:00] <fdans>	 elukey: dear luca
[15:47:03] <fdans>	 it's time
[15:47:35] <elukey>	 fdans: tell me more
[15:47:51] <fdans>	 if you could dump the contents of test_pageviews_bycountry in a tsv i'd be eternally grateful
[15:48:09] <fdans>	 the job just finished running elukey 
[15:48:11] <elukey>	 do you have a query that does it ?
[15:48:19] <fdans>	 oh yes, 1sec
[15:50:48] <fdans>	 this will do I think elukey :
[15:50:49] <fdans>	 cqlsh -e "select * from \"test_pageviews_bycountry\".data" > out.csv
[15:51:12] <elukey>	 how many rows are we talking about ?
[15:51:36] <fdans>	 about 2700
[15:51:52] <fdans>	 (number of wikimedia projects * 3)
[15:52:44] <elukey>	 where do you want the file uploaded to?
[15:53:29] <elukey>	 fdans: --^
[15:54:56] <fdans>	 elukey: is it ok to send it to me by email? or any way that's convenient for you that isn't public
[15:55:26] <elukey>	 fdans: I'll upload it on one of the stat boxes
[15:55:29] <elukey>	 in your home dir
[15:55:32] <elukey>	 stat1005 ok?
[15:55:40] <fdans>	 sounds great!
[15:55:49] <fdans>	 thank you luca
[15:57:24] <elukey>	 fdans: /home/fdans/aqs_test_out.csv
[15:57:26] <elukey>	 there you go
[15:57:39] <fdans>	 graaaaaaazie elukey !!!!
[15:57:48] <fdans>	 can't wait to test this in beta aqs
[16:01:02] <nuria_>	 ping milimetric 
[16:08:37] <joal>	 Hi ebernhardson - I quickly looked at your CR in scala - Do you want me to thoroughly review it (I don't know much of what it does, might take me long), or just approve that the approch looks good?
[16:16:03] <ebernhardson>	 joal: well, mostly your the only person i know that actually writes scala more than once in a blue moon :) Tbh i don't know exactly what the calculations are either, its a port of the algorithm from a python implementation. Mostly that the approach is sane and it's not doing things in wierd ways that scala has better methods for
[16:16:41] <ebernhardson>	 joal: and i suppose a note that because i pretty much only port from python to scala for performance reasons (this is ~20x faster) the scala code is somewhat un-idiomatic using arrays and mutable data
[16:17:28] <joal>	 ebernhardson: I noticed that (scala idioms)
[16:18:01] <joal>	 ebernhardson: In CR message, you said there was perf gain of using mutables - you confirm?
[16:18:54] <ebernhardson>	 joal: yes, my first round i was using things like (0 until N).map { ... } to build up arrays, but converting them all to pre-filled arrays and using while loops increased speed from 13s to 4s on the included benchmark (which takes 90s in python)
[16:19:19] <joal>	 ebernhardson: I'd be interested in double checking that - We heavily use scala functional way of doin stuff in some other jobs, and maybe it wou;d be better using mutables ...
[16:20:17] <joal>	 hum - I think there is an array-filling function in scala - Ok, I'll read the code and suggest possible idiomatic changes
[16:20:46] <ebernhardson>	 this code is all prett simple math running in a tight loop, so i think thats why it benefits. the inner loops run something like 4M times in the benchmark
[16:20:59] <joal>	 right
[16:21:06] <joal>	 let's keep it this way then :)
[16:28:06] <joal>	 also ebernhardson, this example gives us a strong +1 for using scala - I think you for that :)
[16:29:06] <joal>	 ebernhardson: No sure if you tried spark 2.1 (spark2-submit from any stat machine), but it should also be really faster than 1.6
[16:34:07] <joal>	 Ah - Looks like you've actually already done that :)
[16:34:11] <joal>	 ebernhardson: --^
[16:34:15] <ebernhardson>	 joal: yea, we are on 2.1 :)
[16:34:21] <joal>	 great
[16:46:49] <nuria_>	 joal: let me know when you give a 1st pass through the docs for aqs edit endpoint and i can help as needed
[16:53:06] <elukey>	 the more I read about druid's user mailing list the more I think that the indexing service is not super resilient to zk failures
[16:54:21] <elukey>	 Fangjin wrote (long time ago but it might still be relevant): "The indexing service requires ZK to be able to assign tasks, without it task assignment will timeout. I agree Druid needs to be made more resilient to ZK problems."
[17:31:05] <elukey>	 the druid people are awesome
[17:31:07] <elukey>	 2017-12-01T12:44:44,985 INFO io.druid.indexing.overlord.TaskMaster: By the power of Grayskull, I have the power!
[17:31:12] <elukey>	 joal: --^
[17:31:32] <joal>	 huhuhu :)
[17:39:53] <wikibugs>	 (03PS9) 10Milimetric: Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556)
[17:39:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric)
[17:44:10] <joal>	 nuria_: I'm thinking of the scheme for docs: if I follow the current way we do, I'll create 5 new pages, one per endpoint header (edits, editors, new-registered-users, edited-pages and bytes-differences)
[17:45:04] <joal>	 I'm happy to do that, with a getting started for each, and links to AQS and  mediawiki-history-reconstruction docs
[17:45:09] <joal>	 nuria_: --^
[17:45:24] <nuria_>	 joal: on meeting can talk in a bit
[17:46:11] <joal>	 np nuria_ 
[17:50:36] <elukey>	 joal: found something interesting
[17:50:57] <elukey>	 index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0 and the other two replicas have running traces only on the druid1001's middle manager
[17:51:07] <elukey>	 (those ones failed_
[17:51:53] <elukey>	 the main issue seems to be the middle manager asking to acquire the time window lock (12->13) from the overlord, that returned 500 since it didn't find any lock registered for the tasks
[17:52:38] <elukey>	 and IIUC it is the overlord the responsible to managing the task locks 
[17:53:29] <elukey>	 so, I am wondering if this is a problem due to an unclean shutdown of the middle manager, maybe together with the change in the overlord
[17:54:28] <elukey>	 maybe we could re-try following the procedure to drain the middle manager on druid1002, wait for the host to be free from work, and then shutdown its overlord/coordinator/broker/etc..
[17:54:32] <elukey>	 and reboot
[17:55:02] <elukey>	 anyhow, this smells a lot like a Druid bug
[17:56:55] <elukey>	 going offline people!
[17:56:57] <elukey>	 o/
[17:57:04] <elukey>	 have a good weekend :)
[17:57:06] * elukey off!
[17:59:04] <joal>	 gone for diner a-team
[19:05:21] <nuria_>	 joal: I woudl just create a quickstart for all with examples like "how do get the number of active editors for jp wiki"
[19:05:32] <nuria_>	 "how do you get the number of edited pages" 
[19:05:47] <nuria_>	 and from each question  link to maybe a more in depth doc
[19:11:29] <nuria_>	 milimetric: is your chageset for FF anywhere?
[19:11:45] <nuria_>	 milimetric: I thought i will start webpack changes on it
[19:13:12] <nuria_>	 milimetric: I think I found It: https://gerrit.wikimedia.org/r/#/c/391490/
[19:13:21] <milimetric>	 nuria_: I'm still rebasing though
[19:13:28] <nuria_>	 milimetric: ok
[19:14:21] <milimetric>	 nuria_: but if you make a dependent change it should be fine, you should be able to rebase cleanly after I do
[19:15:37] <nuria_>	 milimetric: teh approach i am going to use might imply reshuflling imports
[19:15:48] <nuria_>	 milimetric: so i will wait for rebase
[19:15:56] <milimetric>	 I see, ok
[19:24:20] <wikibugs>	 10Analytics-Kanban: Beta: Wikistats split webpack bundle - https://phabricator.wikimedia.org/T181841#3804054 (10Nuria)
[19:32:35] <joal>	 back 
[19:33:53] <joal>	 nuria_, milimetric, fdans: I suggest Analytics/AQS/Wikistats2 as a first page to document all the endpoints we'll have serving wks2 realted data
[19:34:22] <joal>	 When we'll have more than wikistats-oriented data (historical, deletion drift and so), we'll rename if we think it's needed
[19:34:25] <milimetric>	 joal: just do Analytics/AQS/Wikistats (no 2) because we don't have two Wikistatseses on AQS :)
[19:34:27] <joal>	 Ok for you guys?
[19:34:39] <joal>	 Works for me milimetric :)
[19:34:52] <milimetric>	 yeah, it has enough context that it doesn't need the version here
[19:35:07] <nuria_>	 joal: agreed
[19:37:39] <joal>	 fdans: I wait a few minutes, then proceed ;)
[19:40:09] <fdans>	 yesshhh
[19:40:17] <fdans>	 joal
[19:40:27] <joal>	 yooow fdans 
[19:40:34] <joal>	 sorry to ping you so late
[19:40:40] <joal>	 I know you're an early starter
[19:40:54] <joal>	 fdans: ok for Analytics/AQS/Wikistats?
[19:42:07] <fdans>	 yes! sounds great joal
[19:42:13] <joal>	 awesome :)
[19:42:16] <joal>	 Thanks mate
[20:10:14] <wikibugs>	 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all timeperiod available - https://phabricator.wikimedia.org/T181751#3804275 (10Nuria) Countries for last 24 months:  use wmf;      SELECT         country,         views   FROM (         SELECT             country,             SUM(view_c...
[20:42:29] <wikibugs>	 (03PS10) 10Milimetric: Simplify and fix breakdowns and other data [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556)
[20:42:55] <milimetric>	 ok nuria_ (fdans): finished, rebased, tested ^
[20:43:13] <milimetric>	 it looks like there's not enough data to show the last month alone, but everything else seems to work
[20:43:31] <milimetric>	 please do test yourselves too and let me know if I slipped on anything, this concludes like two weeks of intensive thinking and coding
[20:43:34] * milimetric is spent :)
[20:43:58] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Alpha Release: Breakdowns don't work in Firefox - https://phabricator.wikimedia.org/T180556#3804403 (10Milimetric)
[21:15:34] <joal>	 milimetric, nuria_ : 20:44:47 -!- leila [~leila@tan2.corp.wikimedia.org] has quit [Quit: Leaving.]
[21:15:39] <joal>	 oops
[21:15:52] <joal>	 milimetric, nuria_: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats
[21:17:29] <milimetric>	 joal: I would put the data quality notice in a warning box template.  Also maybe something about the api being alpha
[21:30:24] <joal>	 milimetric: better?
[21:32:01] <milimetric>	 nice joal
[21:32:32] <joal>	 milimetric: Fighting with wiki-templates is not something I'm good at :)
[21:33:44] <milimetric>	 joal: oh, I think there are like 3-4 people who are actually good at that
[21:33:54] <joal>	 :)
[21:41:34] <joal>	 Ok, good for tonight - Have a good weekend a-team
[22:55:15] <wikibugs>	 10Analytics, 10MediaWiki-API: There is not an easy way to tag API requests by application for analytics - https://phabricator.wikimedia.org/T181862#3804738 (10dbarratt)