[00:05:49] HaeB: where was the pageviews05 table? [00:09:23] nuria_: in staging on stat1003 (where did that get moved to during the server update?) [00:10:01] HaeB: mysql:research@analytics-slave.eqiad.wmnet [staging]> [00:10:36] HaeB: but that database does not have a pageviews05 table [00:13:08] HaeB: or wait was it not mysql? [00:13:33] ...more precisely, on analytics-store [00:14:08] https://www.irccloud.com/pastebin/xL2skQKw/ [00:15:50] HaeB: ok, i see [00:16:00] HaeB: there are two staging dbs now [00:17:15] well, they always differed between -store and -slave, IIRC [00:17:31] i understand everything in them was preserved during https://phabricator.wikimedia.org/T156844 [08:17:35] HaeB: correct! [08:45:00] completing the reboot of the analytics103* nodes! [09:16:11] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [09:28:21] fdans: o/ [09:28:33] if you are checking oozie alerts please ping before restarting jobs [09:28:47] I am going to avoid restarts of failed jobs for a bit [09:28:54] since I need to reboot a lot of nodes [09:29:24] gotcha elukey [10:11:15] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3659740 (10Jason821) Hi, Is this prometheus druid metrics exporter open-sourced? really interested to integrate! [10:15:29] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3659740 (10MoritzMuehlenhoff) @Jason821: Sure, everything we develop is open source. You can get it from https://gerrit.wikimedia.org/r/#/admin/projects... [10:22:33] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3802245 (10Jason821) Hi, That's great, but does it support druid 0.10.0 already? The git repo says it only supports 0.9.2 by far. [10:24:56] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3802246 (10elukey) >>! In T177459#3802245, @Jason821 wrote: > Hi, > > That's great, but does it support druid 0.10.0 already? The git repo says it onl... [10:33:35] * elukey is happy --^ [10:39:56] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3802296 (10elukey) [10:46:42] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:07:58] elukey: should I restart the jobs that just failed? [11:08:49] nope [11:09:05] I mean, if it is yours please do [11:09:09] otherwise I'll do it in a bit [11:09:44] I'm guessing my query got interrupted because of a reboot right elukey? :) [11:10:25] 10Analytics-Kanban, 10DBA, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3802437 (10elukey) Opened https://phabricator.wikimedia.org/T181784 to fully decom db104[67] [11:17:26] fdans: yep sorry :( [11:17:56] nonono that's totally fine elukey :) [11:18:38] I'd like to do 1044->1049 today and then stop [11:18:58] I'll do 1050->68 next week [11:19:02] soooo booring :D [12:05:39] Hi my friends ! [12:05:43] I'm here from now on :) [12:06:10] joal_: o/ [12:06:32] don't pay attention to oozie, I am waiting the last reboots before doing the job restarts :) [12:06:40] no prob elukey [12:06:50] how many still to go elukey ? [12:07:03] three to reach 1049 [12:07:14] then next week will do 50->69 [12:07:25] Amost there for this week ! [12:16:46] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Understand Kafka ACLs and figure out what ACLs we want for production topics - https://phabricator.wikimedia.org/T167304#3802681 (10elukey) >>! In T167304#3801261, @Ottomata wrote: >> As explained before we also need to explicitly... [12:18:43] elukey: While Im thinking of it - Shouldn't we productionize the overlord config for equalDistribution? [12:20:09] joal_: I thought the same, but I didn't find the option to set in the druid's properties [12:20:19] * joal_ grumbles [12:20:20] only the one for dynamic config [12:20:25] but maybe I have missed it [12:20:28] it is really handy [12:28:31] elukey: from what I read, it seems that config can only be done dnamically :( [12:28:34] https://groups.google.com/forum/#!topic/druid-user/USzFUGUO8SY [12:29:52] :( [12:30:37] joal_: shall we reboot druid1001 and see how it goes? Maybe a couple of minutes after the hour [12:31:43] elukey: You can go for it now - tasks are shared among 3 workers [12:32:22] elukey: all hadoop nodes done for today? [12:32:23] * elukey likes when joseph feels bold [12:32:36] 1049 still running a ton of containers [12:33:00] elukey: cluster is super busy now - Beginning of month [12:33:11] elukey: So I suggest going for it, and that'll be it [12:34:38] !log re-run webrequest-druid-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots) [12:34:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:35:58] !log re-run pageview-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots) [12:35:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:36:55] !log re-run webrequest-load-wf-text-2017-12-1-10 and webrequest-load-wf-text-2017-12-1-9 (failed due to Hadoop reboots) [12:36:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:37:23] (03PS6) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) [12:37:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric) [12:37:40] !log re-run webrequest-load-wf-upload-2017-12-1-10 and webrequest-load-wf-upload-2017-12-1-7 (failed due to Hadoop reboots) [12:37:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:38:11] not sure about aqs-hourly-wf-2017-12-1-8 [12:38:27] was it run as part of a coordinator? [12:39:51] all right stopping daemons on druid1001 and reboot (pivot will be impacted) [12:40:07] (03PS7) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) [12:40:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric) [12:40:33] even if I can temporarily switch it to druid1002 [12:40:43] ok, fdans, everything's working now I think, but I may have left some ugly code in there somewhere and I still have to fix the tests and this ugly purple color that now happens by default [12:41:10] but you can take a look if you want. I'll be finishing today and then we can have a review anyway, just in case you're curious early [12:42:08] !log temporarily switch pivot's config to druid1002 (to reboot druid1001) [12:42:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:43:15] joal_: druid1001 stopped [12:44:46] from the overlord console I can still see two real time indexers [12:50:14] elukey: It seems the task got back up after restart ! [12:50:55] And no impact on pivot I think [12:51:12] some drops in number around 12, but hopefully not related [12:51:37] yep was about to say that [12:51:51] didn't touch druid before 10m ago [12:52:03] elukey: I think it's not related [12:52:11] all right analytics reboots completed for today [12:52:18] awesome :) [12:52:30] elukey: I think we're safe to reboot all druid then :) [12:52:54] one every hour is fine [12:53:02] joal_: do you know aqs-hourly-wf-2017-12-1-8 ? [12:53:16] I do elukey :) [12:53:18] it failed but I don't see what coordinator belongs to (if any) [12:53:36] This extracts aqs-aggregated logs, for usage analysis [12:53:52] actually, I should push those numbers to druid :) [12:58:24] have you found the coordinator elukey ? [13:01:53] nope :( [13:04:38] elukey: hidden in second page [13:05:40] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add the prometheus jmx exporter to all the Hadoop daemons - https://phabricator.wikimedia.org/T177458#3802747 (10elukey) Metrics exported by the basic config from the HDFS datanode: ``` Hadoop_DataNode_BlockChecksumOpAvgTime{name="DataNodeActivity-analy... [13:06:33] * elukey cries in a corner [13:07:12] !log re-run aqs-hourly-wf-2017-12-1-8 (failed due to Hadoop reboots) [13:07:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:09:09] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add the prometheus jmx exporter to all the Hadoop daemons - https://phabricator.wikimedia.org/T177458#3802752 (10elukey) Metrics exported by the Yarn Node manager: ``` elukey@analytics1029:~$ curl http://analytics1029.eqiad.wmnet:8141/metrics -s | grep... [13:13:03] awesome milimetric! [13:13:10] (sorry, was lunchin) [13:25:23] elukey: I don't know what we have changed, but last realtime tasks failed :( [13:27:17] lovely [13:27:22] all three replicas? [13:27:51] hm, don't know elukey - they're just listed as failed [13:28:10] Have we changed the parameter for tasks -restart? [13:28:34] no we didn't, but it shouldn't matter [13:29:13] the tasks should be created for each hour right? New ones I mean [13:29:31] yes, they've been created, but failed [13:31:16] Unable to grant lock to inactive Task [index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0] [13:31:31] :( [13:35:02] https://groups.google.com/forum/#!topic/druid-user/Wj91V3nby6I [13:35:13] so from here it seems that it might have something to do with Zookeeper [13:35:34] but it seems weird [13:35:43] I have seen that elukey - but I really wonder how it's even possible [13:36:33] Ah ! Maybe - tranquility was connected to overlord 1003 - it changed to 1002 - so tranquility creates it's task, updates zookeeper but doesn't send it correctly to overlord [13:37:06] wasn't it 1002 -> 1003 ? [13:37:18] hm, possible elukey [13:38:00] atm the overlord leader is 1003 [13:38:10] k [13:38:47] so in theory the next hour should be fine [13:38:52] if this is a temp glitch [13:39:50] I hope it is elukey :) [13:40:04] if not, I'll restart the tranquiloity job [13:41:48] 2017-12-01T13:41:23,016 INFO com.metamx.http.client.pool.ChannelResourceFactory: Generating: http://druid1003.eqiad.wmnet:8090 [13:41:51] 2017-12-01T13:41:23,556 WARN io.druid.indexing.common.actions.RemoteTaskActionClient: Exception submitting action for task[index_realtime_banner_activity_minutely_2017-12-01T12:00:00.000Z_2_0] [13:41:55] java.io.IOException: Scary HTTP status returned: 500 Server Error. Check your overlord[druid1003.eqiad.wmnet:8090] logs for exceptions. [13:41:58] this is from the middlemanager on 1003 [13:42:19] Maaaaan [13:44:35] so it probably didn't like the change in the overlord [13:46:00] hm, what did we change elukey ? [13:46:05] only the master, right? [13:46:44] nothing as far as I know, 1001 was not the overlord master [13:48:02] hm, that's really uncool [13:58:03] man - cluster is super hugely busy [14:01:21] interesting from druid1002's overlord log [14:01:22] 2017-12-01T12:43:32,887 INFO io.druid.indexing.overlord.RemoteTaskRunner: Kaboom! Worker[druid1001.eqiad.wmnet:8091] removed! [14:01:34] then [14:01:51] 2017-12-01T12:44:44,060 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0xe95e328c024806da, [14:01:54] likely server has closed socket, closing socket connection and attempting reconnect [14:02:13] so this --^ was probably a zk session on druid1001 [14:02:32] 2017-12-01T12:44:44,161 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED [14:02:35] 2017-12-01T12:44:44,166 INFO io.druid.curator.discovery.CuratorServiceAnnouncer: Unannouncing service[DruidNode{serviceName='druid/over [14:02:38] lord', host='druid1002.eqiad.wmnet', port=8090}] [14:02:56] 2017-12-01T12:44:44,955 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to druid1003.eqiad.wmnet/10.64.53.103:2181, initiating session [14:03:05] this might explain why the overlord master changed [14:03:38] 2017-12-01T12:44:44,979 INFO io.druid.indexing.overlord.TaskMaster: Bowing out! [14:04:04] now if tranquillity thinks that druid1002 is the master [14:04:10] and it doesn't change its settings [14:04:24] elukey: I have no clue what tranquility thinks :) [14:04:32] I just hope it'll change i's settings [14:04:35] any logs that we can check? [14:04:58] funny [14:05:06] now we have two indexers running [14:05:10] for 14:00 [14:05:12] elukey: not really - streaming job runs on hadoop, therefore doesn't finish - therefore no log aggregation [14:05:50] not event the logs of the appmaster? [14:05:53] This is weird indeed [14:06:13] elukey: we can check logs one by one - I can tell you where appmaster is [14:09:45] elukey: driver is on 10.64.36.106, application_1504006918778_285237 [14:11:09] ok, found why we are so behind in hadoop [14:11:35] the 2 monthly jobs for uniques (per_project_family and per_domain) are running concurrently [14:11:39] I need to change that [14:17:46] elukey: another weird thing: pivot data is superweirdly present during the time task was supossedly down [14:17:49] Man [14:17:56] That is very unreliable [14:19:35] joal_: lol? [14:19:51] mwarfol [14:20:35] I am almost convinced that it was zookeeper the issue [14:20:45] but this complicates a lot the reboot procedure [14:22:42] I followed all the logs from druid1002 to druid1003 but then I can't really find a good root cause [14:23:32] elukey: maybe making sure we disable the realtime workers before restarting? [14:24:40] but then we loose data no? [14:24:44] "realtime data" [14:26:34] elukey: if we do it one by one, normally no [14:28:42] joal: but we just did that shutting down the one on druid1001 and it caused this mess [14:28:44] the other weird thing elukey is the failure for new task on druid1001 [14:29:31] I bet that restarting tranquillity fixes this [14:40:11] I restarted the middle manager on druid1001 [14:40:18] so now it lists a [] [14:40:52] let's wait for new hour elukey u [14:53:47] 2017-12-01T13:05:26,354 INFO io.druid.indexing.overlord.TaskQueue: Received FAILED status for task: index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0 [14:53:52] 2017-12-01T13:05:26,354 ERROR io.druid.indexing.overlord.RemoteTaskRunner: WTF?! Asked to cleanup nonexistent task: {class=io.druid.indexing.overlord.RemoteTaskRunner, taskId=index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0} [14:54:07] I tried to follow index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0 [14:59:10] That's weird man :( [14:59:22] elukey: Do you want me to restart streaming job? [15:00:03] index_hadoop_webrequest_2017-12-01T14:31:08.876Z FAILED [15:00:05] :( [15:00:28] Mwarf :( [15:00:29] wait a sec for the new jobs [15:01:43] joal: index_hadoop_webrequest_2017-12-01T14 is sent by hadoop right? [15:02:01] webrequest-druid-hourly-wf-2017-12-1-8 failed [15:02:16] it is elukey [15:02:35] but it is managed by druid [15:02:55] ahhaha joal take a look to the overlord console [15:03:29] hm - looks like druid1001 is back in the game [15:03:52] and also that there are real time indexers for 14:00 [15:04:08] Yes, up to 16:10 [15:04:23] This is expected behavior ) [15:04:28] is it?? [15:04:33] I am missing some stuff the [15:04:35] *then [15:04:58] elukey: tranquility allows for waiting for late events [15:05:04] we have set this to 10 minutes [15:05:59] it still doesn't make sense why there are indexers for UTC 14 and UTC 15 [15:06:26] elukey@druid1003:/var/log/druid$ date [15:06:26] Fri Dec 1 15:05:30 UTC 2017 [15:06:50] what is banner_activity_minutely-2017-12-01T14:00:00.000Z-0001 supposed to index ? [15:06:52] UTC14 are waiting for events to arrive, up to 10 minute late [15:06:58] They'll be finished at 16:100 [15:07:27] ah sorry you are saying 16:00 our timezone [15:07:36] correct sir, sorry my mistake [15:07:42] nono now it makes sense! [15:07:45] slow friday [15:07:52] okok so it seems back in the correct shape then [15:08:20] agreed [15:08:33] except for hadoop related jobs :( [15:09:00] !log rerun pageview-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency [15:09:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:09:21] elukey: let's wait for that one [15:09:24] yep [15:09:29] super curious [15:10:26] another theory that I have is that when rebooting nodes, we should check what zookeeper node holds sessions from the overlord master [15:10:37] and possibily avoid a overlord change [15:11:30] there you go task submitted to the overlord (from the logs) [15:11:44] and assigned to druid1001 correctly [15:20:14] !log rerun webrequest-druid-hourly-wf-2017-12-1-8 after an unexpected Druid Overlord inconsistency [15:20:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:20:33] actually I flipped them, I restarted webrequest then pageviews [15:20:41] no prob [15:21:57] elukey: pivot is now refreshed with data making more sense [15:22:16] elukey: like, a drop between 1 and 2 pm UTC [15:23:01] * elukey nods [15:33:25] Hi all! Quick question... have there been issues with analytics infrastructure this morning? Seeing some bizarre data in banner activity logs on Druid/Pivot: https://goo.gl/WfKTKG [15:33:32] thx in advance!! [15:33:50] (03PS6) 10Fdans: [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) [15:35:04] (03PS8) 10Milimetric: [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) [15:35:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric) [15:36:38] (03CR) 10jerkins-bot: [V: 04-1] [wip] Map component and Pageviews by Country metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/392661 (https://phabricator.wikimedia.org/T181529) (owner: 10Fdans) [15:38:18] AndyRussG: hi! yes [15:38:45] we had issues while rebooting a druid node for maintenance :( [15:38:54] so realtime data got affected [15:38:58] elukey: ah ok understood [15:39:17] but so Hive data is still good, correct? [15:39:38] yep, plus we'll backfill that data once the regular hadoop jobs will run [15:40:38] right :) cool thanks much :D Mmmm also got a report about some issues on stat1005, don't have the details yet... any ideas about possible issues there? [15:41:33] AndyRussG: nope, but possibly some overload due to people crunching data.. let me know if you have more info and I'll try to check! [15:41:38] no alarms fired today though [15:42:27] elukey: ok thanks! yeah much, I'll let you know in a bit.. [15:43:47] heheh switch positions of the words "yeah" and "much" ^ [15:45:05] word-level dyslexia? or dyscribia? [15:46:44] probably simply Friday kicking in :) [15:47:00] elukey: dear luca [15:47:03] it's time [15:47:35] fdans: tell me more [15:47:51] if you could dump the contents of test_pageviews_bycountry in a tsv i'd be eternally grateful [15:48:09] the job just finished running elukey [15:48:11] do you have a query that does it ? [15:48:19] oh yes, 1sec [15:50:48] this will do I think elukey : [15:50:49] cqlsh -e "select * from \"test_pageviews_bycountry\".data" > out.csv [15:51:12] how many rows are we talking about ? [15:51:36] about 2700 [15:51:52] (number of wikimedia projects * 3) [15:52:44] where do you want the file uploaded to? [15:53:29] fdans: --^ [15:54:56] elukey: is it ok to send it to me by email? or any way that's convenient for you that isn't public [15:55:26] fdans: I'll upload it on one of the stat boxes [15:55:29] in your home dir [15:55:32] stat1005 ok? [15:55:40] sounds great! [15:55:49] thank you luca [15:57:24] fdans: /home/fdans/aqs_test_out.csv [15:57:26] there you go [15:57:39] graaaaaaazie elukey !!!! [15:57:48] can't wait to test this in beta aqs [16:01:02] ping milimetric [16:08:37] Hi ebernhardson - I quickly looked at your CR in scala - Do you want me to thoroughly review it (I don't know much of what it does, might take me long), or just approve that the approch looks good? [16:16:03] joal: well, mostly your the only person i know that actually writes scala more than once in a blue moon :) Tbh i don't know exactly what the calculations are either, its a port of the algorithm from a python implementation. Mostly that the approach is sane and it's not doing things in wierd ways that scala has better methods for [16:16:41] joal: and i suppose a note that because i pretty much only port from python to scala for performance reasons (this is ~20x faster) the scala code is somewhat un-idiomatic using arrays and mutable data [16:17:28] ebernhardson: I noticed that (scala idioms) [16:18:01] ebernhardson: In CR message, you said there was perf gain of using mutables - you confirm? [16:18:54] joal: yes, my first round i was using things like (0 until N).map { ... } to build up arrays, but converting them all to pre-filled arrays and using while loops increased speed from 13s to 4s on the included benchmark (which takes 90s in python) [16:19:19] ebernhardson: I'd be interested in double checking that - We heavily use scala functional way of doin stuff in some other jobs, and maybe it wou;d be better using mutables ... [16:20:17] hum - I think there is an array-filling function in scala - Ok, I'll read the code and suggest possible idiomatic changes [16:20:46] this code is all prett simple math running in a tight loop, so i think thats why it benefits. the inner loops run something like 4M times in the benchmark [16:20:59] right [16:21:06] let's keep it this way then :) [16:28:06] also ebernhardson, this example gives us a strong +1 for using scala - I think you for that :) [16:29:06] ebernhardson: No sure if you tried spark 2.1 (spark2-submit from any stat machine), but it should also be really faster than 1.6 [16:34:07] Ah - Looks like you've actually already done that :) [16:34:11] ebernhardson: --^ [16:34:15] joal: yea, we are on 2.1 :) [16:34:21] great [16:46:49] joal: let me know when you give a 1st pass through the docs for aqs edit endpoint and i can help as needed [16:53:06] the more I read about druid's user mailing list the more I think that the indexing service is not super resilient to zk failures [16:54:21] Fangjin wrote (long time ago but it might still be relevant): "The indexing service requires ZK to be able to assign tasks, without it task assignment will timeout. I agree Druid needs to be made more resilient to ZK problems." [17:31:05] the druid people are awesome [17:31:07] 2017-12-01T12:44:44,985 INFO io.druid.indexing.overlord.TaskMaster: By the power of Grayskull, I have the power! [17:31:12] joal: --^ [17:31:32] huhuhu :) [17:39:53] (03PS9) 10Milimetric: Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) [17:39:59] (03CR) 10jerkins-bot: [V: 04-1] Work so far on simplifying and fixing breakdowns [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) (owner: 10Milimetric) [17:44:10] nuria_: I'm thinking of the scheme for docs: if I follow the current way we do, I'll create 5 new pages, one per endpoint header (edits, editors, new-registered-users, edited-pages and bytes-differences) [17:45:04] I'm happy to do that, with a getting started for each, and links to AQS and mediawiki-history-reconstruction docs [17:45:09] nuria_: --^ [17:45:24] joal: on meeting can talk in a bit [17:46:11] np nuria_ [17:50:36] joal: found something interesting [17:50:57] index_realtime_banner_activity_minutely_2017-12-01T13:00:00.000Z_0_0 and the other two replicas have running traces only on the druid1001's middle manager [17:51:07] (those ones failed_ [17:51:53] the main issue seems to be the middle manager asking to acquire the time window lock (12->13) from the overlord, that returned 500 since it didn't find any lock registered for the tasks [17:52:38] and IIUC it is the overlord the responsible to managing the task locks [17:53:29] so, I am wondering if this is a problem due to an unclean shutdown of the middle manager, maybe together with the change in the overlord [17:54:28] maybe we could re-try following the procedure to drain the middle manager on druid1002, wait for the host to be free from work, and then shutdown its overlord/coordinator/broker/etc.. [17:54:32] and reboot [17:55:02] anyhow, this smells a lot like a Druid bug [17:56:55] going offline people! [17:56:57] o/ [17:57:04] have a good weekend :) [17:57:06] * elukey off! [17:59:04] gone for diner a-team [19:05:21] joal: I woudl just create a quickstart for all with examples like "how do get the number of active editors for jp wiki" [19:05:32] "how do you get the number of edited pages" [19:05:47] and from each question link to maybe a more in depth doc [19:11:29] milimetric: is your chageset for FF anywhere? [19:11:45] milimetric: I thought i will start webpack changes on it [19:13:12] milimetric: I think I found It: https://gerrit.wikimedia.org/r/#/c/391490/ [19:13:21] nuria_: I'm still rebasing though [19:13:28] milimetric: ok [19:14:21] nuria_: but if you make a dependent change it should be fine, you should be able to rebase cleanly after I do [19:15:37] milimetric: teh approach i am going to use might imply reshuflling imports [19:15:48] milimetric: so i will wait for rebase [19:15:56] I see, ok [19:24:20] 10Analytics-Kanban: Beta: Wikistats split webpack bundle - https://phabricator.wikimedia.org/T181841#3804054 (10Nuria) [19:32:35] back [19:33:53] nuria_, milimetric, fdans: I suggest Analytics/AQS/Wikistats2 as a first page to document all the endpoints we'll have serving wks2 realted data [19:34:22] When we'll have more than wikistats-oriented data (historical, deletion drift and so), we'll rename if we think it's needed [19:34:25] joal: just do Analytics/AQS/Wikistats (no 2) because we don't have two Wikistatseses on AQS :) [19:34:27] Ok for you guys? [19:34:39] Works for me milimetric :) [19:34:52] yeah, it has enough context that it doesn't need the version here [19:35:07] joal: agreed [19:37:39] fdans: I wait a few minutes, then proceed ;) [19:40:09] yesshhh [19:40:17] joal [19:40:27] yooow fdans [19:40:34] sorry to ping you so late [19:40:40] I know you're an early starter [19:40:54] fdans: ok for Analytics/AQS/Wikistats? [19:42:07] yes! sounds great joal [19:42:13] awesome :) [19:42:16] Thanks mate [20:10:14] 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all timeperiod available - https://phabricator.wikimedia.org/T181751#3804275 (10Nuria) Countries for last 24 months: use wmf; SELECT country, views FROM ( SELECT country, SUM(view_c... [20:42:29] (03PS10) 10Milimetric: Simplify and fix breakdowns and other data [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/391490 (https://phabricator.wikimedia.org/T180556) [20:42:55] ok nuria_ (fdans): finished, rebased, tested ^ [20:43:13] it looks like there's not enough data to show the last month alone, but everything else seems to work [20:43:31] please do test yourselves too and let me know if I slipped on anything, this concludes like two weeks of intensive thinking and coding [20:43:34] * milimetric is spent :) [20:43:58] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Alpha Release: Breakdowns don't work in Firefox - https://phabricator.wikimedia.org/T180556#3804403 (10Milimetric) [21:15:34] milimetric, nuria_ : 20:44:47 -!- leila [~leila@tan2.corp.wikimedia.org] has quit [Quit: Leaving.] [21:15:39] oops [21:15:52] milimetric, nuria_: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats [21:17:29] joal: I would put the data quality notice in a warning box template. Also maybe something about the api being alpha [21:30:24] milimetric: better? [21:32:01] nice joal [21:32:32] milimetric: Fighting with wiki-templates is not something I'm good at :) [21:33:44] joal: oh, I think there are like 3-4 people who are actually good at that [21:33:54] :) [21:41:34] Ok, good for tonight - Have a good weekend a-team [22:55:15] 10Analytics, 10MediaWiki-API: There is not an easy way to tag API requests by application for analytics - https://phabricator.wikimedia.org/T181862#3804738 (10dbarratt)