[00:06:10] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3849995 (10jmatazzoni)
[00:07:19] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3850018 (10jmatazzoni)
[00:07:54] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3849995 (10jmatazzoni)
[00:16:15] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3850061 (10Aklapper)
[00:24:43] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0 ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3850073 (10jmatazzoni)
[00:28:42] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0 ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3850096 (10jmatazzoni)
[00:42:44] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make Kafka JobQueue use Special:RunSingleJob - https://phabricator.wikimedia.org/T182372#3850105 (10Tgr)
[00:42:48] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-API, 10MediaWiki-JobQueue, and 3 others: Handling of structured data input in MediaWiki APIs - https://phabricator.wikimedia.org/T182475#3850102 (10Tgr) 05Open>03Resolved a:03Tgr >>! In T182475#3828649, @Anomie wrote: > Probably go with your second bullet: Define...
[00:44:05] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3850110 (10jmatazzoni) I just tried the Content tab, and see the same inability to combine filters in the Edited Pages stats.
[06:51:00] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3850463 (10elukey) 05stalled>03Resolved
[08:03:26] <joal>	 Hi elukey - Here early today (but will be gone early as well)
[08:03:41] <joal>	 elukey: Let me know when you're coffeinated enough to look at those druid logs :)
[08:19:20] <elukey>	 joal: o/
[08:19:30] <elukey>	 going to bootstrap 10m and then I think we can go
[08:19:40] <joal>	 please take your time :)
[08:23:22] <elukey>	 so db1107 is nicely sanitizing data, ~30 tables sanitized
[08:23:31] <joal>	 \o/
[08:23:32] <elukey>	 so I guess it will run for another week or so
[08:23:39] <elukey>	 hopefully without exploding
[08:23:40] <elukey>	 :D
[08:23:44] <elukey>	 but metrics looks good
[08:23:44] <joal>	 :D
[08:24:34] <elukey>	 kafka1023 is borderline, two partitions with some GBs left, but it should last until Andrew is here
[08:25:02] <joal>	 ok elukey - If there's too much pressure, we can do it together, but I'd rather wait for master
[08:25:03] <elukey>	 we are planning to attempt a data move that should be fine but I'd do it with him around :)
[08:25:09] <elukey>	 yeah :)
[08:25:32] <elukey>	 so the idea is to stop the broker, move the topic partitions from the fullest disk partition to other ones, restart the broker
[08:25:40] <elukey>	 it should work nicely
[08:26:01] <joal>	 elukey: I hear-dropped that yesterday :)
[08:26:10] <joal>	 elukey: That's a smart yway to do it
[08:27:43] <elukey>	 ok so nothing is exploding so far, good :)
[08:28:07] <elukey>	 joal: did you find anything interesting in druid logs?
[08:28:40] <joal>	 elukey: no obvious errors, but a lot of interesting things on what is served
[08:29:19] <joal>	 elukey: I have the feeling Druid is spending a lot of time swapping between segments
[08:30:56] <joal>	 elukey: What ahve you found?
[08:31:02] <elukey>	 the jmx scraper (jvm metrics only) might need some tuning
[08:31:10] <elukey>	 a lot of garbage
[08:31:26] <joal>	 elukey: I recall having seen a few lines from that guy
[08:31:26] <elukey>	 I am reading now
[08:32:28] <elukey>	 joal: one thing that gehe*l mentioned some days ago is that the jmx exporter by itself it is a bit "naive" in the way it retrieves the mbeans to visualize 
[08:33:02] <elukey>	 because some of them might be expensive to retrieve and without any whitelist/blacklist it might affect performances a bit
[08:33:18] <elukey>	 but it is an easy fix that also doesn't require any druid daemon restart
[08:34:17] <joal>	 k
[08:34:25] <joal>	 elukey: just did a quick check for lines-number
[08:35:07] <joal>	 ~half of them come from io.prometheus.jmx.shaded.io.prometheus.jmx.JmxScraper
[08:35:27] <joal>	 1/4+ from io.druid.segment.IndexIO
[08:36:52] <joal>	 Let's be prices, top3: 47% from JmxScrapper, 30% from IndexIo, and 9% from JmxCollector
[08:36:59] <joal>	 s/prices/precise
[08:37:19] <joal>	 So it could indeed be related to metrics
[08:37:37] * elukey loves joal's precision 
[08:38:20] <joal>	 elukey: You're the one knowing about the internals of the Jmx thing - do you have ideas on things to improve (I'm not even sure how/if things chould be improveD)
[08:38:59] <elukey>	 joal: so I'd just blacklist all the mbeans that it can't read properly, like the runtime
[08:39:11] <elukey>	 we do a similar thing for hdfs datanode
[08:39:17] <elukey>	 lemme give you an example
[08:39:38] <elukey>	 blacklistObjectNames: - 'Hadoop:service=DataNode,name=DataNodeInfo'
[08:39:45] <joal>	 elukey: so that I understand better - Your metrics exporter uses Jmx, which in turns reads metrics from beans
[08:39:54] <joal>	 elukey: And Jmx has trouble reading some beans
[08:40:24] <elukey>	 well it just can't read or map some to prometheus metrics and it bails out, but timings seems pretty quick so I think it is not our problem
[08:40:25] <joal>	 elukey: And the line you pasted allows for tuning jmx so that it doesn't even try to extract metrics from some beans
[08:40:28] <elukey>	 but it removes noise
[08:40:34] <elukey>	 exactly
[08:40:40] <joal>	 cool :)
[08:40:49] * joal loves to better understand things :)
[08:41:05] <joal>	 elukey: Could certainly remove some noise
[08:41:11] <joal>	 elukey: Shall we start with that?
[08:41:45] <elukey>	 yep yep finishing my log newspaper reading :D
[08:41:52] <joal>	 :Dn
[08:44:29] <joal>	 elukey: Did a quick analysis as well over JmxScrapper lines
[08:44:50] <joal>	 elukey: And I think I don't understand everything I read :)
[08:48:36] <elukey>	 joal: usually it is the same for me :) what I do is checkout https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Ports#JMX and use jconsole to inspect the mbeans
[08:49:09] <elukey>	 most of the garbage in there is probably a mbean exporter that we don't care
[08:49:17] <elukey>	 I use https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/Jconsole
[08:49:36] <joal>	 Nice :)
[08:50:00] <joal>	 elukey: What I find difficult here is to know which beans provide data that would be used by the exporter
[08:52:58] <elukey>	 joal: did you connect via jconsole?
[08:53:04] <joal>	 Nope
[08:53:06] <elukey>	 we can do it over hangouts
[08:55:07] <joal>	 elukey: trying now
[08:55:12] <joal>	 hangout can do as well :)
[08:55:27] <joal>	 To the cave !
[09:12:16] <elukey>	 joal: this is the idea - https://gerrit.wikimedia.org/r/#/c/399360/1/modules/profile/files/druid/jvm_prometheus_jmx_exporter.yaml
[09:12:34] <elukey>	 syntax in https://github.com/prometheus/jmx_exporter
[09:12:37] <joal>	 elukey: Makes sense
[09:12:58] <elukey>	 ah double ::
[09:12:59] <elukey>	 sign
[09:13:00] <elukey>	 fixing
[09:13:45] <elukey>	 so if you are ok I'd test it on druid1002 
[09:25:55] <joal>	 good for me elukey 
[09:26:35] <elukey>	 applied to historical, looks fine
[09:27:36] <elukey>	 so in numbers
[09:27:45] <elukey>	 jmx scrape duration on druid1001 (no fix applied)
[09:27:46] <elukey>	 jmx_scrape_duration_seconds 0.030194671
[09:27:51] <elukey>	 on druid1002
[09:28:07] <elukey>	 9.71485E-4
[09:28:13] <elukey>	 (0.000971485)
[09:29:01] <joal>	 :)
[09:32:54] <wikibugs>	 10Analytics, 10Analytics-Wikistats: When searching for a project language, display a full list of languages - https://phabricator.wikimedia.org/T182960#3850654 (10Trizek-WMF) >>! In T182960#3848944, @Nuria wrote: >>See what has been done for Compact languages link. Th > This is heavily desktop-focused, our app...
[09:32:57] <mforns>	 helooooo team
[09:33:04] <elukey>	 mforns: o/
[09:33:13] <joal>	 Hi mforns :)
[09:33:13] <elukey>	 joal: rolling it out on druid1002 first, then on the others
[09:33:21] <joal>	 awesome elukey 
[09:37:26] <elukey>	 tried to retrieve metrics for all the daemons, good
[09:37:31] <elukey>	 re-enabling puppet 
[09:40:05] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Druid Woes - https://phabricator.wikimedia.org/T183273#3850677 (10elukey) For the historical daemon, the prometheus jmx scrape timings are the following (before/after the change):  jmx_scrape_duration_seconds 0.030194671 vs jmx_scrape_duration_seconds 9.71485E-4  There w...
[09:40:13] <elukey>	 joal: puppet run everywhere
[09:41:32] <joal>	 okey :)
[09:41:39] <elukey>	 I just tested pageviews-hourly 1d/7d and it didn't timeout, but it might probably be luck
[09:41:43] <elukey>	 can you test your query?
[09:41:49] <joal>	 Testing on something else now
[09:41:54] <joal>	 doesn't seem good :(
[09:42:32] <elukey>	 yep it makes sense, I'd have been really surprised if it was only this
[09:42:37] <joal>	 My test is: take some simple things (webrequest, yesterday, by hour)
[09:42:46] <joal>	 It worked )
[09:43:23] <elukey>	 really?
[09:43:25] <joal>	 now testing last quarter of pageviews-daily // This one seems to fail
[09:43:38] <joal>	 yes, failed
[09:43:50] <elukey>	 last quarter means 3 months?
[09:44:46] <joal>	 yes, but pageviews-daily is way smaller than hourly (1 month = 1.5G of data)
[09:44:57] <joal>	 so it should be able to do it
[09:45:00] <joal>	 But failed
[09:46:35] <joal>	 Interestingly enough, smaller datasets (popups) manage to make it
[09:46:42] <joal>	 I might be too greedy
[09:46:56] <joal>	 But I recall a time when a quarter on pageview-daily was easy peasy
[09:48:04] <elukey>	 so https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1 shows a ton of cached objects 
[09:48:10] <elukey>	 after our queries
[09:48:56] <joal>	 This is good news
[09:49:13] <elukey>	 I'd like to draw a line, if possible, between queries that were surely working a week ago vs the ones that should work but we don't recall if they were consistently not ending in timeouts before
[09:49:36] <joal>	 makes sense elukey 
[09:50:12] <elukey>	 for me now queries for oct1->Dec (pageviews daily) seem to work
[09:50:27] <elukey>	 (at least the view count)
[09:50:49] <joal>	 elukey: trying superset dashboard
[09:52:05] <elukey>	 because I am pretty sure that at some point we'll hit the current misconfiguration limitations
[09:52:43] <elukey>	 I'd also restart one historical with debug logging
[09:52:44] <elukey>	 maybe two
[09:53:03] <joal>	 elukey: dashboard still fails
[09:53:20] <joal>	 This is weird though - dashboard worked last week for sure :(
[09:54:15] <elukey>	 last change to the metrics was on the 12th
[09:54:26] <elukey>	 (i deployed the realtime metrics)
[09:54:44] <joal>	 That day I had dashboard created (evening IIRC)
[09:56:45] <elukey>	 and it was working fine?
[09:56:55] <joal>	 elukey: It was slow, but working
[09:57:14] <joal>	 elukey: timeout for superset is higher (60 seconds) - And now i does timeout
[09:58:16] <elukey>	 so the other theory that I have is that each druid process/daemon might get stuck in sending metrics to the prometheus-druid-exporter
[09:58:26] <joal>	 elukey: I'm testing 1 query (instead of dashboard) - It looks at yesterday's pageviews-hourly, filter for user only, by project
[09:58:46] <elukey>	 but why all of a sudden
[09:58:57] <joal>	 elukey: That's in the back of my mind as well :(
[09:59:02] <joal>	 I have no clue :(
[09:59:09] <joal>	 Maybe it piles up?
[09:59:48] <elukey>	 ok restarted historical on druid1001 with debug
[10:00:14] <elukey>	 once it loads the segments we can try to test
[10:01:59] <elukey>	 I think we are good
[10:02:45] <elukey>	 jmx scraper much better :)
[10:02:51] <joal>	 great :)
[10:03:06] <joal>	 elukey: my query was cached I think
[10:03:21] <joal>	 or something else happened: got result in almost realtime
[10:03:34] <joal>	 esting dashboard
[10:03:52] <joal>	 works WAY faster, got results
[10:04:08] <joal>	 Yes
[10:04:20] <joal>	 Something happens, as if back into normal state
[10:04:45] <joal>	 Every test I run now is working as exepected
[10:04:46] <elukey>	 debug logging speeding up druid haha
[10:04:50] <joal>	 :D
[10:04:56] * joal loves new technology
[10:05:41] <joal>	 elukey: This feels like a memory-leak in broker (since the only thing on druid1001 that is different from other is that broker is used)
[10:08:43] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3850760 (10fgiunchedi) >>! In T179093#3845141, @Ottomata wrote: >>  Regarding 1 [...] for the large traffic cross-dc communication to go over verified protocols like TCP/Kafka, as...
[10:09:50] <elukey>	 joal: currently watching things like
[10:09:52] <elukey>	 2017-12-20T10:08:55,988 DEBUG com.metamx.emitter.core.HttpPostEmitter: Running export with version[8], eventsList count[123], bytes[50210], batches[1]
[10:09:55] <elukey>	 2017-12-20T10:08:55,989 DEBUG com.metamx.emitter.core.HttpPostEmitter: Sending batch to url[http://localhost:8000/], batch.size[123]
[10:09:58] <elukey>	 2017-12-20T10:08:55,989 DEBUG com.metamx.http.client.NettyHttpClient: [POST http://localhost:8000/] starting
[10:10:01] <elukey>	 2017-12-20T10:08:55,990 INFO com.metamx.http.client.pool.ChannelResourceFactory: Generating: http://localhost:8000
[10:10:04] <elukey>	 2017-12-20T10:08:56,006 DEBUG com.metamx.http.client.NettyHttpClient: [POST http://localhost:8000/] messageReceived: DefaultHttpResponse(chunked: false)
[10:10:07] <elukey>	 HTTP/1.0 200 OK
[10:10:10] <elukey>	 Date: Wed, 20 Dec 2017 10:08:56 GMT
[10:10:12] <elukey>	 Server: WSGIServer/0.2 CPython/3.4.2
[10:10:12] <elukey>	 those are the timings for the metrics
[10:10:15] <elukey>	 Content-Length: 0
[10:10:17] <elukey>	 2017-12-20T10:08:56,006 DEBUG com.metamx.http.client.NettyHttpClient: [POST http://localhost:8000/] Got response: 200 OK
[10:11:03] <joal>	 elukey: I'm assuming there is a dedicated thread for that as well
[10:11:11] <elukey>	 and I expect that the com.metamx.emitter.core.HttpPostEmitter does not block the daemon while pushing metrics
[10:11:26] <elukey>	 exactly :D
[10:11:27] <joal>	 elukey: This is siomething we should chek
[10:11:44] <joal>	 elukey: let's stop the daemon and see how druid brehaves?
[10:12:48] <joal>	 Currently playing with superset dashboard - It super fast
[10:13:08] <elukey>	 we can stop the daemon but all the druid daemons will keep pushing to it
[10:13:20] <joal>	 hm, makes sense
[10:13:24] <elukey>	 plus we don't really have a query that constantly fails now
[10:13:27] <joal>	 elukey: would be an interesting test though :)
[10:13:50] <joal>	 As I'm querying, I see the caching growing in metrics - sounds fine
[10:15:04] <joal>	 elukey: We should also setup LVS for broker in private cluster
[10:15:22] <joal>	 I'm looking at load for last 1/2 hour
[10:15:29] <joal>	 And 1001 takes most of it
[10:16:04] <elukey>	 poor 1001
[10:16:10] <elukey>	 joal: another interesting thing - https://grafana.wikimedia.org/dashboard/db/prometheus-druid?panelId=8&fullscreen&orgId=1&from=now-1h&to=now
[10:16:52] <joal>	 elukey: hm
[10:17:14] <joal>	 not sure how to intperet that - Possibly that druid now actually try to do something
[10:18:00] <joal>	 elukey: https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=druid1001&var-network=eth0&from=now-30m&to=now&panelId=16&fullscreen
[10:19:13] <elukey>	 ouch
[10:19:30] <elukey>	 we have also these options for historical: -XX:NewSize=6g -XX:MaxNewSize=6g -XX:MaxDirectMemorySize=32g
[10:20:13] <elukey>	 but total heap size of 12g
[10:20:15] <joal>	 elukey: super fun how 3 of them have a peak of disk usage when broker restarts
[10:20:34] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?panelId=3&fullscreen&orgId=1&from=now-1h&to=now
[10:20:55] <joal>	 pfff
[10:21:12] <elukey>	 so it feels like the jvm is trying to be  conservative in newgen heap usage
[10:21:44] <elukey>	 plus it is swapping
[10:22:03] <elukey>	 because of course it can't really map a lot of segments to page cache
[10:22:59] <joal>	 elukey: his is really super weird - Currently doing other trials (queries on other datasets, to force druid to load segments) - works like a charm
[10:23:39] <elukey>	 maybe if it manages to load segments then it works fine
[10:23:50] <elukey>	 but if it encounters slowdowns
[10:23:56] <elukey>	 then it timesout
[10:24:19] <joal>	 I'm pretty sure that's the thing - But how-come a broker-restart made it work globally - That's super weird
[10:24:44] <elukey>	 that was a historical restart, not broker right?
[10:24:46] <joal>	 elukey: looking at disk-usage for machines - they do work now
[10:24:49] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3850774 (10fgiunchedi) >>! In T179093#3845440, @Krinkle wrote: >>>! In T179093#3845424, @Ottomata wrote: >> More thoughts about ^:  in my conversation with @bblack, I had originall...
[10:25:05] <joal>	 hm, you restart historical on 1001 is all?
[10:25:12] <elukey>	 yep only it
[10:25:18] <joal>	 so weird
[10:27:21] <joal>	 elukey: if feels historical on 1001 was stuck, and preventing the system to work
[10:27:55] <joal>	 Now everything I try works
[10:28:30] * joal feels a bit like when using windows - When in doubt, reboot
[10:28:50] <elukey>	 let's think about what happens when a historical is down
[10:29:09] <joal>	 Coordinator reassing segments to other machines I think
[10:29:12] <elukey>	 the coordinator knows it a probably assigns segments responsibilities to other machines
[10:29:17] <elukey>	 :D
[10:29:26] <joal>	 ok, we're on the same page
[10:29:51] <joal>	 Then historical get's back up, meaning there possibly is some segment shuffle
[10:31:06] <joal>	 My guess would be: historical is given responsibility for some segment on d1, but because of RAM load, has problem to fetch them (no RAM available for system-cache)
[10:31:28] <elukey>	 joal: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&from=now-6h&to=now&panelId=14&fullscreen&var-server=druid1001&var-network=eth0
[10:31:31] <elukey>	 checkout cached memory
[10:31:53] <joal>	 yup
[10:32:02] <elukey>	 meanwhile on others there was a drop, and then a rise again
[10:33:12] <joal>	 elukey: I really think we should move forward with new mem-settings
[10:33:24] <joal>	 And setup LVS for broker
[10:34:00] <elukey>	 the second one is a big problem though
[10:34:26] <joal>	 Diff between 1 and 2/3 is broker - I think broker allocates memory on 2/3 but doesn't use more than initial allocation, thefore leaving some free space
[10:34:32] <joal>	 mwarf
[10:34:51] <joal>	 well elukey, having the first would already be a good step I guess :)
[10:35:56] <joal>	 elukey: We also could setup a NGINX in front of broker on 1001
[10:36:17] <joal>	 Given 1001 is a SPOF anyway, at least making use of other borkers would help
[10:40:19] <elukey>	 so about https://gerrit.wikimedia.org/r/#/c/399205/2
[10:40:32] <elukey>	 given what we know about memory consumption for the new gen
[10:40:48] <elukey>	 I'd review the NewSize/MaxNewSize settings
[10:41:08] <joal>	 elukey: hm, --verbose?
[10:41:18] <elukey>	 "The NewSize and MaxNewSize parameters control the new generation’s minimum and maximum size"
[10:41:25] <elukey>	 https://docs.oracle.com/cd/E19900-01/819-4742/abeik/index.html
[10:41:55] <joal>	 Yes - I recall that - What values should we use then?
[10:42:21] <elukey>	 I am not even sure if they are needed or not
[10:42:33] <elukey>	 the brokers are now using between 3 and 4.something GB of heap
[10:43:00] <joal>	 We have Xms-5G for it, LARGE
[10:43:06] <elukey>	 one historical holds 5.something GB of heap
[10:43:42] <elukey>	 lemme check one thing via jconsole
[10:43:49] <joal>	 sure
[10:45:10] <joal>	 As of the CR, we are at 30Gb max for processes
[10:45:28] <joal>	 I also think overlord Xms could be lowered
[10:46:14] <elukey>	 so on druid1001 the broker uses ~4,3GB of heap now
[10:46:18] <elukey>	 75% is new gen
[10:46:48] <elukey>	 we are setting 4g as max size for new gen now, that might be ok
[10:46:59] <elukey>	 it will probably trigger some GC activity
[10:47:02] <elukey>	 but it should be fine
[10:47:20] <joal>	 k
[10:47:35] <joal>	 I mean, GC is still not that bad ;)
[10:48:22] <elukey>	 oh yes I am only reviewing the values :)
[10:53:34] <elukey>	 so with the new settings we have a max of 36GB and a min of 13GB usable for heap
[10:54:15] <joal>	 elukey: sounds awesome
[10:56:06] <elukey>	 "The amount of direct memory needed by Druid is at least druid.processing.buffer.sizeBytes * (druid.processing.numMergeBuffers + druid.processing.numThreads + 1). You can ensure at least this amount of direct memory is available by providing -XX:MaxDirectMemorySize=<VALUE> at the command line."
[10:59:11] <elukey>	 that should be ~5.6GB for broker and 5.6GB for historical
[10:59:46] <elukey>	 that would leave in both min/max heap usage use cases a lot of ram for os page cache
[11:01:23] <joal>	 Yessir
[11:01:30] * joal likes that
[11:03:14] <elukey>	 ok I like it
[11:03:53] <elukey>	 joal: ready to merge + rollout ?
[11:04:06] <elukey>	 sanity check at druid.processing.buffer.sizeBytes * (druid.processing.numMergeBuffers + druid.processing.numThreads + 1)
[11:04:09] <elukey>	 nope
[11:04:14] <elukey>	 https://gerrit.wikimedia.org/r/#/c/399205/3/hieradata/role/common/druid/analytics/worker.yaml
[11:04:42] <joal>	 elukey: proof-reading again
[11:05:49] <joal>	 elukey: given that we know (out of the line you pasted before) that druid will try to get 5.6Gb for historical and broker straight-away - Should we put 6Gb as Xms?
[11:07:05] <joal>	 Also elukey - IIRC druid prod doc was advising 20/30Gb for broker - Should we move Xmx for broker to 16G?
[11:07:57] <joal>	 All the rest seems fine :)
[11:08:13] <joal>	 And actually, all seems fine, but I wonder :)
[11:11:32] <elukey>	 so I am reading https://groups.google.com/forum/#!topic/druid-user/ccT4ItX1_Wc
[11:11:48] <elukey>	 Segments are just memory mapped in the OS cache and not governed by max direct memory setting. 
[11:11:51] <elukey>	 direct memory is used to allocate bytebuffers used for query processing. 
[11:12:14] <elukey>	 so I think that those 5.6 are off heap
[11:12:28] <joal>	 elukey: Ah !
[11:12:32] <elukey>	 I thinkk
[11:12:37] <elukey>	 does it make sense?
[11:13:43] <joal>	 it does
[11:14:20] <elukey>	 shall we?
[11:15:09] <joal>	 elukey: rethinking the need for Xms/Xmx so big if buffers are not in heap
[11:15:18] <joal>	 elukey: sorry :(
[11:15:49] <elukey>	 good to discuss it! I see your point but nontheless I'd be conservative and not shrink Xmx too much
[11:15:55] <elukey>	 paranoid me speaking
[11:16:26] <elukey>	 calculations are consistent now and if we see the need we can review them later on
[11:16:46] <elukey>	 like when loading a year of pageviews in less than 10ms will be annoying
[11:16:52] <elukey>	 :D
[11:17:02] <joal>	 Good elukey - Works for me
[11:17:13] <joal>	 :D
[11:17:20] <joal>	 LET'S GO !
[11:18:02] <elukey>	 joal: batcave?
[11:18:06] <joal>	 sure elukey 
[11:22:19] <fdans>	 mforns: mind taking a look at this? https://gerrit.wikimedia.org/r/#/c/394062/
[11:22:41] <mforns>	 lookin!
[11:46:43] <elukey>	 mforns: db1107 is sanitizing data nicely!
[11:46:49] <mforns>	 :D
[11:47:03] <mforns>	 elukey, does this mean the parent task is done?
[11:49:20] <elukey>	 I'd say so yes :D
[11:49:28] <elukey>	 it will take some days to finish
[11:49:33] <elukey>	 but we should be done :)
[11:49:34] <elukey>	 \o/
[11:50:00] * elukey lunch!
[12:14:17] <wikibugs>	 (03CR) 10Mforns: "Looks good overall, but I have some subjective opinions on a couple things: reporting only top 100, and bucket names. Also there is a typo" (0310 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521) (owner: 10Fdans)
[12:17:30] <mforns>	 elukey, oh yes, we still have to wait for it to finish, but it's ok, so closeeeeee
[12:31:57] <wikibugs>	 (03CR) 10Mforns: [C: 032] "LGTM!! I didn't test it though. Will give a +2, but won't merge." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans)
[12:53:37] <elukey>	 mforns: also super good that we got "flock: failed to get lock" for today's sanitization run
[12:53:43] <elukey>	 since another one is already running
[12:54:11] <elukey>	 we are now purging letter E tables
[13:14:07] <mforns>	 elukey, good
[13:19:50] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Druid Woes - https://phabricator.wikimedia.org/T183273#3851146 (10elukey) Today me and Joseph set debug logging for the Druid historical daemon on druid1001 and we restarted it to check the source of the timeouts. We noticed something really strange, namely that all of a...
[13:19:50] <wikibugs>	 (03CR) 10Joal: "One typo and one question: is the modtime of an input partition the max update-time of any file of that partition?" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 (owner: 10Ottomata)
[13:27:18] <wikibugs>	 10Analytics, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3851217 (10elukey)
[13:28:41] <wikibugs>	 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3851221 (10elukey)
[13:30:41] <wikibugs>	 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3363866 (10elukey) Logged activity on the parent task instead of this one :)  https://gerrit.wikimedia.org/r/398869 https://gerrit.wikimedia.org/r/399149 https://gerrit.wikimedia.org/r/399153  Men...
[13:31:30] <wikibugs>	 10Analytics-Kanban, 10DBA: Purge all old data from EventLogging master - https://phabricator.wikimedia.org/T168414#3363866 (10elukey) a:03elukey
[13:44:48] <wikibugs>	 (03CR) 10Mforns: [C: 031] "LGTM! Didn't test that though. If you want to pair for testing, let me know!" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 (owner: 10Ottomata)
[13:45:43] <ottomata>	 elukey:  hiiiii
[13:45:59] <elukey>	 ottomata: hiiiiii
[13:46:06] <ottomata>	 i think we may have accidentally lost auto hdfs user creation when we refactored hadoop profile stuff...am fixing buuut Q:
[13:46:14] <ottomata>	 i didn't realize there was a profile hiera backend?!
[13:46:29] <ottomata>	 hieradata/common/profile/hadoop/common.yaml
[13:46:48] <elukey>	 ah sorry did I miss a piece of code ? :(
[13:47:09] <ottomata>	 i think so?  the hadoop users group is still defined in the cdh module hiera
[13:47:14] <ottomata>	 and i think it isn't getting applied?
[13:47:15] <ottomata>	 not totally sure
[13:47:19] <ottomata>	 but, i'm just poking around
[13:47:31] <ottomata>	 the profile hiera just what, always gets lookedup for a profile?
[13:47:38] <ottomata>	 if those values aren't defined in a role?
[13:47:39] <ottomata>	 hiera?
[13:47:48] <elukey>	 ahhhh snap it is auto-looked up?
[13:47:53] <ottomata>	 # Ensure that users in these posix groups have home directories in HDFS.
[13:47:53] <ottomata>	 cdh::hadoop::users::groups: "analytics-users analytics-privatedata-users analytics-admins analytics-search-users"
[13:48:03] <elukey>	 yes I missed it then
[13:48:42] <ottomata>	 i guess i'll move that into profile hadoop common.yaml?
[13:48:48] <elukey>	 yep exactly, sorry :(
[13:48:52] <ottomata>	 i need to look it up in both master and standby
[13:48:58] <ottomata>	 but it won't be used by profile common .pp
[13:48:58] <ottomata>	 hmm
[13:51:43] <ottomata>	 wow elukey i didn't quite realize what's happening here with this profile yaml
[13:51:48] <ottomata>	 its using module hiera lookup
[13:51:50] <ottomata>	 for profile
[13:51:56] <ottomata>	 are we supposed to do that?
[13:52:03] <ottomata>	 i thought not...but i do see other yaml files in there
[13:52:43] <elukey>	 well it is in the hiera lookup path, and since it is standard for everything I thought it was appropriate to use, rather than replicating all the parameters each time
[13:52:49] <elukey>	 but I might be wrong
[13:52:57] <elukey>	 I thought you were ok this is why I proceeded :D
[13:53:10] <ottomata>	 haha, i was!  i don't think i realized what was happening! :p
[13:53:25] <ottomata>	 i mean, i don't mind, but i don't see the benefit now, of moving from cdh module params to profile module params
[13:53:37] <ottomata>	 its exactly the same thing as it was before, except there are new parameter names
[13:53:50] <ottomata>	 instead of the hiera being looked up per role
[13:54:14] <ottomata>	 ANY role that might include the profile will get those vars
[13:55:27] <ottomata>	 hmm, ok we will need to think about this more
[13:55:36] <elukey>	 sure, in the future if we want to change it it will be easier though than having classes that auto-lookup parameters
[13:55:44] <elukey>	 and each time it was a no-op
[13:55:58] <ottomata>	 ?
[13:56:01] <ottomata>	 it is the same though
[13:56:14] <ottomata>	 all we've done is moved assignment of params from cdh:: var names to profile:: var names
[13:56:19] <ottomata>	 and then pass them manually to the class
[13:56:25] <ottomata>	 you can't override them with role
[13:56:48] <ottomata>	 module lookup takes precedence over role
[13:56:52] <ottomata>	 i think..?
[13:56:55] <elukey>	 yep
[13:56:59] <ottomata>	 wait no that can't be right because you are overriding with role?
[13:57:08] <ottomata>	 e.g. profile::hadoop::common::yarn_heapsize: 2048
[13:57:13] <ottomata>	 in standby.yaml
[13:57:16] <elukey>	 no I am not, the variables that stays the same are in the profile
[13:57:21] <elukey>	 the ones that changes are in roles
[13:57:42] <elukey>	 I remember you commenting like "YAY FOR DRY" to this change :D
[13:57:45] <ottomata>	 yes, i was just reading the hierarchy backwards
[13:57:50] <ottomata>	 yesyesyesyes
[13:58:03] <elukey>	 I am sorry if this is not what you wanted :(
[13:58:04] <ottomata>	 haha, i know i liked it, i just didn't realize it was profile
[13:58:05] <ottomata>	 nono
[13:58:07] <ottomata>	 it was!
[13:58:11] <ottomata>	 i didn't realize what was happening
[13:58:14] <ottomata>	 not your fault at all
[13:58:20] <ottomata>	 i'm just really confused by all these hiera puppet rules
[13:58:23] <ottomata>	 they don't really make sense to me
[13:59:25] <ottomata>	 elukey:  i think we probably need to change the client.pp role to common.pp, and then move the common.yaml into role/ instead of profile/ hiera
[13:59:57] <ottomata>	 ah, but will the common role stuff be looked up if the common class isn't included via role() function?
[14:00:01] <ottomata>	 and instead embedded in aanother role?
[14:00:03] <ottomata>	 aye yai yai
[14:00:07] <ottomata>	 ok need to think more i guess
[14:00:33] <ottomata>	 elukey:  i need to figure out what's happening with users, buuut let's do aother things first
[14:00:38] <ottomata>	 kafka and then data deletion?
[14:01:13] <elukey>	 ottomata: I can do kafka while you check the users
[14:01:43] <ottomata>	 ok elukey! :)
[14:02:02] <ottomata>	 so you are going to shut down ka23 broker, move a few big webrequest partitions into a different partition, adn then start it back up?
[14:02:57] <elukey>	 yep I thought to move one text topic partition from the busiest disk partition to the emptiest ones
[14:03:09] <ottomata>	 elukey: , maybe just for safety, mv the partition out of the source dir e.g. /var/spool/kafka/c/data first into /var/spool/kafka/c (dir above)
[14:03:14] <ottomata>	 then copy it do its dest data/ dir
[14:03:28] <ottomata>	 that way if you have to quick revert you can quickly just move it back into c/data
[14:03:45] <ottomata>	 but if you dont' have to, you can delete the partition data from c/ after all is well
[14:03:54] <ottomata>	 elukey:  just one partition?
[14:04:09] <elukey>	 it should be around 340G, so enough no?
[14:05:27] <fdans>	 mforns: I tested the AQS patch in beta yesterday :)
[14:05:43] <mforns>	 ok, then feel free to merge!
[14:05:45] <mforns>	 :]
[14:06:33] <wikibugs>	 (03CR) 10Mforns: [C: 032] "LGTM! I'll give a +2. You probably already tested that, if so, feel free to merge! Otherwise, let me know if you want to pair :]" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399126 (https://phabricator.wikimedia.org/T182000) (owner: 10Ottomata)
[14:07:01] <ottomata>	 oh elukey yeah suppose so!
[14:11:19] <ottomata>	 elukey:  i'm wrong, the user module lookup is working (we should move it though)...something else is weird
[14:11:34] <ottomata>	 this new jonas guy: user jk's hdfs home dir isn't being created
[14:11:39] <ottomata>	 but other folks in analytics-privatedata-users are
[14:11:39] <ottomata>	 hmmm
[14:12:41] <elukey>	 mmmmm
[14:12:43] <elukey>	 stopping kafka
[14:18:00] <ottomata>	 I KNOW WHY
[14:18:04] <ottomata>	 its a bug in the create user dirs script
[14:19:58] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n: Move non-SI prefixes to user- or locale-specific interface - https://phabricator.wikimedia.org/T179906#3851381 (10mforns) a:03mforns
[14:20:54] <elukey>	 kafka is taking ages to copy the files
[14:22:48] <elukey>	 iostat shows 99% usage so it must be normal sigh
[14:31:24] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3851420 (10Ottomata) > I like this idea.  > I like this idea as well,  IIRC, @bblack did not like this idea, as he didn't like coupling service routing with traffic routing.  I don...
[14:36:50] <wikibugs>	 (03CR) 10Fdans: "@Mforns I'm just going to update the endpoint description, removing the provision that the results are limited to 100 countries." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans)
[14:43:57] <wikibugs>	 (03PS18) 10Fdans: Add top by country pageviews oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[14:55:29] <joal>	 elukey: I'm assuming you're repearing kafka, so we'll check druid later
[14:56:30] <ottomata>	 !log dropping some old wmf.webrequest partitions and data
[14:56:31] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:57:04] <elukey>	 joal: yeah :(
[14:57:11] <joal>	 no worries elukey :)
[14:57:21] <elukey>	 joal: all good up to now for the analytics druid cluster?
[15:00:24] <joal>	 elukey: I have not done anything more, queries are still super-fast
[15:05:44] <elukey>	 \o/
[15:05:57] <wikibugs>	 (03CR) 10Joal: "Ah!!!! I completely missed something: instead of pageview_hourly, you should use projectview_hourly !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521) (owner: 10Fdans)
[15:07:37] <elukey>	 joal: do you want to do ops-sync now?
[15:08:00] <joal>	 sure elukey 
[15:08:55] <fdans>	 joal: I was testing the oozie job and it failed on load_cassandra... how do I find out more about this? 
[15:09:33] <wikibugs>	 (03PS1) 10Mforns: Use SI number suffixes [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/399413 (https://phabricator.wikimedia.org/T179906)
[15:09:38] <joal>	 fdans: in hadoop logs --> Find the app_id of the failed job, and look at this app logs with yarn logs
[15:17:00] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3851506 (10Jdcc-berkman) I have a home directory on stat1005, so whatever accounts are necessary to access that.  This is for [[ https://meta.wikimedia.org/wiki/Research:Analyzing_Accessibility...
[15:17:17] <AndyRussG>	 hey joal elukey, how's everything? Just to report Pivot/Druid looks to be back to its normal speedy/stable self :) thanks much!!!!!
[15:18:06] <joal>	 Hi AndyRussG - We spent time fixing that otday
[15:18:14] <joal>	 Thanks for letting us know it works for you :)
[15:18:27] <joal>	 AndyRussG: And of course if it happens again, please let us know again :)
[15:18:28] <elukey>	 AndyRussG: \o/ thanks for the confirmation! 
[15:26:08] <fdans>	 https://www.irccloud.com/pastebin/uLDnmm3I/
[15:26:22] <fdans>	 joal: from yarn logs - does this ring a bell at all?
[15:27:50] <joal>	 fdans: no really - I'd assume that there is a null value in your data, but would need to investigate more
[15:28:07] <joal>	 fdans: in meeting now, and with Lino after - Wil look at it later on tonight
[15:28:21] <joal>	 fdans: can you give me the app_id?
[15:28:37] <fdans>	 joal: 0016125-171207105303491-oozie-oozi-C
[15:28:43] <joal>	 k fdans 
[15:31:06] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3851537 (10Krinkle) >>! In T179093#3850774, @fgiunchedi wrote: > I like this idea as well, it makes easy to reason about the kafka flow of messages following the same path/routing...
[15:33:02] <wikibugs>	 (03CR) 10Mforns: [C: 032] "Oh, yea thanks!" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans)
[15:34:11] <Krinkle>	 ottomata: Thanks for your patience regarding multi-DC statsv. Just want to emphasise that I/we (perf-team) will support whatever you think is best for statsv. My main purpose is to understand it well, and not to have preferences for how its done. The ideas I'm describing were mainly to show the different bits of information I know of that I hadn't seen mentioned before in case it would matter, but in the end, don't mind either
[15:34:11] <Krinkle>	 way.
[15:34:12] <wikibugs>	 (03PS8) 10Fdans: Add pageviews by country endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520)
[15:36:54] <fdans>	 mforns: if you're ok with the description change in aqs I'll merge now
[15:40:31] <ottomata>	 !log removing some old webrequest data from hdfs
[15:40:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:42:06] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3851544 (10Milimetric) @jmatazzoni I answered this in our new FAQ: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats...
[15:49:17] <AndyRussG>	 joal elukey :) yeah really helpful!!! such a neat tool... BTW I'm right now working on a wee python library to consolidate and simplify some typical queries we do for CentralNotice stats in Jupyter... I'll send a link here when it's a far enough along :)
[15:51:01] <wikibugs>	 (03CR) 10Mforns: Add pageviews by country endpoint (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans)
[15:51:23] <mforns>	 fdans, LGTM! I suggested sth, but it's cool like that.
[15:53:10] <wikibugs>	 (03CR) 10Milimetric: [C: 032] Use SI number suffixes [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/399413 (https://phabricator.wikimedia.org/T179906) (owner: 10Mforns)
[15:53:59] <milimetric>	 thanks mforns!
[15:54:17] <mforns>	 thx milimetric :]
[15:59:04] <wikibugs>	 (03PS9) 10Fdans: Add pageviews by country endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520)
[16:02:13] <fdans>	 mforns: standuuup!
[16:03:07] <ottomata>	 Krinkle:  :) will respond on ticket after meetings.  thanks, i'm having a great time! :)
[16:03:38] <wikibugs>	 (03CR) 10Mforns: [V: 032 C: 032] "LGTM!" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans)
[16:12:58] <wikibugs>	 10Analytics-Kanban: Regular backups for EL data - https://phabricator.wikimedia.org/T183383#3851649 (10Nuria)
[16:25:37] <elukey>	 ottomata: kafka1023 bootstrapped
[16:25:46] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3851674 (10jmatazzoni) Thanks for the info @Milimetric.  I think the idea that combining filters would be cool understates the imp...
[16:25:53] <elukey>	 if you want to double check 
[16:26:19] <elukey>	 maybe before removing the temp data we can issue a preferred-replica-election?
[16:27:13] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&from=now-1h&to=now mmm metrics are not stable yet
[16:27:53] <ottomata>	 elukey:  looks good, but ya let's wait until the replica catches up
[16:27:59] <ottomata>	 and we do electio
[16:28:00] <ottomata>	 n
[16:31:11] <elukey>	 ah ottomata now I am getting what's happening
[16:31:36] <elukey>	 I moved webrequest-15 from spool/c to spool/e
[16:31:48] <elukey>	 and now it is ~2.7G
[16:31:56] <elukey>	 so all the traffic is the broker streaming back the data
[16:32:09] <elukey>	 err getting the data from other brokers
[16:32:31] <elukey>	 that explains https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=12&fullscreen&orgId=1&from=now-1h&to=now
[17:05:55] <elukey>	 	Topic: webrequest_text	Partition: 15	Leader: 20	Replicas: 18,14,20	Isr: 20,14
[17:05:58] <elukey>	 	Topic: webrequest_text	Partition: 16	Leader: 20	Replicas: 20,18,22	Isr: 20,22
[17:06:01] <elukey>	 ottomata: --^
[17:06:13] <elukey>	 looks that only those are not in sync, because data is coming in now
[17:06:23] <elukey>	 so it will probably take a lot of hours 
[17:06:34] <elukey>	 can I remove the old data?
[17:08:59] <ottomata>	 ya i think its fine
[17:09:02] <ottomata>	 if it bootstrapped we good
[17:10:52] <elukey>	 ok doing so!
[17:11:05] <elukey>	 good to know if it re-happens
[17:13:15] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3851801 (10Milimetric) @jmatazzoni we aren't actively developing that feature right now, because there are higher priorities, but...
[17:16:29] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Add clear definitions to all metrics, along with links to Research: pages - https://phabricator.wikimedia.org/T183261#3851802 (10Milimetric) The FAQ should be useful in the short term, especially for feature questions and roadmap questions.  But for metric questions,...
[17:38:52] <wikibugs>	 (03PS1) 10Milimetric: Fix width of axis [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/399438 (https://phabricator.wikimedia.org/T182817)
[17:42:13] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "@Ottomata Yes, but this commit renames the cli param from topic to topics, which is a breaking change." [analytics/statsv] - 10https://gerrit.wikimedia.org/r/391703 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata)
[17:45:56] <elukey>	 ottomata: just completed the jvm restarts for zookeper in conf2*
[17:46:02] <elukey>	 will do conf1* tomorrow
[17:46:04] <ottomata>	 ok cool
[17:55:08] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review: Add link to wikistats 2 from analytics.wikimedia.org - https://phabricator.wikimedia.org/T182904#3838173 (10Nuria) 05Open>03Resolved
[18:15:03] * elukey off!
[18:42:01] <wikibugs>	 (03CR) 10Ottomata: "> is the modtime of an input partition the max update-time of any file of that partition" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 (owner: 10Ottomata)
[18:42:09] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add _REFINE_FAILED failure flag and skip refinement if it exists [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 (owner: 10Ottomata)
[18:42:35] <wikibugs>	 (03CR) 10Ottomata: "OOPS, i didn't submit my new patch fixing typos before merging.  will include typo fixes in next change." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399105 (owner: 10Ottomata)
[18:45:23] <ottomata>	 joal:  when you are back let's talk about jsonrefine stuff
[18:45:53] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Add clear definitions to all metrics, along with links to Research: pages - https://phabricator.wikimedia.org/T183261#3852160 (10Nuria) Ok, let's compile things to FAQ and move metric info elsewhere when it pertains.
[18:46:25] <wikibugs>	 (03PS5) 10Ottomata: JsonRefine - Be nicer about what type coercions we will accept [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399126 (https://phabricator.wikimedia.org/T182000)
[18:51:10] <wikibugs>	 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Support multi DC statsv - https://phabricator.wikimedia.org/T179093#3852180 (10Ottomata) Interesting, I think your solution is the simplest.  It allows us to support manual failover for multi-DC statsv, but wont' let us do any active/active stuff....
[18:57:42] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Can't combine 'Editor type' and editor 'Activity level' filters to narrow results (in WikiStats 2.0) - https://phabricator.wikimedia.org/T183316#3852196 (10jmatazzoni) I apologize for using intemperate language. But the figure I reference is a good example of what I mean. Ha...
[19:22:52] <wikibugs>	 (03CR) 10Ottomata: "Ah, but puppet does not set --topic (yet), it just uses the statsv.py default by not providing the CLI opt all." [analytics/statsv] - 10https://gerrit.wikimedia.org/r/391703 (https://phabricator.wikimedia.org/T179093) (owner: 10Ottomata)
[19:29:50] <joal>	 Heya ottomata 
[19:30:10] <ottomata>	 joal ehYYYy :)
[19:30:17] <joal>	 ottomata: JsonRefine?
[19:30:20] <ottomata>	 ya bc?
[19:30:23] <joal>	 OMW
[19:51:46] <wikibugs>	 (03CR) 10Joal: [C: 031] "Good for me !" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399126 (https://phabricator.wikimedia.org/T182000) (owner: 10Ottomata)
[19:54:09] <wikibugs>	 (03PS6) 10Ottomata: JsonRefine - Be nicer about what type coercions we will accept [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399126 (https://phabricator.wikimedia.org/T182000)
[19:59:42] <wikibugs>	 (03CR) 10Ottomata: [C: 032] JsonRefine - Be nicer about what type coercions we will accept [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399126 (https://phabricator.wikimedia.org/T182000) (owner: 10Ottomata)
[20:00:31] <wikibugs>	 (03PS1) 10Ottomata: Bump changelog to v0.0.56 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399458
[20:01:24] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Bump changelog to v0.0.56 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/399458 (owner: 10Ottomata)
[20:02:48] <joal>	 fdans: still here?n
[20:03:50] <fdans>	 joal: yeshh
[20:03:55] <joal>	 sorry for late ping
[20:04:02] <joal>	 I think i have an idea about the job failure
[20:04:15] <fdans>	 I'm all ears joal  :)
[20:04:51] <joal>	 I reviewed more in detail the loading conf, and found there was a leftover from previous job that probably leads to failure: constant_output_granularity_field = granularity
[20:05:00] <joal>	 in coord_....properties file
[20:06:10] <fdans>	 joal: ohh, I guess that gets added as a column?
[20:06:14] <ottomata>	 joal:  any idea why refinery-source builds need to download older cdh versions from archiva?  e.g. cdh5.7.0
[20:06:15] <ottomata>	 ?
[20:06:18] <joal>	 I think so yes
[20:06:33] <fdans>	 I'm going to try erasing that property
[20:06:35] <fdans>	 nice catch joal 
[20:06:46] <joal>	 ottomata: hm - I dont know - camus maybe?
[20:06:52] <ottomata>	 hm
[20:07:08] <joal>	 fdans: Not sure if you've seen the other comment about using projectview instead of pageview
[20:07:15] <joal>	 it';ll make the jobs consume a lot less
[20:07:27] <fdans>	 joal: already applied that :D
[20:07:32] <joal>	 wow
[20:07:35] <joal>	 fdans: You're fast !
[20:07:37] <fdans>	 2 days cpu time => 20min
[20:07:41] <joal>	 :D
[20:07:59] <joal>	 I'm so glad I saw that - And also so sorry not to have caught it earlier
[20:08:03] <fdans>	 joal: I was praising that idea during standup :D
[20:08:07] <joal>	 would have made tests way easier :)
[20:08:26] <fdans>	 the only thing was using a UDF to convert ISO codes to country names
[20:08:39] <fdans>	 but the rest was pretty straightforward
[20:09:11] <joal>	 fdans: no country names in project view ? Rooooh - :(
[20:09:24] <joal>	 cool
[20:09:54] <joal>	 fdans: same patch or a different one? Looks like I can't see new patches after my comment (patch18) onthe original CR_
[20:10:16] <fdans>	 joal: haven't pushed yet, sorry
[20:10:36] <joal>	 fdans: No worries :)
[20:10:57] <joal>	 fdans: If the fix works about the line removal, well, you're good to try :)
[20:11:03] <wikibugs>	 (03PS19) 10Fdans: Add top by country pageviews oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[20:12:17] <fdans>	 testing job...
[20:20:51] <joal>	 Crap, it failed fdans :(
[20:21:17] <fdans>	 yep, dammit
[20:22:12] <joal>	 Same exact problem fdans - Ir actually didn't even manage to write a signle line
[20:24:44] <fdans>	 joal:in workflow there's still the constant_output_granularity_value property
[20:25:22] <joal>	 fdans: Mwarf :(
[20:25:46] <joal>	 fdans: I actually think my catch was incorrect
[20:27:38] <fdans>	 joal: the query is definitely not returning anything strange
[20:40:40] <fdans>	 joal: one clue is that this job was working fine before adding 'access'
[20:40:57] <joal>	 ItAh, this is interesting :)
[20:41:07] <joal>	 Thanks  fdans for the hint )
[20:41:35] <joal>	 fdans: this means the granularity field I suggested to remove has no impact
[20:41:38] <joal>	 hm
[20:46:19] <joal>	 fdans: I have an idea
[20:46:41] <fdans>	 I'm excited
[20:46:58] <joal>	 How about user permissions on the new keyspace I just created?
[20:47:34] <fdans>	 joal: want me to restart the job?
[20:47:42] <joal>	 fdans: not yet, checking
[20:50:15] <joal>	 fdans: that's not it :S
[20:51:15] <fdans>	 dammit
[20:59:56] <joal>	 rhmmm
[21:05:38] * fdans hmmmmm intensifies
[21:06:35] <wikibugs>	 10Analytics: Hand off of Christian's MaxMind geolocation databases repository - https://phabricator.wikimedia.org/T89453#1036629 (10Ottomata) If we aren't going to do this soon, @Milimetric can you re-enable the home-grown script you had for this somewhere (stat1005?).  We don't have the databases from the last...
[21:09:30] <joal>	 fdans: Do you have the last master of the things?
[21:10:00] <joal>	 I noticed some props missing and wondered about that: cassandra_write_consistency for instance
[21:10:55] <fdans>	 ooooohhhh lemme try that
[21:13:23] <joal>	 fdans: it's weird, I think oozie should fail the job if this one is not sret
[21:14:28] <joal>	 fdans: I have no more clues
[21:14:54] <joal>	 fdans: I think I'm gonna try your patch tommorow - kick it a bit if ti deoesn't cooperate
[21:16:16] <fdans>	 joal: makes sense, have a good evening!
[21:18:11] <joal>	 fdans: still posting another list of stuff (no rush, just to save them)
[21:18:33] <wikibugs>	 (03CR) 10Joal: "Another list of small things in cassandra config" (038 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521) (owner: 10Fdans)
[21:19:56] <wikibugs>	 (03PS20) 10Fdans: Add top by country pageviews oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[21:24:28] <nuria_>	 milimetric: I think yesterday the wrong layout made it to reportcard
[21:24:31] <nuria_>	 milimetric: https://analytics.wikimedia.org/dashboards/reportcard/#empty
[21:25:03] <milimetric>	 nuria_: I think that's just the #empty you have in the URL: https://analytics.wikimedia.org/dashboards/reportcard/
[21:25:07] <milimetric>	 (that works for me)
[21:25:14] <milimetric>	 reportcard is on the tabs layout
[21:25:36] <fdans>	 joal: 
[21:25:45] <fdans>	 ADD JAR ${refinery_cassandra_jar_path};
[21:25:49] <nuria_>	 milimetric: if you reload you see vital-signs layout though
[21:25:52] <fdans>	 but where do the parameters go?
[21:25:53] <nuria_>	 milimetric: right?
[21:26:07] <nuria_>	 fdans: parameters to udf?
[21:26:19] <joal>	 fdans: should actually be ${refinery_hive_jar_path} I think :)
[21:26:23] <milimetric>	 nuria_: what?! the hell...
[21:26:29] <milimetric>	 that makes no sense
[21:27:04] <fdans>	 joal: and if I specify that property in the .properties file that should be enough?
[21:27:23] <joal>	 fdans: the parameters needs to be given to the hive action in workflow, therefore passed from props --> bundle --> coord --> workflow
[21:27:45] <milimetric>	 ok, nuria_, yeah, I messed up the deploy somehow, don't know how
[21:27:46] <milimetric>	 will fix
[21:27:47] <joal>	 fdans: we usually specify it everywhere, but having it in .properties file is enough
[21:27:51] <nuria_>	 milimetric: k
[21:28:08] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats:  ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3852592 (10Nuria)
[21:28:26] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3850073 (10Nuria) Reportcard deployment broke, fixing it as we speak.
[21:28:30] <joal>	 fdans: actually we specify it everywhere for coherence, since it started like that - For conciseness, we could not specify it :)
[21:29:06] <wikibugs>	 (03PS1) 10Milimetric: Fix reportcard deploy [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/399472
[21:29:16] <wikibugs>	 (03CR) 10Milimetric: [V: 032 C: 032] Fix reportcard deploy [analytics/analytics.wikimedia.org] - 10https://gerrit.wikimedia.org/r/399472 (owner: 10Milimetric)
[21:29:27] <milimetric>	 k, nuria_ should be fixed on next puppet run: ^
[21:30:30] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3852595 (10Nuria)
[21:30:50] <wikibugs>	 10Analytics: Hand off of Christian's MaxMind geolocation databases repository - https://phabricator.wikimedia.org/T89453#3852598 (10Milimetric) aha, yeah, the script was running but it's broken and it was grounding its errors.  I'll try to fix and we should talk about it tomorrow.
[21:31:37] <wikibugs>	 (03PS21) 10Fdans: Add top by country pageviews oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[21:32:07] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2.0. - https://phabricator.wikimedia.org/T130256#3852602 (10Nuria)
[21:32:09] <wikibugs>	 10Analytics-Kanban: Initial Launch of new  Wikistats 2.0  website - https://phabricator.wikimedia.org/T160370#3852601 (10Nuria) 05Open>03Resolved
[21:32:37] <wikibugs>	 10Analytics-Kanban: Provide breakdown of pageviews per country per year for all timeperiod available - https://phabricator.wikimedia.org/T181751#3852603 (10Nuria) 05Open>03Resolved
[21:32:38] <milimetric>	 ottomata: is it possible that the .git folder didn't copy over in the stat1002 -> stat1005 migration?
[21:32:42] <milimetric>	 that seems to be the problem, it's gone
[21:33:09] <wikibugs>	 10Analytics-Kanban, 10Page-Previews, 10Readers-Web-Backlog, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), and 3 others: Popups timestamp field contains multiple types - https://phabricator.wikimedia.org/T182000#3852604 (10Ottomata)
[21:33:15] <milimetric>	 looks like it happened sometime in late July
[21:34:40] <nuria_>	 fdans: this might be of help: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie#Run_oozie_job_overriding_what_pertains
[21:36:04] <wikibugs>	 (03PS22) 10Fdans: Add top by country pageviews oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521)
[21:37:01] <wikibugs>	 10Analytics-Kanban, 10Page-Previews, 10Readers-Web-Backlog, 10MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), and 3 others: Popups timestamp field contains multiple types - https://phabricator.wikimedia.org/T182000#3852630 (10Ottomata) OK!  Phew!  https://gerrit.wikimedia.org/r/#/c/399126/ s...
[21:38:08] <wikibugs>	 10Analytics: Hand off of Christian's MaxMind geolocation databases repository - https://phabricator.wikimedia.org/T89453#3852634 (10Milimetric) ok, so the .git folder got lost in the migration from stat1002.  It of course included old versions of the GeoIP database, the only thing in that directory is the latest...
[21:38:16] <wikibugs>	 10Analytics: Hand off of Christian's MaxMind geolocation databases repository - https://phabricator.wikimedia.org/T89453#3852635 (10Ottomata) (I keep reading this ticket title as "HANDS OFF of Christian's MaxMind geolocation databases repository".  KEEP YOUR GRUBBY MITTS OFF OF MY DATABASES
[21:41:15] <joal>	 Gone for tonight team- See you tomorrow
[21:41:25] <fdans>	 night joal !
[21:46:39] <wikibugs>	 10Analytics: Hand off of Christian's MaxMind geolocation databases repository - https://phabricator.wikimedia.org/T89453#3852673 (10Milimetric) Ok, crisis averted, nuria pointed out the files were at /srv/stat1002-a/user_dirs_from_stat1002/milimetric/GeoIP-toolbox/MaxMind-database so I moved that folder into my...
[21:48:34] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3852677 (10Nuria) Report card working: https://analytics.wikimedia.org/dashboards/reportcard/#new-editors
[21:48:40] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3852678 (10Nuria) 05Open>03Resolved
[21:51:24] <wikibugs>	 10Analytics, 10Analytics-Wikistats: New page stats are inaccurate for fawiki - https://phabricator.wikimedia.org/T183208#3846927 (10Nuria) Adressed the questions on FAQ, will be closing ticket as the task regarding docs are filed. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats/Metrics/FAQ
[21:51:31] <wikibugs>	 10Analytics, 10Analytics-Wikistats: New page stats are inaccurate for fawiki - https://phabricator.wikimedia.org/T183208#3852705 (10Nuria) 05Open>03Resolved
[21:54:52] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3849363 (10Nuria) Can you explain a bit more why do you need to account extended? I understand that you need to run some reports and that process broken on December 1st, give that seems like ac...
[21:57:10] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Link to 'more info' doesn't always work - https://phabricator.wikimedia.org/T183188#3852709 (10Nuria) ping @JAllemandou  what is the wiki place for this metric to link to? Please be so kind as to create a new page if one does not exists
[21:57:39] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Link to 'more info' doesn't always work - https://phabricator.wikimedia.org/T183188#3852711 (10Nuria)
[22:03:56] <fdans>	 nuria_: does this say anything to you?
[22:03:58] <fdans>	 https://www.irccloud.com/pastebin/JIc00AnB/
[22:04:48] <nuria_>	 fdans: ya, those dirs cannot be process cause  SUCESS flag does not exists
[22:05:06] <nuria_>	 _SUCESS is really  a file named success
[22:05:34] <nuria_>	 and  its presence tells oozie that dir has data reday for another job to use
[22:05:50] <nuria_>	 fdans: let me see
[22:05:51] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3852764 (10Jdcc-berkman) The other concern is that the output from these reports is supposed to be made publicly available by WMF. That's been agreed to in principle, but the process has not be...
[22:06:47] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3852766 (10Nuria) Who in WMF is taking the task of making these reports public to add them to ticket?
[22:07:50] <nuria_>	 fdans: but SUCESS is there
[22:07:56] <nuria_>	 https://www.irccloud.com/pastebin/hXK6O3nn/
[22:08:22] <nuria_>	 fdans: maybe paths are specified wrong?
[22:09:09] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3852767 (10Jdcc-berkman) Stephen LaPorte. Sorry, I don't know his username.
[22:09:41] <fdans>	 nothing wrong with those permissions right nuria_ ?
[22:10:09] <nuria_>	 fdans: no if you are running as hdfs user
[22:10:30] <nuria_>	 fdans: everybody else shoudl be able to read
[22:26:48] <fdans>	 nuria_: what do you mean by paths specified wrong?
[22:33:45] <wikibugs>	 10Analytics, 10Analytics-Cluster: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291#3849363 (10Nuria) @Slaporte: can you advice on progress of this project  and requirements for data access?
[22:46:51] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Please add download option 'as csv file' to Wikistats 2 - https://phabricator.wikimedia.org/T183192#3852849 (10Nuria) a:05Milimetric>03Nuria
[23:46:30] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: ReportCard does not work (at all) - https://phabricator.wikimedia.org/T183321#3852942 (10jmatazzoni) Brilliant. Thank you! It looks different from what I saw before; is this new or was that a regression?