[06:32:41] <wikibugs_>	 (03CR) 10Elukey: [C: 031] Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194 (owner: 10Mforns)
[07:26:42] <elukey>	 morning :)
[07:26:50] <elukey>	 I am gathering logs in https://etherpad.wikimedia.org/p/analytics-namenode-down-26042018
[07:26:50] <joal>	 o/
[07:27:08] <joal>	 elukey: please let me know when you want me to fake helping :)
[07:28:21] <elukey>	 joal: whenever you want, you know more than me about hadoop HA :)
[07:28:37] <joal>	 elukey: not that sure
[07:30:10] <elukey>	 so let's start with what we know
[07:30:23] <elukey>	 I think this is our main reference:
[07:30:25] <elukey>	 2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.36.128:8485, 10.64.53.14:8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961))
[07:30:46] <elukey>	 at this point, the namenode on an1001 declared defeat and started to shutdown
[07:31:09] <joal>	 elukey: yessir - journalnode quorum unreachable, stoping
[07:31:33] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.36.128:8485
[07:31:37] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.53.14:8485
[07:32:08] <elukey>	 these are 28 and 35
[07:32:37] <elukey>	 so I'd say that there are two possible root causes:
[07:33:02] <elukey>	 1) two out of three journal nodes overloaded and responding after 20s to the namenode
[07:33:08] <joal>	 I dont see those lines in the etherpad elukey :(
[07:33:09] <elukey>	 2) the namenode overloaded 
[07:33:23] <elukey>	 joal: line 61/62
[07:33:26] <joal>	 Ah yes
[07:33:36] <joal>	 Was looking for them in journalnodes logs
[07:33:38] <joal>	 ok thanks
[07:34:22] <elukey>	 ah no sorry I should have mentioned an1001 :)
[07:35:25] <elukey>	 one thing worth to mention about 2) is this
[07:35:26] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms
[07:35:39] <elukey>	 happened basically at the same time of the timeout
[07:35:48] <joal>	 yes
[07:35:51] <elukey>	 and it is exactly 20s
[07:36:00] <elukey>	 (more in reality)
[07:36:08] <joal>	 Now what I don't understand is: which of those guys took 20s for a GC?
[07:36:18] <elukey>	 the namenode :(
[07:36:31] <elukey>	 line 60 in the etherpad
[07:36:44] <elukey>	 basically an1001's log say
[07:36:57] <elukey>	 "hey the jorunalnodes are taking more than 20s to respond)
[07:37:17] <elukey>	 and btw, the GC ran for 20s"
[07:37:23] <joal>	 :)
[07:37:53] <joal>	 elukey: I think it should have said: Hey journalnode guys, why are you not answering ... Or is it me not listening?
[07:38:19] <elukey>	 yeah :D
[07:39:24] <elukey>	 joal: I took also a look to the load metrics of the journal nodes, indeed there was an increase, but judging from the past 7d of data it seems "inline" with what they do
[07:39:29] <elukey>	 (correct me if I am wrong)
[07:39:58] <joal>	 elukey: looking at the namenode charts in grafana, interesting one: https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=41&fullscreen&orgId=1&from=1524756345000&to=1524765112946
[07:40:15] <elukey>	 yes that one too
[07:40:46] <joal>	 elukey: an you remind me our time-setting about fixing under-repolicated blocks? How long do we wait?
[07:41:18] <elukey>	 not sure about this one (missing knowledge probably)
[07:43:27] <joal>	 elukey: IIRC correctly, HDFS waits before fixing underreplicated blocks (in case for instance a missing node gets back up)
[07:46:23] <elukey>	 ah with fixing you mean replicating them to another worker
[07:46:28] <elukey>	 yes yes now I get it
[07:46:29] <joal>	 Correct
[07:46:38] <elukey>	 not sure about how long it waits, good question
[07:47:39] <elukey>	 I am adding graphs for journal nodes to grafana
[07:47:51] <elukey>	 for the moment, none of them indicates trouble with journal nodes
[07:48:12] <joal>	 elukey: I'm pretty sure the trouble was not with journalnodes but with namenode actually
[07:51:14] <elukey>	 yeah me too
[07:51:26] <elukey>	 buuut adding more metrics is good :)
[07:51:41] <elukey>	 in the beginning I thought that the new prometheus agent was causing troubles
[07:51:42] <joal>	 i's always good :)
[07:51:50] <elukey>	 but I triple checked and it seems super fast
[07:51:50] <joal>	 Thanks for that elukey 
[07:55:57] <wikibugs_>	 (03PS1) 10Joal: Correct webrequest dataloss false-positive script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429380
[07:56:23] <joal>	 elukey: when you have a minute of brainpower, I'm interested for your comments on --^ (mostly the commit message)
[07:59:19] <wikibugs_>	 (03CR) 10Elukey: [C: 031] "thanks!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429380 (owner: 10Joal)
[08:01:46] <joal>	 elukey: Thanks for having read my prose :)
[08:03:11] <elukey>	 it makes a lot of sense!
[08:09:15] <elukey>	 joal: added new metrics (sync latency, etc..)
[08:09:20] <elukey>	 nothing that I can see
[08:09:21] <joal>	 elukey: awesome :)
[08:09:32] <elukey>	 https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Metrics.html#JournalNode
[08:09:43] <joal>	 elukey: looking at NN metrics in a bit more detail - See that: https://grafana-admin.wikimedia.org/dashboard/db/analytics-hadoop?panelId=25&fullscreen&orgId=1&from=1524761996000&to=1524765578000
[08:09:45] <elukey>	 do you have any idea why the give 3 aggregation for the sync latency?
[08:10:18] <joal>	 I don't even know what that actually means :S
[08:11:13] <elukey>	 I think that the sync latency is how much time it takes for the journal node to fsync the edit log to disk
[08:11:19] <joal>	 elukey: I wonder if the NN has not been overhelmed catching bakc from a node recovery at the error time
[08:12:43] <joal>	 but what does 3 aggregation mean in that case?
[08:13:14] <elukey>	 I have no idea
[08:13:54] <elukey>	 I plotted the 1min granularity
[08:14:03] <elukey>	 that is the most interesting one I think
[08:20:29] <elukey>	 also joal, something that I don't get - why the under replicated blocks dropped when the namenode went down? 
[08:20:46] <elukey>	 and then they didn't go up again
[08:21:14] <elukey>	 plus GC timings for the namenode are horrible
[08:21:26] <joal>	 elukey: +1 on horrible GC times
[08:21:35] <elukey>	 I am wondering if we'd need to test something like G1
[08:21:43] <joal>	 very possible
[08:22:48] <joal>	 elukey: have you added the numa metrics on an100[123]?
[08:22:55] <elukey>	 nope
[08:22:57] <joal>	 elukey: I wass having a wonder: )
[08:25:04] <joal>	 elukey: I think the pending-replication blocks moved to corrupt blocks on NN restart
[08:25:54] <joal>	 And acutally correupt blocks is still high on an1002
[08:26:20] <elukey>	 ack it makes sense
[08:26:40] <elukey>	 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_command-line-installation/content/configuring-namenode-heap-size.html
[08:27:02] <elukey>	 for our range (20-30) they suggest Xmx 16332m
[08:28:32] <elukey>	 and also
[08:28:32] <elukey>	 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_hdfs-administration/content/ch_g1gc_garbage_collector_tech_preview.html
[08:30:49] <joal>	 elukey: growing the Xmx for the namenodes doesn't seem a bad idea to me 
[08:31:27] <elukey>	 the last jump was from 4 to 6, but GC timings are still not great
[08:31:33] <joal>	 agreed
[08:31:56] <elukey>	 growing the heap size of course might exacerbate the issue if it is pathological, but it could also give some relief to CMS
[08:32:15] <elukey>	 I am curious about the split between young and old gen
[08:34:41] <elukey>	 wow we have also the pool metrics!
[08:35:57] * elukey adds graphs
[08:39:52] <elukey>	 added them for the namenode!
[08:39:54] <elukey>	 (s)
[08:41:16] <elukey>	 oldgen ~4.3G, Eden ~1.2
[08:42:10] <elukey>	 oldgen seems pretty stable, so I am wondering if raising Xmx would also mean more things in there to be promoted
[08:42:25] <elukey>	 but it might also mean higher timings for full collections
[08:56:10] <elukey>	 joal: if you are of I'd reimage an1051 and an1053 (not 52, journal node)
[08:56:19] <joal>	 wokrs for me elukey :)
[08:56:23] <elukey>	 ack :)
[08:56:40] <joal>	 elukey: It'll be interesting to check NN behavior when you put nodes as active
[09:13:13] <elukey>	 joal: I think I found a match for the underreplicated blocks
[09:13:37] <elukey>	 the two increase matches with the two batches of worker reimages done yesterday
[09:13:45] <elukey>	 (4 hosts in total)
[09:14:13] <elukey>	 so maybe we could be less aggressive in reimaging
[09:19:25] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4163367 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1051.eqiad.wmnet', 'an...
[09:31:05] <wikibugs_>	 10Analytics: Varnishkafka does not play well with varnish 5.2 - https://phabricator.wikimedia.org/T177647#4163404 (10R4q3NWnUx2CEhVyr) I am trying to port it right now, and as always with software I discover new stuff as I go along. Need to review argument parsing (luckily you decided to be 1-1 compatible with v...
[10:00:02] <elukey>	 So the hadoop JvmPauseMonitor docs says:
[10:00:02] <elukey>	 Class which sets up a simple thread which runs in a loop sleeping for a short interval of time. If the sleep takes significantly longer than its target time, it implies that the JVM or host machine has paused processing, which may cause other problems. If such a pause is detected, the thread logs a message.
[10:04:19] <elukey>	 so I am pretty sure that the outage was due to GC spending 20+s in collection, causing the timeout to fire even if everything was ok
[10:04:49] <elukey>	 since we suffer at the moment from long GC pauses, I think that we should raise the timeout
[10:04:53] <elukey>	 to something like a minute
[10:05:23] <elukey>	 and then investigate if/how we can improve our GC timings
[10:15:01] <joal>	 elukey: 'the timeout' being the Journal-node timeout, right?
[10:16:36] <elukey>	 yep 
[10:19:34] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4163513 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1053.eqiad.wmnet', 'analytics1051.eqiad.wmnet'] ```  and were **ALL** su...
[10:20:10] <elukey>	 two new workers ready :)
[10:20:57] <elukey>	 joal: look at the namenode patterns https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&from=now-3h&to=now
[10:21:02] <elukey>	 exactly what we were seeing
[10:21:18] <elukey>	 also GC time went up https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=26&fullscreen&from=now-1h&to=now
[10:21:53] <joal>	 elukey: Exactly
[10:22:28] <joal>	 elukey: It's a bit unexpected though that there is a bump in GC count similar at 8:20 while the reimage happens a 9:20 ...
[10:22:31] <joal>	 Weird
[10:23:27] <elukey>	 but the important bit now is gc time
[10:23:55] <joal>	 elukey: something else: I wonder about the corrupted-blocks absorbtion: how does that happen ?
[10:24:13] <joal>	 elukey: leving for 1/2h more or less
[10:25:00] <elukey>	 I am going out for an errand + lunch too, atm things looks fine
[10:25:28] <elukey>	 not sure about the corrupted blocks, but the less problematic ones might be temporary ? Not sure if there is a auto fsck or similar
[10:25:34] * elukey takes notes to investigate
[10:39:11] <wikibugs_>	 10Analytics, 10EventBus, 10Wikimedia-Logstash, 10Services (watching): EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230#4163559 (10mobrovac) p:05Triage>03High
[10:55:50] <wikibugs_>	 10Analytics, 10Analytics-Kanban: Add a --dry-run option to the sqoop script - https://phabricator.wikimedia.org/T188556#4163595 (10fdans) a:03fdans
[12:30:38] <wikibugs_>	 (03PS1) 10Fdans: Add --dry-run parameter to mediawiki sqoop script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556)
[12:43:42] <wikibugs_>	 10Analytics, 10EventBus, 10Operations: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163743 (10elukey) p:05Triage>03High
[12:43:52] <elukey>	 sigh --^
[12:45:56] * fdans hugs elukey 
[12:46:28] <wikibugs_>	 10Analytics, 10EventBus, 10Operations, 10Performance-Team: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163776 (10elukey)
[12:49:16] <wikibugs_>	 10Analytics, 10EventBus, 10Operations, 10Performance-Team: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163779 (10Imarlier) @elukey yes, makes sense - I'll fix in a little bit. Sorry for the noise!
[12:49:29] <wikibugs_>	 10Analytics, 10EventBus, 10Operations, 10Performance-Team: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163780 (10Imarlier) a:05elukey>03Imarlier
[12:57:30] <wikibugs_>	 (03PS1) 10Elukey: Add the possibility to specify the Kafka API version to KafkaConsumer [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429412 (https://phabricator.wikimedia.org/T193238)
[12:59:02] <wikibugs_>	 10Analytics, 10EventBus, 10Operations, 10Performance-Team, 10Patch-For-Review: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4163799 (10elukey) Ah nice! I sent a code review as attempt to fix this, but I can abandon it if you have something ready, no problem!
[12:59:59] * elukey thanks fdans
[13:47:52] <wikibugs_>	 (03CR) 10Joal: [C: 04-1] "Not ready IMO for various reasons, see comments inline." (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans)
[14:04:35] <wikibugs_>	 (03CR) 10Mforns: [C: 032] Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194 (owner: 10Mforns)
[14:11:02] <wikibugs_>	 (03Merged) 10jenkins-bot: Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194 (owner: 10Mforns)
[14:16:51] <elukey>	 joal: I found another interesting thing
[14:16:57] <joal>	 ?
[14:17:21] <elukey>	 so on analytics1001 we run, for the namenode daemon, PS MarkSweep as old gen algorightm
[14:17:26] <elukey>	 hahaha sorry
[14:17:30] <elukey>	 algorithm
[14:17:58] <elukey>	 but that one is not the suggested one, namely concurrent mark and sweep
[14:18:14] <elukey>	 (it is UseParallelOldGC)
[14:18:57] <elukey>	 CMS is probably more suitable for our use case since it tries to reduce the total amount of time of stop of the world GC pauses
[14:20:07] <joal>	 elukey: Indeed good catch !@
[14:40:56] <elukey>	 so https://www.cloudera.com/documentation/enterprise/5-10-x/topics/admin_nn_memory_config.html
[14:41:22] <elukey>	 it says to use -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
[14:43:46] <elukey>	 I am a bit unsure about the last two
[14:48:06] <wikibugs_>	 (03PS1) 10Mforns: Modify output defaults for EventLoggingSanitization.scala [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429427 (https://phabricator.wikimedia.org/T190202)
[15:02:26] <wikibugs_>	 (03PS2) 10Fdans: Add --dry-run parameter to mediawiki sqoop script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556)
[15:02:40] <wikibugs_>	 (03CR) 10Fdans: Add --dry-run parameter to mediawiki sqoop script (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans)
[15:03:15] <joal>	 elukey: I don't know about the last two either, but the first two make sense
[15:04:26] <wikibugs_>	 (03CR) 10Nuria: "Let's make sure this gets throughly tested before we merge, please." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans)
[15:04:43] <elukey>	 "A concurrent collection also starts if the occupancy of the tenured generation exceeds an initiating occupancy (a percentage of the tenured generation). The default value for this initiating occupancy threshold is approximately 92%, but the value is subject to change from release to release. This value can be manually adjusted using the command-line option -XX:CMSInitiatingOccupancyFraction=<N
[15:04:48] <wikibugs_>	 (03CR) 10Joal: [C: 031] "Super better :) Let's wait for another review, but LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans)
[15:04:49] <elukey>	 >, where <N> is an integral percentage (0 to 100) of the tenured generation size."
[15:04:52] <elukey>	 joal: --^
[15:05:15] <elukey>	 so 70 seems good
[15:05:30] <elukey>	 (I am reading https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/cms.html)
[15:05:38] <elukey>	 but +CMSParallelRemarkEnabled is not mentioned
[15:05:39] <joal>	 elukey: My understanding if that is that it tries to GC more often
[15:06:39] <elukey>	 joal: it may lead to more GC runs but it also avoids to start when the heap is almost full
[15:06:52] <joal>	 correct elukey 
[15:09:40] <elukey>	 joal: ah the last one seems to add concurrent steps to UseParNewGC
[15:09:51] <joal>	 yes elukey, I was reading on this
[15:10:05] <joal>	 The thing I read says it depends on workloads
[15:10:50] <elukey>	 I would follow Cloudera's suggestions though, everything seems reasonable
[15:11:02] <joal>	 Works for me elukey 
[15:11:31] <elukey>	 maybe we could apply it to the standby namenode first, let it bake for a day and then apply it on 1001
[15:11:37] <elukey>	 not now of course :D
[15:11:56] <joal>	 elukey: Let's do that on monday\/
[15:12:28] <elukey>	 https://gerrit.wikimedia.org/r/#/c/429429/ is ready :)
[15:12:41] <elukey>	 probably I'd also need to open a task to track this work
[15:14:14] <wikibugs_>	 (03PS1) 10Imarlier: statsv: Hardcode kafka api version [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238)
[15:15:55] <wikibugs_>	 (03CR) 10Elukey: "Ian, what do you think about https://gerrit.wikimedia.org/r/#/c/429412? So we could simply add a parameter to the systemd unit, and remove" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:16:13] <wikibugs_>	 (03CR) 10Imarlier: "Elukey: Nothing wrong with the way that you solved this, just doesn't seem to be a reason to even have the command line option." [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:20:40] <wikibugs_>	 (03CR) 10Imarlier: "> Ian, what do you think about https://gerrit.wikimedia.org/r/#/c/429412?" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:20:46] <wikibugs_>	 (03CR) 10Elukey: "I disagree, I think that it ensures that statsv can run on multiple versions of Kafka, but I don't have a strong opinion on this :)" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:29:55] <wikibugs_>	 (03CR) 10Imarlier: "> I disagree, I think that it ensures that statsv can run on multiple" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:36:45] <wikibugs_>	 (03CR) 10Elukey: [C: 031] "Let's do it, I'd really like not to see icinga screaming over the weekend :)" [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:43:57] <wikibugs_>	 (03CR) 10Imarlier: [C: 032] statsv: Hardcode kafka api version [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[15:44:21] <wikibugs_>	 (03Abandoned) 10Elukey: Add the possibility to specify the Kafka API version to KafkaConsumer [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429412 (https://phabricator.wikimedia.org/T193238) (owner: 10Elukey)
[15:50:52] <wikibugs_>	 (03CR) 10Elukey: [C: 031] statsv: Hardcode kafka api version (031 comment) [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[16:01:51] <elukey>	 a-team: standup?
[16:09:04] <wikibugs_>	 (03CR) 10Imarlier: [V: 032 C: 032] statsv: Hardcode kafka api version [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[16:11:28] <marlier>	 elukey: regarding https://gerrit.wikimedia.org/r/#/c/429432/, it looks like maybe Zuul doesn't know about the statsv repo, so it doesn't get auto-merged...
[16:11:31] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4164320 (10elukey)
[16:11:58] <marlier>	 I'll be honest, I don't really know what the right thing to do about that is.  V+2 and then submit?  Merge locally and push?  
[16:15:15] <elukey>	 marlier: o/
[16:15:27] <elukey>	 yes I think that +2 + submit is fine
[16:15:44] <elukey>	 did you see my comment about the code removed?
[16:15:48] <elukey>	 was it intended?
[16:15:48] <wikibugs_>	 (03CR) 10Imarlier: [V: 032 C: 032] statsv: Hardcode kafka api version [analytics/statsv] - 10https://gerrit.wikimedia.org/r/429432 (https://phabricator.wikimedia.org/T193238) (owner: 10Imarlier)
[16:16:23] <marlier>	 Yep, it was intentional
[16:16:44] <marlier>	 That whole thing with forcing the encoding is bad
[16:17:00] <marlier>	 Used to be very common, but it can actually break things.
[16:17:43] <marlier>	 More than you wanted to know: https://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script :-)
[16:18:44] <elukey>	 okok I was triple checking, it looked weird indeed
[16:18:47] <elukey>	 :)
[16:20:01] <marlier>	 elukey: deployed!
[16:21:25] <elukey>	 nice!
[16:42:30] <wikibugs_>	 (03CR) 10Joal: [C: 04-1] "Forgot about the _SUCCESS file writing! Please excuse me!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans)
[16:43:26] <wikibugs_>	 (03PS3) 10Fdans: Add --dry-run parameter to mediawiki sqoop script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556)
[16:44:50] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4164358 (10elukey) 05Open>03Resolved Changes deployed by @Imarlier, everything looks good now! Thanks!
[16:48:24] <wikibugs_>	 (03CR) 10Fdans: [V: 031] "Tested in beta, this is the log generated with the --verbose flag:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429410 (https://phabricator.wikimedia.org/T188556) (owner: 10Fdans)
[17:01:17] <wikibugs_>	 10Analytics, 10Analytics-Kanban: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257#4164399 (10elukey) p:05Triage>03Normal
[17:02:45] <elukey>	 joal,nuria_ --^
[17:03:02] <nuria_>	 elukey: sorry, on meting
[17:04:58] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257#4164431 (10elukey)
[17:05:04] <elukey>	 oh yes it was only a pointer to what we discussed in standup :)
[17:06:07] <joal>	 Thanks for he thorough doc elukey :)
[17:06:49] <elukey>	 joal: let me know if I missed anything and/or if it makes sense :)
[17:07:36] <elukey>	 all right, I think I am done for the week :)
[17:07:39] <elukey>	 have a good weekend people! 
[17:07:52] <joal>	 Bye elukey 
[17:50:40] <ottomata>	 o.
[17:50:41] <ottomata>	 o/
[18:12:39] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257#4164522 (10Ottomata) +1 to all, thanks Luca!
[18:18:00] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Modify output defaults for EventLoggingSanitization.scala [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429427 (https://phabricator.wikimedia.org/T190202) (owner: 10Mforns)
[18:19:01] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Correct webrequest dataloss false-positive script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429380 (owner: 10Joal)
[18:38:35] <wikibugs_>	 (03PS8) 10Amire80: WIP Analyzing failed ULS searches [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/413270 (https://phabricator.wikimedia.org/T190630)
[18:38:35] <joal>	 Gone for tonight a-team - Will see you net wekk
[18:38:41] <mforns>	 byeeeeeee
[18:45:37] <wikibugs_>	 (03CR) 10Nuria: "Can we please add these docs to our oncall page?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429380 (owner: 10Joal)
[18:57:48] <ottomata>	 byeee
[19:00:43] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services: Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#4164647 (10Ottomata) a:03Ottomata
[19:01:07] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services: Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#3910345 (10Ottomata)
[19:01:23] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services: Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#3910345 (10Ottomata)
[19:11:27] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services: Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#4164681 (10Ottomata)
[19:13:22] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services: Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#3910345 (10Ottomata)
[19:53:28] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update ua-parser package. Both uap-java and uap-core  - https://phabricator.wikimedia.org/T192464#4164761 (10Nuria)
[20:02:15] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services: Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#4164774 (10Ottomata) p:05Triage>03Normal
[20:07:45] <wikibugs_>	 10Analytics, 10Analytics-Kanban: Define battery of smoke tests  to run by hand before realease - https://phabricator.wikimedia.org/T190837#4164792 (10Nuria) 05Open>03Resolved
[20:09:44] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4164804 (10Nuria) I have added a brief TL;DR I think it looks good, the thing missing is documenting the dataset table just like we do for all datasets so users can know what data it contains (columns, types.. etc)   ping @Neil...
[20:20:39] <wikibugs_>	 (03CR) 10Divec: "Looks good! You could make the string handling simpler and more consistent by choosing to use either `str` strings or `unicode` strings th" [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/413270 (https://phabricator.wikimedia.org/T190630) (owner: 10Amire80)
[21:51:19] <wikibugs_>	 (03PS1) 10Nuria: [WIP] UA parser specification changes for OS version [analytics/ua-parser/uap-java] (wmf) - 10https://gerrit.wikimedia.org/r/429527 (https://phabricator.wikimedia.org/T189230)
[22:01:15] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4165026 (10Neil_P._Quinn_WMF) I've already used this data for a quick calculation (T192514) and it was quite helpful!   >>! In T191343#4164804, @Nuria wrote: >The thing missing is documenting the dataset table just like we do f...
[22:21:46] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4165113 (10Nuria) >Analytics/Data Lake/Edits/Geowiki seems like the right place and, as I just discovered, there's already some documentation there! Indeed!  >I suggest renaming whole dataset to something like "Geoeditors", whi...