[07:23:57] moritzm: gooood morning! Something very weird has happened to us yesterday, namely after a restart of the Hadoop Namenodes JVM we got a different behavior of the garbage collector.. https://grafana.wikimedia.org/dashboard/db/analytics-hadoop (first two graphs) [07:24:28] I rolled back my change (that was a no op extra Xmx value) but we are still seeing the behavior.. [07:24:52] let me have a look [07:24:54] I checked the new jdk/jre changelog but I don't find a lot [07:25:58] we are not seeing any horrible regression but the heap size went down to ~2GB and the GC started to collect Young Gen (and also Old but only once) [07:26:07] that is a normal behavior for a JVM [07:26:21] but still it is a complete different behavior [07:26:56] and I checked the max heap size, it should be more than 3GB [07:27:21] it's limited to the name nodes? [07:28:08] sounds like a problem best investigated with some tea, back in five [07:28:13] yeah.. the change that I rolled out was related to a new Xmx value for the Datanodes (you can see that they have changed in the graphs), and they were restarted [07:28:17] sure! [07:28:33] (I keep writing) [07:29:09] so basically the Cloudera distribution does something like this: you set environment variables, like HADOOP_HEAP_SIZE, and some daemons will pick them up [07:30:00] I've set HADOOP_HEAP_SIZE to 2048, and both datanodes and namenodes picked it up (-Xmx-2048m). The namenode also uses other env variables to append JVM parameters, one of them is -Xmx=4096m [07:30:17] that since it is the last, wins over the first (I checked with a simple test on analytics1001) [07:31:04] after I restarted all the JVMs I saw the weird namenode behavior and I thought that the new Xmx setting was somehow capping the Heap size to 2GB [07:31:07] so I rolled back [07:31:13] ran puppet [07:31:22] and restarted only the JVMs of the namenodes [07:31:40] (so the datanodes are running with the new config since we have a OOM issue) [07:31:54] (our Java daemons does not restart on config changes) [07:33:32] but the namenodes' JVMs are still behaving in the same way [07:33:55] that is not super weird, but it changed radically [07:38:04] I doublechecked openjdk software changes, openjdk-7 was updated to 7u95 on the 9th of May and we did restart Hadoop after that, so it's not related to an update of java per so. however, [07:38:57] I remember that Oracle made JDK-related changes in openjdk-8, maybe these got backported to openjdk-7 with the update in May and recent Cloudera config changes now make use of that [07:39:07] I'll look whether I can find what I vaguely remember [07:41:12] sure thanks! but don't waste too much time, just wanted your opinion to rule out quick path of investigations.. [07:47:34] mhh, what I had in mind only provides new options for GC oom handling in java 8 and I couldn't find that in the last java 7 update [07:47:53] this might rather be related to recent cdh changes, not sure [07:51:27] really strange, hopefully I'll come up with some plausible explanation :) [07:54:18] Hi elukey [07:54:37] o/ [07:55:42] which way did you set Xmx? docs say "Heap size specified via the HADOOP_CLIENT_OPTS -Xmx option overrides heap size specified via HADOOP_HEAPSIZE." [07:58:32] so we have set HADOOP_NAMENODE_OPS to 4096 (so -Xmx=4096) a long time ago, meanwhile this time I've set HADOOP_HEAPSIZE to 2048 (that has been rolled back for namenodes) [07:59:08] but even now on analytics1001 I can see bla bla -Xmx1000m bla bla-Xmx4096m [08:00:14] hmm, probably best to reach out to the upstream mailing list I think [08:01:47] on a totally unrelated subject, aqs needs to be update to the nodejs 4.4.6 security update at some point. it's already working fine for all of restbase, so I wouldn't expect any problems, but let's still do it in steps [08:02:15] moritzm: I'm one the go to test that :) [08:04:22] joal: ok :-) [08:10:48] joal: I can't find a good explanation to why the namenodes are behaving in this way [08:11:07] I mean, there is nothing alarming at the moment, the JVMs are behaving fine [08:11:28] regular minor GC collections and veeeery sporadic old gen ones [08:14:57] plus jconsole shows that the JVM can allocate ~3.7GB of heap size (not sure why it doesn't say 4G, but close enough( [08:31:22] Analytics, Services, cassandra, Patch-For-Review: Refactor the default cassandra monitoring into a separate class - https://phabricator.wikimedia.org/T137422#2436388 (elukey) Open>Resolved a:elukey Changes deployed successfully, thanks a lot @Nicko for all the work! [08:37:02] elukey: batcave for a braistorm? [08:37:55] sure.. I've also upgraded node on aqs100[456] if you want to test it [08:38:13] elukey: that's ll make it kindof easier :) [09:34:24] for the channel (cc moritzm): After a lot of debugging with joal we found out a possible explanation, namely that ottomata was deleting tons of files right before me restarting the first namenode JVM (how lucky I am). We went from ~15M to ~6.2M files stored (https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=25&fullscreen) [09:35:11] a restart was needed probably to let the JVM to flush out old garbage completely and start from a clean state [09:35:33] ah [09:35:51] it's still a good idea to blame Java first :-) [09:35:57] of course! [09:35:59] :P [09:37:00] joal: and Young Gen collection started exactly when the file number dropped, some minutes before the heap size drop. I think you were telling this to me earlier on but I was checking metrics and my brain can't do multi-tasking really well ;) [09:37:55] yes definitely since you were saying that there was a drop right before the major one in heap size [09:37:58] sorry :) [09:38:06] goooooood now I feel better [09:38:10] the cluster is healthy [09:38:15] rolling out again my change [09:39:03] :) [09:44:06] all right puppet change done, will not force it since the datanodes haven't been restarted and they have the previous config [10:11:38] morning joal ! [10:11:52] Morning addshore :) [10:12:03] ready to pick up where we left off? :D [10:12:09] addshore: I finally had time to investigate a bit yesterday, I have some answers for you [10:12:15] oooh [10:13:34] addshore: Let me finish my aqs tests and I'll be with you more fully [10:13:38] okay! [10:17:09] elukey: eqi aqs1004 [10:17:13] oops:) [10:19:21] elukey: All my tests have been working for aqs on node4.4.6 [10:19:52] elukey: new cluster is healthy, tests on my own are showing correct results [10:20:16] elukey: I think we are good to deploy on one of three prod nodes (with your agreement of course) [10:21:13] sure, will do it after lunch.. maybe 1004 and then tomorrow 100[56]? [10:21:37] hm, elukey, you mean 1001 and then 100[23] I guess [10:21:44] elukey: in that case, yes :D [10:23:15] addshore: So, yesterday your coordinator failed because of bad luck [10:23:21] haha! [10:23:28] addshore: you were testing at the same moment elukey was restarting nodes [10:23:37] okay! [10:23:37] joal: aaaaarrrgh yes [10:23:38] sorry [10:23:41] :D [10:23:45] brainfarting [10:23:48] it happens :) [10:24:15] now addshore, in finding that, I also found that there is another issue [10:25:03] addshore: We have not yet run any oozified spark job in dynamic allocation mode - You are paying that price [10:26:22] elukey: one confitmation please: the dynamic allocation flag set to true is defined for spark-shell and spark-submit, right? [10:27:54] addshore: I want to try to run your job with explicit dynamic allocation setting (I think that's all it needs) [10:32:48] okay! *loads to docs page* [10:33:12] joal: is that -dynamic or something? ;) [10:33:54] addshore: a bit more complex, currently trying it :) spark.dynamicAllocation.enabled=true [10:42:08] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2436731 (mobrovac) >>! In T134502#2433938, @Ottomata wrote: > If it has meaning, I'm all for keeping it. In the current Ev... [10:49:52] elukey: question [10:50:52] elukey: how the heck do we have different hive conf on analytics nodes and stat nodes ???? [10:51:19] * elukey blames ottomata [10:51:21] addshore: discovering more weird things a I move forward with your poatch ! [10:51:29] * elukey have no idea [10:51:29] elukey :D [10:51:50] elukey: I'd go for moritzm way: Always blams java first ;) [10:51:50] jokes aside, no idea.. [10:52:05] :D [10:52:05] I usually blame me first then Andrew [10:52:06] elukey: That's kind of a big issue [10:52:12] :P [10:52:44] elukey: let me explain, when spark runs in client mode, it works: it uses the hive-site file on stat machine, fine [10:53:15] But when in cluster mode, spark application master is on any node, there uses the hive-site file over there [10:53:44] elukey: on analytics1044 for instance, hive-site file still references analytics1015 as metastore :( [10:54:15] ah snap.. now I see [10:54:25] :( [10:54:48] * elukey looks into puppet [10:55:27] I hate analytics1015 [10:55:39] elukey: So do I [10:55:41] joal, path of hive-site? [10:56:27] elukey: I used both /etc/hive/conf.analytics-hadoop/hive-site.xml and /usr/lib/hive/conf/hive-site.xml [10:57:16] so there is no trace of 1015 in puppet, and the hive-site.xml.erb references a metastore_host that is defined only puppet/hieradata/eqiad/cdh/hive.yaml (corretly as analytics1003) [10:57:26] so could it be that those files are old ones? [10:57:37] currently not versioned by puppet [10:58:02] elukey: possibly so [10:58:42] yes [10:58:55] so the hive-site.xml is defined only in hive.pp [10:59:00] file { "${config_directory}/hive-site.xml": [10:59:00] content => template($hive_site_template), [10:59:00] mode => $hive_site_mode, [10:59:00] owner => 'hive', [10:59:00] group => 'hive', [10:59:03] require => Package['hive'], [10:59:05] } [10:59:08] elukey: I absolutely don't know how the hive conf got copied [10:59:23] elukey: let's wait for Andrew and see [11:02:06] * elukey nods [11:02:20] addshore: in case you didn't follow --^ [11:02:46] :D [11:49:09] hey team :] [12:03:59] Hi mforns [12:04:51] HI joal [12:04:52] :] [12:05:37] mforns: so, checkpointing [12:06:31] joal, yes.. batcave? [12:06:37] mforns: sure, OMW [12:27:11] a-team: stepping afk for ~1.5hrs, will be back soon. [12:27:42] sure elukey [12:43:59] Hi ottomata [12:44:16] hiii [12:44:44] We have questions for you when elukey gets back :) [12:44:48] ottomata: --^ [12:47:43] ok! [12:47:44] i am ready [12:50:28] joal: just a random question unrelated to what we were doing before. But I'm guessing it only makes sense to make oozie jobs etc for data that is in hadoop? [12:50:45] addshore: correct : oozie relies on Hadoop to run [12:51:08] addshore: For regular queries on Mysql for instance, with use mforns awesome ReportUpdater [12:51:24] ooooh, are there docs for that? [12:51:49] addshore: I bet there are, but I never used it (/shame on me/ [12:51:49] as I am doing a bunch of stuff that just uses db queries / dump scans / api queries right now running daily / weekly etc [12:51:54] addshore, https://wikitech.wikimedia.org/wiki/Analytics/Reportupdater :] [12:51:55] hehe [12:52:12] mforns: is this just for db related things really? [12:52:44] ahh yes, thats the thing that makes the tsv files and stuff :) [12:53:15] addshore, reportupdater executes periodically sql queries or scripts that generate sql-like output [12:53:31] addshore, and generates those report files, yes [12:54:33] okay! I need to try and take another look at https://gerrit.wikimedia.org/r/#/c/269467/ (may try and get it in my next sprint) [13:40:23] * elukey backkk [13:40:33] Yay, was wating for you elukey :) [13:40:44] :) [13:41:04] ottomata: hellooooo [13:41:43] HIII [13:44:57] good news: we found what happened yesterday [13:45:17] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop [13:45:35] I added HDFS total files [13:46:28] oh nice :) [13:46:28] hah [13:46:30] yeah [13:46:31] makes senes [13:46:31] so when we cancelled files the HDFS namenodes said YESSSS [13:46:45] but we did two things at the same time aligning them perfectly [13:46:47] ahhhh [13:46:48] i see [13:46:50] WHAT ARE THE CHANCEEESSSS [13:46:51] so mem dropped [13:46:54] makes a lot of sense [13:46:54] yeah [13:46:57] nice [13:47:03] but only when I restarted the JVM [13:47:07] ottomata: log cleanup iws actually good ;) [13:47:11] yes!! [13:47:12] yeah [13:47:16] nuria found the proper setting too [13:47:21] hadoop will take care of removing old aggregated log files [13:47:21] :) [13:47:27] so all good, I rolled out again the change [13:47:32] \o/ [13:47:34] ottomata: Wow, great ! [13:47:41] ottomata: What is the setting? [13:47:54] https://gerrit.wikimedia.org/r/#/c/297643/2/templates/hadoop/yarn-site.xml.erb [13:48:41] ottomata: great, thx :) [13:49:38] ottomata: other thing - joal discovered that hive-site.xml on hadoop workers still references analytics1015 as hive metastore.. [13:50:29] HM [13:51:08] weird [13:51:09] ja? [13:51:10] hm [13:51:19] maybe hive-site.xml isn't rendered there by puppet anymore [13:52:20] yeah or maybe it used to be and now it isn't? [13:52:29] yes, makes sense [13:52:32] looking made me reember [13:52:39] we used to include hive client class on all workers [13:52:47] only so we could have hcatalog there [13:52:54] but, then we realized we could just require the hcatalog pacakge [13:52:56] and not hive [13:53:03] so we should be able to remove /etc/hive [13:53:09] maybe purge hive client [13:53:21] I think joal needs it [13:53:27] ottomata: actually we need it if we want to use spark hiveContext [13:53:43] hm [13:53:47] ottomata: in shell or client mode, it uses statX hive-site.xml [13:54:02] but in cluster it needs it on workers? [13:54:06] this sounds familiar... [13:54:10] but when launched by oozie, in cluster mode, app-master is on any node, and looks for file [13:54:17] right [13:54:21] joal, back, do we have something more to talk about? [13:54:33] mforns: hm, don't thionk so [13:54:38] or can I continue with what we spoke? [13:54:39] hm. [13:54:40] OK [13:54:41] yeah [13:54:43] ok. [13:54:43] mforns: if you want we can pair for duplicating the thing [13:54:44] so hm [13:55:00] yeah, and cdh::spark will creat a symlink to hive-site.xml in its config directory if cdh::hive is included [13:55:08] hmmm [13:55:16] yeah, and it doesn't have it now [13:55:18] hm [13:55:29] joal: does spark hive in cluster mode work now? [13:55:33] i would think it wouldn't... [13:55:45] since the /etc/spark/conf/hive-site.xml symlink doesn't exist [13:55:53] joal, oh yea, Dan and I briefly spoke yesterday about the same stuff, he had the same idea we had yesterday, to collect and re-parallelize the rdds every once in a while [13:55:56] ottomata: I think it doesn't [13:56:06] joal, this would be also an option [13:56:18] joal: do you have an easy test case we can try real quick? [13:56:29] mforns: could be, but really I think overhead comes from RDDs for small data [13:56:41] ottomata: Will find one, yes [13:56:47] ottomata: give me a minute [13:56:52] k [13:57:41] joal, ok [13:57:50] let's try duplicating the code [13:57:57] mforns: At least trying with lists will have us sure :) [13:58:02] aha [13:58:53] joal, so let me know when you have time for pairing [13:59:13] mforns: sure, need to find an example for ottomata, then pair :) [13:59:18] joal, ok [13:59:29] mforns: I'm sorry, everytime I offer to pair, something else show up ! [13:59:43] joal, no problem, do you want me to wait for you? [13:59:59] mforns: no thank you, please move forward, I'll catch up [14:00:06] ok [14:16:22] ottomata: I got proofs :) [14:16:30] ja? :) [14:16:44] So, job in client mode succed [14:17:57] then, job in cluster mode (from stat1002) fails - It needs hive stuff (some jars and hive-site file) [14:18:18] Then job in cluster mode succeeds from stat1002 with jars and hive file passing [14:18:32] And job fails from analytics1030 with same files passing [14:19:04] last job is stuck with last line error.log: 16/07/07 14:14:19 INFO metastore: Trying to connect to metastore with URI thrift://analytics1015.eqiad.wmnet:9083 [14:19:09] ottomata: --^ [14:19:55] HM [14:20:02] joal you laucned spark from analytics1030? [14:20:05] to do that? [14:20:11] correct ottomata [14:20:16] spark-submt [14:20:18] huh [14:20:18] hmmm [14:20:32] This is kinda what happen when launched from oozie [14:20:54] aye makes sense [14:20:58] ottomata: oozie launches stuff from anywhere (how are we supposed to catch tehm up after ...) [14:21:07] just surprised that it was able to even read hive-site.xml [14:21:11] since it isn't in spark/conf [14:21:25] oh ottomata I supplied the path [14:21:29] oh, you did? [14:21:32] indeed [14:21:34] what happens without /etc/hive/conf in the path? [14:21:38] it just fails differently? [14:21:41] I also need to do it in cluster mode from stat1002 [14:21:57] ottomata: batcave? [14:22:10] mmmmm, kinda hard atm, am at a cafe, lots of folks and noise [14:22:15] k [14:22:46] spark manages to grab hive conf alone when launched in client mode [14:22:56] when in cluster mode, it doesn't do it by default [14:24:57] hm [14:25:15] weird. hm. [14:25:27] so, wait, in client mode, it knows to look in hive-site.xml? [14:25:29] no. [14:25:36] it is in /etc/spark/conf on stat1002 [14:25:41] so it grabs it from there [14:26:08] but in cluster mode, it never knows where to look? [14:26:15] hm [14:26:16] ottomata: correct [14:26:26] how do you provide it? in spark or in classpath? [14:26:29] ottomata: plus, it needs some hive jars (datanucleus) [14:26:35] hm [14:26:52] ok, well, the simple solution here is to re-include hive-client on datanodes [14:26:58] ottomata: https://gist.github.com/jobar/7ccfccdb28bcb657df74f03b1ea37a6a [14:27:04] am a little worried, i feel like there was some conflict [14:27:09] but, maybe we can try on one [14:27:26] --files, crazy, hm [14:27:42] ottomata: I can even try to fake a hive-site with correct params from an anlytics machine if you want [14:27:54] ottomata: This is the suggested way to do it [14:28:16] ottomata: However, the fact that oozie can do it from anywhere makes even crazier [14:28:50] joal: perhaps you should grab the hive-site.xml from hdfs? [14:28:52] in cluster mode? [14:28:53] is that possible? [14:28:57] we deploy that with refinery. [14:29:14] hm, I didn't try - Will do it now [14:29:19] ottomata: --^ [14:29:20] ah, but you need the jars too? [14:29:21] hm [14:29:23] hm [14:29:47] hm, yeah if you need those datanucleus jars too, we should ensure they are on the potential appmasters too [14:29:48] hm [14:30:19] i mean, we could deploy those to hdfs too i guess...buuuut [14:30:19] hm [14:30:27] it seems better to just include hive client class on workers [14:30:33] i'll try it on one and see what happens [14:31:27] ottomata: file from hdfs fails :( [14:31:46] k [14:31:50] well, i think you need the jars anyway. [14:31:55] so i'm going to try including the class on analytics1030 [14:32:13] oh actually I'm wrong, it worked ! [14:32:16] ottomata: --^ [14:32:29] ottomata: possibly we could have the jars on hdfs as well [14:32:34] ottomata: Will try that [14:32:41] yeah we could [14:32:42] buuut [14:32:46] sounds pretty annoying [14:32:59] ottomata: To maintain you mean? [14:33:12] yeah [14:33:16] hmmm [14:33:18] except... [14:33:23] oozie does things like that [14:33:28] copies its lib jars to hdfs [14:33:31] hm [14:33:35] ottomata: Couldn't we puppetize the copy? [14:34:06] ottomata: like as part of the hive install, copy those files to hdfs? [14:34:10] hmm joal [14:34:12] they are already there! [14:34:15] as part of oozie sharelib [14:34:23] /user/oozie/share/lib/lib_20160223160848/hive [14:34:31] or hive2 [14:34:32] ? [14:34:32] hm [14:34:40] oh [14:34:41] and spark! [14:34:52] /user/oozie/share/lib/lib_20160223160848/spark [14:37:09] well ottomata, seems that everything is already there indeed (with hive-site in refinery) [14:37:28] ottomata: I'll try to remember that oozie have jars for everything on hdfs :) [14:37:41] ottomata: will try with those path, many thanks ! [14:38:25] ok joal, cool [14:38:40] ottomata: Funny thing is, there is a difference (not big) of version from datanucleus is hive folder and in oozie shared libs [14:38:40] i need to clean up hive client stuff on worker nodes though then [14:38:45] yeah i see that [14:38:55] hopefully its ok [14:39:06] hm, are datanucleus jars not in /usr/lib/spark somewhere? [14:39:54] hm, nope [14:40:05] ottomata: makes sense they ask to include them ;) [14:40:31] hm, joal so, it might be annoying to hardcode links to that oozie sharelib dir though [14:40:33] HMMM [14:40:37] i thikn that is dynamic somehow... [14:41:10] ottomata: first thing, let's provide them manully through --jars [14:41:19] On startup, Oozie will look for the newest lib_ directory and use that.  [14:43:06] http://blog.cloudera.com/blog/2014/05/how-to-use-the-sharelib-in-apache-oozie-cdh-5/ [14:43:19] elow are the various ways to include a jar with your workflow: [14:43:19] 1. Set oozie.libpath=/path/to/jars,another/path/to/jars in job.properties. [14:43:30] • There is no need to ever point this at the ShareLib location. (I see that in a lot of workflows.) Oozie knows where the ShareLib is and will include it automatically if you set oozie.use.system.libpath=true injob.properties. [14:44:08] so joal, ja, i *think* for oozie, if we use hive-site.xml from hdfs [14:44:10] the rest will just work [14:44:17] since datanucleus is alreday in the oozie sharelib [14:44:26] makes sense ottomata !!! [14:44:50] joal: aqs1001 upgraded :) [14:44:58] ottomata: however it doesn't when we use spark-submit as-is in cluster mode (but who does) [14:45:04] great elukey !!! [14:45:06] Thanks :) [14:45:30] hm [14:45:34] but that would be nice too [14:45:35] yeah [14:45:44] i guess if you do that you have to manually specify path in oozie sharelib [14:45:46] buuut [14:45:46] hm [14:45:51] joal: i'm still looking at including hive client [14:45:55] ottomata: most of the time, when launching jobs, we do in client mode, in order to get the logs [14:45:56] that may be a fine option too [14:45:58] checking [14:46:01] k [14:48:07] hm, joal for my test, i don't get any conflicts... [14:48:13] i'm going to merge this, and run on analytics1030 [14:48:17] i think it might be good to just do this [14:48:23] doesn't hurt to have hive client on worker nodes... [14:48:33] ottomata: ok [14:48:51] ottomata: currently checking without jars and refinery hive-site on oozie [14:49:09] k [14:49:11] joal, man... the duplicated code is practically the same... but I can not manage to write it as a single generic thing... [14:49:20] :D [14:49:39] mforns: will be with you in minutes or so [14:55:18] ottomata: works great :) [14:55:42] addshore: ottomata found a solution for us :) [14:55:49] :D [14:55:58] in a meeting now, and then going bouldering :/ [14:56:17] addshore: ooools analytics saying: When somthin' wrong, ask for ottomata [14:56:36] addshore: We'll check up tomorrow morning :) [14:56:45] haha! [14:56:46] addshore: Enjoy bouldring ! [14:56:55] I will :) Talk tommorrow! [14:57:00] Later addshore [14:57:06] mforns: Ready I am ! [14:57:10] Thanks for all the help (again) :) [14:57:17] to the batcave! [14:57:20] OMW ! [14:57:48] ottomata: so, just providing the hive-site.xml file from refinery works great [14:58:00] ottomata: no need for the jars [14:58:22] mforns: A m I in the alternate cave? [14:58:25] ok great [14:58:29] joal, mmm [14:58:50] I was [14:58:53] joal: i think we should leave it like that then...i can def include hive-client no problem..but i might have to do a little puppet rejiggering with roles to make it make sense [14:58:54] not sure. [14:59:08] ottomata: At least we have a way [14:59:08] hmm, but hmmm [14:59:17] so keep it easier for you for now ;) [14:59:19] hmmm [14:59:28] but if I don't rejigger...i have to clean up all worker nodes [14:59:28] hehehh [14:59:33] :D [14:59:36] which is the path of least resistance?! [14:59:38] not sure yet... [14:59:39] Mess up or clean up ;) [14:59:41] i might rejigger! [15:02:22] hmmm, hey elukey, we should move the net-topology stuff into the hiera [15:02:27] and use that for analytics_hadoop_hosts [15:05:39] definitely, it could be a good option [15:06:10] I went for the easiest and less impactful one while you were away, didn't want to break everything :) [15:06:35] aye cool [15:06:41] running home for meetings, back shortly [15:09:30] joal: I finally added analytics to the aqs notification groups [15:09:35] so we should get alarms in here too [15:09:42] elukey: YAY ! [15:09:48] elukey: Thanks [15:09:58] elukey: I just had a quick look, and nothing seems wrong [15:10:03] (PS2) Mforns: [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) [15:11:02] joal, it's in gerrit ^ and in cloud9, still with the final processing bug, but working until 100 iterations [15:11:14] (CR) jenkins-bot: [V: -1] [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) (owner: Mforns) [15:29:02] milimetric, mforns : let's do it in batcave :) [15:29:04] batcave or eventbus? I guess batcave [15:29:08] ok :] [15:29:11] k [15:29:15] mforns: you read my mind:) [15:29:18] hehe [15:35:42] ottomata: hola. [15:35:50] holaaaa [15:36:00] ottomata: do you prefer me doing the changes for the logging in yarn_site_extra_properties? [15:36:16] hmm, not sure i have a huge preference [15:36:17] let's see [15:36:37] cdh module has log aggregation on by default, and this is not (currently) a parameter [15:36:52] that's fine, as i can't think of a situation where you'd want log agg off [15:37:04] but if it is on, the rotation should def be on [15:37:09] so hm [15:37:12] right [15:37:16] so module then [15:37:18] or, at least a parameter [15:37:19] yeah [15:37:20] i think so [15:37:21] k [15:37:27] so ya, let's make it a module param [15:37:38] k [15:38:58] joal: fyi, i am including hive client [15:39:01] seems fine. [15:50:00] Analytics, GLAM-Tech: Track pageview stats for outreach.wikimedia.org - https://phabricator.wikimedia.org/T118987#2437769 (Sadads) [16:01:08] a-team: standddupppp [16:01:26] get up, stand up, don't give up the fight [16:18:00] (PS1) Milimetric: Upgrade Node version to 4.4.6 [analytics/aqs] - https://gerrit.wikimedia.org/r/297807 (https://phabricator.wikimedia.org/T139493) [16:57:52] Analytics: Make top pages for WP:MED articles - https://phabricator.wikimedia.org/T139324#2428005 (Milimetric) p:Triage>Normal [17:01:08] Analytics, Collaboration-Team-Interested, Community-Tech, Editing-Analysis, and 6 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019#2416933 (Milimetric) Putting this on our Radar until the project is better defined (right now it's a bit mor... [17:07:58] Analytics: Varnishkafka should auto-reconnect to abandoned VSM - https://phabricator.wikimedia.org/T138747#2408827 (Nuria) What about sequence numbers? Now when varnish restarts we are restarting varnishkafka and thus sequence numbers go to zero. If varnishkafka reconnects to teh new varnish instance automa... [17:12:25] Analytics: Better identify varnish/vcl timeouts on camus - https://phabricator.wikimedia.org/T138511#2438115 (Milimetric) [17:17:23] Analytics: Better identify varnish/vcl timeouts on camus - https://phabricator.wikimedia.org/T138511#2438153 (Milimetric) p:Triage>Normal [17:20:17] Analytics: Split opera mini in proxy or turbo mode - https://phabricator.wikimedia.org/T138505#2438167 (Milimetric) p:Triage>Normal [17:20:54] Analytics: Better identify varnish/vcl timeouts on camus - https://phabricator.wikimedia.org/T138511#2438170 (Nuria) Unless camus has a proper end_dt it is going to asign one and it will likely be the current timestamp. We need to fix that issues if these records really correspond to true requests. Also I... [17:27:10] Analytics: Spike: Evaluate alternatives to varnishkafka: varnishevents - https://phabricator.wikimedia.org/T138426#2438203 (Milimetric) [17:28:44] Analytics: Create ops dashboard with info like ipv6 traffic split - https://phabricator.wikimedia.org/T138396#2399103 (Milimetric) p:Triage>Normal [17:29:17] hangouts died for me [17:29:19] Analytics: Capacity projections of pageview API document on wikitech - https://phabricator.wikimedia.org/T138318#2396838 (Milimetric) p:Triage>Normal [17:32:15] Analytics: Puppetize MirrorMaker - https://phabricator.wikimedia.org/T138267#2395161 (Milimetric) [17:32:17] Analytics, Analytics-Cluster: Puppetize and deploy MirrorMaker using confluent packages - https://phabricator.wikimedia.org/T134184#2438216 (Milimetric) [17:32:21] Analytics, Analytics-Cluster: Puppetize and deploy MirrorMaker using confluent packages - https://phabricator.wikimedia.org/T134184#2256991 (Milimetric) p:Triage>Normal [17:32:52] Analytics: Upgrade Kafka (non-analytics cluster) - https://phabricator.wikimedia.org/T138265#2395136 (Milimetric) p:Triage>Normal [17:33:29] Analytics: Set up dedicated Druid Zookeeper - https://phabricator.wikimedia.org/T138263#2438222 (Milimetric) p:Triage>Normal [17:33:55] Analytics: Puppetize pivot UI - https://phabricator.wikimedia.org/T138262#2438227 (Milimetric) p:Triage>Normal [17:34:37] Analytics: Productionize Druid Pageview Pipeline - https://phabricator.wikimedia.org/T138261#2438230 (Milimetric) [17:34:39] Analytics-Kanban: Productionize Druid loader - https://phabricator.wikimedia.org/T138264#2438232 (Milimetric) [17:34:59] Analytics: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2387140 (Milimetric) p:Triage>Normal [17:36:00] Analytics: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2382422 (Milimetric) p:Triage>Normal [17:38:36] Analytics, MediaWiki-API, User-bd808: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321#2364785 (Milimetric) Moving to radar for now, but when you prioritize and define this, we can help code the Oozie jobs that would get this done. Just let... [17:42:59] Analytics, MediaWiki-API, User-bd808: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321#2438254 (bd808) >>! In T137321#2438240, @Milimetric wrote: > Moving to radar for now, but when you prioritize and define this, we can help code the Oozie... [17:45:06] Analytics, Patch-For-Review: Implement a standard page title normalization algorithm (same as mediawiki) - https://phabricator.wikimedia.org/T126669#2020367 (Milimetric) p:High>Normal [17:49:35] Analytics: Investigate where Kafka records will almost all null fields are coming from - https://phabricator.wikimedia.org/T136844#2438311 (Milimetric) Open>Resolved a:Milimetric Haven't been able to reproduce, it might have been a fluke [17:54:51] Analytics: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2438329 (Milimetric) a:Nuria [17:55:03] Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2064831 (Milimetric) [18:00:25] Analytics: Puppetize pivot UI - https://phabricator.wikimedia.org/T138262#2438361 (Milimetric) [18:00:27] Analytics: Set up a deployment repository for Pivot - https://phabricator.wikimedia.org/T136640#2438360 (Milimetric) [18:00:35] Analytics: Set up a deployment repository for Pivot - https://phabricator.wikimedia.org/T136640#2342416 (Milimetric) p:Triage>Normal [18:00:53] Analytics, Pageviews-API: Enable pageviews on nl/be.wikimedia.org {melc} - https://phabricator.wikimedia.org/T127804#2438363 (Milimetric) [18:00:55] Analytics, Pageviews-API: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2438365 (Milimetric) [18:01:09] Analytics, GLAM-Tech: Track pageview stats for outreach.wikimedia.org - https://phabricator.wikimedia.org/T118987#2438366 (Milimetric) [18:01:11] Analytics, Pageviews-API: Count pageviews for all wikis/systems behind varnish - https://phabricator.wikimedia.org/T130249#2130851 (Milimetric) [18:24:24] elukey: you still around? [19:16:37] ottomata: let me know if you think is ok to restart the namenode with new chnages cause i would like to learn how to do that [19:36:47] cool nuria_ [19:36:51] merged cdh change, looks good to me! [19:36:57] next is to update the cdh submodule in ops puppet [19:36:59] want to do that too? [19:37:11] ottomata: yes, talking to milimetric , back in abit [19:37:21] k [20:27:20] ottomata: updating module [20:34:48] cool [20:38:47] ottomata: done [20:38:58] ottomata: let me know if you want to restart node and such [20:55:16] nuria_: cool, let's do tomorrow? [20:55:20] since almost end of day here for me [20:55:21] ja? [20:55:21] ottomata: yessir [20:55:26] ottomata: sounds great [20:55:37] k ping me, we do it together whenev you are ready tomorrow [20:55:46] ottomata: will do thnaks man [20:55:47] milimetric: still around? [20:55:49] *thanks [20:56:06] hey ottomata yeah [20:56:30] wanna brain bounce some v simple async js with a poor soul such as me? [20:56:43] omw to the cave [20:57:00] uh... google's being silly [21:29:46] ottomata: so, the librdkafka/confluent-kafka-python libs have balanced consuming by default, right? [21:32:26] as long as the consumers belong to a consumer group i suppose [21:46:40] yes [21:47:01] madhuvishy: it is built into librdkafka now [21:47:12] ottomata: yeah was just reading the docs [21:47:31] aye cool [21:47:42] i gotta run for the eve, be back later! [21:54:44] ottomata: okay! I reviewed [21:54:50] bye :) [23:32:04] Anyone able to check the schema of a table in labswiki?