[01:05:27] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2432256 (Legoktm) >>! In T115119#2430019, @Milimetric wrote: > Ok, @Legoktm, I thought you sa... [01:42:22] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:41] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [04:28:58] PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:31:27] RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:44:10] Hi elukey [06:54:32] o/ [06:54:42] going to the office, brb :) [06:54:48] elukey: sure ! [08:09:07] Analytics-Cluster, Operations, Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#2379016 (MoritzMuehlenhoff) It's a bit strange, the source package has also changed and it appears there's two versions of that source in Debian stretch by now: https:... [08:33:57] * elukey is checking again hdfs' configs before increasing the datanode [08:34:05] heap size [08:34:07] :) [08:34:13] okey [08:36:01] joal: do you need me for aqs? [08:36:57] no need, just feedback :) [08:42:29] sure! anything in particular? [08:42:37] how is the bulk loading working ? [08:44:13] loading works great [08:44:30] ! [08:45:29] elukey: It put quite a lot of pressure on AQS, but it loaded 1 month of data in 4 hours (2h to generate the SSTables, 2 hours to stream) [08:50:19] wooooa [08:50:22] \o/ [08:53:54] http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html [08:54:01] HADOOP_HEAPSIZE sets the JVM heap size for all Hadoop project servers such as HDFS, YARN, and MapReduce. HADOOP_HEAPSIZE is an integer passed to the JVM as the maximum memory (Xmx) argument. [08:54:13] HADOOP_NAMENODE_OPTS overrides the HADOOP_HEAPSIZE Xmx value for the NameNode. [08:54:27] but what about all the other daemons? [08:54:32] Yarn seems set [08:54:40] but I am not sure about other ones [08:54:41] grrr [09:10:49] Analytics, Analytics-Cluster, EventBus, Operations, Services: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2432626 (MoritzMuehlenhoff) p:Triage>Normal [09:45:23] (PS1) Joal: Update WikidataArticlePlaceholderMetrics params [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297566 [09:45:31] addshore: --^ [09:46:12] *looks* [09:46:18] addshore: I'll add some comments in the oozie job based on the changes I suggest above [09:47:03] awesome! [09:59:11] (CR) Joal: [C: -1] "Still some errors, but not far :)" (11 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (owner: Addshore) [10:42:55] joal: I have tested the heap size change in labs, basically the HADOOP_HEAP_SIZE environment var will add a Xmx2048 to both data and namenodes. HADOOP_NAMENODE_OPTS will be appened to namenode's arguments adding a Xmx4096, so the JVM will pick up the last one [10:43:12] that should be what the cdh documentation says [10:43:16] does it make sense? [10:43:24] the other daemons should work accordingly [10:43:52] if this is true, I'll need to merge and then restart all the HDFS daemons to force the new change to be picked up by the JVMs [10:44:38] I am going to wait ottomata for a final chat, it is not that urgent :) [10:45:15] * elukey lunch! [10:54:49] (CR) Addshore: [C: 1] Update WikidataArticlePlaceholderMetrics params [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297566 (owner: Joal) [11:25:28] (PS4) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [11:25:34] (CR) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job (11 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (owner: Addshore) [11:59:27] a-team I'm AFK for a while [12:01:19] o/ [13:26:40] Back ! [13:31:36] Hey addshore, looks like your code is ready to be tested :) [13:34:44] ooooh [13:41:22] addshore: I suspect you have never tested oozie :) [13:41:30] addshore: let me know when is a good time for you [13:41:30] nope! [13:41:43] okay, just going to go and grab a quick bite to eat! [13:41:56] addshore: take yout tiem :) [13:42:18] addshore: I have meetings 5:30 to 6:30 pm CET [13:42:27] addshore: but except from that, flexible :) [13:42:48] okay! [13:52:11] joal, do you have some minutes for scala help please? [13:52:17] mforns: I do [13:52:24] mforns: To the batcave ! [13:52:30] :] [14:09:24] ottomata: aloha! puppet disable --ANALYTICS-CLUSTER && merge && selected puppet runs to be super sure [14:09:28] ? [14:09:44] hoyo! [14:09:55] --ANALYTICS-CLUSTER [14:09:56] hheh [14:10:05] ja sounds good, the puppet runs won't restart anything though [14:10:15] so, you might be able to just merge and then run puppet and restart selectvely [14:10:38] ah yes good point, always forget [14:10:42] elukey: i saw a nodemanager flap alert last night, is it related? [14:11:01] the nodemananger heap size is already set via YARN_HEAPSIZE, ja? [14:11:02] to 2048? [14:11:32] ah snap I didn't notice the Yarn failure [14:11:37] I saw HDFS onyl [14:11:40] *only [14:11:45] havn't looked, loking [14:11:52] 1032 [14:11:54] but yeah Yarn should have 2GB [14:12:01] :/ [14:18:45] elukey: not finding much, i do see java.io.IOException: No space left on device in the recent .out file [14:18:53] but, it can't really tell if that correlates with the time [14:19:07] yeah sorry I was working on another thing, checking [14:19:16] np [14:19:26] could be from a ffew days ago when jo filled up some disks [14:19:38] there are other exceptions there too [14:19:43] ah 1032 for yarn and 1034 for hdfs [14:19:44] weird [14:20:51] don't see anything unusual in .log for that time [14:21:59] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433186 (MoritzMuehlenhoff) p:Triage>Normal [14:22:02] hm ¯\_(ツ)_/¯ [14:22:03] :) [14:22:44] ahahhahah [14:24:10] joal: I'm ready when you are! [14:25:27] ottomata: not related but.. [14:25:28] org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of /var/log/hadoop-yarn/apps/hdfs/logs is exceeded: limit=1048576 items=1048576 [14:25:32] :P [14:25:48] hm [14:25:49] oh [14:25:51] my [14:26:01] is that why i see those messages about log aggregation not working? [14:26:41] no idea about what you are saying sorry :( [14:26:59] 2016-07-06 01:56:10,223 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: Log aggregation is not initialized for container_e25_1467197794735_19368_01_000020, did it fail to start? [14:28:11] hahah [14:28:12] sudo -u hdfs hdfs dfs -ls /var/log/hadoop-yarn/apps/hdfs/logs | head [14:28:12] Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded [14:28:44] elukey: those are just aggregated log files after jobs finish...i'm going to just delete them all to clear them out [14:28:49] i can't even hdfs dfs -ls them [14:29:06] objection? [14:29:28] nope [14:33:48] joal: give me a poke if you become free :) [14:34:06] addshore: in impromptu meeting, will ping you [14:34:13] okay! [14:34:50] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433216 (Gehel) I'm doing a release of jmxtrans right now. This come with a few fixes to the stability of the graphite and statsd writers, including moving to a different res... [14:36:06] hmm yikes [14:36:06] sudo -u hdfs hdfs dfs -rm -R /var/log/hadoop-yarn/apps/hdfs/logs/* [14:36:07] Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded [14:36:27] addshore: Ready I am ! [14:36:39] awesome! [14:38:02] ottomata: heya1 [14:38:07] hi! [14:38:15] ottomata: have already deleted the logs? [14:38:33] no [14:38:35] not able to! [14:38:36] sorry addshore, another thing I want ot mess with, will be with you in a minute [14:38:40] maybe i deleted some [14:38:42] okay! [14:38:44] but it errored [14:38:56] ottomata: yeah, my point was, let's logs from before 2016 ? [14:39:05] That would be good already [14:39:06] joal: i would if i could,b ut i can't even look into the dir [14:39:14] ottomata: I can't find anything in the logs :/ [14:39:21] i might have to delete the whole logs directory and recreate it [14:39:23] ottomata: I know ! It's kind of a known issue [14:39:26] but -Dproc_nodemanager -Xmx2048m -Dhadoop.log.dir=/var/log/hadoop-yarn [14:39:32] if i try to ls or * the dir it hangs and eventually OOMs [14:39:44] elukey: aye [14:40:04] all right merging the change [14:40:06] ottomata: I have tried that before, I think bumping java memory for the client can make work though [14:40:08] that flap last night doesn't seem to be a node manager OOM [14:40:13] ah ok [14:40:34] ottomata: need to help addshore , let me know if you need me [14:40:46] ottomata: But I'd rather not loose all our logs ;) [14:41:16] addshore: Sooooo, testing oozie :) [14:41:19] :D [14:41:31] aye [14:41:34] joal: will try that first [14:41:43] addshore: you talk with elukey about thatone, he's gonna love explaining you how much crapy the thing is :D [14:41:59] addshore, elukey : just kidding ;) [14:42:18] hah! [14:42:45] So addshore, what you need is, on stat1002, the jar you will use (the one with the code I suggested, for parameters change) [14:43:24] addshore: And, the version of the refinery code you will test (like clone the refinery repo on stat1002, and git pull the correct patch) [14:43:36] 16:42 PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:34] okay, so I'll checkout that patch and build the jar first! [14:45:05] is there anybody working on stat1002 now? [14:45:19] I can't even do df -h :P [14:45:27] im on it :O [14:46:03] it seems veeeery slow [14:46:43] elukey: 80% CPU since 10 minutes about [14:47:24] totally locked up for me now :P [14:48:02] elukey: i think its me [14:48:08] i lsed via /mnt/hdfs [14:48:09] its hanging [14:48:11] trying to kill it :/ [14:48:47] kinda knew that wo uldn't work, but didn't thikn it would lock it up [14:48:49] indeed ottomata : fuse_dfs --> 1600% ! [14:48:57] ja [14:49:25] i f umounted it [14:49:27] i think better [14:49:44] ja ok [14:49:47] addshore: better now? [14:49:51] yup! [14:50:12] ottomata: seems that the process still exists [14:50:56] the process? [14:51:13] in top, I still see it sometimes [14:51:46] hm, gone for now it seems [14:51:47] the ls process? [14:51:50] hm ok [14:52:00] sorry for the noise ottomata [14:53:42] okay joal repackaged and I have the jar! :) [14:53:50] ok great [14:54:15] addshore: then, refinery code is needed [14:54:36] addshore: with the patch you want to apply [14:54:51] okay *goes to clone that and check that out* [14:55:22] addshore: on stat1002, obvisouly ;) [14:55:57] joal: thanks for increase of heapsize tip, i can ls things. going to just try to remove old ones [14:56:07] ottomata: Awesome [14:56:41] joal: I have that! [14:56:50] (also writing all of this down) ;) [14:57:01] ottomata: You currently are working that task : T139178 [14:57:01] T139178: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178 [14:57:18] ottomata: Xmx updated on analytics1032, restarted yarn and hdfs [14:57:20] all good [14:57:28] ottomata: shall I assign it to you, with 5 points, and move to in-progress? [14:57:35] addshore: Great :) [14:57:48] I am thinking to restart the daemons on 1002 [14:57:56] to double check that everything is good [14:58:03] any objections? joal, ottomata [14:58:09] addshore: now you need to make the oozie folder of the refinery repo available from hadoop (meaning, copy it to hdfs) [14:58:25] elukey: good for me (it's not master, correct?) [14:58:42] nope :) [14:58:45] I'll triple check [14:59:05] addshore: My way of doing it is: I have a oozie folder in /user/joal that contains my WIP on oozie [14:59:24] addshore: I regularly wipe it and recreate, but it's always useful for testing [14:59:36] addshore: familiar with HDFS commands? [14:59:41] nope! [15:00:21] addshore: ok, hdfs dfs - usually works [15:00:23] like: [15:00:25] hdfs dfs -ls [15:00:45] this should give you the content of your home folder on hdfs (should be /user/addshore) [15:00:47] oooh, yep, okay [15:01:14] addshore: notice the home folder on hdfs being /user/NAME, and not /home/NAME [15:01:26] uhhhh joal. i just deleted them all [15:01:29] :( [15:01:30] sorry [15:01:38] was scripting a thing to delete like 3/4 of them [15:01:42] since there are no dates on the dir names [15:02:04] but i just noticed that now there is only one dir in there...so somehow i just accidentally deleted them all [15:02:19] ottomata: DOne :) [15:02:48] addshore: So, does hdfs dfs -ls works? [15:02:53] yup [15:03:06] ok, I was just checking you actually have a hdfs home ;) [15:03:21] So, to copy files to you home: [15:03:38] hdfs dfs -put /path/to/oozie /user/addshore [15:03:55] or even leave the end empty (uses home by default) [15:04:24] okay, and what is the path to oozie ? [15:04:31] *reads up * [15:04:58] addshore: the path to the oozie folder in the refinery reop [15:05:40] ahh okay! [15:06:19] addshore: the oozie folder contains the files oozie will look after when trying to run your code (based on the parameter we'll give it) [15:07:13] awesome, so now i have an oozie folder! [15:07:19] :) [15:08:14] ottomata: while you're at it, I think it would a good idea to clean the /var/log/hadoop-yarn/apps/USERNAME folders based on time watchathink? [15:08:38] addshore: Now it's about launching a oozie job using the files that are in your folder on hdfs :) [15:09:28] joal: maybe a good time to make a task to have proper yarn aggregated log rotation :) [15:09:56] ottomata: We have the one thing already, maybe reuse that one? [15:10:15] ottomata: You have deleted logs for hdfs user, but not all the rest of us ;) [15:10:15] we have one thing? [15:10:20] oh, but all of them! [15:10:31] ottomata: That leaves us about 30Tb logs ;) [15:10:31] the hdfs user has wayyy more apps than the rest of us [15:10:39] about half of it [15:10:40] jaaa [15:10:52] hmm, ok ok ok ok ok [15:10:56] gimme a few [15:11:04] still making a task for proper rotation :) [15:11:11] I looked at it a few days ago, and we created : T139178 [15:11:11] T139178: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178 [15:11:17] ottomata: REUSE !!! [15:11:21] :D [15:11:30] addshore: Have you had a look at the oozie page we have? [15:11:38] I don't think so! [15:11:57] addshore: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie [15:12:08] Analytics, Analytics-Cluster: Rotate YARN aggregated logs in hdfs:///var/log/hadoop-yarn/apps/$username/logs - https://phabricator.wikimedia.org/T139470#2433286 (Ottomata) [15:12:13] addshore: I let you have a glance, then come to me to actually launch the job :) [15:12:27] oh wait, I have scanned through that one before! [15:12:44] addshore: I was kinda sure I gave it to you once ;) [15:14:12] addshore: Running a oozie job is as simple as oozie job -config /path/to/file.properties run [15:14:25] addshore: BUT, we always want to override some properties :) [15:15:07] addshore: And, there is a -dryrun mode that also helps catching some errors [15:16:10] addshore: Overriding a property when launching oozie is done by -D name=value [15:16:18] can the properties be copied from when we ran spark-submit [15:16:21] ? [15:16:29] ahh, okay, so not quite :D [15:16:35] :) [15:16:47] And, you won't run spark-submit here, oozie will it for you [15:17:17] addshore: The properties you want to override are the ones defined in your coordinator.properties file [15:17:50] joal: i don't really have a good way to do this, the dirs are not done by date [15:17:55] it was easy for hdfs user [15:18:06] i was just going to delete 700K dirs [15:18:21] this will take me a while to script, will have to parse output of hdfs ls and react [15:18:21] etc. [15:18:27] soooooo, not going to do right now :) [15:18:36] ottomata: restarted all the daemons on 1002, if you want to double check [15:18:46] on 1002? [15:19:09] namenode has -Xmx4096m [15:19:10] cool [15:19:38] it has also -Xmx2048, but the last one should win.. [15:19:43] I am double checking this assumption :) [15:19:48] also https://github.com/apache/kafka/pull/1497#issuecomment-230803562 [15:19:51] :) [15:20:18] ottomata: folder names are id based: if you sort them by name should be ok for time :) [15:20:37] addshore: I'll have an exmaple for you in a minute [15:21:18] joal: is the path to the properties file a hdfs path or regular path? [15:21:27] addshore: hdfs [15:21:32] oh no sorry, regular [15:21:36] :D [15:21:50] hehe, elukey i just saw that too! [15:21:51] :) [15:21:56] so, at least oozie job -config ~/refinery/oozie/wikidata/articleplaceholder_metrics/coordinator.properties run -Dgraphite_namespace=daily.test.articleplaceholder [15:21:57] addshore: please don't try it in run mode ;) [15:22:17] joal: ja, but how many to delete? [15:22:26] addshore: Correct, but this would fail [15:22:26] joal: not saying it can't be done [15:22:33] ottomata: Right [15:22:43] ottomata: Your call on timing :) [15:22:49] ottomata: I stop bothering ;) [15:23:01] easy to script if i always delete the same number, but to do it right we have to uhhhh do it right :) [15:23:05] i think it isn't urgent [15:23:23] ottomata: The reason we filed a task is because HDFS was 65% full [15:23:31] ottomata: Having 60T logs [15:23:32] oh what task? [15:23:35] maybe i missed this [15:23:41] i thought this was mostly about too many files in one dir [15:23:43] The one I pasted two times for you already :-P [15:23:46] ...>>> [15:24:00] ottomata: T139178 [15:24:00] T139178: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178 [15:24:12] oh weird, haha, joal, i didn't see it because it wasn't joal [15:24:17] my brain ignored stashbot [15:24:21] huhuhu [15:24:25] np ottomata :) [15:24:58] ottomata: Aggreed, too many files wasn't good, but 30Tb logs left is kindof a lot :) [15:25:07] Analytics, Analytics-Cluster: Rotate YARN aggregated logs in hdfs:///var/log/hadoop-yarn/apps/$username/logs - https://phabricator.wikimedia.org/T139470#2433362 (Ottomata) [15:25:09] Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2433364 (Ottomata) [15:25:42] addshore: When looking in your coordinator.properties file, you define hdfs path for the coordiantor.xml and workflow.xml files (not only) [15:25:58] addshore: For the moment, oozie wouldn't be able to find them at the place you told it [15:26:29] addshore: The easiest way to test is to override oozie_directory, using the oozie folder you created on hdfs :) [15:26:45] Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2421739 (Ottomata) This morning I (slightly accidentally) deleted all hdfs user aggregated logs. This brought us down to 30T. We should script something to properly remove old logs for each user. Something like... [15:26:47] joal: agree cool [15:27:10] also addshore : Since you are testing, overriding start_time and end_time is kinda important, to only have a couple days run [15:28:22] ottomata, joal: rolling restart hdfs on all the analytics hosts (except 1001) [15:28:29] elukey: k [15:28:41] ahhh okay! and then that will run jobs that have no run before for that period of time? [15:29:13] addshore: You will define from which point in time you start [15:29:18] addshore: https://gist.github.com/jobar/de1adb9ac53ddd3e6be23199c14003a9 [15:29:39] addshore: I have typo (two times start_date instead of end_date) [15:29:51] addshore: But I think it should do it [15:30:13] okay! [15:30:24] addshore: the refinery_directory override is a trick to force oozie to use the latest version, even in case of broken deploy (you shouldn't mater) [15:30:27] and should I add -dryrun? and does -dryrun replace -run ? [15:30:34] yessir ! [15:30:43] First, try with dryrun instead of run [15:30:46] addshore: --^ [15:30:59] cool! [15:31:08] Error: E1002 : E1002: Invalid coordinator application URI [/user/addshore/oozie/pageview_hourly/datasets.xml], path not existed : /user/addshore/oozie/pageview_hourly/datasets.xml: /user/addshore/oozie/pageview_hourly/datasets.xml [15:31:17] mforns / joal: i'll brt [15:31:21] elukey: cool proceed [15:31:27] addshore: If oozie spits you a big bunch of XML, no easy spottable errors found, [15:31:33] addshore: Ah, errors ;) [15:32:12] Analytics, Services, cassandra, Patch-For-Review: Refactor the default cassandra monitoring into a separate class - https://phabricator.wikimedia.org/T137422#2433396 (Eevans) p:Triage>Normal [15:32:39] addshore: addshore You see the bug? [15:32:44] I didn't spot it before [15:33:05] can't imediatly spot it! :/ [15:33:06] path for dataset should be: /user/addshore/oozie/pageview/hourly/datasets.xml [15:33:28] addshore: Normal, I'm kinda use to our system and folders :) [15:34:05] ahhh, okay [15:34:46] (PS5) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [15:36:36] addshore: This error is kinda ok, you don't have to re-copy the files to hdfs (since a property update [15:36:47] addshore: You could actually have tested it using -D :) [15:37:22] ahhh, true! [15:37:39] :) [15:37:52] oooooh [15:37:52] Error: E1002 : E1002: Invalid coordinator application URI [/user/addshore/oozie/pageview/hourly/datasets.xml], path not existed : /user/addshore/oozie/pageview/hourly/datasets.xml: /user/addshore/oozie/pageview/hourly/datasets.xml [15:37:58] addshore: currently in meeting, will try to followup with you but I might get long to answer [15:38:01] okay! [15:38:38] addshore: actually, I'm bad, sorry, it was /user/addshore/oozie/pageview/datasets.xml [15:38:39] /pageview/datasets.xml perhaps! [15:38:44] addshore: without the hourly [15:38:48] addshore: sorry [15:39:07] (PS6) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [15:39:23] joal / mforns: uh... nvm, I'm running way late in my 1/1 [15:39:37] milimetric: ok, I'll reschedule :) [15:39:41] ok [15:46:08] addshore: I have time now [15:46:14] addshore: How is the thing goign? [15:46:15] :D [15:46:24] *switches back to the correct tabs* [15:46:32] hmmm... Error: E1002 : E1002: Invalid coordinator application URI [/user/addshore/oozie/pageview/hourly/datasets.xml], path not existed : /user/addshore/oozie/pageview/hourly/datasets.xml: /user/addshore/oozie/pageview/hourly/datasets.xml [15:46:43] oh wait, I didn't update it... [15:46:44] joal: i got a non urgent q about revision_visibility_change whenvever you have a brain context switch moment :) [15:46:47] Right, easier would be to overrid [15:46:53] yup [15:47:05] ottomata: post-standup? [15:47:12] sho [15:47:46] cool ottomata [15:48:20] joal, can a checkpoint be unpersisted? [15:48:28] joal: awesome, that gives me loads of xml! [15:48:41] mforns: I don't think, you need to remove the parent folder (for instance) [15:48:49] addshore: Your good to try in run mode :) [15:48:56] addshore: Before [15:49:02] addshore: Do you know hue ? [15:49:32] addshore: And more precisely, the oozie monitoring part of hue (this one I'm pretty sure nor ;) [15:49:35] addshore: https://hue.wikimedia.org/oozie/list_oozie_coordinators/ [15:49:50] I have logged in a few times but never really done anything on it! [15:49:54] addshore: When started, yours should come up in there [15:50:20] okay! [15:51:38] job: 0009634-160630131625562-oozie-oozi-C [15:51:42] :) [15:52:20] although it has not appeared ;) [15:52:27] hue is a bit slow [15:52:32] okay! [15:53:32] addshore: another bug :) [15:53:44] addshore: Have you found you coordinator on hue? [15:53:58] addshore: if no, refresh ;) [15:54:23] I don't see it :/ [15:54:30] addshore: hm [15:54:44] addshore: https://hue.wikimedia.org/oozie/list_oozie_coordinators/ [15:54:48] I have it first in the list :( [15:55:24] ahh yes, I see it! apparently I had clicked something which meant it wasn't appearing! I think I was offset or something! [15:55:35] addshore: Arf [15:55:42] addshore: So, click on it to get details [15:55:51] addshore: You'll see two instances of actions to run [15:55:52] 2-02 Jun 2016 00:00:00 Missing hdfs://analytics-hadoop/wmf/data/wmf/pageview_hourly/hourly/year=2016/month=6/day=2/hour=0/_SUCCESS [15:55:59] Rught [15:56:06] In that path: /pageview_hourly/hourly/ [15:56:13] should be /pageview/hourly/ [15:56:19] addshore: Same error from me ... [15:56:43] and that is from pageview_data_directory ? [15:56:45] addshore: Meaning, kill this job (oozie job -kill OOZIE_ID) [15:57:03] addshore: correct [15:57:10] (PS7) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [15:58:30] addshore: quick one I see: alignment for error_emails_recipients protperty in property file, please :) [15:58:36] okay, all updated code wise, just waiting for the job to die [15:58:55] (PS8) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [15:59:01] and updated that alignment :) [15:59:23] cool, so the coordinator has been killed, so now I can retry? [16:00:13] addshore: indeed, please go ahead: ) [16:00:43] doing! [16:00:48] addshore: I'm in meeting, so probably slow to answer [16:00:51] okay! [16:02:49] addshore: still path error :( Remove hourly from the dataset path (sorry I should have been more explicit) [16:02:58] (PS9) Addshore: Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 [16:02:59] yep, on it! [16:06:24] joal: java.io.FileNotFoundException: File does not exist: hdfs://analytics-hadoop/wmf/refinery/2016-05-16T17.41.29Z--a85b77a-dirty/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.32.jar in the Log this time! [16:06:52] addshore: maaan, told you that thing was hell [16:07:12] addshore: So, I forgot to have you copy the jar file you wish to use to your hdfs folder [16:07:22] :D [16:07:36] copy it over, and override the path i guess? [16:08:01] and ovverride the spark_job_jar prop to /user/addashore/refinery [16:08:05] -job-0.0.32.jar [16:08:15] addshore: You actually don't need me [16:08:18] :D [16:16:17] hmmm libpath [/user/addshore/oozie/wikidata/articleplaceholder_metrics/lib] does not exist [16:16:57] hm ... is it needed anywere addshore? [16:17:36] actually the exception is java.io.FileNotFoundException: File file:/var/lib/hadoop/data/j/yarn/local/usercache/addshore/appcache/application_1467197794735_21162/container_e25_1467197794735_21162_01_000001/= does not exist [16:18:24] addshore: that doesn't look good, first time I see something like that [16:18:43] addshore: After my meeting I'll be gone, is that ok if we settle that tomorrow? [16:18:54] yup thats fine! :) [16:18:59] ok cool [16:19:05] I'll get on with some other things :) [16:19:13] addshore: sorry for the back & forth with path stuff :S [16:19:19] no worries :D [16:25:54] ottomata: Oh, I didn't realise the script I pasted is s ruby ! [16:26:27] ottomata: we could also use the hdfs python lib (works great, and having on stat1002 a4 [16:26:36] ottomata: we could also use the hdfs python lib (works great, and having on stat1002 and 4 could be nice) [16:26:43] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [16:26:52] joal / mforns: I assume yall have to go tonight, right? I can chat for a bit and catch up on schemas otherwise [16:26:55] WHAT [16:27:04] milimetric: I need to run indeed [16:27:25] milimetric: I rescheduled tomorrow (but you and mforns actually don't me) [16:27:28] milimetric, I have the interview in 5, but later we can talk, although joal will be off before that I think [16:27:49] man, everyone writes faster than me... [16:27:53] :] [16:27:58] :D [16:28:05] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2433581 (Ottomata) Hm, @Eevans and @mobrovac, got a Q. Over at https://github.com/wikimedia/mediawiki-extensions-EventBus/... [16:28:07] WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal [16:28:13] so one of the journals is not happy [16:28:15] checking [16:28:16] joal: ah cool [16:28:16] ja [16:28:34] milimetric: a better way to put it (for me at least), is that everybody makes typos faster than you ;) [16:28:38] elukey: hmm [16:28:38] k [16:28:54] joal: you meant to ping mforns, but I think it proves your point even better [16:28:58] ottomata: I used that lib when working with halfak [16:29:16] :D [16:29:22] nice, joal the snakebite one? [16:29:26] xD [16:29:27] Thanks milimetric ;) [16:29:35] hm, lemme check ottomata [16:29:40] k, we can just talk tomorrow, it's cool, I'm happy to stare at this insane page for a while longer [16:29:49] * milimetric off to lunch [16:30:00] https://github.com/spotify/snakebite [16:30:08] ottomata: nope, the pypi one :) [16:30:12] oh hm [16:30:26] java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : local host is: "analytics1001.eqiad.wmnet/10.64.36.118"; destination host is: "analytics1035.eqiad.wmnet":8485 [16:30:31] milimetric, if you want to talk about insane page history later, I'm for that [16:30:35] this is weird, journal node is running on 1035 [16:30:38] hm [16:30:38] and only one?? [16:30:54] ah no I got it [16:30:58] my fault dammit [16:31:15] I restarted two journals not waiting long enough [16:31:17] :/ [16:31:18] ah [16:31:18] ok [16:31:19] sorry [16:31:27] a-team: please let me know if any of you is interested in shadowing in the CTO interviews [16:31:43] looks like 1002 took over ok though elukey [16:31:52] yeah yeah [16:32:00] my bad, really sorry [16:32:01] :( [16:32:04] np! :) [16:32:51] elukey: no harm done, things look fine [16:32:57] gonna go to lunch [16:34:16] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [16:35:37] nuria_, the candidate sent an email they are going to be late, asking to reschedule [16:35:43] what? [16:35:49] man... [16:36:14] 1 minute ago... well [16:36:17] hehe [16:40:14] nuria_, I guess there will be no interview today, no? [16:45:17] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2433634 (mobrovac) @Ottomata these two fields are redundant in a sense, but they exist because when you are viewing the his... [16:57:21] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2433659 (Halfak) The sha1 field is very useful. I use it to detect reverts -- edits which fully restore a page to the prev... [16:57:24] something is not right https://grafana.wikimedia.org/dashboard/db/analytics-hadoop [16:57:40] it seems that the HDFS namenodes picked up the Xmx2048 [16:57:54] even if they should pick the last one, according to what the docs says [17:02:25] milimetric, do you want to chat about page history? [17:02:43] mmm elukey@analytics1001:~$ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Xmx2048m -XX:+PrintFlagsFinal -Xmx4096m 2>/dev/null | grep MaxHeapSize [17:02:50] uintx MaxHeapSize := 4294967296 {product} [17:26:21] anybody there? [17:27:11] what's up elukey ? [17:27:23] hello :) [17:27:38] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop does not look right [17:27:45] the namenodes are using 2GB now [17:27:56] but I checked and the JVM seems to pick up the last Xmx value [17:29:17] they started to do minor Young Gen collections [17:30:06] hm [17:30:30] elukey: Rollback for at least namenodes seems readonnabl, doesn'it? [17:31:24] elukey: I'm afraid of cluster dying [17:33:09] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433874 (Gehel) jmxtrans 259 is now released: http://central.maven.org/maven2/org/jmxtrans/jmxtrans/259/ [17:33:42] joal: I have no idea why it is behaving in this way [17:34:00] elukey: can't say [17:34:28] mforns: yes, but I've got scrum of scrums and one more meeting after that [17:35:03] ok milimetric np [17:35:35] elukey: Can I let you handle that with ottomata when he gets back? [17:36:42] sure, I am rolling back [17:37:05] elukey: I'll help monitoring for a minute if you want [17:40:02] sure.. I am planning to restart the namenode on 1001 that is the current standby [17:40:07] and then 1002 [17:40:25] the rest of the cluster can go ahead with the new options so we'll test them [17:40:36] but tomorrow I'll need to check what the hell happened [17:40:41] elukey: correct [17:41:27] all right 1001 restarted [17:45:47] mmmm nothing has changed [17:46:15] elukey: :( [17:47:52] done also 1001 [17:48:03] elukey: you mean 1002? [17:48:12] yeah sure [17:48:20] now 1001 is active [17:48:22] mmmmm [17:48:45] so we had some jvm upgrades last month sitting there and waiting to be deployed, but security related [17:48:48] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433931 (fgiunchedi) thanks @Gehel ! this would likely allow us to fix {T97277} too [17:49:14] joal, checkpointing is speeding a bit the execution but not much... and... strange is that the checkpoint files are very small. [17:49:17] mmm [17:49:30] mforns: hm [17:49:33] Analytics, Operations: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2433934 (Gehel) @fgiunchedi Yep, that fix has been merged. [17:50:48] mforns: let's discuss that tomorrow, I should already be gone ;) [17:50:52] Analytics-Kanban, EventBus, Patch-For-Review: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need - https://phabricator.wikimedia.org/T134502#2433938 (Ottomata) If it has meaning, I'm all for keeping it. In the current EventBus hook though, it is set to the same v... [17:50:56] ottomata: hiiiiiiiii [17:51:01] I'd need your brain for a sec [17:51:05] elukey: sorry I didn't help :( [17:51:09] elukey: Gone for now [17:51:22] thanks joal! [17:51:31] ottomata: so https://grafana-admin.wikimedia.org/dashboard/db/analytics-hadoop looks weird [17:51:46] Analytics: Upgrade AQS node version to 4.4.6 - https://phabricator.wikimedia.org/T139493#2433953 (Milimetric) [17:51:53] elukey: hiy ja whats' up? [17:51:55] the namenode's heap consumption went down like it had 2GB limit [17:52:06] but I checked the JVM [17:52:07] hm, elukey probably just beacuse it was restarted [17:52:08] no? [17:52:14] elukey@analytics1001:~$ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Xmx2048m -XX:+PrintFlagsFinal -Xmx4096m 2>/dev/null | grep MaxHeapSize [17:52:19] Analytics-Kanban: Upgrade AQS node version to 4.4.6 - https://phabricator.wikimedia.org/T139493#2433967 (Milimetric) p:Triage>High [17:52:21] and it looked good [17:52:33] hmm i see [17:52:35] its staying at around 2G [17:52:36] hm [17:52:39] let's see.. [17:52:40] ottomata: yeah I thought so but it looks weird.. [17:52:47] now the weirder thing is that I just rolled back :P [17:53:33] elukey: rel-eng gave me this guide to pass on (for tomorrow): https://wikitech.wikimedia.org/wiki/Scap3/Migration_Guide [17:53:55] milimetric: thanks! [17:54:11] ottomata: now that I see -Xmx1000m and -Xmx4096m are living togethr [17:54:16] in the rollback scenario [17:54:27] so it is not the new setting [17:54:30] but something else [17:54:37] Moritz installed some upgrades a while ago [17:54:41] but mostly security [17:55:36] elukey: hm, maybe it really is just beacuse it was restarted, might take a while to reach max heap used [17:56:04] hm looking at history maybe not [17:56:19] yeah that what didn't convincem e [17:56:23] convince me [17:56:48] moreover there are Young Gen collections [17:56:51] that were 0 before [17:57:13] hmmm elukey, i see Maximum heap size:  [17:57:13] 3,728,384 kbytes [17:57:16] in jmx [17:57:29] which is not what' i'd expect for either -Xmx value... [17:57:55] I checked before doing the restart, it was that value [17:58:08] but not sure why [17:58:59] btw are you using jvisualvm? [17:59:16] na jconsole [17:59:21] ah okok [17:59:28] I used it too but sometimes it doesn't connect [17:59:29] weird [18:00:33] elukey: you saw the 3728...kbytes value before the restar too? [18:00:52] yeah iirc it was the same on a namenode not restarted [18:01:51] and we haven't changed the GV [18:01:54] *GC [18:02:43] ottomata: I am going to step away for ~30 mins, brb [18:02:56] ok [18:18:36] welll, elukey it is def above the 1000M and even above the 2048m mark, so it would seem that neither of those are the limit [18:19:05] probably it is either that 3.5G value or the 4g value we want it to be [18:19:24] dunno why it isn't jumping bak up immediately though. [18:19:42] perhaps there is something with the jvm updates moritz did that makes it GC more often [18:20:04] it doesn't seem particularly worrying thouhg [18:20:08] i guess let's just keep an eye on it [18:27:50] ottomata: here I am! I am thinking the same [18:28:04] we are seeing only minor GC collections (young gen) not major ones [18:28:21] so for the moment it seems a natural behaviro [18:28:26] elukey: btw i modified your dashboard to use line size of 1 instead of 2, i hope you don't mind! [18:28:48] please change whatever you wish, thanks! [18:29:07] ok phew, wasn't sure if you had it on purpose, i much prefer the thin line, easier to read especially when surrounded by lots of other lines [18:30:46] yep yep :) [18:31:04] on the bright side: https://grafana-admin.wikimedia.org/dashboard/db/analytics-hadoop?panelId=1&fullscreen [18:31:11] the datanodes looks good [18:31:33] ottomata: plan forward - I'll re-rollout the change tomorrow, and we're going to keep it monitored [18:32:05] I am not really seeing anything major going on [18:32:18] makes sense? [18:32:47] ok cool [18:32:48] yeah [18:32:51] sounds good elukey [18:32:56] the datanodes are still running with 2048, ja? [18:33:49] yep, didn't restarted them [18:35:11] cool [18:35:14] ja thanks elukey, looks good [18:35:16] have a good eve! [18:35:33] you too! [18:35:37] Analytics-Dashiki, Analytics-Kanban: Dashiki should load stale data if new data is not available due to network conditions - https://phabricator.wikimedia.org/T138647#2434182 (Nuria) p:Triage>Lowest [18:36:44] ottomata: i have some time today and just did the bit of research on service workers i wanted to do, should i grab the issue with logs and hdfs? [18:37:04] nuria_: sure if you like! :) [18:37:06] ottomata: I can look to see if there is a config setting if you have not looked into that yet [18:37:12] ja, that would be swell :) [18:37:35] ottomata: ok, you said you deleted some logs [18:37:44] nuria_: ja, i deleted the hdfs user logs [18:37:48] but no logs for other users [18:37:53] ottomata: and now we are looking for a way to schedule deletion? [18:38:05] ja, ideally we'd have normal log rotate [18:38:11] like [18:38:19] delete logs older than 6 months [18:38:20] or something [18:38:28] or 3 months [18:38:28] whatever [18:38:30] but, that is hard [18:38:42] because its hdfs, you can't just look at file mtime with usual tools [18:38:48] right [18:38:51] and, these directories don't have any dates in their names [18:39:06] so, either, we parse the output (or use some hdfs client tool that just gives us) to get the dates on the dirs [18:39:07] or [18:39:09] we just make it dumber [18:39:10] and say [18:39:15] only keep the latest 1000 directories [18:39:28] the directories do have incremental job ids in them [18:39:35] so we can at least sort them [18:39:53] so, either way [18:39:55] doesn't really matter [18:40:05] it might be nice to have a smart time based hdfs rotate script [18:40:09] probably not that hard to do [18:40:37] ottomata: i see, logs are under: /wmf/data/tmp? [18:41:05] no [18:41:16] /var/log/hadoop-yarn/apps/$username/logs [18:41:17] i think [18:41:18] in hdfs [18:41:26] i put that in ticket it hink [18:41:37] ottomata: ah yes, sorry. [18:42:01] Analytics-Cluster, Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2434265 (Nuria) [18:42:04] nuria_: hdfs dfs -stat does give you a date for a file without parsing [18:42:11] i betcha one of hdfs python libs abstracts that nicely [18:42:22] this, https://hdfscli.readthedocs.io/en/latest/ [18:42:22] or [18:42:23] maybe [18:42:25] https://github.com/spotify/snakebite [18:42:59] https://snakebite.readthedocs.io/en/latest/client.html#snakebite.client.Client.stat [18:43:21] or maybe [18:43:22] https://hdfscli.readthedocs.io/en/latest/quickstart.html#exploring-the-file-system [18:43:52] that is, unless you find a nice config setting that just does this automatically [18:44:08] nuria, this is called somethign like 'application log aggregation [18:44:09] ottomata: so, if i do not find a config setting [18:44:09] ' [18:44:11] yarn log aggregation [18:44:12] or something [18:44:23] ottomata: the idea would be to run the python script via oozie [18:44:28] or a plain cron? [18:44:31] nuria_: probably not via oozie, just a cron [18:44:37] ottomata: ok [18:45:19] ottomata: which if i remember right is the way we run other things like the stats for refinery, correct? [18:45:38] the webrequest stats? naw those are oozie [18:45:44] nuria_: the distinction is usually [18:45:55] if you want to launch based on data frequency and existence in hdfs, then oozie is great [18:46:02] if you just want to run soething periodically, cron is easier [18:46:24] oozie would be a pain for this [18:46:27] you just want a simple script [18:46:48] k [18:46:49] that takes as args an hdfs path and a time period [18:46:56] or a number of files [18:46:58] whichever [18:47:07] and it will do the right thing [18:47:13] can run manually or via cron in the same way [18:47:58] Analytics-Cluster, Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2434296 (Nuria) Logs are at: /var/log/hadoop-yarn/apps/$username/logs Path to action: - try to see if there is a yarn level setting that enables some kind of log rotate - otherwise run a p... [18:48:10] ottomata: ok, let me do some digging to see if i find a config setting of some sort [18:48:16] k cool [18:48:18] thanks nuria_ [18:48:31] Analytics-Cluster, Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2434297 (Nuria) a:Nuria [18:49:36] mforns_away, milimetric : all done with the service workers research i wanted to do, let me know when you have 5 minutes and i can brain bounce with you. I have depriotized the item but I will probably work on it on my free time. [18:59:45] milimetric: can you cancel retro 7/19? I will schedule a end of quarter retro next week. [18:59:59] sure [19:26:20] ottomata: let me known if this sounds like it could be it: [19:26:24] https://www.irccloud.com/pastebin/CZeecu7W/ [19:26:27] cc joal [19:27:09] nuria_, milimetric, I'm back in case you want to brain bounce [19:27:21] milimetric: yt? [19:27:36] yes, batcave? [19:28:35] sure [19:28:40] it wil not take long [19:29:07] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2434498 (Ottomata) OH man, I know exactly the problem. input [encoding=json] kafka t... [19:29:14] cc mforns batcave? [19:29:43] omw just a sec nuria_ [19:29:44] hmmm, nuria_ it might be, but i think from the description it is not [19:30:10] ottomata: " will delete the application's localized file directory [19:30:11] and log directory" [19:30:11] AHHH [19:30:12] but there is one! [19:30:17] yarn.log-aggregation.retain-seconds [19:30:22] How long to keep aggregation logs before deleting them. -1 disables. Be careful set this too small and you will spam the name node. [19:30:24] default is -1 [19:30:42] i saw that one on here: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml [19:30:50] ja! [19:30:52] that is it [19:37:40] https://www.irccloud.com/pastebin/2EBagDMb/ [19:37:53] ottomata: I think we have log agreggation enabled, let me see [19:38:05] ottomata: yes, we do: [19:38:07] 182 [19:38:07] 183 yarn.log-aggregation-enable [19:38:08] 184 true [19:38:08] 185 [19:38:20] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2434548 (Jgreen) We also uncovered a second issue that deployed yesterday: https://g... [19:39:28] ottomata: why do you think the other one might not work? ah cause it talks about application logs and not user logs? [19:39:30] nuria_: , yes we do [19:39:36] yes, they are different [19:39:38] so [19:39:40] when the job is running [19:39:55] each container across all different nodes log things to the local fs [19:39:57] not to hdfs [19:40:14] that is what the first setting you mentioned is about...(maybe they are in hdfs too, but those refer to logs of running jobs) [19:40:34] log aggregation takes all of the logs from the different containers after the job as finished, and puts it into /var/log/hadoop-yarn/apps/... [19:40:43] ajam [19:40:47] in hdfs [19:40:57] we mostly use that for debugging jobs [19:41:02] its really hard to get logs while a job is running [19:41:07] usually have to wait til it is finished, or it dies [19:41:12] ok, i see, will dig some more [19:41:18] so, nuria_, i think you have found it [19:41:20] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2434555 (CCogdill_WMF) Wow, thanks for finding the answer! So I'm guessing this isn't... [19:41:28] ottomata: seems that we cannot use " yarn.log-aggregation.retain-seconds" as it requires agreggation being off [19:41:40] yarn.log-aggregation.retain-seconds is exactly what we want [19:41:45] no [19:41:47] wh? [19:41:50] ottomata: ya, see [19:42:05] ottomata: "Time in seconds to retain user logs. Only applicable if log aggregation is disabled" [19:42:18] nuria that is [19:42:19] "yarn.nodemanager.log.retain-seconds" [19:42:19] not [19:42:26] "yarn.log-aggregation.retain-seconds" [19:42:32] we ahve log aggregation on [19:42:32] so [19:42:34] yarn.log-aggregation.retain-seconds will work [19:43:42] ajam [19:44:38] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2434574 (Ottomata) @CCogdill_WMF it should be about that. Webrequest logs are sprea... [19:45:07] 'ottomata totally, ok, updating puppet and crossing fingers [19:45:18] ottomata: sorry i got those two mixed up [19:46:28] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2434578 (Ottomata) Well, let me caveat that. You have been receiving half of all of... [19:46:29] np, nice find nuria_ [19:46:39] ottomata: we will need to add two settings [19:46:44] i think that will take a cdh module update and a ops puppet update [19:46:48] ottomata: one for check and one for ttl [19:46:55] naw, you can leave check as it is i think [19:46:59] unless you really want to parameterilze it [19:47:00] i think its fine [19:47:04] we probably won't ever chnage it [19:47:10] all we care is that old logs eventually get deleted [19:47:14] regularly [19:47:22] the period at which they are deleted isn't so important [19:47:34] ottomata: k , we wil also need to restart the namenode for this change to take effect [19:47:46] ja that's fine, we can do whenever [20:14:29] did one of you call me by any chance? I got a call from the office [20:15:08] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2434715 (awight) @Ottomata Fantastic (and horrific!), thanks for the info. That's go... [20:16:32] hm, not i [20:16:37] but i am not in the office! [20:24:46] ottomata: my you're looking well today [20:24:51] ottomata: have you been working out? [20:25:58] urandom: ? [20:26:00] :) [20:26:33] ottomata: how are you? :) [20:26:38] hah [20:26:46] am well indeed [20:26:47] how are you? [20:27:00] outstanding; thanks for asking! [20:27:06] though... [20:27:16] * urandom clears his throat [20:27:39] i could be quite nearly perfect, if i only i could get a bit of a merge [20:28:02] haha oh well i am so glad that I might be able to help one achieve perfection [20:28:12] oh! really!? [20:28:26] what a coincidence! [20:28:34] this I assume? [20:28:34] https://gerrit.wikimedia.org/r/#/c/297645/ [20:28:39] aye [20:28:45] gimme the risks! [20:28:49] 1009 is prod, ja? [20:28:53] i have no context here [20:28:54] heh, yeah [20:28:59] should i just do it? [20:29:03] well, the first one went REAL BAD [20:29:26] and so there was a lot of heroic effort to get things sorted [20:29:38] and then a bit of back-to-the-drawing-board [20:29:56] but all of them since have gone off flawlessly [20:30:10] 17 in total today [20:30:55] ah ok [20:31:04] worst-case, things go badly and we lose a node [20:31:12] but there is lots of redundancy [20:31:13] so, you've done this on other hosts already, and it is feeling pretty regular and normal by now? [20:31:19] yeah [20:31:24] * urandom knocks on wood, hard [20:31:52] yeah, this just enables the 2.2 config [20:32:05] i have puppet disabled, so it will do nothing until the trigger is pulled [20:32:20] and then i'll do each on that host, 1-by-1 [20:32:30] well... there are only 2 instances on this one [20:32:55] it's as routine as it's going to be, yeah [20:33:20] and it would be pretty weird to encounter a serious issue at this point [20:33:54] ok [20:33:55] merging [20:34:04] ottomata: thank you sir! [20:34:19] merged. [20:34:23] \o/ [21:18:45] nuria_: not sure you saw, but i commented on your change [21:45:31] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, and 2 others: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2435093 (DStrine)