[03:47:16] PROBLEM - Hadoop DataNode on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:00:47] RECOVERY - Hadoop DataNode on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:24:20] mmmmm [06:24:50] this is the second one in the past two weeks [07:06:05] Analytics, Editing-Analysis, Editing-Department, Flow, MediaWiki-Page-editing: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019#2416933 (Amire80) [07:07:30] elukey: thanks for rounding up the investigation of memcached. I'm sorry I haven't had a chance to reply yet. [07:07:54] Analytics, Collaboration-Team-Interested, Community-Tech, Editing-Analysis, and 3 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019#2416945 (Amire80) Tagging also Collaboration-Team and Community-Tech with the hope that it will interest the... [07:08:21] Analytics, Collaboration-Team-Interested, Community-Tech, Editing-Analysis, and 3 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019#2416947 (Amire80) [07:08:52] ori: o/ hope that makes sense, huge delay from my side due to other things happening :) I noticed that 1.4.27 has bugs and 1.4.28 will be out too, so the logging functions might need a bit of time before being prod read [07:09:26] Analytics, Collaboration-Team-Interested, Community-Tech, Editing-Analysis, and 3 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019#2416933 (Amire80) [07:11:46] Analytics, Collaboration-Team-Interested, Community-Tech, Editing-Analysis, and 6 others: statistics about edit conflicts according to page type - https://phabricator.wikimedia.org/T139019#2416949 (Amire80) Finally, tagging TCB-Team, German-Community-Wishlist and Community-Wishlist-Survey, becaus... [07:24:29] hallo [07:24:33] o/ [07:24:43] is there a way to get yesterday's date in a beeline query? [07:25:07] the big query at https://www.mediawiki.org/wiki/Universal_Language_Selector/Design/Interlanguage_links/metrics has "day = 10", which I have to change every day. [07:25:13] I run it once a day. [07:25:17] For the day before. [07:25:41] If there's a function that gives me the number of yesterday's day, it would save me a few seconds every day. [07:25:54] So today I had to change it to "day = 29", for example. [07:26:57] I'd love to have something along the lines of "day = getdate( yesterday )". [07:26:59] no idea, I usually work not much with beeline.. but Joseph or Marcel should be online during the next 2/3 hours, if you're not in a hurry :) [07:27:11] no hurry [07:27:18] super, I'll ping them! [07:27:33] another q: when is the reboot on stat1002? [07:28:18] aharoni: we use oozie for date setup, but I'm sure using some of hiove functions you can make it [07:28:28] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions [07:28:30] goooood morning :) [07:28:38] elukey: o/ [07:29:00] network acls are in place, I am talking with Moritz to merge the ferm rules [07:29:12] elukey: you rock :) [07:29:20] elukey: at switch level as well? [07:30:15] joal: ah, reading. one of these should help... [07:30:16] yep these ones are in place, I only need to merge my change for Ferm [07:30:37] awesome, I need to thank Alex :) [07:35:21] what's the result wrt 9160, can we drop the rules again? [07:38:50] joal, elukey - thanks, that page was helpful. What I need is day(date_sub(from_unixtime( unix_timestamp() ), 1)). A tad convoluted, but works. [07:38:53] moritzm: I am about to clean up the AQS role, thrift as already been removed so nothing is listening on 9160 [07:39:02] *has already [07:40:01] ok [08:06:59] is there a way in beeline to send a SELECT query's output into a file? [08:10:26] aharoni: sorry I didn't answer - I am planning to reboot the hosts in one hour more or less [08:10:37] I'll update the email thread when I start [08:11:32] aharoni: http://stackoverflow.com/questions/14289846/hive-query-output-to-file, from https://www.google.fr/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=hive%20write%20file [08:44:09] joal: thanks, but it doesn't seem to work: [08:44:15] INSERT OVERWRITE DIRECTORY '/home/amire80/hello.txt' select 'hello'; [08:44:25] gives me "Error: Error while compiling statement: FAILED: NullPointerException null (state=42000,code=40000)" [08:45:29] aharoni: INSERt OVERWRITE DIRECTORY writes on HDFS, use INSERT OVERWRITE LOCAL DIRECTORY for local dir [08:47:57] INSERT OVERWRITE LOCAL DIRECTORY '/home/amire80/hello.txt' select 'hello'; [08:48:01] gives me the same result. [08:55:15] Another approach aharoni: beeline --outputformat=csv2 -e "SELECT 'hello';" > output.csv (from https://community.hortonworks.com/questions/25789/how-to-dump-the-output-from-beeline.html) [08:57:27] joal: [08:57:28] elukey@analytics1039:~$ telnet aqs1004-a.eqiad.wmnet 7000 [08:57:29] Trying 10.64.0.126... [08:57:29] Connected to aqs1004-a.eqiad.wmnet. [08:58:05] elukey: YAYYYY ! [08:58:10] elukey: Thanks a lot mate :) [08:58:18] Need to update some code, then will try [08:58:24] I am going to keep puppet disable on aqs100[123] JUST IN CASE [08:58:30] huhuh :) [08:58:42] they are still running with the old ferm rules [09:06:58] all right Thrift completely removed [09:07:10] (except aqs100[123] since puppet is disabled) [09:08:14] elukey: Maaaan ... I knew something was wrong in there [09:08:59] elukey: cassandra-2.1.13 uses thrift for bulk preparation, but CQL for regular querying !!!! [09:09:18] hm, I'll need to change some code in there [09:12:28] elukey: What is our plan to migrate to 2.x? [09:13:23] 2.x?? [09:13:30] 2.2.x sorry [09:13:36] ahhh okok [09:13:52] elukey: any time for a little batcave (for more details sharing and braindumping) [09:13:54] not sure, I think that initially we wanted to postpone it after the cluster migration [09:13:56] ? [09:13:59] sure! [09:30:48] urandom: Hello ! [09:31:05] joal: hi! [09:31:39] * elukey commutes to the office [09:31:58] urandom: do you have some time today, or are you in team-offsite mode? [09:32:11] yes? [09:32:14] :) [09:32:29] huhu, a question answered by a question :) [09:32:58] i might have some time, at some point, but am currently in team-offiste mode [09:33:21] urandom: we're not going to bother then, it can wait when you'll be back :) [09:33:27] also, that depends on the amount of time :) [09:33:39] urandom: questions about versions, migration, etc [09:34:28] urandom: we realised with elukey that, since we want/need to reaload the full dataset, maybe going straight to 2.2.6 could be wise, instead of having a mi [09:34:32] migration later on [09:35:00] urandom: this is basically the thoughts :) [09:35:29] urandom: So, it can wait when you're back :) [09:35:58] joal: i have some thoughts on this, but yeah, let me collect them and ping you later [09:36:09] so that i can do it justice [09:36:16] urandom: please enjoy your time in offsite, we'll wait :) [09:58:08] joal: since I'll need to reboot analytics1003 it might be wise to stop oozie/camus [09:58:16] do you have any preference since we are around the hour? [09:58:35] shoul I wait a bit, stop and leave the current jobs to finish? [09:58:41] elukey: earlier, better :) [09:59:00] elukey: peak for us is around 19:00 UTC [09:59:09] between 17 and 19 UTC [10:05:52] elukey: interestingly, now that we have closed thrift, sstableloader doesn't work ;) [10:06:38] elukey: The 2.1.13 version is in between pure CQL and thrift ... Shame [10:10:25] :/ [10:17:03] joal: ok so I am going to stop camus and oozie then [10:21:32] camus and oozie stopped, will wait a bit for jobs to finish before restarting analytics1003. [10:21:41] Going to proceed soon with stat1004 [10:25:56] stat1004 is up and running [10:26:00] proceeding with stat1003 [10:28:16] aharoni: going to reboot stat1002, I noticed that you are still logged in [10:28:19] need more time? [10:28:32] Oh, np [10:28:38] now is excellent time [10:28:53] super :) [10:30:18] Rebooted stat1002 [10:40:09] ok stopped oozie, hive-* and mysql daemons on analytics1003 [10:40:13] host rebooting [10:42:48] all up and running [10:42:54] let's see how things start behaving [10:44:36] hue didn't like the restart [10:44:39] sigh [10:44:41] checking [10:57:10] so nothing on the logs, but if I login with incognito I can see some mysql errors [11:03:23] so the db on 1003 is up and running [11:03:24] weird [11:07:51] ahhhh maybe I got the problem [11:15:37] no it is not db permissions [11:16:30] I can connect from 1027 to 1003 [11:16:31] mmmm [11:27:47] hallo [11:27:53] is stat1002 ready for use? [11:28:46] elukey: sorry was away for a bit [11:28:52] elukey: how is hue going? [11:29:25] hm elukey, have you restarted oozie? [11:30:43] yeah [11:30:58] stopped everything before that [11:30:59] hm, weird, no oozie jobs on the cluster [11:31:09] yeah I stopped all the bundles :) [11:31:18] Arf, ok :) [11:31:42] the weird thing is that I can see only a mysql error when I log in incognito [11:31:48] and no relevant errors in the logs [11:31:59] elukey: could it be the LDPA sync issue ? [11:32:31] elukey: looks like I can't even get to the login portal [11:33:38] yeah it says 500, but afaiu it seems because it can't retrieve any mysql data [11:34:01] but I checked and I can connect from 1027 to 1003 with mysql -u hue -h analytics1003.eqiad.wmnet -p [11:34:19] :( [11:34:52] have you more details about the LDAP sync that you were referring to? [11:34:56] never heard of it [11:35:33] elukey: It was errors we haad before having the Mysql DB: data was store in RAM, particularly user data (with imported ldap creds) [11:35:46] But you could access hue (even if you couldn't loggin) [11:35:52] Now seems a bit more radical [11:36:18] elukey: Have ypu tried to restart hue server on 1027? [11:36:27] tons of times :P [11:36:31] It might have died because of mysql being away for a it [11:36:43] Would have guessed so, but prefered to check [11:38:15] set error.log to debug, let's see if something changes [11:39:02] nothing [11:40:14] elukey: looks like it doesn't try to connect to mysql [11:40:30] elukey: batcave? [11:41:34] give a minute that I am going to try to caffeinate my brain [11:41:38] might be the reason [11:41:43] :) [11:46:56] aharoni: all good :) [12:42:09] elukey: thinking of that: would be a good thing to send an email to analytics (and even engineering if we don't manage to solve the thing) [12:42:23] yep sure [12:42:43] I think that Jaime is out for lunch, should be back soon.. otherwise I'll ask [12:54:29] joal: I am wondering how all the things are keep going [12:54:38] ? [12:54:48] we have all oozie, hive, etc.. tables on this mysql instance [12:54:53] no? [12:55:02] correct [12:55:14] I tested hive, seems queriable [12:55:20] so oozie for example should use them a lot to store its state [12:55:27] probably not writable, but at least queriable [12:55:46] oozie is not to be restarted before we find a solution, that is correct [12:56:10] This is why I suggested an email actually: this is pretty major outage is we don't solve the thing reasonnable fast [12:56:56] :( [12:57:10] hm hm ... Agreed ;( [12:59:40] elukey: if there is one good thing about that: We don't receive misc-loading alerts anymore !! [12:59:44] :D [13:00:26] hm ... A bit of a forced laugh [13:02:37] ahahaha [13:02:37] elukey: however one not so bad thing is that camus is running [13:02:58] Jaime is online, chatting with him [13:03:08] we'll have loading-refine and next to catch up, but not the camus importation [13:03:22] gok great [13:06:52] joal: can we restart the mysql instance or should we stop oozie first? [13:07:05] elukey: I think you should stop oozie and hue first [13:07:16] all right, doing it [13:13:23] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [13:17:06] hello https://hue.wikimedia.org/oozie/list_oozie_bundles/ [13:17:09] welcome back [13:17:23] elukey: Yay :) Well done mate ! [13:17:44] elukey: so, mysql upgrade? [13:18:14] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [13:19:00] joal: yes, Jaime fixed it, it was probably sitting there waiting for me to destroy it [13:19:20] :D [13:19:23] also the puppet code has read_only = 1 on restart [13:19:33] Wow that's bad [13:19:36] no? [13:19:43] so we might want to fully productionize that DB [13:19:50] +1 [13:24:39] joal: https://hue.wikimedia.org/oozie/list_oozie_bundles/ seems to say that all the bundles are suspended.. [13:24:45] shall we re-enable them from hue too? [13:25:24] elukey: I couldn't reeanble them by CLI due to mysql read-only [13:25:37] So yes, we should resume them :) [13:26:12] done! [13:28:52] Grgeat [14:03:39] hey europe-based folks :] have you been having ssh latency problems today? [14:05:19] mforns: o/ to stats hosts by any chance? [14:05:29] elukey, stat1003 yes [14:05:40] I rebooted all of them :) [14:06:21] are they still going through rebooting process? [14:06:43] nope, they should be fine [14:06:59] works fine for me [14:07:15] might be a connectivity issue? [14:07:25] elukey, ok, thanks! it's a bit slow for me, but I can manage [14:07:40] yes... seems like that, that's why I asked, must be something local here [14:08:04] thanks :] [14:15:39] elukey: I manage to have better errors with cassandra now [14:15:48] Still not working, but better [14:16:32] super :) [14:23:09] joal, elukey: o/ [14:23:14] heya urandom :) [14:23:33] i might have a min or two [14:23:37] * urandom looks around [14:23:46] huhuhu [14:24:44] so urandom, we were thinking about the migration scheme with elukey [14:25:25] yeah? you mean re: going straight to 2.2.6 now? [14:25:32] urandom: on the new cluster [14:25:37] right [14:25:46] urandom: Idea is: We have the old cluster running 2.1.13 [14:25:55] And the new one running the same, but no link [14:26:21] Since we plan to reload the full dataset onto the new one instead of integrating it slowly in the old one, maye going straight to 2.2.6 makes sense? [14:26:30] yeah, maybe [14:26:45] you should know that we had some issue with 2.2.6 in restbase [14:26:59] i think we are past that now, but... let me get the ticket [14:27:13] urandom: arf ... Migration related, or restbase relateD? [14:27:24] https://phabricator.wikimedia.org/T137419 [14:27:40] joal: well, to be honest, i haven't gotten to the bottom of it yet [14:27:58] in 2.2.6, a change was made to allow mmap'd decompression reads [14:28:23] and for reasons that i don't yet understand, that failed miserably [14:28:48] now, we have a work-around, and that work-around is essentially to disable that one new feature [14:29:14] so with that work-around, there is no regression [14:29:40] that was one issue... [14:29:40] urandom: hmmm [14:30:00] the second issue: https://phabricator.wikimedia.org/T137474 [14:30:22] the second issue was more annoying; less serious [14:30:46] basically, the histogram metrics were... not useful [14:30:52] I don't get the second thing [14:31:52] well, I see the issue, but don't understand how this can even happen [14:31:55] so the quantile values in the histograms, say 99p, lacked any recency bias... you'd get the 99p as calculated across the entire time the service was up [14:31:56] urandom: --^ [14:32:16] Ok, makes sense [14:32:24] (kind of) [14:32:52] you might find the referenced upstream ticket useful in understanding the issue, if you want to read more [14:33:02] I have looked at that yes [14:33:07] basically, someone made a stupid (IMO) [14:33:09] says resolved in 2.2.8 [14:33:10] it will be fixed [14:33:16] oh? [14:33:36] Actually, says unresolved, but fix version 2.2.8 [14:33:38] auh, someone attached a fix [14:33:43] i should test that [14:34:00] yeah, it will be fixed, and i have a patched Debian package in the meantime [14:34:05] that is the TL;DR [14:34:11] right [14:34:18] https://people.wikimedia.org/~eevans/debian/ [14:34:54] so issue one is 'fixed' (worked-around) in config, already setup as the default in Puppet [14:34:57] My between the lines understanding is that we should keep going with 2.1.13 while you haven't solved all the mysteries :) [14:35:06] and the second issue is 'fixed', by installing my patched version [14:35:26] joal: yeah, depends on your level of risk aversion [14:35:54] the other way to look at it is that you've got the ability to test it now, before it goes into production [14:35:54] urandom: I'm not too afraid of risk, but some of opsy-friends wouldn't like me to make decisions :) [14:36:00] i didn't have that luxury :) [14:36:15] urandom: makes complete sense [14:36:22] and you ight save yourself the upgrade later [14:36:23] I think we should do that actually [14:36:28] s/ight/might/ [14:36:36] urandom: upgrade was tough as well, wss it ? [14:36:49] shouldn't be [14:37:16] it was very tough for me :) [14:37:20] hm, I have noticed a difference in sstable format for instance [14:37:29] which makes me fear a bit [14:37:30] yes [14:38:32] that's probably more of a concern for an upgrade [14:38:45] in that it complicates a rollback [14:39:05] you can snapshot, and if things go horribly wrong, you can rollback to the snapshot [14:39:05] right [14:39:28] in your case, i guess writes happen once a day, so you'd have that long of a window to do so without data loss [14:39:29] urandom: our idea is also the keep aqs100[123] (the old cluster) in sync for some time to flip back in case of troubles [14:39:32] shouldn't be too hard [14:39:40] auh yeah, nice [14:40:08] elukey: in any case, urandom is right in saying that it's a luxury to be able to test new version not in prod [14:40:23] elukey: We should go for that test just for for that reason [14:41:03] i know you're also eager to get the new machines into production, so maybe the biggest risk is that things go badly and delay the roll-out [14:44:39] urandom: in any case the ;oading will be long, so I'm ready to loose a week or two [14:45:11] joal: ok, the import will take a week or two? [14:45:41] joal: how did sstableloader work out, btw? [14:51:01] joal: so we should test 2.2.6 on aqs100[456] + bulk load and decide later on? [14:52:00] elukey, joal: we have exactly one host (3 instances) upgraded in restbase at this time [14:52:14] and we paused to sort out the issues [14:52:45] and then of course came wikimania + the services offsite, which pushed things back a bit [14:52:57] but we're ready to move forward again [14:53:07] and restbase1007 has been running well now for a while [14:53:29] \o/ [14:53:39] so, TL;DR while it was a rougher ride than i would have hoped, i think we're in good shape now [14:54:00] i don't anticipate you'd have any troubles [14:54:23] of course, i didn't anticipate that we would have either, so what do i know? :) [14:54:36] :) [14:54:57] completely unrelated joal: no timeouts with varnishlog -g request -q 'VSL ~ timeout and not ReqHeader:Upgrade ~ Websocket and not ReqHeader:Upgrade ~ websocket' -n frontend -T 1800 [14:55:14] ok, i'm going afk for about 30 minutes whilst we relocate...brb [14:55:17] elukey: YAY :) [14:55:25] urandom: will give you updates [14:55:30] joal: kk! [14:55:45] later urandom, thanks a lot for the advises [14:55:54] elukey: So, what are your thoughts? [14:56:10] elukey: I'm sure you can read mine, so better to ask for yours :D [14:57:39] I would like to test 2.2.6 now to avoid a massive pain soon to upgrade, but I want to chat with Alex first to get his opinion [14:57:50] elukey: agreed [14:58:06] elukey: I let you manage the Alex-talk, but I'm willing to listen ;) [14:59:41] probably an email is good :) [14:59:47] will write one in a bit [14:59:55] k great [15:03:12] milimetric: o/ [15:03:20] hey [15:03:31] milimetric: I have a question about druid loading productionisation [15:03:41] ok, sure, chat in cave? [15:03:45] sure :) [15:04:21] hm, looks like hangout is dead [15:04:25] milimetric: --^ [15:04:29] same for you ? [15:04:32] no, I'm in [15:04:33] weird [15:04:36] appear.in? [15:04:55] appear.in/wmf-batcave [15:05:26] joal: ^ any luck with that? [15:05:53] * elukey brb [15:06:04] milimetric: nope [15:06:30] ok, sounds like your compy / internets [15:14:47] mforns: you wanna hang out with me for a bit on pageviews? appear.in/wmf-batcave [15:15:19] (I'm curious how good appear.in and WebRTC has gotten since we last used it, but we can use the hangout if this breaks for you) [15:15:38] milimetric, sure give me 2 mins [15:15:47] ofc take your time [15:19:49] milimetric, my internet is slow somehow, I'll try to reboot, brb [15:19:54] k [15:28:30] PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:31:01] RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:33:20] mmmm [15:33:29] this has appened this morning with 1039 [15:44:23] ah there you go [15:44:25] 2016-06-30 15:22:59,761 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1552854784-10.64.21.110-1405114489661:blk_1150231688_76520718 [15:44:28] java.io.IOException: Premature EOF from inputStream [15:45:19] 2016-06-30 15:07:42,892 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: analytics1034.eqiad.wmnet:50010:DataXceiver error processing unknown operation src: /10.64.36.132:33543 dst: /10.64.36.134:50010 [15:45:23] java.lang.OutOfMemoryError: Java heap space [15:45:46] sad_trombone.wav [15:47:34] :( [15:48:11] tons of M&S pauses then [15:48:12] 2016-06-29 20:56:38,162 ERROR org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/var/lib/hadoop/data/g/hdfs/dn, DS-74cf236d-5f3a-4cb3-b12a-29a3246c03f8) exiting because of exception [15:48:17] java.lang.OutOfMemoryError: GC overhead limit exceeded [15:48:58] and Yarn has -Xmx3276m [15:56:39] Analytics, Analytics-Cluster: Java OOM errors kill Hadoop HDFS daemons on analytics* - https://phabricator.wikimedia.org/T139071#2418521 (elukey) [15:56:49] created a task --^ [15:57:17] I can see a lot of long (seconds) GC pauses before the OOMs [15:57:32] elukey: standard java behavior [15:57:37] but only sometimes OOMs trigger alarms [15:57:43] yeah [15:57:49] File:sad_trombone.wav [15:58:15] joal: might be worth to prioritize this, seems a bit bad [15:58:18] elukey: I think it's the first time we experience that one (OOM on datanode) [15:58:29] elukey: agreed [15:59:40] elukey: alos, we are at 65% usage of hdfs ... Just had a look: We should delete some logs and possibly ask some usersx [16:00:01] elukey: hdfs cleaning should wait for andrew though [16:00:48] elukey: let's do standup in appear.in/wmf-batcave [16:01:35] related task: https://phabricator.wikimedia.org/T118501 [16:02:06] Analytics, Analytics-Cluster: Java OOM errors kill Hadoop HDFS daemons on analytics* - https://phabricator.wikimedia.org/T139071#2418550 (elukey) Related task https://phabricator.wikimedia.org/T118501 [16:10:08] Analytics, Analytics-Cluster: Java OOM errors kill Hadoop HDFS daemons on analytics* - https://phabricator.wikimedia.org/T139071#2418584 (elukey) Might be due to HDFS's datanode Xmx setting to 1GB? ``` hdfs 7994 11.9 1.0 1820504 699712 ? Sl 15:29 4:38 /usr/lib/jvm/java-1.7.0-openjdk-amd6... [16:10:29] Xmx for the datanode is 1GB, may need to bump it up [16:31:13] joal: spot checking https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen with analytics worker nodes I can see that we are not using tons of memory, maybe it is safe just to bump the datanodeXmx [16:31:21] *datanode's Xmx to 2GB [16:31:24] and see [16:31:35] but it would be nice to track how many OOM we get in a day [16:32:31] anyhow, will check tomorrow :) [16:32:34] o/ [16:32:37] elukey: I'd actually like indeed to know how much of OOMs we get, and maybe understand a bit of the reason [16:32:39] going offline team! byyyeee [16:32:54] well 1GB is not that much for the datanode [16:33:01] But on the other hand, it's way cheaper to bump the heap space :) [16:33:04] the Yarn ones have ~3GB each [16:33:09] Yeah, we'll discuss that tomorrow :) [16:33:13] Have a good evening ! [16:33:19] you too! [16:33:22] byeeee [16:33:25] Bye :) [16:46:11] elukey, joal: I dunno if you guys are tracking the kibana 4 upgrade expected next week, but I created you a Cassandra dashboard here: https://kibana4.wmflabs.org/app/kibana#/dashboard/cassandra-aqs [16:46:23] urandom_: We are not ! [16:46:25] hopefully it will work [16:46:29] urandom_: thanks for the update [16:46:42] can't test it until there are matching events [16:46:50] urandom_: makes sense :) [16:54:17] a-team, I'm gone for today [16:54:23] See y'all tomorow [16:54:24] bye joal! [16:54:29] see ya [16:54:31] :] [16:56:28] joal: ttfn! [18:16:54] a-team: there's an eventlogging issue -- see https://grafana.wikimedia.org/dashboard/db/performance-metrics [18:17:04] ori, looking [18:18:22] hm, the EL graphs themselves look ok [18:18:25] https://grafana.wikimedia.org/dashboard/db/EventLogging [18:20:14] hmmm, yeah, I see navigation timing events [18:21:04] the legacy zmq forwarder is using 36% of eventlog1001 memory, is this normal? [18:21:44] doesn't sound normal, but it has been a long time since I looked at that [18:22:24] what a bout a restart? [18:22:28] *about [18:23:46] ori: so I am seeing both a lag in replication and an apparent problem replicating older events [18:24:00] but it's not far enough behind to explain the weird graphs you sent [18:24:10] right now we're about 30 minutes behind [18:24:37] and 3% of events for some older timespans missing [18:25:50] yeah, that would explain it, then [18:26:04] coal accumulates events in a sliding window of 5 minutes [18:26:10] anything outside of that just gets discarded [18:27:28] also, no more validation errors than normally: https://logstash.wikimedia.org/#/dashboard/elasticsearch/eventlogging-errors [18:28:19] yeah, it's the lag [18:29:41] coal doesn't cope with that well; it keeps the sliding window by continuously pruning events that fall outside of its boundaries. but the boundaries are defined relative to the current time [18:31:28] https://github.com/wikimedia/operations-puppet/blob/production/modules/coal/files/coal#L168-L174 [18:31:41] any idea why it's lagged? [18:40:16] milimetric: could you look into that? [18:42:46] ori: that could happen normally as dbs get rebooted, that poor-man's replication script inserts, or any number of hiccups that jaime usually deals with. I wouldn't bet on the replication lag being perfect with this current setup. [18:43:28] However, if you get your events out of Hadoop instead, that should be as up to date as the prod EL db [18:43:30] coal subscribes to the zeromq stream; are the databases involved in that pipeline at all? [18:43:45] i watch performance.wikimedia.org like a hawk and this has never been an issue before [18:43:47] oh... that I have no idea how it works [18:44:11] I thought you meant the db replication lag [18:44:52] I actually thought we had turned zmq off [18:45:07] the metrics are live again [18:45:28] mforns: you rebooted the zmq forwarder? [18:45:34] milimetric, no [18:45:45] it is still using lots of mem [18:45:53] hmmm [18:52:21] milimetric, there's some logs in eventlogging_forwarder-legacy-zmq.log that indicate errors committing offsets to kafka [18:52:28] in the last hour [18:52:36] maybe the lag is in kafka [18:53:50] that would make sense, and maybe explain the high mem too if it's hanging onto what it can't commit? [18:54:38] but kafka dashboard seems ok: https://grafana.wikimedia.org/dashboard/db/kafka [18:54:47] milimetric, makes sense [18:56:01] this would be a small blip on that dashboard, since that's the overall kafka cluster and this is like .2% [18:56:08] (traffic wise) [18:57:43] aha [18:58:21] uh... it seems ok for now, I'll check on it from time to time. [18:58:23] milimetric, the strange thing is that performance metrics are back but the forwarder seems stuck still [18:58:28] right... [18:58:42] should we restart EL? [18:58:58] yeah, it shouldn't hurt anything [18:59:10] we might loose the metrics that are buffered in memory [18:59:21] but this is not critical I guess [18:59:41] uh... could we just restart the legacy forwarder? [18:59:53] kill it? [19:00:05] I mean the metrics that are buffered in the legacy forwarder [19:00:20] yeah, I meant let's not restart the whole thing so we only lose those [19:00:22] the mysql concumer will be fine [19:00:45] *consumer [19:00:53] so you mean just kill 31438? [19:01:05] mmmm I would restart everythinh [19:01:33] how would the mysql consumer be fine? wouldn't what's in the buffer get lost? Or wait, you guys changed that? [19:01:34] what do you think? [19:02:07] milimetric, well if the process gets restarted with the restart script, it is programmed to flush the buffered events [19:02:13] oh cool [19:02:16] yeah, I'll do that [19:02:21] eventloggingctl restart [19:02:36] I don't know what would happen if the mysql consumer gets killed by other causes. [19:02:40] cool [19:03:04] hm, that seemed to have done nothing :) [19:03:24] maybe restart's not implemented, I'll just do stop / start [19:03:35] milimetric, oh! yes [19:03:40] restart does not work [19:03:41] ok, that worked [19:03:59] ok [19:04:14] how'd you check the memory of the zmq forwarder? [19:05:11] top, sort by memory, get the biggest process, ps-aux it to check who it is [19:05:13] dunno [19:05:35] the log is OK now: Forwarding kafka:///kafka1013.eqiad.wmnet... [19:06:00] thanks for investigating [19:06:14] * ori uses https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py btw [19:07:34] cool [19:09:18] ori, milimetric, I think the problem we saw is related to: https://phabricator.wikimedia.org/T133779 [19:10:10] oh yeah, right, I forgot we bounced all the kafka boxes today [19:10:15] I thought it was only analytics1*** [19:43:15] hallo [19:43:47] I'be been running a query in beeline, and it has been stuck at Stage-1 map = 36% for a few minutes [19:43:58] so I was wondering whether it's still alive [19:45:10] few minutes is normal on the cluster [19:45:24] if it's stuck there for close to an hour, it might be time to kill and try again [19:45:36] but usually those kinds of problems are due to the cluster hiccuping and not something in the query [19:46:50] aharoni: ^ (and also if it helps the cluster has been hiccuping a bit today) [19:47:10] hmmmm [19:47:15] I'll wait a few more minutes, I guess [19:54:22] milimetric: heh. it was worth the wait. it suddenly jumped to 71% after about hakf an hour. [19:54:27] this server is weird-ish. [19:54:46] progress bars: one of the hardest problems in CS :) [19:54:51] (to me, I only started using it recently. I guess you are accustomed to its quirks?..) [19:56:56] ... and stuck there again. [19:57:24] I mean, we're working on better ways for people to access this kind of data [19:57:41] Hadoop and Map Reduce are pretty high level and complicated [20:17:23] see you tomorrow a-team! logging off.. :] [20:17:32] nite! [20:18:34] milimetric: ... and to 86% after half an hour or so more ... so weird. [20:19:15] aharoni: heh, yes, but you know, weirder things have happened :) [20:25:25] milimetric: more weird: "Stage-1 map = 89%, reduce = 0%" [20:25:39] in the previous row it was "Stage-1 map = 89%, reduce = 2%" [20:25:52] for some reason I imagined that the % only goes up [20:25:56] aharoni: no offense or anything, but it's ok, this is normal, as I said the progress bars here are weird [20:26:04] it's *very* complicated under the hood [20:26:13] and I have a lot of other things to work on [20:26:45] it's also not entirely unexpected if the job fails, you can restart it [20:27:07] it does sound like a pretty heavy job, so it might be a good idea to sample or look at less data, or think of other ways to make it faster