[03:31:16] average_drifter: the todo for clicktracking is to kill it, it suffers from fundamental design problems [03:31:20] we're hoping to do that this week [03:33:00] ori-l: that's interesting [03:33:06] ori-l: so a rewrite is imminent ? [03:34:11] yes and no. it'll take some time before functionality is at parity with clicktracking. some of the features it implements are useful and don't currently have full-featured alternatives [03:34:42] but yes, in that lots of its basic functionality is obviated by things in E3Experiments [03:35:45] ori-l: so E3Experiments is a group of projects ? [03:35:46] what's the nature of your interest? are you interesting in hacking on feature testing code? (if so, yay, there's lots to do!) or are you the administrator of a mediawiki deployment that is using it? [03:36:08] a little bit, yeah [03:36:22] ori-l: I'm just a guy interested in the clicktracking extension, I've done some stuff with clicktracking, but more simplistic(heatmaps, in Perl) [03:37:13] cool! happy to answer questions [03:37:23] ori-l: I wonder if heatmaps is one of the objectives of clicktracking.. I know for example Erik Zachte likes visualization a lot, I saw loads of stuff on his blog about visualizing data [03:37:33] ori-l: so let's go back to something concrete, you mentioned testing code [03:37:57] ori-l: would writing some QUnit tests for the clicktracking extension be a good thing ? [03:38:44] again, i think it'll be disabled this week, so i'm not sure how useful that would be, but E3Experiments could certainly use more testing [03:38:45] ori-l: that would basically just cover the modules/jQuery.clicktracking.js and the stuff in modules/ [03:39:49] you're free to contribute tests to clicktracking too if you want, of course, but if you give me a better feel for what kind of stuff you're interested in i could help identify opportunities in the codebase that are likely to see wide deployment if implemented [03:41:01] ori-l: well first of all, let's start of with what I know, so basically I know Javascript and Perl and C/C++ [03:41:28] ori-l: these are my main things. I can read PHP but I'm not proficient in it, however due to my Perl background I could somehow transfer that to PHP [03:41:37] my wife and i are just putting my son to bed so i have to go for a bit, but if you can bare to wait, i'll respond [03:41:39] ori-l: about what I'm interested in, I like TDD a lot [03:41:51] ori-l: no problem, I'll be here [03:41:59] thanks! and sorry to nip out in the middle of a conversation like that [03:42:13] alright [03:43:32] so as I was saying, I like TDD a lot and if it's possible to write tests for the ClickTracking extension and if you know some ideas where that would fit in, I'd be glad to give try that out [03:43:44] but before that, I'd like to ask what a UserBucket is ? [03:44:01] also, why does E3Experiments extension depend on the Clicktracking extension ? [04:48:38] average_drifter: sorry for the delay [04:49:21] so, if you look at the SkelJS extension on gerrit (under mediawiki/extensions/SkelJS) you'll see sample code for qunit tests [04:50:56] because most javascript code has dependencies on the mediawiki environment, you can't run them using just qunit. there's a standard way to declare test modules to mediawiki, though, and that'll cause your tests to be included when the test suite is run, using Special:JavaScriptTest or something like that [04:51:29] i think we have a couple in the E3Experiments extension too, but we haven't exactly been shining exemplars of TDD [04:52:27] if you want to write tests for E3, i'd follow the pattern laid out by the couple of tests we have already. that'd be a huge service, really, and i'd be personally grateful [04:52:47] if you prefer to target the clicktracking extension, i'd follow the example laid out by SkelJS to add qunit tests [04:53:11] the docs at http://www.mediawiki.org/wiki/Manual:JavaScript_unit_testing are pretty good [04:53:51] i'd be happy to answer any questions, review code, or help out. [05:15:36] ok, cool, thanks ! how about tests on server-side ? [13:25:33] morning otto [13:26:41] morning! [13:31:08] howdy [13:31:14] i am up early, and mucking about. [13:31:38] woa that is early [13:31:45] yes well. [13:31:57] it is due to my amazing dedication to the cause and/or periodic insomnia. [13:32:15] so~ [13:32:20] zo! [13:32:25] afaict, all nodes are up. [13:32:42] here are a few useful commands: [13:33:20] oooo [13:33:21] ssh an01 -- nodetool -h localhost info [13:33:21] dsetool ring [13:33:22] yay! [13:33:24] ssh an01 -- nodetool -h localhost ring [13:33:31] oh, they wrapped it? [13:33:34] i will look into that. [13:33:34] yup [13:33:42] probably just adds hadoop-related commands [13:33:42] same output from dsetool ring [13:33:44] looks great [13:33:47] yeah [13:33:49] awesome [13:33:54] did you do anything to make them happy? [13:34:08] i was getting some reported down or missing or something, and sometimes nodetool / dsetool would timeout [13:34:37] it looks like around 8p UTC, an07 wedged itself. [13:34:57] the jsvc processes were still up, but it wouldn't respond to anything [13:35:30] i asked them politely to quit, they didn't; i asked them rather impolitely to fuck off, they did; i restarted the service; all was well [13:36:05] from past experience, i am pretty sure i know exactly why you saw churn when you started it up. [13:36:22] have you done any reading about the seeds? [13:36:35] basically, the whole system runs on gossip, so there's no single point of failure [13:37:06] but when a node joins a cluster for the first time, there need to be some well-known points of contact for it to bootstrap its knowledge [13:37:30] that's what the seeds are. (apparently this is now pluggable with a service call, which is neat, but not really relevant for us) [13:37:59] hmm [13:38:07] seeds as in the IPs it starts with? [13:38:14] they should all be seeded with all 10 IPs [13:38:30] so right now, we have all 10 there [13:38:33] which is bad [13:38:35] oh [13:38:43] because it creates a race. [13:38:54] everybody is trying to contact everybody else, in order [13:39:01] and everybody is failing, because, well, everybody is down. [13:39:20] eventually, some number of unlucky nodes will throttle and wait [13:39:41] morning guys [13:39:47] and probably a small kernel of gossip will start between (i'd guess) a few early nodes, and a few late nodes [13:39:55] hmmmm [13:40:00] so you'd see an01 come up and an08 come up, something like that [13:40:01] anyway [13:40:05] so we should have what, 1 seed per 3 nodes or something? [13:40:06] the solution! [13:40:27] you usually have 1/rack, 2-3 per dc [13:40:31] oh [13:40:38] so maybe we only need two right now then? [13:40:41] yes. [13:40:42] an01 and05? [13:40:46] sure. [13:40:49] the other thing [13:40:49] an01 an06 [13:40:50] yeah [13:40:52] is to bring them up in order [13:41:02] what order? [13:41:03] seeds first? [13:41:07] let an01 and an05 come up, bootstrap, and chat up each other first [13:41:09] aye [13:41:13] then you can bring up the rest all at once. [13:41:31] ok cool, can change the seeds easy [13:41:53] since typical operation strives to avoid 100% down more than anything else, this case isn't exactly one of the main concerns. [13:42:24] you'll notice everything settled down eventually. i only had to restart one node, and i have no idea why it was wedged. [13:42:36] (logs showed nothing) [13:44:02] aye, i noticed 07 being weird too [13:44:22] lmk if you happen to notice anything else there [13:44:33] kinda worrisome, since there's absolutely nothing going on. [13:46:55] ottomata, you wanna make a user named "opscenter" on an01 only? [13:47:24] (or maybe show me how?) [13:47:43] (i am assuming `sudo adduser opscenter` is not an acceptable answer) [13:48:02] hm, why? [13:50:31] it runs as root otherwise? [13:53:15] dse? [13:53:18] doesn't it run as cassandra? [13:53:29] it does not. [13:53:41] not according to /etc/init.d/opscenterd, anyway [13:53:56] hmm, yup running as root, [13:54:00] let's make it run as cassandra [13:54:07] do as you will. [13:54:08] cassandra already exists and owns the data directories [13:54:15] well [13:54:20] this is just a monitoring webapp [13:54:23] what is opscenterd [13:54:24] ? [13:54:27] it'll connect an agent to each node [13:54:28] ohohh [13:54:29] opscenter [13:54:34] i thought you meant dse/cassandra [13:54:37] and collect instrumenting/perf/monitoring data [13:54:42] no no. [13:55:04] ah [13:55:16] is that running now? [13:55:44] the dashboard is, i think. [13:55:51] aye [13:55:53] http://analytics1001.wikimedia.org:8888/ [13:55:54] can I see it? [13:55:57] though that doesn't respond for me. [13:56:01] you have iptables running? [13:57:00] yeah you are supposed to proxy all web stuff on the cluster though an01:8085 [13:57:05] but it isn't working for me right now either.. [13:57:10] this prompt is about as confidence-building as "curl ... | bash": [13:57:11] http://www.datastax.com/docs/_images/agent_install_credentials2.png [13:57:34] can we open 8888 for now? [13:57:41] er. [13:57:42] fine [13:57:46] i'll set up a proxy. [13:57:58] i just dedicate opera to my cluster browser [13:58:00] and make it always proxy [13:58:06] :P [13:58:29] i don't get anything when I curl locally though [13:58:34] curl http://analytics1001.wikimedia.org:8888/ [13:58:42] 504 Gateway Time-out [13:58:47] hm. [13:58:48] oh i know [13:59:01] hm [13:59:01] HEAD http://analytics1001.wikimedia.org:8888/ [13:59:01] 500 Can't connect to analytics1001.wikimedia.org:8888 (connect: Connection refused) [13:59:51] hmm, i don't know [14:01:20] ack [14:01:21] hm [14:02:30] ok the proxy is working [14:02:43] :8888 is not [14:02:58] i just added a vhost to proxy 8086 to 8888... [14:03:43] well, the an01 8085 shoudl just proxy everything nicely [14:03:51] so if you configure your browser to always use 8085 [14:04:01] then you should be able to enter any address and get it proxied through an01 [14:04:14] the cluster hosts all allow traffic from an01 [14:04:16] uh. [14:04:20] are we running a squid here? [14:04:33] no [14:04:36] i'm doing it with apache [14:04:43] also [14:04:49] you're logged into an01, right? [14:04:52] i can't HEAD or curl that port locally [14:04:57] 8888 [14:04:57] curl --url http://localhost:8888/ [14:05:11] you'll get a very ... interesting ... response [14:05:16] ah now it is working [14:05:19] well [14:05:20] HEAE is [14:05:22] HEAD is [14:05:26] buh? [14:05:26] curl shows nothing [14:06:46] i thikn maybe the python thing is not bound to the public IP [14:06:54] opscenter is not bound to public IP [14:06:59] cause I get 200 on localhost [14:07:05] 500 on analytics1001.wikimedia.org [14:07:10] tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN [14:07:14] hm [14:07:22] that does not look very public to me. [14:07:24] one sec. [14:07:35] aye [14:08:41] opscenter.conf [14:08:42] eh? [14:08:47] where? [14:09:07] /etc/opscenter [14:09:17] [webserver] [14:09:17] port = 8888 [14:09:18] interface = 127.0.0.1 [14:09:22] ...how [14:09:24] did i miss that [14:09:55] btw: service opscenterd restart [14:10:03] whenever you're done :) [14:10:34] i still want http://analytics.wikimedia.org/ :( [14:11:24] http://analytics1001.wikimedia.org:8888/opscenter/index.html [14:11:31] with browser 8085 proxy [14:11:45] should I go through this or do you want to? [14:11:52] i'm happy too [14:12:07] hm? [14:12:26] you mean, "dave, shut up and set up a socks proxy" [14:13:25] haha [14:13:25] no [14:13:33] the opscenter is asking questions [14:13:37] what are our hosts? [14:13:39] oh [14:13:46] however, i cannot get to it [14:13:58] i assume because i need to shut up and set up a socks proxy :) [14:15:03] it also wants to know our names [14:15:07] and about our first kisses [14:15:13] who are our parents? [14:15:20] why do we do things? [14:15:28] okay, "stfu-socks-proxy" running. [14:15:32] i down see this wizard. [14:15:39] those are all excellent questions. [14:16:44] i need to eat a food, i think. man cannot live on espresso alone. [14:17:15] hopefully you'll be done recounting your romantic history and moral failings to ops center by the time i return. [14:17:32] ha, except [14:17:35] opera things look funny [14:17:38] I cannot see what I type [14:18:17] you have to be wearing opera glasses [14:18:33] or maybe you lack sufficient self-loathing to run that browser? [14:18:35] dunno. [14:18:38] brb foodz [14:18:41] mk [14:25:24] those *are* very good questions ;) [14:37:50] back [14:38:13] aye, i'm currently searching for a way to puppetize opscenter-agent [14:38:37] "0 of 10 agents connected" [14:39:03] right, the opscenter-agent is not running anywhere [14:39:10] i can do this through the web gui [14:39:15] ja [14:39:16] if I give it my private key [14:39:18] but i'd rather puppetize it [14:39:33] oo! [14:39:34] opscenter-agent.deb [14:39:41] is included in /usr/share/opscenter [14:39:56] yeah, i was gonna say... [14:40:05] also: http://www.datastax.com/docs/opscenter/agent/agent_index [14:40:16] aye eyah i'm reading all that [14:40:59] ah [14:41:02] this is the one we want http://www.datastax.com/docs/opscenter/agent/agent_manual [14:42:20] yup [14:42:35] i shall trust in your inordinate expertise :) [14:45:19] i don't like the fact that I have to run a script... [14:45:26] i just want to install a deb, and maybe modify a config file [14:45:28] but i'll get it [14:45:41] what's the script dooo? [14:46:21] create directories, copy some files around, install the .deb [14:46:24] i can do it with puppet [14:46:39] word. [14:46:40] ergh, i might not bother right now though... [14:46:48] heh [14:46:56] you sound like me all of a sudden... [14:47:51] btw: [14:47:57] jconsole-proxy analytics1001 7199 7199 analytics1001.wikimedia.org [14:48:01] works as expected. [14:51:33] oh. [14:51:44] i just realized why we were getting that weird squid page from the shell. [14:52:01] http_proxy=http://brewster.wikimedia.org:8080 [15:06:58] aye yeah [15:07:08] that should not be set on an01, but I think it is right now [15:07:19] man, i've got opscenter-agent up and running on 6 of the nodes [15:07:25] but still am seeing nothing in opscenter gui [15:09:35] hm. [15:10:05] trying gui installer... [15:11:29] want me to look first? [15:13:08] psh, that worked [15:14:06] snerk. [15:14:18] did you start the agent service everywhere afterward? [15:15:10] yeah [15:18:23] milimetric: you have mail? [15:18:27] what's his TZ? [15:18:44] mail? [15:18:50] i guess he's up because he said good morning! [15:19:02] milimetric: yes. the fast version not the snail version [15:19:27] I have a wikimedia.org account, is that what you mean [15:19:30] ottomata, is iptables allowing 80? [15:19:42] dandreescu@wikimedia.org [15:19:44] milimetric: have you checked your inbox recently? ;) [15:20:37] only from cluster IPs, i think [15:20:41] so you have to proxy [15:20:54] 8085 is the only port allowed from outside of cluster [15:21:07] jeremyb: what's my token? [15:21:12] milimetric: empty string [15:22:54] cool, thanks [15:24:12] ottomata: boo, even 80? [15:24:49] yup [15:25:01] why you got a prob? [15:25:13] just thinking about ways to make our lives easier. [15:33:48] hey jeremyb, why couldn't you create an account for stefan? [15:34:37] drdee: i asked sumanah about the case of "also wants an svn acct" and she said to leave it and she'd do the whole thing herself [15:34:54] k [15:34:56] ty [15:58:03] changing locations, eating food, be back on for standup [16:19:42] * milimetric lunch [16:25:40] hi average_drifter [16:25:49] I've created Stefan's Labs account but the Subversion account process requires an ssh key https://www.mediawiki.org/wiki/Developer_access/Subversion#Requesting_commit_access [16:47:25] average_drifter ^^^ [16:47:30] sumanah: thanks [16:56:46] https://plus.google.com/hangouts/_/747c161e540f46a13aa312b2f1ab701b9ec441b3 [16:56:56] https://plus.google.com/hangouts/_/747c161e540f46a13aa312b2f1ab701b9ec441b3 [17:02:10] https://plus.google.com/hangouts/_/747c161e540f46a13aa312b2f1ab701b9ec441b3 [17:02:25] milimetric ^^ erosen [17:21:04] sorry about that [17:21:10] got caught up in a meeting with jessie [17:22:08] she can get out of hand at times. [17:22:10] wild, even. [17:28:37] ok, dschoon, should I go ahead and push the seed change and restart cassandras? [17:28:40] i'll stop them all [17:28:44] and then start an01 and an06 [17:28:46] sure. [17:28:46] let them chat [17:28:49] then start each one [17:28:53] sounds great. [17:29:29] i'd like to try walking through the "automated deployment" myself, just to see wtf it says [17:29:52] the puppet thing? [17:29:54] yeah you should do it [17:30:01] well, that too, someday. [17:30:08] but at the moment, i was referring to the "fix" button [17:30:11] in the DSE UI [17:30:18] oh [17:30:27] it doesn't say much [17:30:36] (i'll also note that an01 has succeeded in finally finding its way to its own public interface, and its agent now shows up) [17:30:37] it asks for your username and private SSH key [17:30:46] (i have a key that I generated on an01 that I used) [17:30:54] i'd like to diff the conf files it generates [17:31:00] just curious how much of it comes from gossip. [17:31:10] hm [17:31:21] i noticed there's an opscenter_agent_monitor daemon [17:31:24] should I stop the opscenter0agents too? [17:31:24] which is kinda "buh?" [17:31:27] yes. [17:31:31] ok, how about [17:31:33] I stop everything [17:31:37] start up the cassandra cluster [17:31:40] and then let you mess with opscenter? [17:39:24] did the wifi in SF die? [17:40:52] ha, must have [17:50:53] what happened? network out? [17:53:33] ok dschoon, all cassandras restarted and reporting in [17:53:43] do you want to play with the gui installer? [17:53:49] for opscenter? [17:54:06] actually, so, when I had orignally started opscenter [17:54:13] I had entered the IPs of each node in the gui [17:54:21] but then I wanted to see if Ic ould make it display hostnames [17:54:26] so i edited the cluster and put in the hostnames [17:54:29] didn't seem to make a difference [17:54:30] so I left it [17:54:35] but maybe that is confusing things [17:54:37] I will change it back to IPs [17:55:40] dschoon, lemme know if you want to hit the fix button (if you don't I will :p ) [17:58:28] oook i'm doing it :p [18:01:09] go ottomata, go~ [18:17:54] interesting, dschoon, now that I've made it so tha tonly an01 and an06 are listed as seeds [18:17:59] those are the only machines that are connected in opscenter [18:23:13] yargh blarg [18:27:17] lol [18:27:31] an10 cassandra was reporting as down [18:27:33] dunno why [18:27:35] restarting it now [18:27:46] and you are right about disabling ssl [18:27:48] i did that again [18:27:55] and log messages about opscenter are happier [18:28:01] even though it still doesn't seem to work [18:28:24] (hooray internet) [18:28:24] hokay, lemee look. [18:28:34] oh more are conecting now? [18:28:34] hmm [18:28:37] 4 of 10 connected [18:28:53] weird [18:29:49] btw, now that my ssh agent keys are better [18:29:52] dsh is awesome [18:30:19] dsh -g k 'sudo tail -f /var/log/{cassandra,opscenter,opscenter-agent}/*' [18:30:30] i think the entirety of the office is tethered over Ryan_Lane's mifi [18:30:35] hahhaa [18:30:43] yeah, i don't think the office can handle that right now [18:30:48] it's having trouble with ... jpg [18:32:04] maybe opscenter agents just take forever to start working together [18:32:09] now there are 5/10 [18:32:22] i think there's a polling threshold [18:33:38] sigh. http://www.datastax.com/docs/opscenter/configure/configure_opscenter_adv#stat-reporter-interval [18:33:51] the only documentation about this property is that you can set it to 0 to disable it? [18:33:52] really? [18:34:13] wha!? [18:34:17] it phones home to datastax?! [18:34:24] to ops-center :P [18:34:31] oh! [18:34:36] By default, OpsCenter periodically sends usage metrics about your cluster to DataStax Support [18:34:38] yeah. [18:34:42] i think to Datastax [18:34:42] that is, in fact, what it says. [18:34:43] booo [18:34:45] i think you are right. [18:34:51] TIME TO SET THAT TO ZERO [18:35:46] yeah, I will puppetize all of this once we get it [18:35:52] um, maybe we need to se cassandra seed_hosts? [18:35:53] no hm [18:36:02] that probably has nothing to do with opscenter-agent problems [18:36:03] right? [18:36:10] there are two ways opscenterd is getting its data [18:36:15] one is by talking to cassandra gossip stuff [18:36:18] and it looks like that is all working [18:36:22] since it reports 10 nodes [18:36:28] and the other is by talking to opscenter-agent [18:36:31] which is only half work [18:36:31] i cannot imagine they're related. [18:36:32] working [18:36:40] also: you only listed 2 seeds, right? [18:36:46] yet we see 4 agents. [18:36:48] for cassandra, yes [18:36:54] not for opscenter [18:36:57] er [18:37:04] i didn't list any nodes for opscenter, except in the gui [18:37:08] where I listed them all [18:38:11] lemee look [18:38:29] (god iterm2 is great) [18:38:38] in the gui, I listed the IPs in Edit Cluster... [18:38:44] btw, do you know the watch command [18:38:45] its super nice [18:38:55] ja [18:38:58] i'm doing this in a bit of my terminal [18:38:58] watch -d -n 3 "sudo nodetool -hlocalhost ring | sed -e '1,2d' | awk '{print \$1 \"\t\" \$4}' | sort" [18:39:11] prob don't need sudo [18:39:12] but wahtv [18:39:25] (you don't) [18:39:33] what is this? [18:39:33] an06: ERROR [Jetty] 2012-09-18 18:39:10,381 Exception running Jetty, restarting: java.net.BindException: Address already in use [18:39:39] that is from an opscenter agent log [18:39:54] sooo... yeah. /etc/opscenter/clusters/KrakenAnalytics.conf has all 10 in seed_hosts [18:40:29] that file is put in place by the gui [18:40:47] *nod* [18:40:53] i think i agree with that at this point. [18:40:56] let's see if it changes :) [18:41:05] if we change it? [18:41:08] manually? [18:41:19] i deleted 09 [18:41:21] from the GUI [18:41:25] ok [18:41:28] yeah its gone [18:41:31] in the file [18:41:58] still says 10 active nodes in dashboard though [18:41:59] so that is good [18:42:02] hilariously [18:42:03] yes. [18:42:04] that. [18:42:08] take all of them out but an01 and an06 [18:42:16] although I doubt this will doa nything to opscenter agent prob [18:42:29] done. [18:42:53] now let's give it a while. [18:43:03] so, i have no evidence [18:43:06] but i have a suspicion. [18:43:46] eh? [18:43:48] that suspicion involves jsvc and the ability for different applications to run in the same VM. [18:43:57] because the agent *needs* to run in the same VM. [18:44:02] hm [18:44:04] otherwise it cannot access the information it wants. [18:44:07] ok... [18:44:12] and it isn't? [18:44:21] this is why the agents MUST come up after dse does. [18:44:27] well, it would explain why they're "running" [18:44:44] but not sending anything to opctr [18:44:53] and therefore it thinks they're down. [18:45:01] hm [18:45:15] the only node that I restarted cassandra on since I started opscenter-agent was an10 [18:45:19] so I will restart opscenter-agent there [18:46:01] so there's this opscenter_agent_monitor process somewhere [18:46:02] how can you tell if they are running in the same vm? [18:46:11] i think taht is restarted by the same init script, no? [18:46:22] i'm trying to figure that out now. i've not worked with jsvc before [18:47:16] oh, all opscenter monitor is doing [18:47:22] is monitoring the opscenter-agent proc [18:47:27] and restarting it if it is not running [18:48:10] basically: [18:48:23] while [ 0 ]; do [18:48:23] sleep 30 [18:48:23] [18:48:23] /etc/init.d/opscenter-agent start &>/dev/null [18:48:51] hm, anyway, iunno, lets watch this for a bit [18:48:53] fuck. [18:48:57] i bet you anything [18:49:01] oh? [18:49:03] it has to do with goddamn process privilieges [18:49:11] and it can't attach or something [18:49:20] hm, you think? [18:49:20] 'cause they want us to run it as a certain user [18:49:22] all this crap [18:49:27] why would it work on some but not others then? [18:49:37] i'm just trying to think of things that might be different about setup [18:49:53] perhaps there's a race for the pidfile [18:49:55] they're all running as root right now anyway [18:50:01] no [18:50:04] dse forks [18:50:07] and becomes cassandra [18:50:09] yeah as cassandra [18:50:09] ywah [18:50:17] but i bet the pidfile is owned by root or something [18:50:19] cassandra seems to be fine though [18:50:24] and it depends on which one of them writes it [18:50:38] easy enough to test [18:51:06] -rw------- 1 root root 5 2012-09-18 18:27 /var/run/dse.pid [18:52:02] hrm. [18:52:31] all owned by root. [18:52:34] i see no pattern. [18:53:20] yeah, which is fine [18:53:23] i think [18:53:31] since the daemon is managed by root, but the proc runs as cassandra [18:53:40] *nod* [18:53:59] it was an ok hypothesis [18:54:49] you restart anything on 9? [18:54:51] it came back up. [18:55:29] it was down? [18:55:30] and no [18:55:39] i mean to say, it's agent appears now [18:55:58] ah yeah [18:55:59] hm [18:56:00] weird [18:56:03] nope, didn't touch it [18:56:09] ok i unno [18:56:13] maybe opscenter is just slow to start up [18:56:19] mind if I start playing with hadoop? [18:56:22] cassandra seems fine [18:59:00] brb [19:06:20] weird, an05 cass is showing as down agin [19:06:23] having trouble restarting it [19:06:24] hm [19:08:41] Hey - I have a stats.wikimedia question, to which I'm not sure if anyone in here will know the answer, but I thought I'd check :) [19:09:06] All the historical numbers for PTWP (http://stats.wikimedia.org/EN/TablesWikipediaPT.htm) changed today, and I'm not really sure why… [19:18:08] dschoon [19:18:13] so, two things [19:18:28] 1 i think hadoop/cassandra consistency level is set to something weird by default [19:18:35] because for some reason, nodes keep flapping [19:18:40] and with even 1 down [19:18:44] hadoop fs doesn't work: [19:18:49] ls: could not get get listing for 'cfs:/user/otto/logs' : UnavailableException() [19:19:15] 2 [19:19:19] why do they keep flapping!? [19:19:23] they look fine to me [19:19:45] the procs are still running, and there are no logs on the downed machines that show anythign wrong [19:19:51] but they are gossiping as down [19:43:01] hokay. [19:43:05] back from a forced lunch. [19:44:16] i thought about the flapping during lunch, ottomata [19:44:22] yeah? [19:44:36] (btw i am running a pig job right now :) ) [19:44:38] i'm gonna do some jconsole archeology [19:46:37] what's the URL for the namenode? [19:51:07] WAAHHHh they are down again [19:51:09] ummmmmmmmm [19:51:11] good question [19:51:16] DSE abstracts that crap [19:51:24] job tracker is on an02 it looks like [19:53:00] okay, so. [19:53:05] this is definitely bunk. [19:53:14] *something* is actively wrong. [19:53:23] and i think we might have to think systemically about it [19:53:30] and test each brick in the wall [19:53:36] yeah [19:53:52] at the moment, i don't think it's dse [19:54:20] network being weird? [19:54:22] other than the problem bringing up the nodes in the veeeery beginning, i haven't seen any issues. [19:54:24] *nod* [19:54:34] current leading candidates are: [19:54:59] - the init crap that launches jsvc [19:55:06] - opscenter itself [19:55:08] - network [19:55:15] - os configuration [19:55:28] hm, well there are probs with cass too [19:55:29] not just opscenter [19:55:32] which is more worrysome [19:55:33] oh? [19:55:39] yeah, cass nodes keep flapping [19:55:42] at least they gossip as down [19:55:45] how do you know? [19:55:51] i mean, opscenter reports them as down [19:55:53] i'm watching logs and nodetool ring [19:55:58] but i run nodetool ring, and they seem ok. [19:56:04] an08 was just down for about 3 mins [19:56:07] hm. [19:56:11] okay. that's a good lead. [19:56:12] right now, all up [19:56:21] ah [19:56:22] two other places to look. [19:56:23] an05 is down now [19:56:41] and, since there are no logs on the offending hosts about problems [19:56:42] there's the #1 suspect with any intermittent java problem: jvm config [19:56:45] namely: gc [19:56:47] hm [19:56:49] really? [19:56:51] yes. [19:56:56] gc is life or death. [19:56:56] i dunno, because this doesn't seem to actually go down [19:56:58] it stops the world. [19:56:59] as far as I can tell [19:57:02] the process keep running [19:57:04] which means no network [19:57:05] no logs on the offending host [19:57:08] no anything happens. [19:57:13] it seems more like gossip just reports them as down [19:57:19] sure, the *process* keeps running [19:57:20] gossip is being too sensitive [19:57:21] or [19:57:27] there are network problems [19:57:27] but the jvm spends all its time cleaning up garbage [19:57:31] this is why tuning is so important [19:57:42] but! these nodes are idle [19:57:45] what, you think the process i just timing out because the jvm is busy? [19:57:50] so we'll take a look, but that seems REALLY unlikely. [19:57:50] so its not reporting to gossip? [19:58:01] but it's never wise to ignore GC settings [19:58:02] an03 is down atm [19:58:07] and i've not looked closely at them [19:58:08] hm. [19:58:11] okok, one sec [19:58:14] one last idea [19:58:26] an01: INFO [HintedHandoff:1] 2012-09-18 19:58:04,131 HintedHandOffManager.java (line 284) Endpoint /10.64.21.103 died before hint delivery, aborting [19:58:32] who are you watching? [19:58:34] just an01? [19:58:41] all logs with dsh [19:58:48] dsh -g k 'tail -f /var/log/cassandra/*' [19:58:49] does everyone report that, or just an01? [19:58:59] more than just an01 [19:59:09] scrolling back [19:59:11] hm. and you're running a pig job right now? [19:59:22] yeah, but i can't imagine it is still working [19:59:27] whatever replication/consistency settings are the default [19:59:30] if any nodes are down [19:59:31] anyway -- i would also hazard that since we're seeing intermittent problems on every box, we need to consider things that touch every box. [19:59:34] hadoop fs fails [19:59:37] so any shared confs or crons [19:59:43] but in my console it looks like it is still running [19:59:51] yeah, its every box [19:59:58] shared confs? [19:59:59] crons? [20:00:06] i doubt there *are* any [20:00:11] yeah don't htikn so [20:00:16] but i'm saying we're seeing periodic slowdown and dc [20:00:24] across all nodes [20:01:38] which either means a systemic problem: network; something faulty in the rack (power?); and places where they have the same configuration or software [20:02:25] so i guess we both know what to do. [20:02:33] first order of business is to start shutting things off [20:02:34] simplifying [20:02:38] reducing variables. [20:02:51] let's turn off opscenter and all the agents [20:02:54] and rerun the pig job [20:03:57] ok [20:04:02] shutting down opscenter [20:04:10] at least then we know it's not that stuff. [20:04:34] oops, i just stopped cassandra too [20:04:37] haha [20:04:39] s'ok. [20:04:39] oh well, good way to start :p [20:04:44] a clean up/down is fine [20:04:48] probably desireable [20:04:51] yeah [20:05:34] i think if we still see the problem, we should next shut down all the nodes but two [20:05:36] and see if it happens [20:05:43] then keep adding one until we see it again. [20:06:51] actually. [20:07:01] if the test fails with only dse running [20:07:15] i think we should next try shutting off iptables everywhere [20:07:15] man, 3 of the cassandras won't stop [20:07:18] with service script [20:07:24] ok [20:07:27] give em time [20:07:38] hadoop takes forever to shut down. [20:07:50] i also worry about us using an01 as a bastion [20:07:55] yeah the logs i'm seeing are hadoop messages [20:07:57] i feel like we should take it out of the ring. [20:08:08] why? [20:08:19] i mean, we're streaming megs and megs of logs (each!) from everywhere, through there, then to us [20:08:30] oh hm, yeah true [20:08:32] it's noise that could be effecting the outcome [20:08:36] ok, i'm fine with that [20:08:38] schroedinger's box! [20:08:43] can we start over from the beginning then? [20:08:47] right now the clusters all know about an01 [20:08:48] because we observe it, it happens :) [20:08:50] how do you take it out? [20:08:54] nodetool [20:09:00] after the others have started up? [20:09:05] do we ahve to startup an01 and then take it out? [20:09:10] i have loaded a single sampeld log in [20:09:11] well [20:09:14] easier solution [20:09:18] take everyone down [20:09:23] update the conf in puppet [20:09:25] and change the cluster name [20:09:29] hm [20:09:42] it means nobody will have a cached copy of anything [20:09:50] so they'll make no assumptions [20:09:59] they'll all figure out new ring positions, etc [20:10:08] assign new replicas [20:10:26] those are the two things that are effected by an01 in the cluster [20:10:30] hmmmmmmmmmmmmmmmm, ok [20:10:47] it also allows us to preserve this [busted] cluster [20:10:50] for forensics [20:10:53] hmm [20:11:07] since the new cluster will just live alongside the old one. [20:11:21] fyi, if we really wanted to take an01 out of the cluster, you'd run http://www.datastax.com/docs/1.1/references/nodetool#nodetool-decomission [20:11:29] $cassandra_cluster_name = "KrakenAnalytics2" [20:11:32] $cassandra_seeds = [ [20:11:32] "10.64.21.102", # an02 [20:11:33] "10.64.21.106", # an06 [20:11:33] ] [20:11:33] eh? [20:11:49] word [20:11:57] looks good. [20:13:18] still waiting for an07 to shutdown [20:13:25] go go hadoop! [20:13:32] can I just kill it? [20:13:34] yes. [20:13:37] cassandra won't care. [20:13:45] hadoop might be pissy, but whatever. [20:14:04] ok, running puppet on 02 and 06 first, to start up seeds [20:14:05] (they are like the most tempermentally mismatched married couple ever) [20:14:29] ...wait, does puppet start things? [20:14:36] yes, but not restart [20:14:36] are we SURE it doesn't just run randomly and ruin things? [20:14:41] i turned that off [20:14:49] but ops-puppet does? [20:14:53] ? [20:15:00] ops puppet dosen't know about cassandra or dse [20:15:01] so no [20:15:02] but [20:15:04] currently our puppet [20:15:05] (don't we still have ops-puppet as well as anal-puppet?) [20:15:06] if you run puppet and cassandra is off [20:15:09] it will start dse [20:15:11] yes [20:15:11] heh [20:15:14] how nice of it. [20:15:15] okay. [20:15:20] we can turn that off too, but i thought htat was fine [20:15:21] let's monitor that, just to be safe. [20:15:28] previously the service was subscribed to the config files [20:15:30] so if you changed a config [20:15:34] puppet would restart dse [20:15:37] i don't like that [20:15:41] that's cool for things like virtual hosts [20:15:43] but not for databases [20:15:43] i mean, we have -Xms=8G or something, right? [20:16:13] hehe [20:16:14] an06: org.apache.cassandra.config.ConfigurationException: Saved cluster name KrakenAnalytics != configured name KrakenAnalytics2 [20:16:34] so if puppet was randomly trying to start another copy, that copy would first allocate 8G of RAM, write a bunch of files, and then attempt to grab interfaces, which would fail, and then it'd die. [20:16:47] that sounds like a good way to get giant, random pauses [20:16:56] well, our puppet is not running unless we tell it to [20:16:59] yeah. [20:17:08] so let's... monitor :) [20:17:10] ok [20:17:17] just you know, check the touch times every once in a while? [20:17:19] also it doesn't restart if the proc is running [20:17:23] (there's a lastrun file, right?) [20:17:25] of the .pid file? [20:17:48] dunno. is there a way to know when *our* puppet last ran? [20:18:56] (i guess maybe i should learn about puppet, eh?) [20:20:03] /var/lib/puppet.analytics/state [20:20:17] last_run_summary.yaml [20:20:33] anyway, i can't start this cluster with a different name [20:20:59] sweet. [20:21:01] also: sweet. [20:21:09] where are you? i'll come look. [20:21:14] um, an02 [20:21:19] also, tail the cassandra logs [20:21:40] sudo tail -f /var/log/cassandra/* [20:22:33] an02: org.apache.cassandra.config.ConfigurationException: Saved cluster name KrakenAnalytics != configured name KrakenAnalytics2 [20:22:39] word. [20:22:51] can we just keep the same name and decom an01? [20:22:58] it might take a while. [20:23:00] how much data is there? [20:23:00] hm [20:23:04] 2.5 G [20:23:06] is all I loaded [20:23:11] hm. [20:23:12] let's find out! [20:23:13] it's a fun task! [20:23:13] ha, ok [20:23:14] okok [20:23:15] step 1. [20:23:18] changing name back to orig [20:23:22] yes. [20:23:30] and let's both get some jconsole action pointed at an01 [20:23:34] (once it's up) [20:24:01] ok, i'm going to run puppet on an01 [20:24:09] this will start cass [20:24:26] you update the cluster name everywhere? [20:25:02] not really, but cass hasn't started there yet [20:25:05] whenev pup runs it will [20:25:09] mk. [20:25:51] ok, cass running on an01 [20:25:58] should I start an02 and an06? [20:26:10] we need to start them all to decom an01, right? [20:26:43] jc analytics1001 7199 7199 analytics1001.wikimedia.org & [20:26:54] yeah, you need them all up. [20:27:17] ok jconsole on [20:27:48] ok starting 02 then 06 [20:28:17] with updated conf? [20:28:26] actually. [20:28:27] wait. [20:28:31] why are we doing this? [20:28:40] ? [20:28:43] so we can decom 01 [20:28:44] ? [20:28:54] why don't we just wipe the state? [20:28:56] how? [20:29:05] you just delete the files. [20:29:08] they're in ... [20:29:13] the data? the commitlog? [20:29:19] /var/lib/cassandra [20:29:19] ? [20:29:36] yep! [20:29:40] i'm looking now. [20:29:57] awww [20:29:59] i wanna decom! [20:30:03] come oooon lets do it [20:30:06] it'll be fun [20:30:14] haha [20:30:16] fine. [20:30:55] ok starting 06 [20:33:03] waiting for 06 to show as up in ring [20:35:10] grrr [20:35:18] still shows as down :( [20:35:21] who does 6 know about? [20:35:24] cassandra is an exercise in patience [20:35:25] oh good q [20:35:43] (honestly, this stuff was all near-instant with an empty DB back when i did it last) [20:35:47] (it's all about good configuration) [20:36:01] well, you didn't do it with DSE, right? [20:36:06] so who knows what defaults they changed [20:36:16] also, i just realized we are running a slightly outdated version of DSE [20:36:23] heh [20:36:25] since I installed DSE originally back in may or whenvver that was [20:36:31] god, i'd die if that was the reason [20:36:46] hmmm [20:36:46] nodetool -hlocalhost ring [20:36:47] Error connection to remote JMX agent! [20:36:50] java.net.SocketTimeoutException: Read timed out] [20:37:00] should I try to restart it [20:37:02] ? [20:37:03] cass? [20:37:17] no. [20:37:20] just chill for a bit. [20:38:27] was this one of the ones you forced down? [20:38:36] hmm, naw [20:38:38] i only forced 07 down [20:38:44] it did take a while to go down I think though [20:38:46] but I didn't force it [20:39:04] dstat says nothing is going on. [20:39:05] doo dee doo [20:39:05] 10.64.21.106 Down [20:39:15] lemme restart cass again [20:39:18] cmooon [20:39:21] impatience [20:39:23] give in [20:39:35] may i? [20:39:38] yup [20:40:04] huh [20:40:10] netstat says jconsole is up [20:40:12] i wanna look first. [20:43:04] k [20:44:03] no dice, it seems. [20:44:11] can't connect. [20:44:47] restart it! [20:45:53] yeah, that jvm is fucked. [20:47:03] thumbs are a-twiddlin [20:47:11] i'ma kill -9 it [20:48:42] jaaaaaaaaa do it [20:51:44] killed, then ran service dse start, and now we're at the same place as before. [20:51:49] yup [20:51:55] can we wipe em now? [20:52:00] fiiiine [20:52:09] aiight. [20:53:05] you wipin? [20:53:09] i dunno what to wipe [20:53:22] first we have to shut them all down. [20:53:46] hehe [20:53:53] just 1 and 2 up [20:53:54] ah [20:53:57] i see you shutdown 1 [20:53:58] hehe [20:53:58] dsk -g kk sudo killall -9 jsvc [20:54:02] oh nastay [20:54:08] crash only! [20:54:26] an03: jsvc: no process found [20:54:26] an07: jsvc: no process found [20:54:26] an04: jsvc: no process found [20:54:27] an09: jsvc: no process found [20:54:29] an08: jsvc: no process found [20:54:31] an05: jsvc: no process found [20:54:33] an10: jsvc: no process found [20:54:35] ;) [20:54:39] mmmmk [20:54:46] so what [20:55:02] k [20:55:04] rm -rf /var/lib/cassandra/commitlog/* /var/lib/cassandra/data/*/* [20:55:05] ? [20:55:05] they're all off [20:55:18] i was planning: [20:55:20] rm -rf /var/lib/cassandra/* [20:55:24] no don't do that [20:55:26] i checked -- no jars [20:55:27] they are mounted partitions [20:55:28] no config [20:55:30] oh. [20:55:32] LAME. [20:55:41] /dev/sde1 276G 368M 261G 1% /var/lib/cassandra/commitlog [20:55:41] (won't it just tell me to fuck off?) [20:55:41] /dev/sdf1 280G 5.1M 280G 1% /var/lib/cassandra/data/f [20:55:41] /dev/sdg1 280G 5.4M 280G 1% /var/lib/cassandra/data/g [20:55:41] /dev/sdh1 280G 4.3M 280G 1% /var/lib/cassandra/data/h [20:55:41] /dev/sdi1 280G 4.3M 280G 1% /var/lib/cassandra/data/i [20:55:41] /dev/sdj1 280G 4.3M 280G 1% /var/lib/cassandra/data/j [20:55:45] maybe? [20:55:46] (i *can't* do that, right?) [20:55:57] i dunno [20:56:01] just do [20:56:05] frumple. [20:56:06] fine fine. [20:56:16] rm -rfv /var/lib/cassandra/{commitlog,data/*/*} [20:56:26] or you! :) [20:56:27] o [20:56:29] ok i'lld o it [20:56:37] i love rm -v [20:56:40] it's so ... satisfying [20:57:28] k done [20:57:29] dsh is great! [20:58:08] I <3 it so much [20:58:17] i should fix dsync [20:58:20] ok [20:58:24] so if we start an02 and an06 now [20:58:27] they should be brand new? [20:58:31] and only know about each other? [20:59:32] yep. [20:59:39] ok, trying that [20:59:41] provided we've updated the conf everywhere [20:59:45] yes [20:59:51] well, not everywhere [20:59:54] but I am starting via puppet [21:00:02] so puppet will update confs before starting dse [21:00:17] the only conf change is that we are using an02 and an06 as seeds [21:00:17] right? [21:00:21] still the same cluster name [21:00:28] er [21:00:31] hm, is this a problem? [21:00:31] an02: INFO [main] 2012-09-18 21:00:19,906 Mx4jTool.java (line 72) Will not load MX4J, mx4j-tools.jar is not in the classpath [21:00:32] also remove an01 from the class [21:00:38] of machines that run dse [21:00:41] yeah [21:00:43] no, that's fine. [21:00:45] ok [21:00:56] i'm curious if it's on every machine though [21:01:57] brb [21:02:11] 06 did it [21:02:15] mx4j-tools.jar is not in the classpath [21:02:28] right. [21:02:34] i read that recentl.y [21:02:51] hmmmmmm [21:02:58] an02 and an06 are not talking to each other [21:03:09] give em a sec [21:03:15] this is the slowest part. [21:03:22] they're being careful not to overwhelm the network. [21:03:30] btw, do we have multicast enabled? [21:03:39] for? [21:03:40] cass? [21:03:45] iunno [21:04:23] i'm just thinking that knowing our people [21:04:28] everything is off by default [21:04:32] no matter how unreasonable [21:04:46] it's now up [21:04:52] yay cool they know about each other [21:04:53] cool [21:05:01] they see each other [21:05:06] they are friends! [21:05:21] should I start another? [21:05:29] you can start the other 7, sure. [21:05:33] k [21:05:35] errybdy but 01 [21:05:39] aye [21:07:59] hmm, an02 logs say an03 came up and is now part of cluster [21:08:02] ring doesn't show it [21:08:03] waiting... [21:08:22] who did you ask? [21:08:33] 2 and 3 [21:08:39] even 3 doesn't show itself...? [21:09:37] 3 isn't up [21:11:45] i think you are right sir [21:12:05] the proc is running though [21:12:07] grrr i don't like this [21:12:09] killing proc [21:14:06] hm, yeah still [21:14:08] what's up with 03? [21:15:03] let me look into it. [21:15:25] k, i'm probably heading out purty soon [21:16:45] jaaaa, i gotsta go [21:16:48] good luck fiddling [21:16:52] lemme know what you figure out [21:16:59] aiight. [21:17:05] i'm proally gonna call it quits soon. [21:17:11] and think. [21:17:20] thinking is often better than gridining with these things :) [21:17:37] mk cool [21:17:44] maybe i should just reinstall the whole thing [21:17:46] upgrade DSE anyway [21:18:44] i think that's probably going to be Thing #1 tomorrow [21:20:37] yeah [21:20:39] will do that then [21:20:40] oook bye [23:50:27] hello [23:50:34] had to crash, have a bad cold [23:50:39] drdee: you there ?