[00:00:07] it'll show up on the project's page, and in the combined server admin log [00:00:10] Ryan_Lane: what's my project? [00:00:10] !project etherpad [00:00:10] https://labsconsole.wikimedia.org/wiki/Nova_Resource:etherpad [00:00:16] right :) [01:08:49] RoanKattouw: gerrit isn't on Formey anymore, is it? [01:09:17] I ... don't know, ask Ryan_Lane|away [01:09:36] yes/no [01:09:38] * Krinkle knew Ryan was gonna move it, and knows he did, but wonders to what, for wikitech docs [01:09:45] replication still on [01:10:06] master is on manganese, slave is on formey [01:10:13] ok [01:10:15] manganese is a new one? [01:10:25] Yeah. It's in eqiad [01:10:28] ok [01:16:07] Ryan_Lane: https://wikitech.wikimedia.org/view/Manganese , https://wikitech.wikimedia.org/view/Formey [01:16:59] gah [01:16:59] https://labsconsole.wikimedia.org/wiki/Category:Projects_used_in_production [01:17:33] I get why the documentation pages are there, but why are most of the projects missing? [01:21:48] ugh. is it really going to make me empty edit all of the project pages? [01:21:57] why didn't it stick that crap into the job queue? [01:25:40] Ryan_Lane: Could testswarm.wmflabs.org be restored? Hashar and/or I am going to work on it this weekend (to use Special:JavaScriptTest as developed in the JSTesting branch intead of ./tests/qunit/index.html), but need a public pointer for clients to join the swarm and test it [01:26:30] was it taken away? [01:26:44] Not sure, I see log msgs about some VMs being killed [01:27:15] no problem, nothing important was on there, but can a new one be created based on the production integration machine and puppet for testswarm ? [01:27:32] Hm.. I see a 'miniswarm' instance, wonder what that is about [01:28:42] you had deleted the testswarm instances, right? [01:28:48] I haven't done anything [01:28:51] you need to work out the issue on it with it filling up / [01:29:00] I guess hashar did [01:29:12] the live machine isn't doing that, right ? [01:29:25] no, because it has enough space in / [01:29:30] instances do not [01:30:13] what do you mean exactly by /, is it creating new directories in the root of the drive en mass ? [01:31:40] ah, you don't have an IP in testswarm anymore [01:31:47] Krinkle: no, the / filesystem [01:31:50] ok [01:32:03] anyway, I won't run the normal every-commit fetcher, only manual updates [01:32:10] no need to have it do that, we know that works [01:32:23] seems you still had 1 in your quota, I allocated an Ip for you [01:34:58] ok. added the hostname back too [01:35:10] you'll need to make an instance, and get testswarm going, then associate the ip [01:35:39] okay, 'making an instance'. Lets see [01:36:32] I can do that from labs console ? [01:37:22] yep [01:37:28] "Manage Instances" in the sidebar [01:37:33] been there [01:37:42] didn't see a way to add, lets see again [01:38:09] Do I need more user rights ? [01:38:31] "Actions" column is empty everywhere, except for deployment-prep instances [01:38:53] oh. lemme see [01:38:57] there's also [Add instance] for deployment-prep, but not for testswarm project or it's instances [01:39:11] you probably don't have sysadmin rights [01:39:22] indeed hashar does, not you [01:39:26] ok [01:39:34] ok. you're good now [01:41:04] Ryan_Lane: Any defaults I should change in the "Instance information" section ? [01:41:49] oh, yeah I think I'll need more than 0GB storage :) [01:47:54] Ryan_Lane: okay, I think I'm hitting the wall for my openstack knowledge. I need an instance that can basically do what the live production machine can do (git, svn, mysql, php, apache). I was assuming the packages from this list that the live machine has is somewhere public, is it? [01:48:04] I can't find a manifest for integration in operations/puppet [01:48:40] like the 'database' group has some that I need for sure, but no idea which. and under 'continuous integration' I see "testswarm" and "packages", do I need "packages" ? [01:49:15] (aside from "testswarm" obviously) [01:51:02] instance name "swarm_jstesting" is not allowed? " Bad resource name provided. Resource names start with a-z, and can only contain a-z, 0-9, -, and _ characters." [01:54:14] !log testswarm created instance "swarm-specialpage" for test/development of testswarm/qunit using Special:JavaScriptTest instead of ./tests/qunit/index.html [01:54:15] Logged the message, Master [01:56:24] !log testswarm associated IP for testswarm.wmflabs.org with IP of "swarm-specialpage" instance [01:56:25] Logged the message, Master [01:56:58] most of what you need likely isn't puppetized properly [01:57:08] well, actually [01:57:10] that's not true [01:57:18] the testswarm stuff was puppetized in labs [01:57:19] what about php5, apache, testswarm, db:core, mysql-something? [01:57:21] so that stuff should be ok [01:57:33] db::core doesn't do what you think [01:57:40] the apache stuff is a serious mess [01:57:40] or is the rest automatic when I click 'testswarm' [01:57:55] I'd have to look at the manifests to see what testswarm does [01:58:07] right now I didn't tick any per the warning that it might fail building [01:58:13] right [01:58:16] you do it later [01:59:17] testswarm stuff doesn't install the dependencies, it seems [01:59:29] where can you see that [01:59:53] I looked in the manifests [02:00:04] manifests/misc/contint.pp [02:00:08] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=manifests; ? [02:00:12] class testswarm [02:00:16] I have a checkout [02:00:30] contint, what a name :D [02:00:33] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/misc/contint.pp;h=fe37720f88a99a39afc7213c21b46bfe0da0d87b;hb=HEAD [02:00:33] I'd never have found that [02:00:41] I greped for testswarm [02:00:46] right [02:01:34] Ryan_Lane: So why do I only see contint::testswarm and content::packages in labs console ? [02:01:42] not e.g. jenkins, which is also in here [02:02:06] not that I need that, just trying to understand it [02:03:00] because it's the only one added [02:03:13] oh, it's manually maintained [02:03:14] ok [02:03:23] you can add puppet groups to your project, and once you verify they work, we can move them globally [02:03:50] if I tick the test::testswarm checkbox for my instance on labs, can I change some of the manifest (e.g. remove files {} and the cronjob every minute to do checkouts) ? [02:03:54] PROBLEM Current Load is now: CRITICAL on swarm-specialpage swarm-specialpage output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:04:34] PROBLEM Current Users is now: CRITICAL on swarm-specialpage swarm-specialpage output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:05:14] PROBLEM Disk Space is now: CRITICAL on swarm-specialpage swarm-specialpage output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:05:31] if you run puppet, those nrpe errors will go away [02:05:38] sudo to root, then: puppetd --test [02:05:57] to change the manifest you'll need to push a change for review [02:06:04] PROBLEM Free ram is now: CRITICAL on swarm-specialpage swarm-specialpage output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:06:08] we don't have branch-per-project puppet yet [02:06:11] I've been working on it [02:06:59] but how can I test the puppet change before I push it to gerrit ? [02:07:17] that's the problem [02:07:18] can I change the manifest on the instance and apply it to itself somehow ? [02:07:24] PROBLEM Total Processes is now: CRITICAL on swarm-specialpage swarm-specialpage output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:07:27] not until we have branch per project [02:07:44] when you push it in for review ops can tell you if something is wrong [02:08:03] when we have branch-per-project, you'll be able to test it before pushing in for review [02:08:08] ok [02:08:14] PROBLEM dpkg-check is now: CRITICAL on swarm-specialpage swarm-specialpage output: CHECK_NRPE: Error - Could not complete SSL handshake. [02:08:16] it's in the plans :) [02:08:39] (ran puppetd --test) [02:09:34] RECOVERY Current Users is now: OK on swarm-specialpage swarm-specialpage output: USERS OK - 1 users currently logged in [02:10:00] Ryan_Lane: okay if I tick that checkbox on labsconsole it'll apply that manifest you linked to. Is it periodically enforced or one-time ? [02:10:13] e.g. can I modify those php files it fetches from puppet and the crontab ? [02:10:14] RECOVERY Disk Space is now: OK on swarm-specialpage swarm-specialpage output: DISK OK [02:10:17] enforced when puppet runs [02:10:30] every 30 minutes [02:10:53] if you want it run once, you check it, force run, then uncheck it [02:11:04] RECOVERY Free ram is now: OK on swarm-specialpage swarm-specialpage output: OK: 86% free memory [02:11:05] right [02:11:08] is that common ? [02:11:12] (for labs) [02:11:22] right now, probably [02:11:51] cause on labs I can imagine people going to work, based on the production puppet, and then build on that until it's stable and then puppetize it back up [02:12:24] RECOVERY Total Processes is now: OK on swarm-specialpage swarm-specialpage output: PROCS OK: 85 processes [02:12:40] yeah [02:12:50] when we switch to the new model, you can keep it checked [02:12:57] and it'll keep applying what you are working on [02:13:07] so for now, I need apache facing to the subdomain you assigned for me. and testswarm needs php and mysql. [02:13:14] RECOVERY dpkg-check is now: OK on swarm-specialpage swarm-specialpage output: All packages OK [02:13:32] I just checked "misc::contint::test::testswarm" on labsconsole [02:13:33] well, the subdomain is on the public IP [02:13:44] puppetd —test <-- [02:13:50] right [02:13:50] that'll apply it now [02:13:54] RECOVERY Current Load is now: OK on swarm-specialpage swarm-specialpage output: OK - load average: 0.00, 0.15, 0.32 [02:14:04] you add the IP to the instance via "Manage addresses" [02:14:05] what's --test for ? [02:14:10] that page takes a while to open [02:14:15] I added the IP already [02:14:21] —test is just how puppet runs. it's stupid [02:14:25] ok [02:14:29] there's no reason with puppet [02:14:34] :) [02:14:48] err: Failed to apply catalog: Could not find dependency File[/etc/apache2/sites-available/integration.mediawiki.org] for Exec[update-testswarm-publish-checkout] at /etc/puppet/manifests/misc/contint.pp:280 [02:15:47] win [02:15:56] need apache first ? [02:16:08] no you need all of the jenkins crap, it seems [02:16:47] see how gallium.wikimedia.org is configured [02:17:00] add those to a puppet group for your project via "Manage puppet groups" [02:17:06] you don't need the sudo crap [02:17:14] you can see how gallium is configured via site.pp [02:17:23] you also don't need the variables [02:17:34] or the install_certificate line [02:17:36] I… 'd rather just go for the manual way then [02:17:40] * Ryan_Lane nods [02:17:42] likely easier [02:17:49] just copy the few lines from testswarm I need [02:17:53] someone needs to clean up the contint stuff [02:18:04] and make it not depend so much on all the other stuff [02:18:22] I know how, did that locally once. All that's left that I don't know how is get mysql,php5,apache on it [02:18:52] you recommend not using the puppet packages from labs console ? [02:20:13] well, mysql classes don't work [02:20:27] you might be able to use apache2::php or whatever it's called [02:20:36] our apache classes are fucking stupid [02:20:47] testswarm ran on labs before [02:20:53] with public IP and all [02:21:07] yeah [02:21:09] just anything to get it going :) [02:21:09] it was all manuak [02:21:12] okay [02:21:13] *manual [02:21:16] so apt-get ish ? [02:21:19] yep [02:21:36] and as some point someone should clean up all the puppet config [02:22:00] if I apt-get mysql, php and apache; will that 'work' right away or do I need to do other stuff? [02:22:11] asking now instead of when I get to it since I suspect you might be leaving soon-ish ? [02:22:22] well, you'll need to configure apache [02:22:26] and make a database in mysql [02:22:51] mysql I can figure out [02:23:01] configure apache, is that much work? [02:23:05] not really [02:23:19] you can see how it's currently done in puppet [02:23:27] ok [02:23:37] probably a template or a file [02:25:47] Ryan_Lane: just 'apache' or 'apache2' or 'libapache2-mod-php5' ? [02:25:53] apache2 [02:25:56] sorry, I'm really a noob at this point [02:26:02] okay [02:26:07] and php5, php5-mysql, php-apc [02:29:01] okay, done [02:29:08] "It works!" :) [02:29:58] heh [02:30:20] so php-mysql installs mysql itself as well ? [02:30:24] nice [02:30:25] no [02:30:27] oh [02:30:36] mysql-server does, I think [02:30:42] and mysql-client you'll probably want too [02:30:46] apt-get search mysql [02:31:12] Hm.. "E: Invalid operation search" [02:31:17] oh [02:31:18] right [02:31:22] apt-cache search [02:31:24] aha [02:31:32] * Ryan_Lane things apt is stupid [02:31:34] *thinks [02:32:06] contint.pp also lists php5-curl, mysql-server [02:32:12] and the testswarm class "curl" [02:32:27] since testswarm uses curl we'll need that too. So both 'curl' and 'php-curl' [02:32:40] php5-curl ;) [02:32:44] right [02:32:51] and then 'curl' separately as well ? [02:32:55] umm [02:33:01] I guess [02:33:04] you may not need it [02:33:07] ok [02:33:09] that's for the actual curl command [02:33:15] it's likely installed already anyway [02:34:33] mysql-client also flashed by when I installed mysql-server [02:34:47] so that's fine I guess [02:35:07] yeah, curl is available now from CLI [02:35:45] "include generic::packages::git-core" I guess that's a symbolic name for a group of packages ? [02:35:48] (from the manifest) [02:36:36] yeah [02:36:39] it's git-core ;) [02:36:46] oh [02:36:55] Why wasn't that just listed the other lists ? [02:37:01] it's safest to wrap packages in classes, because puppet doesn't complain when you include a class twice, but it does when you include a package twice [02:37:05] which is just stupid [02:37:35] ok [02:37:38] I'd really like to wrap every package in a class [02:38:00] we run into issues often with things conflicting because of packages that different things want to add [02:41:37] PROBLEM Free ram is now: WARNING on puppet-lucid puppet-lucid output: Warning: 12% free memory [02:42:47] ok. heading out [02:42:49] * Ryan_Lane waves [02:43:16] bye, and thanks for the help! [02:57:37] RECOVERY Total Processes is now: OK on storm11 storm11 output: PROCS OK: 95 processes [02:58:17] RECOVERY dpkg-check is now: OK on storm11 storm11 output: All packages OK [02:58:57] RECOVERY Current Load is now: OK on storm11 storm11 output: OK - load average: 0.17, 0.16, 0.06 [02:59:37] RECOVERY Current Users is now: OK on storm11 storm11 output: USERS OK - 0 users currently logged in [03:00:17] RECOVERY Disk Space is now: OK on storm11 storm11 output: DISK OK [03:01:07] RECOVERY Free ram is now: OK on storm11 storm11 output: OK: 92% free memory [03:13:57] RECOVERY Current Load is now: OK on storm1 storm1 output: OK - load average: 0.18, 0.07, 0.02 [03:14:37] RECOVERY Current Users is now: OK on storm1 storm1 output: USERS OK - 0 users currently logged in [03:15:17] RECOVERY Disk Space is now: OK on storm1 storm1 output: DISK OK [03:16:07] RECOVERY Free ram is now: OK on storm1 storm1 output: OK: 92% free memory [03:17:37] RECOVERY Total Processes is now: OK on storm1 storm1 output: PROCS OK: 94 processes [03:18:17] RECOVERY dpkg-check is now: OK on storm1 storm1 output: All packages OK [09:03:22] PROBLEM dpkg-check is now: CRITICAL on bots-3 bots-3 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:23:22] RECOVERY dpkg-check is now: OK on bots-3 bots-3 output: All packages OK [09:25:52] PROBLEM Free ram is now: CRITICAL on bots-3 bots-3 output: Critical: 2% free memory [09:30:52] RECOVERY Free ram is now: OK on bots-3 bots-3 output: OK: 57% free memory [13:19:13] New patchset: Mark Bergsma; "Revert "added in simply mysql server call for labs testing of blog"" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/2915 [13:19:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2915 [13:19:35] Change merged: Mark Bergsma; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/2915 [13:43:59] PROBLEM Current Load is now: CRITICAL on dumps-3 dumps-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:44:39] PROBLEM Current Users is now: CRITICAL on dumps-3 dumps-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:45:19] PROBLEM Disk Space is now: CRITICAL on dumps-3 dumps-3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:48:59] RECOVERY Current Load is now: OK on dumps-3 dumps-3 output: OK - load average: 0.19, 0.39, 0.37 [13:49:39] RECOVERY Current Users is now: OK on dumps-3 dumps-3 output: USERS OK - 0 users currently logged in [13:50:19] RECOVERY Disk Space is now: OK on dumps-3 dumps-3 output: DISK OK [14:14:04] PROBLEM Current Load is now: CRITICAL on dumps-4 dumps-4 output: Connection refused by host [14:14:44] PROBLEM Current Users is now: CRITICAL on dumps-4 dumps-4 output: Connection refused by host [14:15:24] PROBLEM Disk Space is now: CRITICAL on dumps-4 dumps-4 output: Connection refused by host [14:16:14] PROBLEM Free ram is now: CRITICAL on dumps-4 dumps-4 output: Connection refused by host [14:17:34] PROBLEM Total Processes is now: CRITICAL on dumps-4 dumps-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:18:24] PROBLEM dpkg-check is now: CRITICAL on dumps-4 dumps-4 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:25:12] 03/02/2012 - 14:25:12 - Updating keys for ialex [14:33:24] RECOVERY dpkg-check is now: OK on dumps-4 dumps-4 output: All packages OK [14:34:04] RECOVERY Current Load is now: OK on dumps-4 dumps-4 output: OK - load average: 0.14, 0.12, 0.17 [14:34:44] RECOVERY Current Users is now: OK on dumps-4 dumps-4 output: USERS OK - 0 users currently logged in [14:35:24] RECOVERY Disk Space is now: OK on dumps-4 dumps-4 output: DISK OK [14:36:14] RECOVERY Free ram is now: OK on dumps-4 dumps-4 output: OK: 90% free memory [14:37:34] RECOVERY Total Processes is now: OK on dumps-4 dumps-4 output: PROCS OK: 87 processes [15:04:04] PROBLEM Current Load is now: CRITICAL on dumps-nfs2 dumps-nfs2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:09:10] RECOVERY Current Load is now: OK on dumps-nfs2 dumps-nfs2 output: OK - load average: 0.39, 0.70, 0.70 [15:24:27] can someone start a mediawiki instance to run trunk code? [15:26:27] PROBLEM Current Load is now: WARNING on dumps-nfs1 dumps-nfs1 output: WARNING - load average: 8.60, 7.20, 5.48 [15:35:55] liangent: I'm interested in the answer to that as well. [15:56:27] RECOVERY Current Load is now: OK on dumps-nfs1 dumps-nfs1 output: OK - load average: 1.85, 3.26, 4.55 [18:01:39] PROBLEM Current Load is now: WARNING on dumps-nfs1 dumps-nfs1 output: WARNING - load average: 3.78, 5.49, 5.10 [18:06:39] RECOVERY Current Load is now: OK on dumps-nfs1 dumps-nfs1 output: OK - load average: 2.87, 4.05, 4.56 [18:29:39] PROBLEM Current Load is now: WARNING on dumps-nfs1 dumps-nfs1 output: WARNING - load average: 6.84, 6.37, 5.48 [19:37:12] I'm having issues with bastion1.pmtpa.wmflabs [19:37:31] it spits bytes at me when i attempt to log in, rendering it rather useless [19:38:15] (esp as a ssh proxy) [19:38:46] I see no Ryan_Lane, so... not sure who to poke. [19:52:12] 03/02/2012 - 19:52:11 - Updating keys for overlordq [20:12:00] !log bastion rebooting bastion1 [20:13:08] :o [20:13:18] it faulted [20:13:24] Yeah we need to move that bot, bots-2 is a pile of poo [20:13:47] Don't we have 2 bastion instances now? [20:14:00] not yet [20:14:01] brb [20:15:11] I love mushrooms [20:17:49] PROBLEM host: bastion1 is DOWN address: bastion1 CRITICAL - Host Unreachable (bastion1) [20:21:31] well fuck [20:21:40] it seems to be broken [20:23:16] !log bastion rebooting bastion1 via nova [20:23:18] Redeploy from puppet! [20:23:29] Also the bot isn't here [20:23:45] I'd fix it but you broke bastion :P [20:23:53] * Ryan_Lane sighs [20:23:57] It'll invalidate the key [20:31:05] \o/ [20:32:26] !log bastion restarted instance by restarting libvirt-bin service on virt4, then running "virsh destroy instance-000000ba", then rebooting via nova. [20:32:59] ok. that's much better [20:33:19] RECOVERY host: bastion1 is UP address: bastion1 PING OK - Packet loss = 0%, RTA = 10.47 ms [20:36:56] * Damianz paws labs-home-wm [20:37:24] Come on bots-2... [20:37:29] * Damianz throws some shit at it [20:39:14] Meh I might just go back to watching fosdem videos for an hour while it gives me a ssh session. [20:39:44] is it dead? [20:39:48] or just really overloaded? [20:41:08] It's alive enough to respond to ping but not give me a ssh session after waiting 5min D: [20:41:16] Might just reboot it [20:43:06] !log bots restarted bots-2 [20:43:09] .. [20:43:18] !log bots restarted bots-2 [20:43:21] Oh screw you [20:44:23] !log bots restarted bots-2 [20:44:29] No such file.. ok [20:44:43] !log bots restarted bots-2 [20:44:44] Logged the message, Master [20:45:00] !log bots Also crated /var/run/adminbot which was breaking the bot for starting - owner please fix [20:45:02] Logged the message, Master [20:45:16] * Damianz puts it back under init [20:45:49] The weird thing is the instance is laggy as hell with a 0 load... could you check the node it's on? [20:45:54] And he left :P [20:52:25] heh. [20:52:28] I'll try again. [20:52:39] PROBLEM Current Load is now: WARNING on dumps-nfs1 dumps-nfs1 output: WARNING - load average: 5.25, 4.87, 5.28 [21:17:39] RECOVERY Current Load is now: OK on dumps-nfs1 dumps-nfs1 output: OK - load average: 1.59, 3.59, 4.66 [21:19:20] Ryan_Lane: Could you check how laggy the node bots-2 is on is? The instances is uber laggy with a load of 0 and hardly any disk io =/ [21:21:47] hm [21:22:31] I wonderif the nfs server is overloaded [21:22:50] bots-2 isn't laggy for me [21:23:16] maybe te bot stopped running/ [21:24:58] I really need to move the labs bots to the new instance [21:25:15] hmm [21:26:01] It's weird because sometimes I can get straight in and othertimes it just won't give me a bash session... might try logging in with /bin/sh next time it does that and see what it's doing. [21:27:31] hm. that makes no sense [21:27:37] I'm getting no lag at all [21:28:11] I *really* need to get that new node in [21:28:32] all of the hosts are swapping like crazy [21:28:39] See it's letting me right in atm but before was like bleh [21:29:02] Mmm servers running vms that are swapping isn't a good thing [21:29:11] bleh: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=&c=Virtualization+cluster+pmtpa&h=virt4.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [21:29:32] Ouch [21:29:57] That's swapping quite a bit [21:30:21] yeah [21:30:28] wait io is low, overall [21:30:52] it's not a major problem to swap if is isn't causing wait time [21:31:01] since it's likely swapping unused pages [21:32:21] spam [21:32:51] ? [21:33:08] 5lines :P [21:33:58] 5 lines? [21:34:13] Of you diying, connecting, changing hosts/nick etc [21:34:17] I got disconnected, I may have missed what you are talking about [21:34:18] oh. [21:34:19] heh [21:34:53] Hmm [21:35:21] * Damianz goes to argue with glance [21:38:34] heh [21:39:10] This is the point where I find all my puppet manifests that don't have ubuntu support :( [21:45:50] seems memory usage has increased overall in the last month, but not swap usage, really: http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:46:00] it's likely just swapping out unused pages [21:46:37] that's a good trend. shows KVM isn't a total POS :) [21:51:29] CPU usage is basically nothing [21:51:57] I find KVM to be quite friendly to servers, I run like 5 vms on my desktop and hardly notice. [21:53:28] yeah. it seems we can continue to grow [21:53:37] iowait seems to be the biggest problem [21:54:15] that may be gluster's fault [21:54:37] this is kind of bad: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Virtualization+cluster+pmtpa&m=load_one&s=by+name&mc=2&g=cpu_report [21:55:29] That looks like it's slowly trending upwards as well [21:58:32] yeah [21:58:35] which is bad [22:40:39] PROBLEM Current Load is now: WARNING on dumps-nfs1 dumps-nfs1 output: WARNING - load average: 6.10, 6.61, 5.51 [22:50:39] RECOVERY Current Load is now: OK on dumps-nfs1 dumps-nfs1 output: OK - load average: 3.20, 4.24, 4.92 [23:03:39] PROBLEM Current Load is now: WARNING on dumps-nfs1 dumps-nfs1 output: WARNING - load average: 6.46, 6.16, 5.51 [23:04:49] PROBLEM Current Load is now: WARNING on bots-cb bots-cb output: WARNING - load average: 3.22, 10.53, 5.79 [23:09:49] RECOVERY Current Load is now: OK on bots-cb bots-cb output: OK - load average: 0.33, 4.06, 4.27