[00:05:37] marktraceur, Reedy, I take it we have reason to believe that etherpad.wmflabs.org could access the outside sometime in the recent past? [00:07:54] No idea [00:16:16] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:00] RECOVERY dpkg-check is now: OK on mobile-testing i-00000271 output: All packages OK [00:22:48] Reedy: When marktraceur sayd 'Last known working date...' was he talking about the last time he could reach the outside from the etherpad instance? [00:25:28] http://deployment.wikimedia.beta.wmflabs.org/ 404 error [00:25:35] http://beta.wmflabs.org [00:25:54] yeah [00:29:40] RECOVERY Current Users is now: OK on bastion-restricted1 i-0000019b output: USERS OK - 5 users currently logged in [00:41:11] andrewbogott: Yes, Thursday 2012-06-21 at 14:40 GMT-7 [00:41:20] ok, email sent. [00:41:29] I saw! Thanks for that [00:41:48] That list needs a little action anyway :) [00:50:00] !log integration created instance integration-apache1 to use as sandbox for setting up TestSwarm+BrowserStack [00:50:01] Logged the message, Master [00:53:37] wtf? [00:53:38] krinkle is not allowed to run sudo on i-000002eb. This incident will be reported. [00:53:44] LOL [00:53:45] PROBLEM Current Load is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:54:01] <^demon> It sends Ryan an SMS ;-) [00:54:25] PROBLEM Current Users is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:55:05] PROBLEM Disk Space is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:55:13] ^demon: How come I can't root? I own this instance [00:55:16] Krinkle: that will go on your permanent record [00:55:42] I'm following the documentation xD [00:55:45] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:55:47] <^demon> Krinkle: No clue. [00:56:30] PROBLEM HTTP is now: CRITICAL on integration-apache1 i-000002eb output: CRITICAL - Socket timeout after 10 seconds [00:57:35] PROBLEM Total Processes is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:58:15] PROBLEM dpkg-check is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:59:05] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:55] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [01:10:55] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 20% free memory [01:18:55] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 17% free memory [01:20:54] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [01:28:29] did andrewbogott get sorted? [01:31:43] marktraceur: andrewbogott: did ya'll get sorted? i'm still catching up [01:32:27] jeremyb: Not sorted, but moved the discussion to the labs list. [01:32:45] reading there now [01:34:06] andrewbogott: can you maybe try to find it broken on an instance I already have access to? or give me access to one? ;) [01:34:45] and what's an example test? did you try my curl? [01:35:10] damn you > No Nova credentials found for your account. [01:35:23] maybe i have to go review ryan's patch now ;) [01:35:39] jeremyb: Are you in the testlabs project? [01:35:46] i think not [01:37:32] 06/27/2012 - 01:37:32 - Created a home directory for jeremyb in project(s): testlabs [01:37:41] andrewbogott: https://labsconsole.wikimedia.org/wiki/User:Jeremyb has a list ;) [01:37:45] or that works [01:37:54] which instance now? [01:37:56] well... now you are. Have a look at utils-abogott [01:38:33] 06/27/2012 - 01:38:33 - User jeremyb may have been modified in LDAP or locally, updating key in project(s): testlabs [01:38:52] huh? [01:38:58] labs-home-wm is weird [01:39:47] does each project have it's own private netblock? not that it matters. just wondering [01:40:54] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [01:41:12] andrewbogott: 27 01:34:45 < jeremyb> and what's an example test? did you try my curl? [01:41:52] hah, > The last Puppet run was at Thu Mar 22 21:40:58 UTC 2012 (138480 minutes ago). [01:45:54] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [01:45:59] jeremyb: I tried 'telnet google.com 80' and various other things. [01:46:04] andrewbogott: have we tried just booting a problem instance? [01:46:25] I tried rebooting utils-abogott [01:46:36] up less than 2 hrs [01:47:12] did you use the web's reboot? [01:47:20] vs. executing `reboot` [01:50:20] yeah, rebooted via labsconsole. [01:51:45] huh [01:51:58] so, nc: connect to gerrit.wikimedia.org port 22 (tcp) failed: No route to host [01:52:07] but also, Connection to gerrit.wikimedia.org 29418 port [tcp/*] succeeded! [01:52:24] odd ;) [01:52:33] (i guess intentional?) [01:52:41] didn't used to be i think [01:52:45] anyway, still digging [01:53:38] andrewbogott: also, have we tried making a brand new instance? [01:55:07] damnit, Thehelpfulone ping [01:55:28] now every list is named starting with "A " (for sample set=2) [01:55:39] and some clients assume that A is a first name [01:55:55] (or is that mut ante's doing?) [01:56:05] * jeremyb didn't look into it *that* much yet [01:57:54] jeremyb: Haven't tried making a new one. [01:58:13] jeremyb: I suspect that what we're seeing is intentional, so might be best to wait for Ryan or Faidon to confirm or deny. [01:59:17] right, but why? one obvious suspect is that you haven't run puppet in so long [01:59:24] but there's others [01:59:38] anyway, have you checked stuff like iptables? [01:59:42] can you? [02:00:03] iptables is wide open on the gerrit box [02:00:29] * jeremyb tries making a new box [02:04:44] labsconsole is sloooooow [02:06:34] one difference i see is you're on oneiric [02:14:03] PROBLEM Current Users is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:14:33] PROBLEM Disk Space is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:15:23] PROBLEM Free ram is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:15:43] PROBLEM Total Processes is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:16:40] PROBLEM dpkg-check is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:16:40] PROBLEM Total Processes is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:17:00] PROBLEM dpkg-check is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:18:20] PROBLEM Current Load is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:18:35] labs-nagios-wm: quiet [02:18:36] ;) [02:18:46] Emw: getting settled in? [02:19:28] huh [02:19:29] jeremyb: gradually [02:20:22] wow, i forgot how long it takes to boot a host! [02:21:33] i want serial console! [02:25:40] PROBLEM Current Users is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:25:40] PROBLEM Free ram is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:25:40] PROBLEM Current Load is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:26:00] PROBLEM Disk Space is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:26:13] i've got a new extension that i'm trying to get set up on labs. i just got developer access. reading over https://labsconsole.wikimedia.org/wiki/Help:Contents, it seems like the high-level workflow to getting an instance/project properly set up might be 1. create project, 2. create security group, 3. create instance. is that the gist? if not, what is? [02:26:43] you probably don't have to deal with the security group part [02:26:50] but otherwise yes [02:26:57] unless it fits into an existing project [02:27:07] what's the extension? [02:27:18] where's it already in use? [02:28:37] RECOVERY Current Load is now: OK on network-test1 i-000002ec output: OK - load average: 0.79, 1.56, 1.33 [02:29:07] RECOVERY Current Users is now: OK on network-test1 i-000002ec output: USERS OK - 1 users currently logged in [02:29:37] RECOVERY Disk Space is now: OK on network-test1 i-000002ec output: DISK OK [02:29:46] jeremyb: it's a new media handling extension that that will allow users with webgl-enabled browsers to manipulate 3D models of large biological molecules. more here: http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html [02:30:17] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [02:30:20] it's not already in use (just on my machine, and a random other machine) [02:30:42] andrewbogott: so, network-test1 is precise and it gets out to the world fine [02:31:05] andrewbogott: can you check iptables? [02:31:33] (iptables -L) [02:31:37] RECOVERY Total Processes is now: OK on network-test1 i-000002ec output: PROCS OK: 82 processes [02:31:46] jeremyb: On which machine? [02:31:53] andrewbogott: your util [02:32:07] RECOVERY dpkg-check is now: OK on network-test1 i-000002ec output: All packages OK [02:32:10] i'm still booting oneiric [02:33:53] Nothing interesting... [02:34:02] It's also getting late here, so I'm not likely to be of much help. [02:34:03] that's yeah, exactly [02:34:18] hey, i think you're an hour earlier than me! ;P [02:34:34] Emw: coming to NYC? DC? [02:34:44] Emw: know anyone up in boston with a spare JTAG? [02:35:04] (for guruplug/dreamplug) [02:35:22] i'll be in DC 7/9 through 7/15 [02:36:47] sorry, boston is out of JTAGs [02:37:03] hah [02:39:27] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [02:40:39] andrewbogott: what's the current state of puppet? we still have a test branch? [02:40:49] how do i request a new project? my basic goal is to get this extension working on a staging environment that's close to wikimedia's. it involves uploading files, which involves some processing upon completing the upload to transform a plaintext file into a static image (not sure how relevant that is). this would make a new project appropriate, right? [02:40:56] (i.e. what are my boxen building off of) [02:41:06] Emw: maybe andrewbogott can make one if he's not asleep [02:41:34] Emw: pick a name. what's the extension called? [02:41:50] PDBHandler [02:42:08] then that's the name of your new project [02:42:33] fantastic [02:48:35] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM dpkg-check is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM Total Processes is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:42] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:22] ldap being slow? [02:51:43] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 9.00, 11.85, 7.69 [02:53:16] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.86, 7.18, 6.51 [02:53:17] RECOVERY dpkg-check is now: OK on grail i-000002c6 output: All packages OK [02:55:37] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:37] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:37] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:37] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:45] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:29] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:34] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.71, 3.05, 4.87 [02:58:34] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 0.24, 2.44, 2.36 [02:58:34] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [02:58:34] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 70% free memory [02:58:34] RECOVERY Current Users is now: OK on mwreview i-000002ae output: USERS OK - 0 users currently logged in [02:58:35] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [02:58:35] RECOVERY Total Processes is now: OK on mwreview i-000002ae output: PROCS OK: 112 processes [02:59:07] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:09] * jeremyb wonders who can make Emw a project... Reedy ? [02:59:14] or else i guess europe [03:00:55] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [03:00:55] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [03:00:55] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 93% free memory [03:00:55] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.48, 2.99, 2.31 [03:00:55] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 83 processes [03:02:15] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 0.16, 1.47, 1.21 [03:02:15] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 103 processes [03:02:20] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 80% free memory [03:02:20] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [03:02:20] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [03:03:35] jeremyb, based on that description i gave, this would be a 'specific' project and not a 'global' one, right? just looking over https://labsconsole.wikimedia.org/wiki/Help:Terminology [03:03:46] huh? [03:03:49] *click* [03:03:55] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [03:04:29] errr, maybe that's how things are supposed to be. certainly it's not reality ;) [03:04:43] anyway, yes you would be specific [03:06:25] PROBLEM Free ram is now: UNKNOWN on network-test2 i-000002ed output: NRPE: Unable to read output [03:06:25] RECOVERY Current Users is now: OK on network-test2 i-000002ed output: USERS OK - 1 users currently logged in [03:06:25] RECOVERY Current Load is now: OK on network-test2 i-000002ed output: OK - load average: 0.13, 1.20, 0.85 [03:08:05] RECOVERY Total Processes is now: OK on network-test2 i-000002ed output: PROCS OK: 73 processes [03:08:10] RECOVERY Disk Space is now: OK on network-test2 i-000002ed output: DISK OK [03:11:25] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.54, 1.08, 3.36 [03:16:31] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.11, 1.00, 2.66 [03:30:48] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:32] network-test1: [ 4979.188031] BUG: soft lockup - CPU#0 stuck for 47s! [gmond:23238] [03:34:57] PROBLEM Disk Space is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:57] PROBLEM Free ram is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:57] PROBLEM Current Users is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:57] PROBLEM Total Processes is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:02] PROBLEM Current Load is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:02] PROBLEM dpkg-check is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:57] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 14% free memory [03:36:36] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:33] RECOVERY Disk Space is now: OK on redis1 i-000002b6 output: DISK OK [03:39:33] RECOVERY Current Users is now: OK on redis1 i-000002b6 output: USERS OK - 0 users currently logged in [03:39:33] RECOVERY Free ram is now: OK on redis1 i-000002b6 output: OK: 85% free memory [03:39:33] RECOVERY Total Processes is now: OK on redis1 i-000002b6 output: PROCS OK: 88 processes [03:39:38] RECOVERY Current Load is now: OK on redis1 i-000002b6 output: OK - load average: 0.39, 2.72, 2.19 [03:39:38] RECOVERY dpkg-check is now: OK on redis1 i-000002b6 output: All packages OK [03:40:08] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [03:41:18] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [03:42:18] PROBLEM Disk Space is now: WARNING on deployment-transcoding i-00000105 output: DISK WARNING - free space: / 78 MB (5% inode=53%): [03:46:55] it's a Jamesofur! [03:47:03] It's a jeremyb [03:47:05] ! [03:47:05] There are multiple keys, refine your input: !, $realm, $site, :), access, account, account-questions, accountreq, addresses, afk, alert, amend, ask, b, bang, bastion, blehlogging, blueprint-dns, bot, bots, broken, bug, bz, change, console, credentials, cs, damianz, damianz's-reset, db, demon, deployment-prep, docs, documentation, domain, epad, etherpad, extension, gerrit, gerrit-wm, ghsh, git, git-puppet, gitweb, group, hashar, help, hexmode, hyperon, info, initial-login, instance, instance-json, instancelist, instanceproject, keys, labs, labsconf, labsconsole.wiki, labs-home-wm, labs-morebots, labs-nagios-wm, labs-project, leslie's-reset, link, linux, load, load-all, logbot, mac, magic, manage-projects, meh, monitor, morebots, nagios, nagios.wmflabs.org, nagios-fix, newgrp, new-labsuser, new-ldapuser, nova-resource, openstack-manager, origin/test, os-change, osm-bug, pageant, password, pastebin, pathconflict, petan, ping, pl, pong, port-forwarding, project-access, project-discuss, projects, puppet, puppetmaster::self, puppet-variables, putty, pxe, queue, quilt, report, requests, resource, revision, rights, rt, Ryan, ryanland, sal, SAL, security, security-groups, sexytime, socks-proxy, ssh, start, stucked, sudo, terminology, test, Thehelpfulone, unicorn, whatIwant, whitespace, wiki, wikitech, windows, wl, wm-bot, [03:47:13] have you come to heal all the opsen? [03:47:19] they need tea! [03:48:26] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 16% free memory [03:51:52] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory [03:54:26] jeremyb: thanks for all the help (i'm off for the night) [03:54:48] * jeremyb too momentarily [03:54:49] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [03:54:54] giving up on andrewbogott [03:55:04] eh? [03:55:22] Emw: see labs-l. most recent msg there i think [03:55:28] was trying to figure it out [03:55:33] ah, gotcha [03:55:39] you should subscribe if you're not on it [03:56:09] i just tried to subscribe and got notified that i apparently already am [03:56:21] hah [04:06:42] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:06:52] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:08:22] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [04:10:31] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [04:11:51] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:11:51] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:13:21] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:14:51] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:19:51] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [05:15:20] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [05:50:30] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [06:27:13] PROBLEM Free ram is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [06:27:13] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:09] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [06:32:18] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 13% free memory [06:32:18] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 5.13, 5.63, 3.33 [06:32:18] PROBLEM Total Processes is now: WARNING on deployment-thumbproxy i-0000026b output: PROCS WARNING: 155 processes [06:32:28] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [06:33:07] PROBLEM Current Load is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:32] PROBLEM Disk Space is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:32] PROBLEM Current Users is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:32] PROBLEM Total Processes is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:38] PROBLEM Free ram is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:57] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.49, 7.35, 4.78 [06:37:57] RECOVERY Current Load is now: OK on grail i-000002c6 output: OK - load average: 0.43, 1.95, 1.51 [06:37:57] RECOVERY Disk Space is now: OK on grail i-000002c6 output: DISK OK [06:37:57] RECOVERY Current Users is now: OK on grail i-000002c6 output: USERS OK - 0 users currently logged in [06:37:58] RECOVERY Total Processes is now: OK on grail i-000002c6 output: PROCS OK: 101 processes [06:38:03] RECOVERY Free ram is now: OK on grail i-000002c6 output: OK: 85% free memory [06:41:02] PROBLEM Free ram is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:22] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:54] RECOVERY Total Processes is now: OK on deployment-thumbproxy i-0000026b output: PROCS OK: 150 processes [06:43:03] PROBLEM Current Users is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:03] PROBLEM dpkg-check is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:15] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:55] PROBLEM Disk Space is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:55] PROBLEM Current Load is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:33] RECOVERY Disk Space is now: OK on deployment-transcoding i-00000105 output: DISK OK [06:45:23] PROBLEM Total Processes is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:27] PROBLEM Total Processes is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:38] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:38] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:38] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:36] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [06:48:41] PROBLEM Free ram is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:37] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:34] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [06:58:49] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:55] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:14] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:15] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:29] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:40] PROBLEM Total Processes is now: WARNING on deployment-thumbproxy i-0000026b output: PROCS WARNING: 151 processes [07:01:15] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:15] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:08] PROBLEM Disk Space is now: WARNING on deployment-transcoding i-00000105 output: DISK WARNING - free space: / 78 MB (5% inode=53%): [07:02:19] PROBLEM Free ram is now: CRITICAL on gluster-2 i-000002e0 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Total Processes is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:45] PROBLEM dpkg-check is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:45] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Disk Space is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Current Users is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Current Load is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [07:03:29] RECOVERY Disk Space is now: OK on gluster-4 i-000002e4 output: DISK OK [07:03:29] PROBLEM Current Load is now: WARNING on gluster-4 i-000002e4 output: WARNING - load average: 3.99, 6.66, 6.30 [07:03:39] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 17.00, 19.96, 13.14 [07:03:54] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:05:55] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 3.34, 6.24, 5.88 [07:05:56] RECOVERY Total Processes is now: OK on gluster-4 i-000002e4 output: PROCS OK: 85 processes [07:06:01] PROBLEM Free ram is now: UNKNOWN on gluster-4 i-000002e4 output: NRPE: Unable to read output [07:06:01] RECOVERY Current Users is now: OK on mwreview i-000002ae output: USERS OK - 0 users currently logged in [07:06:01] RECOVERY Total Processes is now: OK on mwreview i-000002ae output: PROCS OK: 113 processes [07:06:06] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 69% free memory [07:06:06] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [07:06:06] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 5.02, 5.21, 4.23 [07:06:06] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [07:07:00] PROBLEM Free ram is now: UNKNOWN on gluster-2 i-000002e0 output: NRPE: Unable to read output [07:07:00] RECOVERY Current Users is now: OK on gluster-4 i-000002e4 output: USERS OK - 0 users currently logged in [07:07:00] RECOVERY dpkg-check is now: OK on gluster-4 i-000002e4 output: All packages OK [07:07:29] PROBLEM host: integration-apache1 is DOWN address: i-000002eb PING CRITICAL - Packet loss = 100% [07:08:44] RECOVERY Current Load is now: OK on gluster-4 i-000002e4 output: OK - load average: 0.16, 2.51, 4.58 [07:08:44] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:08:45] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:08:45] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 90 processes [07:08:54] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:08:59] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 2.88, 5.45, 5.07 [07:09:05] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:10] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:26] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 1.84, 3.43, 4.75 [07:11:55] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.26, 3.07, 3.98 [07:11:55] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [07:11:55] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [07:11:55] RECOVERY Total Processes is now: OK on network-test1 i-000002ec output: PROCS OK: 85 processes [07:12:00] RECOVERY dpkg-check is now: OK on network-test1 i-000002ec output: All packages OK [07:12:00] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 76 processes [07:12:05] RECOVERY Disk Space is now: OK on network-test1 i-000002ec output: DISK OK [07:12:05] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 94% free memory [07:12:05] RECOVERY Current Load is now: OK on network-test1 i-000002ec output: OK - load average: 0.65, 3.26, 3.82 [07:12:05] RECOVERY Current Users is now: OK on network-test1 i-000002ec output: USERS OK - 0 users currently logged in [07:12:05] RECOVERY host: integration-apache1 is UP address: i-000002eb PING OK - Packet loss = 0%, RTA = 0.74 ms [07:13:55] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.04, 2.01, 3.67 [07:13:55] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.33, 2.79, 3.58 [07:13:55] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 83% free memory [07:13:55] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [07:13:55] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 0 users currently logged in [07:13:56] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 105 processes [07:14:00] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [07:21:05] good morning [07:23:20] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.15, 0.68, 4.21 [07:32:00] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.57, 0.83, 3.75 [07:50:55] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [08:21:57] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [08:27:18] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 12% free memory [08:32:18] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [08:47:36] i-000002eb [08:47:36] hashar: ok :) [08:47:38] https://labsconsole.wikimedia.org/wiki/Nova_Resource:I-000002eb [08:47:55] btw, it is good to see you in the morning :-] [08:48:37] yeah [08:48:42] Also on https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=integration&instanceid=i-000002eb [08:48:47] so hmm [08:48:48] There is a huge flood of this mess everlasting: [08:48:52] I am not part of the integration project :-] [08:48:55] Jun 27 07:37:48 i-000002eb dhclient: DHCPREQUEST of 10.4.0.102 on eth0 to 10.4.0.1 port 67 [08:48:55] Jun 27 07:37:48 i-000002eb dhclient: DHCPACK of 10.4.0.102 from 10.4.0.1 [08:48:56] Jun 27 07:37:48 i-000002eb dhclient: bound to 10.4.0.102 -- renewal in 43 seconds. [08:48:58] Jun 27 07:38:19 i-000002eb nrpe[3463]: Host 10.4.0.34 is not allowed to talk to us! [08:48:59] Jun 27 07:38:20 i-000002eb nrpe[3465]: Host 10.4.0.34 is not allowed to talk to us! [08:48:59] Jun 27 07:38:20 i-000002eb nrpe[3467]: Host 10.4.0.34 is not allowed to talk to us! [08:49:00] etc. [08:49:05] yeah dhclient is a mess [08:49:12] we should make syslog to redirect that to another file :-( [08:49:35] dhcp is made every minute or so to make sure instances acquires IP asap (I think) [08:49:46] but it doesn't work? [08:50:02] well it does! bound to 10.4.0.102 -- renewal in 43 seconds. [08:50:03] hashar: you are a member of the project, actually [08:50:04] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Integration [08:50:10] ahhh [08:50:10] "s not allowed to talk to us!" [08:50:14] that is the damn project filter again [08:50:17] set to false by default [08:50:21] indeed, same here [08:50:28] also missed it the first time [08:50:39] need to change the default :-] [08:51:52] so, did you get root meanwhile? [08:51:55] Krinkle? [08:52:01] no [08:52:16] mutante: what, you mean on the labs instnace? [08:52:28] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [08:52:42] Krinkle: you are a sysadmin :/ [08:53:06] Krinkle: yes, i meant the sudo issue [08:53:12] nope, not solved yet [08:53:13] I can't sudo either [08:53:33] i heard that one from hundfred a little while ago [08:53:37] on a fresh instance [08:53:43] maybe puppet did not work fully on that instance and is missing the labs sudoer file? [08:53:54] he said he could sudo, and then it stopped working [08:54:01] and i dont think he has been removed from groups [08:54:16] I know we have a conflict for two puppet class which try to both install /etc/sudoers [08:54:24]