[00:05:37] marktraceur, Reedy, I take it we have reason to believe that etherpad.wmflabs.org could access the outside sometime in the recent past? [00:07:54] No idea [00:16:16] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:00] RECOVERY dpkg-check is now: OK on mobile-testing i-00000271 output: All packages OK [00:22:48] Reedy: When marktraceur sayd 'Last known working date...' was he talking about the last time he could reach the outside from the etherpad instance? [00:25:28] http://deployment.wikimedia.beta.wmflabs.org/ 404 error [00:25:35] http://beta.wmflabs.org [00:25:54] yeah [00:29:40] RECOVERY Current Users is now: OK on bastion-restricted1 i-0000019b output: USERS OK - 5 users currently logged in [00:41:11] andrewbogott: Yes, Thursday 2012-06-21 at 14:40 GMT-7 [00:41:20] ok, email sent. [00:41:29] I saw! Thanks for that [00:41:48] That list needs a little action anyway :) [00:50:00] !log integration created instance integration-apache1 to use as sandbox for setting up TestSwarm+BrowserStack [00:50:01] Logged the message, Master [00:53:37] wtf? [00:53:38] krinkle is not allowed to run sudo on i-000002eb. This incident will be reported. [00:53:44] LOL [00:53:45] PROBLEM Current Load is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:54:01] <^demon> It sends Ryan an SMS ;-) [00:54:25] PROBLEM Current Users is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:55:05] PROBLEM Disk Space is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:55:13] ^demon: How come I can't root? I own this instance [00:55:16] Krinkle: that will go on your permanent record [00:55:42] I'm following the documentation xD [00:55:45] PROBLEM Free ram is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:55:47] <^demon> Krinkle: No clue. [00:56:30] PROBLEM HTTP is now: CRITICAL on integration-apache1 i-000002eb output: CRITICAL - Socket timeout after 10 seconds [00:57:35] PROBLEM Total Processes is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:58:15] PROBLEM dpkg-check is now: CRITICAL on integration-apache1 i-000002eb output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:59:05] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:55] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [01:10:55] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 20% free memory [01:18:55] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 17% free memory [01:20:54] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [01:28:29] did andrewbogott get sorted? [01:31:43] marktraceur: andrewbogott: did ya'll get sorted? i'm still catching up [01:32:27] jeremyb: Not sorted, but moved the discussion to the labs list. [01:32:45] reading there now [01:34:06] andrewbogott: can you maybe try to find it broken on an instance I already have access to? or give me access to one? ;) [01:34:45] and what's an example test? did you try my curl? [01:35:10] damn you > No Nova credentials found for your account. [01:35:23] maybe i have to go review ryan's patch now ;) [01:35:39] jeremyb: Are you in the testlabs project? [01:35:46] i think not [01:37:32] 06/27/2012 - 01:37:32 - Created a home directory for jeremyb in project(s): testlabs [01:37:41] andrewbogott: https://labsconsole.wikimedia.org/wiki/User:Jeremyb has a list ;) [01:37:45] or that works [01:37:54] which instance now? [01:37:56] well... now you are. Have a look at utils-abogott [01:38:33] 06/27/2012 - 01:38:33 - User jeremyb may have been modified in LDAP or locally, updating key in project(s): testlabs [01:38:52] huh? [01:38:58] labs-home-wm is weird [01:39:47] does each project have it's own private netblock? not that it matters. just wondering [01:40:54] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [01:41:12] andrewbogott: 27 01:34:45 < jeremyb> and what's an example test? did you try my curl? [01:41:52] hah, > The last Puppet run was at Thu Mar 22 21:40:58 UTC 2012 (138480 minutes ago). [01:45:54] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [01:45:59] jeremyb: I tried 'telnet google.com 80' and various other things. [01:46:04] andrewbogott: have we tried just booting a problem instance? [01:46:25] I tried rebooting utils-abogott [01:46:36] up less than 2 hrs [01:47:12] did you use the web's reboot? [01:47:20] vs. executing `reboot` [01:50:20] yeah, rebooted via labsconsole. [01:51:45] huh [01:51:58] so, nc: connect to gerrit.wikimedia.org port 22 (tcp) failed: No route to host [01:52:07] but also, Connection to gerrit.wikimedia.org 29418 port [tcp/*] succeeded! [01:52:24] odd ;) [01:52:33] (i guess intentional?) [01:52:41] didn't used to be i think [01:52:45] anyway, still digging [01:53:38] andrewbogott: also, have we tried making a brand new instance? [01:55:07] damnit, Thehelpfulone ping [01:55:28] now every list is named starting with "A " (for sample set=2) [01:55:39] and some clients assume that A is a first name [01:55:55] (or is that mut ante's doing?) [01:56:05] * jeremyb didn't look into it *that* much yet [01:57:54] jeremyb: Haven't tried making a new one. [01:58:13] jeremyb: I suspect that what we're seeing is intentional, so might be best to wait for Ryan or Faidon to confirm or deny. [01:59:17] right, but why? one obvious suspect is that you haven't run puppet in so long [01:59:24] but there's others [01:59:38] anyway, have you checked stuff like iptables? [01:59:42] can you? [02:00:03] iptables is wide open on the gerrit box [02:00:29] * jeremyb tries making a new box [02:04:44] labsconsole is sloooooow [02:06:34] one difference i see is you're on oneiric [02:14:03] PROBLEM Current Users is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:14:33] PROBLEM Disk Space is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:15:23] PROBLEM Free ram is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:15:43] PROBLEM Total Processes is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:16:40] PROBLEM dpkg-check is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:16:40] PROBLEM Total Processes is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:17:00] PROBLEM dpkg-check is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:18:20] PROBLEM Current Load is now: CRITICAL on network-test1 i-000002ec output: Connection refused by host [02:18:35] labs-nagios-wm: quiet [02:18:36] ;) [02:18:46] Emw: getting settled in? [02:19:28] huh [02:19:29] jeremyb: gradually [02:20:22] wow, i forgot how long it takes to boot a host! [02:21:33] i want serial console! [02:25:40] PROBLEM Current Users is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:25:40] PROBLEM Free ram is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:25:40] PROBLEM Current Load is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:26:00] PROBLEM Disk Space is now: CRITICAL on network-test2 i-000002ed output: Connection refused by host [02:26:13] i've got a new extension that i'm trying to get set up on labs. i just got developer access. reading over https://labsconsole.wikimedia.org/wiki/Help:Contents, it seems like the high-level workflow to getting an instance/project properly set up might be 1. create project, 2. create security group, 3. create instance. is that the gist? if not, what is? [02:26:43] you probably don't have to deal with the security group part [02:26:50] but otherwise yes [02:26:57] unless it fits into an existing project [02:27:07] what's the extension? [02:27:18] where's it already in use? [02:28:37] RECOVERY Current Load is now: OK on network-test1 i-000002ec output: OK - load average: 0.79, 1.56, 1.33 [02:29:07] RECOVERY Current Users is now: OK on network-test1 i-000002ec output: USERS OK - 1 users currently logged in [02:29:37] RECOVERY Disk Space is now: OK on network-test1 i-000002ec output: DISK OK [02:29:46] jeremyb: it's a new media handling extension that that will allow users with webgl-enabled browsers to manipulate 3D models of large biological molecules. more here: http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html [02:30:17] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [02:30:20] it's not already in use (just on my machine, and a random other machine) [02:30:42] andrewbogott: so, network-test1 is precise and it gets out to the world fine [02:31:05] andrewbogott: can you check iptables? [02:31:33] (iptables -L) [02:31:37] RECOVERY Total Processes is now: OK on network-test1 i-000002ec output: PROCS OK: 82 processes [02:31:46] jeremyb: On which machine? [02:31:53] andrewbogott: your util [02:32:07] RECOVERY dpkg-check is now: OK on network-test1 i-000002ec output: All packages OK [02:32:10] i'm still booting oneiric [02:33:53] Nothing interesting... [02:34:02] It's also getting late here, so I'm not likely to be of much help. [02:34:03] that's yeah, exactly [02:34:18] hey, i think you're an hour earlier than me! ;P [02:34:34] Emw: coming to NYC? DC? [02:34:44] Emw: know anyone up in boston with a spare JTAG? [02:35:04] (for guruplug/dreamplug) [02:35:22] i'll be in DC 7/9 through 7/15 [02:36:47] sorry, boston is out of JTAGs [02:37:03] hah [02:39:27] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [02:40:39] andrewbogott: what's the current state of puppet? we still have a test branch? [02:40:49] how do i request a new project? my basic goal is to get this extension working on a staging environment that's close to wikimedia's. it involves uploading files, which involves some processing upon completing the upload to transform a plaintext file into a static image (not sure how relevant that is). this would make a new project appropriate, right? [02:40:56] (i.e. what are my boxen building off of) [02:41:06] Emw: maybe andrewbogott can make one if he's not asleep [02:41:34] Emw: pick a name. what's the extension called? [02:41:50] PDBHandler [02:42:08] then that's the name of your new project [02:42:33] fantastic [02:48:35] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM dpkg-check is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:35] PROBLEM Total Processes is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:42] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:22] ldap being slow? [02:51:43] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 9.00, 11.85, 7.69 [02:53:16] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.86, 7.18, 6.51 [02:53:17] RECOVERY dpkg-check is now: OK on grail i-000002c6 output: All packages OK [02:55:37] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:37] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:37] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:37] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:45] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:24] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:29] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:34] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.71, 3.05, 4.87 [02:58:34] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 0.24, 2.44, 2.36 [02:58:34] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [02:58:34] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 70% free memory [02:58:34] RECOVERY Current Users is now: OK on mwreview i-000002ae output: USERS OK - 0 users currently logged in [02:58:35] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [02:58:35] RECOVERY Total Processes is now: OK on mwreview i-000002ae output: PROCS OK: 112 processes [02:59:07] PROBLEM Free ram is now: CRITICAL on wikistats-history-01 i-000002e2 output: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:09] * jeremyb wonders who can make Emw a project... Reedy ? [02:59:14] or else i guess europe [03:00:55] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [03:00:55] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [03:00:55] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 93% free memory [03:00:55] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.48, 2.99, 2.31 [03:00:55] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 83 processes [03:02:15] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 0.16, 1.47, 1.21 [03:02:15] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 103 processes [03:02:20] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 80% free memory [03:02:20] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [03:02:20] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [03:03:35] jeremyb, based on that description i gave, this would be a 'specific' project and not a 'global' one, right? just looking over https://labsconsole.wikimedia.org/wiki/Help:Terminology [03:03:46] huh? [03:03:49] *click* [03:03:55] PROBLEM Free ram is now: UNKNOWN on wikistats-history-01 i-000002e2 output: NRPE: Unable to read output [03:04:29] errr, maybe that's how things are supposed to be. certainly it's not reality ;) [03:04:43] anyway, yes you would be specific [03:06:25] PROBLEM Free ram is now: UNKNOWN on network-test2 i-000002ed output: NRPE: Unable to read output [03:06:25] RECOVERY Current Users is now: OK on network-test2 i-000002ed output: USERS OK - 1 users currently logged in [03:06:25] RECOVERY Current Load is now: OK on network-test2 i-000002ed output: OK - load average: 0.13, 1.20, 0.85 [03:08:05] RECOVERY Total Processes is now: OK on network-test2 i-000002ed output: PROCS OK: 73 processes [03:08:10] RECOVERY Disk Space is now: OK on network-test2 i-000002ed output: DISK OK [03:11:25] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.54, 1.08, 3.36 [03:16:31] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.11, 1.00, 2.66 [03:30:48] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:32] network-test1: [ 4979.188031] BUG: soft lockup - CPU#0 stuck for 47s! [gmond:23238] [03:34:57] PROBLEM Disk Space is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:57] PROBLEM Free ram is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:57] PROBLEM Current Users is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:57] PROBLEM Total Processes is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:02] PROBLEM Current Load is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:02] PROBLEM dpkg-check is now: CRITICAL on redis1 i-000002b6 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:57] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 14% free memory [03:36:36] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [03:39:33] RECOVERY Disk Space is now: OK on redis1 i-000002b6 output: DISK OK [03:39:33] RECOVERY Current Users is now: OK on redis1 i-000002b6 output: USERS OK - 0 users currently logged in [03:39:33] RECOVERY Free ram is now: OK on redis1 i-000002b6 output: OK: 85% free memory [03:39:33] RECOVERY Total Processes is now: OK on redis1 i-000002b6 output: PROCS OK: 88 processes [03:39:38] RECOVERY Current Load is now: OK on redis1 i-000002b6 output: OK - load average: 0.39, 2.72, 2.19 [03:39:38] RECOVERY dpkg-check is now: OK on redis1 i-000002b6 output: All packages OK [03:40:08] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [03:41:18] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [03:42:18] PROBLEM Disk Space is now: WARNING on deployment-transcoding i-00000105 output: DISK WARNING - free space: / 78 MB (5% inode=53%): [03:46:55] it's a Jamesofur! [03:47:03] It's a jeremyb [03:47:05] ! [03:47:05] There are multiple keys, refine your input: !, $realm, $site, :), access, account, account-questions, accountreq, addresses, afk, alert, amend, ask, b, bang, bastion, blehlogging, blueprint-dns, bot, bots, broken, bug, bz, change, console, credentials, cs, damianz, damianz's-reset, db, demon, deployment-prep, docs, documentation, domain, epad, etherpad, extension, gerrit, gerrit-wm, ghsh, git, git-puppet, gitweb, group, hashar, help, hexmode, hyperon, info, initial-login, instance, instance-json, instancelist, instanceproject, keys, labs, labsconf, labsconsole.wiki, labs-home-wm, labs-morebots, labs-nagios-wm, labs-project, leslie's-reset, link, linux, load, load-all, logbot, mac, magic, manage-projects, meh, monitor, morebots, nagios, nagios.wmflabs.org, nagios-fix, newgrp, new-labsuser, new-ldapuser, nova-resource, openstack-manager, origin/test, os-change, osm-bug, pageant, password, pastebin, pathconflict, petan, ping, pl, pong, port-forwarding, project-access, project-discuss, projects, puppet, puppetmaster::self, puppet-variables, putty, pxe, queue, quilt, report, requests, resource, revision, rights, rt, Ryan, ryanland, sal, SAL, security, security-groups, sexytime, socks-proxy, ssh, start, stucked, sudo, terminology, test, Thehelpfulone, unicorn, whatIwant, whitespace, wiki, wikitech, windows, wl, wm-bot, [03:47:13] have you come to heal all the opsen? [03:47:19] they need tea! [03:48:26] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 16% free memory [03:51:52] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory [03:54:26] jeremyb: thanks for all the help (i'm off for the night) [03:54:48] * jeremyb too momentarily [03:54:49] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 16% free memory [03:54:54] giving up on andrewbogott [03:55:04] eh? [03:55:22] Emw: see labs-l. most recent msg there i think [03:55:28] was trying to figure it out [03:55:33] ah, gotcha [03:55:39] you should subscribe if you're not on it [03:56:09] i just tried to subscribe and got notified that i apparently already am [03:56:21] hah [04:06:42] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:06:52] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:08:22] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 4% free memory [04:10:31] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [04:11:51] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:11:51] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:13:21] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 96% free memory [04:14:51] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:19:51] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [05:15:20] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [05:50:30] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [06:27:13] PROBLEM Free ram is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [06:27:13] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:09] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [06:32:18] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 13% free memory [06:32:18] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 5.13, 5.63, 3.33 [06:32:18] PROBLEM Total Processes is now: WARNING on deployment-thumbproxy i-0000026b output: PROCS WARNING: 155 processes [06:32:28] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [06:33:07] PROBLEM Current Load is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:32] PROBLEM Disk Space is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:32] PROBLEM Current Users is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:32] PROBLEM Total Processes is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:38] PROBLEM Free ram is now: CRITICAL on grail i-000002c6 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:57] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.49, 7.35, 4.78 [06:37:57] RECOVERY Current Load is now: OK on grail i-000002c6 output: OK - load average: 0.43, 1.95, 1.51 [06:37:57] RECOVERY Disk Space is now: OK on grail i-000002c6 output: DISK OK [06:37:57] RECOVERY Current Users is now: OK on grail i-000002c6 output: USERS OK - 0 users currently logged in [06:37:58] RECOVERY Total Processes is now: OK on grail i-000002c6 output: PROCS OK: 101 processes [06:38:03] RECOVERY Free ram is now: OK on grail i-000002c6 output: OK: 85% free memory [06:41:02] PROBLEM Free ram is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:22] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:54] RECOVERY Total Processes is now: OK on deployment-thumbproxy i-0000026b output: PROCS OK: 150 processes [06:43:03] PROBLEM Current Users is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:03] PROBLEM dpkg-check is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:15] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:55] PROBLEM Disk Space is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:55] PROBLEM Current Load is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:33] RECOVERY Disk Space is now: OK on deployment-transcoding i-00000105 output: DISK OK [06:45:23] PROBLEM Total Processes is now: CRITICAL on gluster-4 i-000002e4 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:27] PROBLEM Total Processes is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:38] PROBLEM Current Users is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:38] PROBLEM Free ram is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:38] PROBLEM dpkg-check is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:36] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [06:48:41] PROBLEM Free ram is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:37] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:34] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [06:58:49] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:55] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:14] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:15] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:29] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:40] PROBLEM Total Processes is now: WARNING on deployment-thumbproxy i-0000026b output: PROCS WARNING: 151 processes [07:01:15] PROBLEM Current Load is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:15] PROBLEM Disk Space is now: CRITICAL on mwreview i-000002ae output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:08] PROBLEM Disk Space is now: WARNING on deployment-transcoding i-00000105 output: DISK WARNING - free space: / 78 MB (5% inode=53%): [07:02:19] PROBLEM Free ram is now: CRITICAL on gluster-2 i-000002e0 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:40] PROBLEM Total Processes is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:45] PROBLEM dpkg-check is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:45] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Disk Space is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Current Users is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Current Load is now: CRITICAL on network-test1 i-000002ec output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:50] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [07:03:29] RECOVERY Disk Space is now: OK on gluster-4 i-000002e4 output: DISK OK [07:03:29] PROBLEM Current Load is now: WARNING on gluster-4 i-000002e4 output: WARNING - load average: 3.99, 6.66, 6.30 [07:03:39] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 17.00, 19.96, 13.14 [07:03:54] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:05:55] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 3.34, 6.24, 5.88 [07:05:56] RECOVERY Total Processes is now: OK on gluster-4 i-000002e4 output: PROCS OK: 85 processes [07:06:01] PROBLEM Free ram is now: UNKNOWN on gluster-4 i-000002e4 output: NRPE: Unable to read output [07:06:01] RECOVERY Current Users is now: OK on mwreview i-000002ae output: USERS OK - 0 users currently logged in [07:06:01] RECOVERY Total Processes is now: OK on mwreview i-000002ae output: PROCS OK: 113 processes [07:06:06] RECOVERY Free ram is now: OK on mwreview i-000002ae output: OK: 69% free memory [07:06:06] RECOVERY dpkg-check is now: OK on mwreview i-000002ae output: All packages OK [07:06:06] RECOVERY Current Load is now: OK on mwreview i-000002ae output: OK - load average: 5.02, 5.21, 4.23 [07:06:06] RECOVERY Disk Space is now: OK on mwreview i-000002ae output: DISK OK [07:07:00] PROBLEM Free ram is now: UNKNOWN on gluster-2 i-000002e0 output: NRPE: Unable to read output [07:07:00] RECOVERY Current Users is now: OK on gluster-4 i-000002e4 output: USERS OK - 0 users currently logged in [07:07:00] RECOVERY dpkg-check is now: OK on gluster-4 i-000002e4 output: All packages OK [07:07:29] PROBLEM host: integration-apache1 is DOWN address: i-000002eb PING CRITICAL - Packet loss = 100% [07:08:44] RECOVERY Current Load is now: OK on gluster-4 i-000002e4 output: OK - load average: 0.16, 2.51, 4.58 [07:08:44] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:08:45] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:08:45] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 90 processes [07:08:54] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:08:59] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 2.88, 5.45, 5.07 [07:09:05] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:05] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:10] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:26] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 1.84, 3.43, 4.75 [07:11:55] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.26, 3.07, 3.98 [07:11:55] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [07:11:55] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [07:11:55] RECOVERY Total Processes is now: OK on network-test1 i-000002ec output: PROCS OK: 85 processes [07:12:00] RECOVERY dpkg-check is now: OK on network-test1 i-000002ec output: All packages OK [07:12:00] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 76 processes [07:12:05] RECOVERY Disk Space is now: OK on network-test1 i-000002ec output: DISK OK [07:12:05] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 94% free memory [07:12:05] RECOVERY Current Load is now: OK on network-test1 i-000002ec output: OK - load average: 0.65, 3.26, 3.82 [07:12:05] RECOVERY Current Users is now: OK on network-test1 i-000002ec output: USERS OK - 0 users currently logged in [07:12:05] RECOVERY host: integration-apache1 is UP address: i-000002eb PING OK - Packet loss = 0%, RTA = 0.74 ms [07:13:55] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.04, 2.01, 3.67 [07:13:55] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 0.33, 2.79, 3.58 [07:13:55] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 83% free memory [07:13:55] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [07:13:55] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 0 users currently logged in [07:13:56] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 105 processes [07:14:00] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [07:21:05] good morning [07:23:20] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.15, 0.68, 4.21 [07:32:00] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.57, 0.83, 3.75 [07:50:55] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [08:21:57] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [08:27:18] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 12% free memory [08:32:18] PROBLEM Free ram is now: UNKNOWN on network-test1 i-000002ec output: NRPE: Unable to read output [08:47:36] i-000002eb [08:47:36] hashar: ok :) [08:47:38] https://labsconsole.wikimedia.org/wiki/Nova_Resource:I-000002eb [08:47:55] btw, it is good to see you in the morning :-] [08:48:37] yeah [08:48:42] Also on https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=integration&instanceid=i-000002eb [08:48:47] so hmm [08:48:48] There is a huge flood of this mess everlasting: [08:48:52] I am not part of the integration project :-] [08:48:55] Jun 27 07:37:48 i-000002eb dhclient: DHCPREQUEST of 10.4.0.102 on eth0 to 10.4.0.1 port 67 [08:48:55] Jun 27 07:37:48 i-000002eb dhclient: DHCPACK of 10.4.0.102 from 10.4.0.1 [08:48:56] Jun 27 07:37:48 i-000002eb dhclient: bound to 10.4.0.102 -- renewal in 43 seconds. [08:48:58] Jun 27 07:38:19 i-000002eb nrpe[3463]: Host 10.4.0.34 is not allowed to talk to us! [08:48:59] Jun 27 07:38:20 i-000002eb nrpe[3465]: Host 10.4.0.34 is not allowed to talk to us! [08:48:59] Jun 27 07:38:20 i-000002eb nrpe[3467]: Host 10.4.0.34 is not allowed to talk to us! [08:49:00] etc. [08:49:05] yeah dhclient is a mess [08:49:12] we should make syslog to redirect that to another file :-( [08:49:35] dhcp is made every minute or so to make sure instances acquires IP asap (I think) [08:49:46] but it doesn't work? [08:50:02] well it does! bound to 10.4.0.102 -- renewal in 43 seconds. [08:50:03] hashar: you are a member of the project, actually [08:50:04] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Integration [08:50:10] ahhh [08:50:10] "s not allowed to talk to us!" [08:50:14] that is the damn project filter again [08:50:17] set to false by default [08:50:21] indeed, same here [08:50:28] also missed it the first time [08:50:39] need to change the default :-] [08:51:52] so, did you get root meanwhile? [08:51:55] Krinkle? [08:52:01] no [08:52:16] mutante: what, you mean on the labs instnace? [08:52:28] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [08:52:42] Krinkle: you are a sysadmin :/ [08:53:06] Krinkle: yes, i meant the sudo issue [08:53:12] nope, not solved yet [08:53:13] I can't sudo either [08:53:33] i heard that one from hundfred a little while ago [08:53:37] on a fresh instance [08:53:43] maybe puppet did not work fully on that instance and is missing the labs sudoer file? [08:53:54] he said he could sudo, and then it stopped working [08:54:01] and i dont think he has been removed from groups [08:54:16] I know we have a conflict for two puppet class which try to both install /etc/sudoers [08:54:24] one is provided by base:: the other by some apache:: package [08:54:32] so that could prevent installing the correct shudders file [08:54:45] or sudo::labs_project is broken? [08:54:52] ew..I don't have any puppet stuff enabled yet on that instance. [08:54:55] should I? [08:54:59] nop indeed [08:55:03] (to get "it" working) [08:55:19] it might be something general with security groups as well [08:55:19] classes: base -- ldap::client::wmf-test-cluster -- exim::simple-mail-sender -- sudo::labs_projec [08:55:39] also see labs-l, mail from Andrew Bogott [08:55:42] ohh [08:55:45] talking about security groups [08:55:47] we might need a sudo policy https://labsconsole.wikimedia.org/wiki/Special:NovaSudoer [08:56:05] doing that [08:56:31] !log integration Created ALL/ALL sudo policy for Krinkle and I [08:56:33] Logged the message, Master [08:56:39] * Krinkle was just about to [08:56:48] I did select "default" during instance creation [08:56:52] looks like that didn't work [08:56:54] that sounds like default policies have been removed [08:57:00] for sudo and firewall rules [08:57:22] root@integration-apache1:~# [08:57:24] I love our labs [08:57:34] somehow the change has been applied instantly [08:58:01] oh it uses ldap [08:58:02] damn [08:58:06] that project is really awesome [08:58:29] yea, LDAP [09:03:44] hashar: So here is the plan I have for now: Set up TestSwarm+BrowserStack (right from github and npm) without gerrit/jenkins. [09:03:46] Get that working and then either A) puppetize is for live (likely never gonna happen), B) have production jenkins submit testswarm to that instead. I know B) sounds crazy, but it makes sense because it doesn't rely yet (one-way pushes) on it and its better than what we have now. [09:03:46] it doesn't need the mw-fetcher because TestSwarm doesn't run tests, it distributes them. [09:03:46] urls all point to clones on int.mw.org [09:03:46] B) is exactly what I thought about [09:03:46] but you definitely want to use puppet [09:03:47] that is a great to document the setup [09:03:49] sure [09:03:51] and let you easily reinstall a machine if it is screwed beyond repair [09:03:55] I can definitely help there [09:03:57] as well as ops [09:04:09] yep [09:04:22] can you easily install testswarm from source? [09:04:31] this way we could just get rid of the testswarm debian package [09:04:32] Yes, 3 commands and its done [09:04:44] so we will want a github puppet class :-] [09:04:54] PROBLEM Current Users is now: WARNING on bastion1 i-000000ba output: USERS WARNING - 6 users currently logged in [09:05:00] which can just be given a tag / sha1 and the github project name [09:05:13] there is probably one existing in puppet forge [09:05:18] hashar: Right now I want to avoid puppet so I can work freely, and face the real problems, solve them, and then document it. [09:05:38] you could do all of that with puppetmaster::self [09:05:39] hashar: so right now I want to get php, mysql, apache stuff running. Where do I start? apt-get or is there a default puppet thing I can use from labs (that doesn't mess up the ability to install non-puppet stuff) [09:05:44] let you have a local puppet repo on the instance [09:05:55] so you can just edit the .pp file, save it, then run puppetd -tv [09:06:20] Krinkle: you can select puppet classes for php, mysql, apache from labsconsole and apply them, and it would not mean it removes software you still install with apt manually [09:06:25] Also, I'll need a public-IP, labsconsole gives me a failure message when I try to allocate one [09:06:32] the puppet class for testswarm should be in manifests/misc/contint.pp [09:06:35] that is because we are out of IPs [09:06:42] :( [09:06:45] I removed 3 others from retired proejcts [09:06:54] should give me at least one? [09:06:56] let me check the quota for your project [09:07:04] ah, there is a quota per project [09:07:05] what is the project name again? [09:07:06] makes sens :) [09:07:12] mutante: "integration" [09:07:19] probably has 0 [09:08:00] yea, i shall raise it to 1 in a moment, but independent of that i heard that we ran out of IPs in the labs range from Ryan recently [09:08:41] unsure about the "giving back removed ones", good question [09:10:41] Krinkle: yes, integration has "floating_ips: 0" quota, upping [09:11:30] I think Ryan talked about setting up a shared reverse proxy for labs [09:11:30] !log integration - raised floating_ips quota from 0 to 1 [09:11:31] Logged the message, Master [09:11:59] that would help saving up public IP [09:13:09] fyi how this works: https://labsconsole.wikimedia.org/wiki/Help:Nova-manage [09:14:10] !log deployment-prep upgrading deployment-transcoding [09:14:11] Logged the message, Master [09:15:51] mutante: do I / should I have the ability to raise floating_ips myself for next time? [09:16:16] i.e. what access does it require to use those commands [09:16:59] !log deployment-prep -transcoding : dpkg --purge linux-image-2.6.32-37-virtual linux-image-2.6.32-318-ec2 linux-image-2.6.32-34-virtua [09:17:00] Logged the message, Master [09:17:12] the old kernels are taking too much space for the little / ;-D [09:17:24] RECOVERY Disk Space is now: OK on deployment-transcoding i-00000105 output: DISK OK [09:17:26] Krinkle: it requires root on virt1, so i think the quota itself is just ops, but you should be able to use Special:NovaAddress to actually assign IPs and remove old ones [09:17:45] yeah quota is root only [09:17:49] Krinkle: wanna try again now? [09:17:50] mutante: still gives me "Failed to allocate new public IP address." - takes a few minutes to take effect? [09:18:11] https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaAddress&action=allocate&project=integration [09:18:45] yea, i get "Netadmin required", as intended, but i can add myself... [09:20:33] 06/27/2012 - 09:20:32 - Created a home directory for dzahn in project(s): integration [09:21:10] !log integration Enabling group puppet "puppetmaster::self" on integration-apache1 [09:21:11] Logged the message, Master [09:21:32] Krinkle: do you still know the IPs you used before on other projects but then removed? [09:21:33] 06/27/2012 - 09:21:33 - User dzahn may have been modified in LDAP or locally, updating key in project(s): integration [09:21:39] mutante: no [09:22:34] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [09:23:08] hashar: So puppetmaster::self is the only puppet thing I do from labsconsole and the rest locally? https://labsconsole.wikimedia.org/w/index.php?title=Nova_Resource:I-000002eb&curid=1410&diff=4498&oldid=4485 [09:23:28] should be [09:23:31] k [09:23:36] I think the puppet repo is somewhere in /var/lib/git [09:23:47] and there should be some symbolic links in /etc/puppet too [09:24:28] http://lists.wikimedia.org/pipermail/labs-l/2012-June/000239.html --> https://labsconsole.wikimedia.org/wiki/Help:SelfHostedPuppet [09:24:42] I have never used that though [09:26:44] PROBLEM host: deployment-transcoding is DOWN address: i-00000105 CRITICAL - Host Unreachable (i-00000105) [09:26:51] I rebooted that one [09:27:45] mutante: the labs nagios can not connect to my apaches for some reason http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?hostgroup=deployment-prep&style=detail [09:27:57] mutante: do you have any knowledge about the nagios in labs? [09:28:24] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.95, 5.75, 5.25 [09:29:03] yes,if it is about puppet freshness monitoring, yes if it is general Nagios, no if it is the "how do nagios configs get auto-created there" [09:29:18] hehe [09:29:30] apache is definitely running on deployment-apache30 [09:29:52] that might be a firewall rule, but there is a web security policy which allow port 80/443 from 0.0.0.0/0 [09:29:58] hashar: "sudo puppetd -tv" is failing [09:30:09] Krinkle: dpaste.org the output ? [09:30:14] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 71 MB (5% inode=57%): [09:30:15] indeed, from a glance an Nagios, it looks like firewalling / security group changes [09:30:27] hashar: http://cl.ly/1t2u042W1J1W2i0n1B1W [09:30:36] hmm [09:30:41] that URL looks like a virus [09:31:14] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 18% free memory [09:31:35] hashar: looks like firewalling / security group, yea [09:31:54] hashar: why? [09:32:07] not much different that an average bit.ly link [09:32:15] hashar: socket timeout, not actually refused / down [09:33:04] Krinkle: the code snippet is scope.lookupvar('puppetmaster::config').sort. [09:33:12] hashar: and this one on apache31: NRPE: Command 'check_ram' not defined , is a bit weird, because that should be defined everywhere by default (and usually is) [09:33:21] Krinkle: so it looks like the variable `puppetmaster::config` is not set [09:33:48] mutante: that might because puppet is broken on my boxes. I got two puppet class conflicting on installing /etc/sudoers [09:33:59] mutante: need Ryan to review a change I made [09:34:07] hashar: well, I didn't create any of those files. I just enabled the checkbox and ran the command. Odd that this has a parse error in it? [09:34:13] hashar: ok, makes sense [09:34:51] Krinkle: not really a parse error. Just that the object does not have a sort method because that is a null object [09:35:06] (I think) [09:35:16] I don't know. I just want it to "Just work" :P [09:36:03] so anyway, looks like I'll be puppetizing later. Gonna get going now (one way or another) - ow, I still need the public IP for later though (other wise I can't connect to it from browserstack);. [09:36:24] RECOVERY host: deployment-transcoding is UP address: i-00000105 PING OK - Packet loss = 0%, RTA = 0.55 ms [09:36:36] mutante: Any luck trying from your account? [09:37:30] Krinkle: not yet, i got the same error and now looking at the list of all IPs in the pool [09:37:39] ok [09:38:46] there is a whole new range already being prepared but not assigned to pool yet [09:39:05] but there might be reasons for that, i know there was work on it the other day [09:41:00] !log bastion - re-adding unassigned IP 208.80.153.222 to virt2 pool [09:41:01] mutante: do you have wiki-rights to see deleted contributions? [09:41:02] Logged the message, Master [09:41:13] mutante: if so, you can find out the IPs I disallocated [09:41:13] !log integration - Allocated new public IP address: 208.80.153.222 [09:41:14] Logged the message, Master [09:41:17] Krinkle: ^:) [09:41:24] https://labsconsole.wikimedia.org/wiki/Special:Contributions/Krinkle [09:41:43] I can't see it myself though [09:41:45] mutante: thx [09:42:20] Krinkle: i dont have wiki powers, no staff flag yet afaict [09:42:23] !log deployment-prep deleted deployment-thumbproxy instance. We are not going to replicate the production thumbnailing architecture [09:42:25] Logged the message, Master [09:42:41] mutante: staff shouldn't apply to labsconsole though, its not in wmf-centralauth [09:42:49] oh, on labsconsole. never mind, yea [09:42:59] mutante: can I safely remove the old "phabricator.wmflabs.org" hostname associated with it? [09:43:50] oh... [09:43:57] I can't associate the IP to my instance apparently [09:44:18] hrmm, i could just see that it was not assigned to any instance, unlike most other IPs [09:44:41] we gotta check on phabricator [09:44:51] but it was definitely not in use [09:44:57] per nova-manage that is [09:45:22] RECOVERY Current Users is now: OK on bastion1 i-000000ba output: USERS OK - 5 users currently logged in [09:45:43] o hai. [09:47:29] mutante: Yeah, looks like something wasn't clean up in the db and shined through when the IP was assigned to integration. [09:47:43] phabricator is no longer active since 2012-01, all instances were removed [09:49:30] ok, so i guess its safe to remove the DNS name [09:49:38] and reuse the IP [09:50:02] PROBLEM host: deployment-thumbproxy is DOWN address: i-0000026b check_ping: Invalid hostname/address - i-0000026b [09:50:25] deployment-thumbproxy <--- I have deleted that one [09:52:17] mutante: okay, done [09:52:34] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [09:52:35] mutante: still can't associate the instance though. getting an error that has no interface message created for it even [09:52:48] https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaAddress&action=associate&ip=208.80.153.222&project=integration [09:52:48] Krinkle: ok, cool, i just added myself to sysadmin (yes, not inherited from netadmin) to try and configure your instance [09:53:08] hrmm [09:53:14] submit that and get the error [09:54:23] ack, i see it. hrmm, at this point i would like to ask Ryan and/or create Bugzilla ticket [09:54:41] sounds like an issue with giving back / re-using IPs [09:54:52] that had other names associated with them before [09:56:14] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 20% free memory [09:56:19] i may have an associated problem... [09:56:27] or maybe not [09:56:45] Jens_WMDE: what is it? [09:57:02] on i-00000225.pmtpa.wmflabs... wikidata-dev-3 [09:57:24] no way to reach itself using the external ip. [09:57:36] no ping, http, nothing. [09:57:44] everything else works. [09:58:00] most likely you need to add a security group [09:58:10] and/or your default group is not in effect anymore [09:58:24] PROBLEM Current Users is now: WARNING on bastion1 i-000000ba output: USERS WARNING - 6 users currently logged in [09:58:27] root@i-00000225:~# ping wikidata-dev-client.wikimedia.de [09:58:28] PING wikidata-dev.wikimedia.de (208.80.153.239) 56(84) bytes of data. [09:58:28] ^C [09:58:28] --- wikidata-dev.wikimedia.de ping statistics --- [09:58:28] 9 packets transmitted, 0 received, 100% packet loss, time 8005ms [09:59:43] mutante: here? https://labsconsole.wikimedia.org/wiki/Special:NovaSecurityGroup [09:59:49] Jens_WMDE: check /Special:NovaSecurityGroup , yes [10:00:02] for example to allow http i have [10:00:23] 80 80 tcp 0.0.0.0/0 [10:00:30] and 443 [10:01:34] and then there is "-1 -1 icmp" which would allow ping [10:01:58] but that is from "source group" default, and http is not, i added it [10:03:49] Jens_WMDE: or i think i should say "add a rule to your existing security group" as opposed to "add new group", [10:03:58] mutante: can you spot anything wrong here? https://labsconsole.wikimedia.org/wiki/Nova_Resource:I-00000225 [10:04:48] Jens_WMDE: i cant tell from that page, gotta add myself to project roles etc, it is all "per project", hold on [10:05:13] mutante: thank you very much [10:07:29] !log deployment-prep updating packages on deployment-cache-bits [10:07:31] Logged the message, Master [10:07:31] Jens_WMDE: oh, but http already works for me! [10:07:43] Jens_WMDE: without changes, but http://wikidata-dev-repo.wikimedia.de/wiki/Main_Page works for me [10:08:24] mutante: i know. it works. [10:08:53] mutante: the problem is only when i access wikidata-dev-repo.wikimedia.de on the instance. [10:09:04] eh, but < Jens_WMDE> no ping, http, nothing. http://208.80.153.239 [10:09:12] ohh [10:09:15] got you wrong [10:09:20] i.e. when i refer to the instance by its external IP/name [10:09:49] my wording was confusing :) [10:09:55] but then the problem is [10:10:21] well, dont you just want to access localhost then? [10:10:34] i quick-fixed it by adding "10.4.0.23 wikidata-dev-repo.wikimedia.de" to /etc/hosts [10:10:44] i was about to suggest that next :p [10:10:55] but why connect from instance to external IP of itself? [10:10:57] but still... [10:11:14] well. [10:11:34] so we're doing this thing, wikidata :) [10:11:45] heh! yea, i got a flyer [10:11:50] i'm testing the API/refilling the database [10:12:12] the script i have hammers the wikidata instance with api calls. [10:12:47] the thing is it all used to work last week :( [10:13:13] i mean, i can work around it, but... [10:13:25] something like .. hmm .. curl -I -H 'Host:wikidata-dev-repo.wikimedia.de' --url "http://wikidata-dev-repo.wikimedia.de" localhost ? [10:13:48] by which i mean "talk to localhost to reach Apache , yet be able to request the right virtual host" [10:13:50] mutante: a little more, but basically yes. [10:14:18] mutante: yeah, but what's the point... [10:14:38] mutante: i mean... i can refer to www.google.com on the instance. [10:14:55] i can refer to wikidata-dev-repo.wikimedia.de from the outside. [10:15:17] but i can't refer to wikidata-dev-repo.wikimedia.de from the instance. and that changed last week or so. [10:15:26] well, just not needing to open something if it can also stay internal is the instinct when talking about firewall rules [10:16:08] hm yes. [10:16:10] but if you open the same port for users anyways, i guess .. shrug, whatever works [10:16:49] please also ask on list about the recent change, yea [10:17:09] there is a recent mail on it, a thread where it fits in [10:17:15] okay okay [10:17:20] will have a look [10:17:22] Security groups & outside access [10:17:25] it fits exactly [10:17:25] thanks anyway [10:18:09] thank you very much, it's really appreciated :) [10:18:46] no problem, just that i also dont know that much about the most recent changes yet [10:20:41] I love the beta project [10:20:42] "some instances which recently had access to the outside Internet no longer have this access." [10:20:47] that is it for sure [10:21:03] just discovered that memcached is online instance of 64MB :-) [10:21:08] on a 8 GB machine [10:21:24] PROBLEM Total Processes is now: CRITICAL on deployment-mc i-0000021b output: Connection refused by host [10:21:27] !log deployment-prep rebooting deployment-mc for kernel upgrade [10:21:28] Logged the message, Master [10:22:44] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [10:23:57] morning Ryan [10:24:08] morning :) [10:24:08] morning [10:24:10] feeling better? [10:25:14] a little [10:25:26] i tried to re-assign an unused IP to the pool. .153.222 to give it to "integration"/Krinkle [10:25:34] because i heard we ran out of IPs [10:25:55] and that was "phabricator" before, yet i could tell it was not assigned to anything and phabricator has been closed [10:25:58] !log deployment-prep removed memcached from deployment-nfs-memc , it is running on deployment-mc nowadays. [10:25:59] Logged the message, Master [10:26:18] you need to unallocate it from the project [10:26:24] before it can be re-used [10:26:40] yet, we still get an openstackmanager-associatedaddressfailed> [10:26:43] it is listed under integration now and I could ad a hostname to it [10:26:44] PROBLEM host: deployment-mc is DOWN address: i-0000021b CRITICAL - Host Unreachable (i-0000021b) [10:26:53] it only fails when trying to associate it with an instance [10:27:07] hm [10:27:42] https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaAddress&action=associate&ip=208.80.153.222&project=integration [10:28:12] seems you aren't authorized [10:28:17] Ryan_Lane: what i did: nova-manage floating list, see it is not assigned to any instance nor says "virt2" as pool, then "nova-manage floating create 208.80.153.222", then i could assignt that to the project, unlike before, but we got the new error above when assign to instance [10:28:22] are you in the netadmin group in the project? [10:28:36] mutante: it's in the project [10:28:38] i added myself to it and could confirm the error [10:28:44] (netadmin role) [10:29:40] it felt like that is caused by it being re-used and the old DNS name still being somewhere [10:30:10] hm [10:30:10] Krinkle could remove the "phabricator" name from it though [10:30:23] database says its still assigned to phabricator project [10:31:43] wow it's in both [10:31:49] hmm, well, using "floating create" changed it from not letting me assign to project to assigning it to project [10:31:54] uggghhhhhh [10:31:59] you did floating create? [10:32:00] Ryan_Lane: possibly related to the current trouble with public IPs: we can't access our public IPs from *inside* of labs, not even from the instance itself: https://bugzilla.wikimedia.org/show_bug.cgi?id=37985 [10:32:12] the address exists twice now [10:32:17] Ryan_Lane: yes , it appeared to me like i had to [10:32:29] Ryan_Lane: sorry, but it did not say it was in virt2 pool [10:32:40] mutante: the correct way to do this, is to go into the phabricator project, then unallocate it [10:32:52] then go into the integration project and allocate it [10:33:18] ok, though phabricator has been closed a while ago i think [10:33:23] looking [10:33:46] Daniel_WMDE: yes, that's a known issue. it's because it's using NAT [10:33:55] internally you must use internal IPs [10:34:07] mutante: there's no such thing as closed [10:34:15] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [10:34:18] and now there's two addresses [10:34:18] Ryan_Lane: but it used to work?... [10:34:21] so things are fucked up [10:34:35] Daniel_WMDE: I'd be highly surprised if that was the case [10:34:45] well, be surprised then :) [10:34:51] Ryan_Lane: hm. I actually saw it work. [10:34:51] Daniel_WMDE: if the instances are on different hosts, its possible [10:34:59] if they are on the same host, then it won't work [10:35:07] Ryan_Lane: ok, just closed as in "all instances deleted". [10:35:08] which means, you should never rely on it [10:35:09] Ryan_Lane: it'S the *same* instance, talking to itself using it's public IP. [10:35:18] mutante: yeah, that doesn't mean it's closed ;) [10:35:32] Daniel_WMDE: that absolutely won't work [10:35:35] RECOVERY host: deployment-mc is UP address: i-0000021b PING OK - Packet loss = 0%, RTA = 0.83 ms [10:35:41] NAT makes that impossible [10:35:45] Ryan_Lane: it did, apparently... [10:35:48] well. /etc/hosts to the rescue [10:36:00] but but but ... [10:36:04] ah well. [10:36:05] just use internal IPs [10:36:05] Ryan_Lane: howdy [10:36:24] you're going to get nothing but trouble using the public ones [10:36:25] RECOVERY Total Processes is now: OK on deployment-mc i-0000021b output: PROCS OK: 133 processes [10:36:27] paravoid: howdy [10:37:01] how's stuff? [10:37:13] mutante: btw, when an IP says its in a certain pool, it means it's assigned to a specific host [10:37:23] mutante: that doesn't really mean anything [10:37:32] that has to do with the nova-network service [10:37:33] Ryan_Lane: project without an instance has "Allocate IP" but not "Disallocate", and "Disassociate" is per instance [10:37:37] 06/27/2012 - 10:37:36 - Created a home directory for dzahn in project(s): phabricator [10:37:58] that's not correct [10:38:31] 06/27/2012 - 10:38:31 - User dzahn may have been modified in LDAP or locally, updating key in project(s): phabricator [10:38:34] things are broken now that there are two addresses [10:39:20] PROBLEM Current Users is now: WARNING on bastion-restricted1 i-0000019b output: USERS WARNING - 6 users currently logged in [10:39:22] oh..hmm, sry, just describing what i see now when trying to remove. [10:39:31] I deleted both addresses [10:39:40] then re-added one [10:39:49] Oh Ryan_Lane ,you're here [10:39:49] Great :) [10:40:45] !log deployment-prep made deployment-mc to use 'memcached' puppet class. Now uses 2000MB apparently [10:40:47] Logged the message, Master [10:40:58] ok fixed [10:41:04] I associated it with the instance [10:41:05] Ryan_Lane: thanks! how about the .155. network i see there. should we use that instead next [10:41:12] no [10:41:15] that shouldn't be used [10:41:18] ok [10:41:25] that's leslie's network [10:41:30] 06/27/2012 - 10:41:30 - Created a home directory for laner in project(s): integration [10:41:32] she already allocated all of them [10:41:42] we do have an additional block to use, though [10:42:11] we should see which projects have IPs allocated but not associated [10:42:13] and take them back [10:42:32] 06/27/2012 - 10:42:32 - User laner may have been modified in LDAP or locally, updating key in project(s): integration [10:42:41] ok, i was going to make sure re-using IPs like that is ok in general, as long as disassociating it first of course [10:42:49] paravoid: it's ok. I'm not sick anymore, so that's a plus [10:43:00] oh, didn't know you were sick :/ [10:43:03] yes. must disassociate them first [10:43:10] yeah, that's why I wasn't on yesterday [10:43:12] sorry [10:43:19] last week in Berlin too, isn't it? [10:43:22] yep [10:43:42] were you in Berlin for the wikidata project? [10:44:20] Daniel_WMDE: wait, your instance can't talk to the outside world? [10:44:44] now *that's* a problem [10:44:55] Krinkle-away: you got the IP now, Ryan fixed it. [10:44:58] Ryan_Lane: it can talk to the outside world EXCEPT to the outside IP associated with instances [10:45:02] Yep [10:45:08] well, that's normal [10:45:23] but I'm getting reports that people can't talk to the outside world [10:45:34] and indeed, bastion-restricted cant [10:45:48] Ryan_Lane: I'll be away starting next week [10:45:57] (btw) [10:46:01] oh? where you going? [10:46:08] Nicaragua [10:46:13] vac + debconf [10:46:24] well, vac + debcamp + debconf, 2 weeks [10:48:08] Ryan_Lane: so, I pinned python-eventlet in puppet, upgraded it & restarted nova on virt6-8 [10:48:13] so that's ready too [10:48:16] how shall we proceed? [10:48:19] migrate a VM there perhaps? [10:48:41] ah. cool. have fun in Nicaragua [10:48:58] you can create a test on [10:48:59] *one [10:49:26] how would I create it in one of these nodes specifically? [10:49:32] it has no instances [10:49:35] so it'll be preferred [10:49:42] if I enable it you mean [10:49:46] yes [10:49:49] curl 'http://localhost' works from within the instance shell. Can't get at it from the outside yet though. [10:50:00] takes a while for DNS to take effect? Any estimate? [10:50:19] dns should work right now [10:50:28] localhost? [10:50:30] .... [10:50:31] Hm.. DNS shouldn't be an issue, I"m using the public IP directly in another tab, same result [10:50:36] that's 127.0.0.1 [10:50:42] Ryan_Lane: can we test it without affecting the service? [10:50:50] PROBLEM Free ram is now: CRITICAL on deployment-squid i-000000dc output: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:50] paravoid: nope [10:50:55] you must enable it to testit [10:50:56] (mostly out of curiosity) [10:51:10] Ryan_Lane: Yes, that's serving /var/www/index.html as it should [10:51:18] uh, okay... [10:51:29] lunch time! [10:51:34] Krinkle-away: do your security groups allow port 80 into that instance? [10:51:35] Ryan_Lane: Just testing that apache works. http://integration.wmflabs.org/ / http://208.80.153.222/ is not responding yet though [10:51:53] because dig shows the address properly from the outside world [10:51:59] right [10:52:11] so. likely security groups [10:52:21] ping workd [10:52:23] *works [10:52:50] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [10:52:57] Krinkle-away: see backlog with Daniel_WMDE a little while ago, we talked about security groups to allow 80 and curl [10:53:38] we did? [10:53:40] eh, sorry, Jens_WMDE i meant to say [10:53:45] ah :) [10:54:17] anyway, security group rules are inbound only [10:54:24] there's no outbound rules [10:54:35] Krinkle-away: but, in your case, you need an inbound rule ;) [10:54:44] I'm copying rules from resourcelaoder2 now for group 'http' [10:55:23] you can't modify an instance's groups once the instance is created [10:55:35] Then I better get it right :) [10:55:38] so, if the instance isn't in the http group, that's not going to work [10:55:49] you can modify the group all you want, and the settings will apply properly [10:55:50] uh? [10:56:16] but, if the instance is only in the default group, you can't add it to the http group [10:56:30] But I shouldn't modify the default group, right? [10:56:39] well, you can [10:56:40] that made me say "< mutante> Jens_WMDE: or i think i should say "add a rule to your existing security group" as opposed to "add new group"" [10:56:50] it's not the preferred way of going about [10:56:53] about it [10:57:01] but if you need to, it's fine [10:57:15] the issue with doing that is that it applies project wide [10:57:33] so, if you add another instance where you want to hide port 80 it isn't possible [10:58:09] Jens_WMDE: ^ [10:58:11] Ryan_Lane: np for now. Ahm. so what CIDR range should I use? I see two common patterns in other project of their "web" 80 rule [10:58:12] Begin: Running /scripts/init-bottom ... done. [10:58:12] cloud-init start-local running: Wed, 27 Jun 2012 10:56:18 +0000. up 1.68 seconds [10:58:16] no instance data found in start-local [10:58:17] last lines from the log [10:58:20] cloud-init-nonet waiting 120 seconds for a network device. [10:58:20] PROBLEM host: deployment-squid is DOWN address: i-000000dc CRITICAL - Host Unreachable (i-000000dc) [10:59:03] Ryan_Lane: any ideas? [10:59:22] likely doesn't have networking [10:59:26] which host is this on? [10:59:29] yes, it doesn't [10:59:32] it got allocated to virt6 [10:59:47] Krinkle-away: 0.0.0.0/0 if you want access from the outside world [10:59:49] Ryan_Lane: 0.0.0.0/0 or 10.4.0.0/24 (for the port 80 rule in integration/default) [10:59:51] paravoid: lemme take a look [10:59:53] Ah, I see [10:59:59] Ryan_Lane: wait [11:00:02] 10.4.0.0/24 is labs only [11:00:20] ok [11:00:23] I wasn't sure in which direction the CIDR was for [11:01:50] !log integration Added rule for port 80 (from outside world) to integration/default security group [11:01:51] Logged the message, Master [11:04:50] Ryan_Lane: so, eth1.103 exists; br103 wasn't. the VM starts up, br103 gets created but eth1.103 doesn't get added to the bridge [11:04:54] Krinkle-away: all rules are ingress [11:05:02] paravoid: heh. lame [11:05:08] PROBLEM host: nginx-dev2 is DOWN address: i-000002ee CRITICAL - Host Unreachable (i-000002ee) [11:05:16] so... [11:05:38] if you create the bridge and add the device, nova doesn't object [11:05:57] it's perfectly happy with things pre-configured [11:06:03] and you prefer it that way too, right? :) [11:06:17] nova-compute.log shows addbr, ip link set etc. but not addif [11:06:21] Ryan_Lane: btw, I noticed that the Nova_Resource properties are often outdated (especially "Instance State" and "Public IP" properties) - which is odd since the edit page for that ("Configure") has no input fields for those. They just update when doing a null-edit to Configure. [11:06:26] like https://labsconsole.wikimedia.org/w/index.php?title=Nova_Resource%3AI-000001d7&diff=4513&oldid=2869 [11:06:31] well, I prefer things to be consistent more than my own way :) [11:06:31] I did a null-edit on the Configure page [11:06:51] and I'm fine leaving things as they are if they work too :) [11:06:54] paravoid: agreed. this is a regression in nova [11:06:58] Great http://integration.wmflabs.org/ "It works!" [11:07:01] because in cactus this worked properly [11:07:17] Krinkle-away: yeah. known bug [11:07:19] how this works in virt1-5? [11:07:22] k, np [11:07:28] paravoid: who says it does? :) [11:07:29] Thanks mutante and Ryan_Lane - stuff is working now :) [11:07:35] I have steps for rebooting hosts [11:07:39] oh? [11:07:53] http://wikitech.wikimedia.org/view/OpenStack#Rebooting_hosts [11:08:03] it's done so rarely that I have been too lazy to fix it [11:08:33] RECOVERY host: deployment-squid is UP address: i-000000dc PING OK - Packet loss = 0%, RTA = 0.44 ms [11:08:53] I planned on doing so with the new hosts [11:08:59] !log integration Installing bunch of stuff on integratin-apache1 - experimentation right now, documenting steps on [[Nova_Resource:Integration/Setup]] (subject to retroactively change at any time) [11:09:00] Logged the message, Master [11:09:13] so that I could reboot without consequences [11:10:15] RECOVERY HTTP is now: OK on integration-apache1 i-000002eb output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.009 second response time [11:10:45] RECOVERY Free ram is now: OK on deployment-squid i-000000dc output: OK: 89% free memory [11:13:15] RECOVERY Current Users is now: OK on bastion1 i-000000ba output: USERS OK - 5 users currently logged in [11:19:15] cloud-init running: Wed, 27 Jun 2012 11:16:46 +0000. up 63.01 seconds [11:19:16] waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id 11:16:46 [ 1/100]: url error [[Errno 101] Network is unreachable] [11:19:19] ?!? [11:22:55] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [11:23:11] paravoid: soo…. [11:23:28] I'm tcpdumping on the host [11:23:31] paravoid: if eth1.103 is being used, doesn't that mean that eth1's link needs to be up? [11:23:35] it does DHCP requests but noone replies [11:23:49] it is [11:24:04] 3: eth1: mtu 1500 qdisc mq state DOWN qlen 1000 [11:24:14] no-carrier? [11:24:15] oops [11:24:18] heh [11:24:28] # mii-tool eth1 [11:24:28] eth1: no link [11:24:34] noone cabled those? :( [11:24:36] well, that's surely a problem [11:25:20] same with virt7/8 [11:25:40] didn't even think of looking for a link, I took that for granted :/ [11:27:02] * Ryan_Lane sighs [11:27:14] I'll open a ticket for Chris to look at [11:27:31] anything more specific than "do the same as virt1-5"? [11:29:25] PROBLEM Current Users is now: CRITICAL on bastion-restricted1 i-0000019b output: USERS CRITICAL - 14 users currently logged in [11:33:03] hmm, 14 may sound a lot for -restricted, but its just users having multiple connections [11:35:45] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [11:35:55] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:55] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [11:53:25] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 5.98, 4.79, 4.95 [11:55:55] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 0.37, 0.27, 0.24 [12:01:25] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 6.79, 6.40, 5.66 [12:05:45] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:14:28] so hmm puppetmaster::self does not work indeed :-D [12:17:06] paravoid: are you around? I have applied puppetmaster::self and got an error Failed to parse template puppet/puppet.conf.d/20-master.conf.erb: undefined method `sort' for :undefined:Symbol at /etc/puppet/manifests/puppetmaster.pp:76 [12:17:16] it looks like puppetmaster::config is not defined [12:17:35] hmm? [12:18:22] my instance is deployment-cache-bits [12:18:23] 20-master shouldn't be used [12:18:29] I know krinkle ad the same issue this morning [12:21:56] merge between puppetmaster::self and a3ee38be broke it [12:21:57] :( [12:22:37] heh [12:23:00] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [12:23:02] no one seemed to check anything after my merge ;) [12:24:19] gaaaah puppetmaster.pp [12:28:30] PROBLEM HTTP is now: CRITICAL on integration-apache1 i-000002eb output: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 3498 bytes in 4.085 second response time [12:35:48] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [12:40:10] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [12:40:35] hashar: http://integration.wmflabs.org/ wee [12:42:18] !!!!!!! [12:42:24] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Integration/Setup [12:43:20] RECOVERY HTTP is now: OK on integration-apache1 i-000002eb output: HTTP OK: HTTP/1.1 200 OK - 14800 bytes in 0.124 second response time [12:47:11] PROBLEM host: php5builds is DOWN address: i-00000192 check_ping: Invalid hostname/address - i-00000192 [12:56:34] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [12:57:27] Krinkle-away: you might want to use a subdirectory to let room for jenkins [12:57:55] hashar: Yeah, will do [12:58:03] hashar: Do you know what's up with curl cli ? [12:58:11] Looks like it times out on a simple request [12:58:13] what do you ask? [12:58:15] from the browser it woks fine [12:58:15] oh [12:58:24] e.g. curl -v http://integration.wmflabs.org/api.php?action=cleanup [12:58:47] ahh it uses the public IP [12:58:59] I don't think NAT would let you do that [12:59:29] just setup an entry in /etc/hosts : 127.0.0.1 integration.wmflabs.org [13:00:14] how about using localhost? [13:00:24] or just localhost yes [13:00:31] hm.. might not work due to virthualhost config [13:00:42] you could pass the Host as a curl header [13:01:07] curl --proxy localhost:80 [13:01:24] PROBLEM Current Users is now: WARNING on bastion1 i-000000ba output: USERS WARNING - 6 users currently logged in [13:01:37] curl --proxy localhost:80 -v http://integration.wmflabs.org/api.php?action=cleanup [13:01:45] interesting [13:01:50] so no /etc/hosts file needed? [13:01:52] nop [13:01:57] curl contact the proxy [13:02:06] then send the http:// request crafted from the URL [13:02:39] since localhost (the proxy) has a virtual host for integration.wmflabs.org, it happily honor Host: integration.wmflabs.org [13:02:51] I think the curl --proxy trick comes from domas [13:03:25] could do the same with: curl --header 'Host: integration.wmflabs.org' http://localhost/api.php?action=cleanup [13:05:54] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [13:06:29] < mutante> something like .. hmm .. curl -I -H 'Host:wikidata-dev-repo.wikimedia.de' --url "http://wikidata-dev-repo.wikimedia.de" localhost [13:11:24] RECOVERY Current Users is now: OK on bastion1 i-000000ba output: USERS OK - 5 users currently logged in [13:16:24] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 2.22, 3.06, 4.53 [13:18:49] hashar: moved into subdir; updated /Setup [13:19:04] mod rewrite was disabled by default [13:19:48] had to run a2enmod rewrite ; working now http://integration.wmflabs.org/testswarm/info [13:25:24] RECOVERY Disk Space is now: OK on ipv6test1 i-00000282 output: DISK OK [13:27:34] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [13:33:32] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 71 MB (5% inode=57%): [13:34:42] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:36:32] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [13:36:42] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:22] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 8.32, 7.69, 6.18 [13:46:26] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [13:50:11] PROBLEM dpkg-check is now: CRITICAL on deployment-cache-bits i-00000264 output: DPKG CRITICAL dpkg reports broken packages [13:56:02] !log integration Running `puppetd -tv` on integration-apache1. puppetmaster::self has been fixed by ops [13:56:03] Logged the message, Master [13:56:26] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [13:57:46] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [14:01:26] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 20% free memory [14:07:08] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [14:09:18] PROBLEM Current Users is now: WARNING on bastion1 i-000000ba output: USERS WARNING - 6 users currently logged in [14:18:49] Is it just me or is gerrit really slow? Like the header loads then...wait...page loads [14:19:51] <^demon> It's a tad slow, but works. [14:20:04] Damianz: we had recurring slowness for the past days [14:20:25] Gerrit makes me sad [14:21:52] Someone needs to make an awesome git command wrapper thing to pull comments in and vote on things... like hub but for gerrit [14:22:25] <^demon> There's command line interfaces for the review functions [14:22:34] <^demon> Any user can use them. [14:22:49] Hmmm [14:22:52] * Damianz googles [14:22:56] * Damianz hopes it's not in java [14:22:56] <^demon> Why google? [14:23:08] <^demon> There's a nice big "Documentation" link at the top of gerrit [14:23:12] <^demon> People should click that more often. [14:23:22] I would if the page loaded :P [14:23:27] <^demon> https://gerrit.wikimedia.org/r/Documentation/index.html [14:23:50] <^demon> Specifically, you're looking for https://gerrit.wikimedia.org/r/Documentation/cmd-index.html#_a_id_user_commands_a_user_commands [14:24:46] Hmm interesting [14:24:49] <^demon> Basic syntax is `ssh -p 29418 gerrit.wikimedia.org gerrit [command]` [14:25:04] SSH could be kinda neat as it doesn't require any extra tokens but also sluggish maybe [14:25:07] * Damianz will play later [14:25:30] make that a function in your bash [14:26:25] Damianz: https://github.com/hashar/alix/blob/master/shell_functions [14:27:05] :) [14:27:28] example: $ gerrit query is:open owner:hashar [14:27:30] Might actually write some git commands so I can use it directly but that's a nice starting point [14:28:08] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [14:28:10] gerrit has options format either text or json. There is probably a way to make a list having one line per change [14:28:34] I am interested in any change you could make :-] [14:29:01] I tried grabbing json from the web api once, got into a load of jetty errors and then gave up and just copied the list manually. [14:29:01] ok. going to upgrade gluster [14:29:26] gerrit query is:open owner:hashar --format=json ;-] [14:29:47] there is a --format=text (default), probably over formats too but it is hard to tell since there is no online help for format [14:30:08] RECOVERY dpkg-check is now: OK on deployment-cache-bits i-00000264 output: All packages OK [14:32:04] [--format {TEXT | JSON}] [14:33:48] Json sorta looks useable, wonder if the id related to anything referencable from knowing a repo and sha1 [14:37:10] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [14:53:59] !log integration integration-apache1 now uses puppetmaster:self (see /var/lib/git/ ) [14:54:00] Logged the message, Master [14:55:34] !log integration created psm-lucid instance to test out bootstrapping of puppetmaster::self from scratch [14:55:35] Logged the message, Master [14:55:49] paravoid: ^^ going to try the installation procedure on a fresh instance :-] [14:58:14] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [14:58:59] hashar: sec [15:04:25] PROBLEM Current Load is now: CRITICAL on psm-lucid i-000002f1 output: Connection refused by host [15:05:05] PROBLEM Current Users is now: CRITICAL on psm-lucid i-000002f1 output: Connection refused by host [15:05:42] hashar: try it now [15:05:45] PROBLEM Disk Space is now: CRITICAL on psm-lucid i-000002f1 output: Connection refused by host [15:06:15] PROBLEM Free ram is now: CRITICAL on psm-lucid i-000002f1 output: Connection refused by host [15:07:15] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:08:14] * jeremyb wonders if there's a way to know which host your instance is on? [15:08:52] jeremyb: physical host you mean ? [15:08:55] yes [15:09:02] did you see labs-l about the SNAT? [15:09:07] paravoid: I am running puppet right now then will add the puppetmaster::self class to it [15:09:15] nop [15:09:25] RECOVERY Current Load is now: OK on psm-lucid i-000002f1 output: OK - load average: 1.39, 1.58, 1.04 [15:10:14] RECOVERY Current Users is now: OK on psm-lucid i-000002f1 output: USERS OK - 1 users currently logged in [15:10:44] RECOVERY Disk Space is now: OK on psm-lucid i-000002f1 output: DISK OK [15:11:14] PROBLEM Free ram is now: UNKNOWN on psm-lucid i-000002f1 output: NRPE: Unable to read output [15:11:35] jeremyb: well, there's *kind of* a way to know which host you're on [15:11:41] but it's not terribly accurate right now [15:11:49] the instances pages list the host [15:12:07] but if we live-migrate an instance, it's wrong [15:12:16] righhhhht [15:12:28] when we move to essex, andrewbogott wrote a plugin for openstack to update mediawiki [15:12:41] so, whenever something changes, the pages will be correct [15:12:49] ahh, rings a bell but i didn't know it needed a certain version [15:13:00] no plugins till essex [15:13:02] is that a SMW specific plugin? [15:13:15] it's a plugin for openstack [15:13:17] what are we on now? [15:13:17] not for mediawiki [15:13:44] and openstack didn't have a plugin framework till andrew wrote it for the essex release [15:13:52] we're on diablo [15:13:54] essex is out [15:14:04] hahah, we had to write it?! [15:14:04] we'll be upgrading at some point [15:14:22] well, openstack is still fairly young ;) [15:14:57] we wrote a number of things for openstack [15:15:04] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [15:15:09] it's nice that we can, and it gets accepted easily enough [15:15:19] yeah, sure [15:15:30] my contributions were mostly patches. I rewrote the LDAP support in nova, though. [15:15:44] i'm not seeing where it says the host name [15:16:00] instance host [15:17:12] paravoid: so puppetmaster::self runs with out error on a fresh instance. congrats! [15:17:30] heh [15:18:46] !log integration deleting psm-lucid instance. puppetmaster::self does run without on a fresh instance! [15:18:46] Logged the message, Master [15:19:16] without on ? [15:19:54] RECOVERY dpkg-check is now: OK on mobile-testing i-00000271 output: All packages OK [15:19:56] so.... what's a normal amount of time to expect a fresh host just created with no extra puppet classes to take to reach an idle, stable, final state? [15:19:57] yeah that s my english [15:20:00] I tend to skip words [15:20:03] hah [15:20:06] jeremyb: a few minutes [15:20:18] seemed to take forever for both precise and oneiric last night [15:20:27] jeremyb: "without error" [15:20:28] are they not fully supported yet? (vs. lucid) [15:20:37] hashar: ahh [15:20:41] lets try on Precise [15:20:52] * hashar loves lab [15:21:19] !log integration created psm-precise to test puppetmaster::self on a Precise box [15:21:20] Logged the message, Master [15:22:29] Ryan_Lane: i see the host now. i was looking at the instance list not an individual instance page. (i left out a word when reading what you wrote! compensating for hashar? ;P) [15:22:51] ah [15:22:53] heh [15:23:09] any idea about initial boot time? [15:24:22] would be nice to be able to tell labsconsole to use a different initial puppetmaster (a ::self one from the same project maybe) and to get serial access over SSH (not just rarely updated read only access over the web) [15:24:37] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:02] jeremyb: Which would mostly be useless without the root pass (which no one knows)? [15:25:06] also, ryan, you have a few patchsets waiting for review in case you didn't notice [15:25:42] * Ryan_Lane nods [15:25:47] PROBLEM SSH is now: CRITICAL on mobile-testing i-00000271 output: CRITICAL - Socket timeout after 10 seconds [15:25:57] I'll review them when I get back to the states [15:26:15] * jeremyb hasn't a clue when that trip is ;P [15:26:23] friday [15:26:31] ahh, k [15:26:36] I don't want to make many changes before then [15:26:53] sure. you could +1 and then +2 later. but whatever ;) [15:27:31] Damianz: err... so, also give them a custom root pass? we could allow custom root pass on anything created for <12 hr lifespan (just picked the number out of thin air... can be changed) and require global root for longer lived nodes? [15:28:17] anyway, even just to view the console but not login it would be useful [15:28:17] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [15:29:27] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 4.39, 4.62, 4.96 [15:34:35] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 84% free memory [15:35:45] PROBLEM Free ram is now: UNKNOWN on psm-precise i-000002f2 output: NRPE: Unable to read output [15:36:46] so Precise just returns me : [15:36:47] err: /Stage[main]/Certificates::Wmf_ca/Exec[/bin/ln -s /etc/ssl/certs/wmf-ca.pem /etc/ssl/certs/$(/usr/bin/openssl x509 -hash -noout -in /etc/ssl/certs/wmf-ca.pem).0]/returns: change from notrun to 0 failed: /bin/ln -s /etc/ssl/certs/wmf-ca.pem /etc/ssl/certs/$(/usr/bin/openssl x509 -hash -noout -in /etc/ssl/certs/wmf-ca.pem).0 returned 1 instead of one of [0] at /etc/puppet/manifests/certs.pp:198 [15:36:47] err: /Stage[main]/Certificates::Wmf_labs_ca/Exec[/bin/ln -s /etc/ssl/certs/wmf-labs.pem /etc/ssl/certs/$(/usr/bin/openssl x509 -hash -noout -in /etc/ssl/certs/wmf-labs.pem).0]/returns: change from notrun to 0 failed: /bin/ln -s /etc/ssl/certs/wmf-labs.pem /etc/ssl/certs/$(/usr/bin/openssl x509 -hash -noout -in /etc/ssl/certs/wmf-labs.pem).0 returned 1 instead of one of [0] at /etc/puppet/manifests/certs.pp:219 [15:36:58] I guess that is not much of an issue anyway [15:37:15] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [15:37:23] !log integration applying puppetmaster::self to psm-precise [15:37:24] Logged the message, Master [15:40:35] well, it needs to be fixed [15:40:48] did not happen on lucid [15:41:04] yeah [15:41:57] ohh remembers me of /etc/sudoers [15:43:32] paravoid: so puppetmaster::self run fine on Precise :-] [15:44:38] Ryan_Lane: I have a conflict for /etc/sudoers trying to get installed by two puppet class. Any idea who could review / merge that beside you (since you are leaving soon) [15:44:40] https://gerrit.wikimedia.org/r/#/c/12178/ [15:46:07] I gave it a +1 [15:46:29] since it can break production, I guess you want to schedule that change for a later deployement so [15:47:00] i'd prefer not to do it right now [15:47:03] someone else can [15:47:11] fine thanks [15:47:11] any ops person [15:47:23] will poke Faidon or Daniel tomorrow ;) [15:55:14] I can even test my change using puppetmaster::self [16:02:35] PROBLEM dpkg-check is now: CRITICAL on psm-precise i-000002f2 output: DPKG CRITICAL dpkg reports broken packages [16:02:45] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [16:03:45] PROBLEM HTTP is now: CRITICAL on psm-precise i-000002f2 output: Connection refused [16:07:15] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [16:08:45] RECOVERY HTTP is now: OK on psm-precise i-000002f2 output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.003 second response time [16:12:35] RECOVERY dpkg-check is now: OK on psm-precise i-000002f2 output: All packages OK [16:16:45] PROBLEM HTTP is now: CRITICAL on psm-precise i-000002f2 output: Connection refused [16:33:04] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [16:36:54] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 133 processes [16:37:24] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [16:44:11] everything looks broken [16:45:12] <^demon> Everything? We'd better get to work. [16:58:15] It seems back now [17:03:04] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [17:03:55] bad channel to report outages, unless it's a labs outage ;) [17:04:27] We have outages? I'd never noticed [17:04:27] :P [17:05:28] Ryan_Lane: Did you piss off china last time you where there? :P [17:07:24] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [17:08:33] the sever of WikiLabs isn't down, but seems all other servers of WMF is down [17:09:15] Labs isn't affected, production apache boxes are [17:29:40] http://mobile-geo.wmflabs.org/ looks unreachable [17:33:05] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [17:35:48] gotta love bot announcements like that... [17:37:35] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [17:40:29] <^demon> Ryan_Lane: I want this shirt http://www.zazzle.com/2_lgtm_tee_shirt-235862924096653439 :) [17:40:50] <^demon> Or http://www.zazzle.com/i_would_prefer_t_shirts-235689879400316433 [17:41:25] heh [17:41:55] it doesn't have a -2 [17:46:38] lol, those are cool [17:47:26] <^demon> Ryan_Lane: Is downtime over enough that I can restart gerrit? [17:47:33] yes [17:50:51] <^demon> Argh, it won't let me. Must've not gotten my sudo setup right. [17:52:02] Damn, let me edit the config file. [18:02:58] <^demon> Ryan_Lane: Mind giving it a kick for me? [18:03:11] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [18:08:11] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [18:12:42] ^demon: kick it in what way? [18:12:44] I ran puppt [18:12:46] *puppet [18:12:55] <^demon> Was just looking for a restart. [18:12:56] and nothing happened [18:13:00] why a restart? [18:13:14] <^demon> That was #gerrit's bright idea to fix our commits stuck in "Submitted, Merge Pending" [18:13:18] ugh [18:13:34] well, I can't do it now [18:13:38] they're about to start a deploy [18:13:57] <^demon> We'll live. [18:13:59] <^demon> Well, did the puppet run force a restart? [18:15:39] no [18:15:48] because no changes were done to the configs [18:15:54] <^demon> *nod* Ok. [18:16:02] <^demon> I'll just wait. [18:16:10] <^demon> "Restart gerrit" was kind of a silly solution anyway. [18:33:11] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [18:38:14] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [18:39:34] 06/27/2012 - 18:39:33 - User orion may have been modified in LDAP or locally, updating key in project(s): bastion,swift [18:39:41] 06/27/2012 - 18:39:41 - Updating keys for orion at /export/keys/orion [19:03:11] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [19:08:41] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:09:29] Ryan_Lane: apparently one of the ganeti folks has a presentation at Google I/O [19:09:32] about ganeti among other things [19:09:36] oh/ [19:09:36] https://developers.google.com/events/io/sessions/gooio2012/1203/ [19:09:38] cool [19:09:39] not sure if you care [19:09:49] nah, I'm interested [19:09:57] good to see what's in other systems [19:10:00] they do use it for something similar to labs otoh [19:10:08] internally [19:10:39] also of interest to me/you: https://developers.google.com/events/io/sessions/gooio2012/1201/ [19:10:50] ganeti is pretty cool for what it does [19:10:58] "SPDY: It's here!" [19:11:01] yeah, I saw a SPDY talk at velocity [19:11:07] was a really interesting talk [19:11:14] ah there was streaming? [19:11:16] damn [19:11:25] no [19:11:27] last year [19:11:34] oh [19:11:38] I'm going to miss this SPDY talk [19:11:42] I'll be on a plane [19:11:48] heh :) [19:12:12] I really hope I'm not sick for my flight [19:12:29] Hope you're not sick at all [19:12:49] I've been sick the past two days [19:12:57] I wonder if redhat con is being video'd... love to catch the selinux talk [19:13:03] Ryan_Lane: :( Get better soon [19:13:06] ty [19:22:10] why http://mobile-geo.wmflabs.org/ is unaccessible? [19:33:15] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [19:35:31] Probably because the host is not responding to anything on port 80. [19:36:32] Ryan_Lane: Labs console seems down [19:36:38] Still playing with the networking stuff? [19:36:47] wfm [19:36:48] Or paravoid [19:37:04] <^demon> Ryan_Lane: lack of a real git library for php is pissing me off :( [19:37:06] oh derp [19:37:15] Can we make wmflabs.org point there for idiots like me? [19:37:26] i'd prefer we don't [19:37:32] maybe a landing page would be ok [19:37:43] but I don't want a redirect from wmflabs to labsconsole [19:37:53] <^demon> A landing page saying "Welcome to labs, you probably want labsconsole" would be easy :) [19:38:00] Be nice if we could get labsconsole in a good place in google... ruddy old wikis [19:38:14] !nagios [19:38:14] http://208.80.153.210/nagios3 http://nagios.wmflabs.org/nagios3 [19:38:15] MaxSem: A couple of days ago we had a similar problem with a different host, and clicking the 'Reassociate IP' button resolved some things. If you are netadmin for that project, it's easy: https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaAddress&showmsg=setfilter [19:38:20] of course, that said, wmflabs.org doesn't have an A record pointing anywhere ;) [19:39:02] I was gonna look at it but apaprently that host isn't in the project list I thought it was so I can't eh, back to steak [19:39:05] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [19:46:52] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - load average: 77.46, 42.34, 17.40 [19:50:41] andrewbogott, thanks, but it didn't help [19:51:11] MaxSem: ok, most likely the problem is internal to that instance then. [19:51:38] meh, nobpdy admits to doing anything:) [19:51:52] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.93, 16.19, 12.87 [20:03:22] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [20:09:32] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [20:11:52] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.69, 0.75, 3.83 [20:33:22] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [20:39:33] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [20:51:35] 06/27/2012 - 20:51:35 - User owen may have been modified in LDAP or locally, updating key in project(s): bastion [20:51:43] 06/27/2012 - 20:51:43 - Updating keys for owen at /export/keys/owen [21:03:23] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [21:08:34] PROBLEM Free ram is now: CRITICAL on configtest-main i-000002dd output: CHECK_NRPE: Socket timeout after 10 seconds. [21:08:39] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [21:09:43] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [21:13:37] PROBLEM Free ram is now: UNKNOWN on configtest-main i-000002dd output: NRPE: Unable to read output [21:13:37] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 13% free memory [21:33:34] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [21:39:44] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [22:03:36] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [22:11:46] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [22:16:14] paravoid: I'm working with alpha_ori on trying out https://labsconsole.wikimedia.org/wiki/Help:Using_debs_in_labs [22:16:32] okay [22:16:33] puppet's complaining that it won't install a package because the repo is unsigned. [22:16:44] do you know the right way to tell apt that it's all cool? [22:17:06] make a key, sign the packages with it, and use apt-key to trust it [22:18:02] Ryan_Lane: in the context of labs (and a cluster of machines using a local file:/// repo), will that work? [22:18:15] also, I want it to work for packages that I don't build (and so can't sign) [22:18:25] you don't sign packages, you sign repositories [22:18:32] I'm not sure if signatures work with file:/// though [22:18:49] you can disable signature verification but you can't do it on a per project basis [22:19:00] even with selfhostedpuppet? [22:19:53] hmm... [22:20:24] paravoid: on what basis can you do it? (host or global?) [22:20:34] global [22:20:43] you disable apt's signature verification completely [22:20:47] which is bad obviously [22:21:37] also, I'm confused by the combination of your and ryan's statements. "use apt-key to sign the package" and "you don't sign packages you sign repositories." [22:21:49] he's right [22:21:55] it's the repo, not the packages [22:22:18] the description in the man page for apt-key says "Packages which have been authenticated using these keys will be [22:22:18] considered trusted." [22:22:37] is it trying to say "packages from repositories which have been authenticated..."? [22:22:42] maplebed: man apt-secure [22:22:45] Yeah, but AFAIK you sign the repo and apt-key add the key. [22:22:58] maplebed: yes, that's what it's trying to say :) [22:23:15] hrmph. I should file a bug. that's confusing. [22:23:29] well, it's not wrong per se [22:23:43] it says "authenticated" not "signed" [22:24:29] Though I think you can sign packages so you can distribute trusted packages without a repo. [22:24:43] there was debsigs but it never took off [22:24:53] it's not officially supported and can create problems in various scenarios [22:25:06] so in the context of https://labsconsole.wikimedia.org/wiki/Help:Using_debs_in_labs, [22:25:14] apt doesn't check it so it will still warn; dpkg doesn't care about signatures either way [22:25:20] paravoid: Ah. [22:25:30] it sounds like I need to create a Releases file, [22:25:47] and ... [22:26:02] I'm not sure what the right way is to get the key into a place where puppet will like it and it's still labs-only [22:26:04] you can pipe the output of gpg --recv-keys right into apt-key [22:26:08] yes, a Release file [22:26:38] grumble. [22:26:48] also http://wiki.debian.org/SecureApt [22:26:51] i've been thinking if we want a keyserver [22:27:01] what I want is to allow unsigned file:/// repos. [22:27:04] :P [22:27:15] Unsigned files is sorta bad [22:28:04] well, he's right though; asking signatures for file:/// it's a tad too much [22:28:05] within labs, the signature doesn't actually do anything useful. [22:29:11] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=596498 [22:30:40] wow, didn't know that [22:30:46] so, >= precise :-) [22:33:36] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [22:36:16] paravoid: tomorrow I'll try and follow http://wiki.debian.org/SecureApt and see if it'll work, [22:36:43] but if you have any advice (or help) to make https://labsconsole.wikimedia.org/wiki/Help:Using_debs_in_labs have a set of instructions that'll work, I'd love it. [22:36:58] maplebed: for precise you can you [trusted=yes] [22:37:12] right, but I'm trying to do this for the current swift clutser, which is not on precise. [22:37:26] and I don't watn to put upgrading to precise in between here and the newer verson of swift. [22:37:59] yes, make sense :) [22:38:19] I just watn to use labs to test upgrading software! [22:41:06] PROBLEM Puppet freshness is now: CRITICAL on wikistats-01 i-00000042 output: Puppet has not run in last 20 hours [22:41:46] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:03:36] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [23:07:57] who should i ask about having a new project created for an extension i'd like to get onto labs? [23:11:10] emw: it's a cool project [23:11:31] GChriss: thanks! [23:11:47] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:11:48] lmk if you find out -- I have an outstanding project request listed on https://labsconsole.wikimedia.org/wiki/Requests [23:17:07] PROBLEM Total Processes is now: CRITICAL on testforx i-000002f3 output: Connection refused by host [23:17:32] PROBLEM dpkg-check is now: CRITICAL on testforx i-000002f3 output: Connection refused by host [23:18:52] PROBLEM Current Load is now: CRITICAL on testforx i-000002f3 output: Connection refused by host [23:19:09] Emw: also, consider displaying HTML5 video "fallback" in lieu of a static image [23:19:11] http://thekyle.tk/post/6039657612/look-what-i-can-do-with-html5-javascript-and [23:19:32] PROBLEM Current Users is now: CRITICAL on testforx i-000002f3 output: Connection refused by host [23:20:01] to at least display full rotation, even if non-interactive [23:20:42] PROBLEM Free ram is now: UNKNOWN on testforx i-000002f3 output: NRPE: Unable to read output [23:20:53] (and for everyone just joining: re: "[Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models" @ http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html ) [23:22:02] RECOVERY Total Processes is now: OK on testforx i-000002f3 output: PROCS OK: 78 processes [23:22:32] RECOVERY dpkg-check is now: OK on testforx i-000002f3 output: All packages OK [23:23:29] GChriss: interesting. for a closer non-webgl fallback, i think i'd try projecting 3d onto 2d by using the non-webgl html5 canvas (or microsoft's VML, which would support interactive models in ie8) [23:23:52] RECOVERY Current Load is now: OK on testforx i-000002f3 output: OK - load average: 0.05, 0.76, 0.60 [23:24:32] RECOVERY Current Users is now: OK on testforx i-000002f3 output: USERS OK - 0 users currently logged in [23:25:13] pre-rendered video prob. isn't the best option, but something to keep in mind [23:27:19] indeed. i think this extension that deals specifically with one niche file type might be a good starting point for enabling interactive 3d models in general on wikipedia [23:30:41] any particular modeling in mind (besides molecules)? [23:31:36] buildings and planets come to mind initially [23:32:07] maps [23:33:05] stuff on http://sketchup.google.com/3dwarehouse/, though licensing might be an issue [23:33:40] PROBLEM host: incubator-apache is DOWN address: i-00000211 CRITICAL - Host Unreachable (i-00000211) [23:36:40] I could see traction w/ physics articles and makerbot 3d models [23:38:22] the top answer here indicates that there's fairly high-quality free and open source software that converts photographs to 3d models: http://superuser.com/questions/30053/is-there-any-free-open-source-software-that-converts-photos-to-3d-models [23:38:29] that'd be neat [23:39:58] Google Body (we just need to find an MRI and a wikivolunteer) [23:40:10] ha! true [23:40:34] ya, anatomy would be a good use [23:41:01] i think there's a lot of pedagogical potential for incorporating 3d models in wikipedia's content [23:41:50] PROBLEM host: nginx-dev2 is DOWN address: i-000002f0 CRITICAL - Host Unreachable (i-000002f0) [23:41:51] agreed [23:42:21] also see http://cinema.elphel.com/en/stereo3d