[00:05:22] PROBLEM Free ram is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:06:12] PROBLEM SSH is now: CRITICAL on deployment-web deployment-web output: CRITICAL - Socket timeout after 10 seconds [00:06:22] PROBLEM Current Load is now: CRITICAL on deployment-web deployment-web output: CRITICAL - load average: 70.74, 38.41, 15.97 [00:08:12] PROBLEM dpkg-check is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:12] PROBLEM Disk Space is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:32] PROBLEM HTTP is now: CRITICAL on deployment-web deployment-web output: CRITICAL - Socket timeout after 10 seconds [00:11:02] RECOVERY SSH is now: OK on deployment-web deployment-web output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:13:01] * Damianz watches the server implode [00:13:02] RECOVERY Disk Space is now: OK on deployment-web deployment-web output: DISK OK [00:13:02] RECOVERY dpkg-check is now: OK on deployment-web deployment-web output: All packages OK [00:15:12] PROBLEM Free ram is now: WARNING on deployment-web deployment-web output: Warning: 6% free memory [00:18:41] yay [00:19:12] hexmode: you made a huge traffic there [00:19:20] people read blog :D [00:20:22] PROBLEM Free ram is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:45] * Damianz sends petan to find a haproxy box and a few replicated webservers [00:23:32] PROBLEM Total Processes is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:22] PROBLEM Current Users is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:52] PROBLEM SSH is now: CRITICAL on deployment-web deployment-web output: CRITICAL - Socket timeout after 10 seconds [00:26:12] PROBLEM Disk Space is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:12] PROBLEM dpkg-check is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:50] hrm... why it no work? [01:08:15] Monkeys [01:09:38] Apparently the server died, I blame the guy who blogged about it ;P [01:10:23] * hexmode hangs his head in shame [01:10:41] where is Ryan_Lane when you need a savior? [01:11:09] * Damianz points at Ryan_Lane [01:11:12] in which way did it die? [01:11:21] OOM? [01:11:24] nagios alerted for everything and now is timed out [01:11:27] Ryan_Lane: I can't ssh into it [01:11:33] Go find someone with console access to it :P [01:11:33] check the console log [01:11:41] and if it's OOM'd, then reboot it [01:11:52] there's no console access to instances right now [01:13:10] rebooting [01:13:28] !log deployment-prep oom reboot -web [01:13:29] Logged the message, Master [01:14:18] True but I can't see the console log either :P As I'm only in the bots group. [01:14:28] yep [01:14:30] * Damianz thinks about going doing some work [01:14:36] have to be a sysadmin in the project to see the consol [01:14:40] or maybe just a member [01:14:43] likely just a member [01:15:21] I can see the console logs for bots stuff but can't rebooted them or anything as that requires the sysadmin role. [01:15:52] k, it is back now [01:16:18] you guys probably need more -web boxes [01:16:22] and more squids [01:16:36] would likely help if I finished sanitizing the squid config [01:16:45] I'll ask Reedy tomorrow [01:16:48] or email him now [01:16:48] Can not haz web [01:16:58] the web stuff needs to be puppetized [01:17:02] RECOVERY Current Load is now: OK on deployment-web deployment-web output: OK - load average: 0.17, 0.20, 0.09 [01:17:05] So does the bot stuff :P [01:17:11] it's very likely 90%+ puppetized [01:18:12] got fp, but now it is spining for http://labs.wikimedia.beta.wmflabs.org/wiki/Problem_reports [01:18:19] spinning even [01:18:19] For the stuff can't it just use the production puppet configs with the ip stuff scrapped out? [01:18:22] RECOVERY Total Processes is now: OK on deployment-web deployment-web output: PROCS OK: 108 processes [01:18:26] works for me [01:18:39] your browser may have a hung connection [01:19:12] RECOVERY Current Users is now: OK on deployment-web deployment-web output: USERS OK - 1 users currently logged in [01:19:22] RECOVERY HTTP is now: OK on deployment-web deployment-web output: HTTP OK: HTTP/1.1 302 Found - 553 bytes in 0.007 second response time [01:19:51] yeah, I'll email reedy info and leave this for now [01:20:01] * Ryan_Lane nods [01:20:50] He's probably sleeping [01:23:24] probably [01:23:29] I would hope ;) [06:54:05] PROBLEM Current Load is now: WARNING on incubator-bots2 incubator-bots2 output: WARNING - load average: 8.16, 7.66, 6.32 [06:58:45] PROBLEM Current Load is now: WARNING on incubator-nfs incubator-nfs output: WARNING - load average: 8.55, 9.17, 7.77 [06:59:55] PROBLEM Current Load is now: WARNING on bots-sql3 bots-sql3 output: WARNING - load average: 1.43, 6.18, 5.51 [07:04:55] RECOVERY Current Load is now: OK on bots-sql3 bots-sql3 output: OK - load average: 0.17, 2.49, 4.09 [07:13:45] RECOVERY Current Load is now: OK on incubator-nfs incubator-nfs output: OK - load average: 4.13, 3.94, 4.92 [07:24:05] RECOVERY Current Load is now: OK on incubator-bots2 incubator-bots2 output: OK - load average: 4.98, 4.65, 4.99 [07:32:25] PROBLEM Current Load is now: WARNING on incubator-bots2 incubator-bots2 output: WARNING - load average: 4.99, 4.99, 5.07 [10:00:14] PROBLEM Total Processes is now: CRITICAL on incubator-bots3 incubator-bots3 output: Connection refused by host [10:01:34] PROBLEM dpkg-check is now: CRITICAL on incubator-bots3 incubator-bots3 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:02:24] PROBLEM Disk Space is now: CRITICAL on incubator-bots3 incubator-bots3 output: Connection refused or timed out [10:09:24] PROBLEM host: incubator-bots3 is DOWN address: incubator-bots3 CRITICAL - Host Unreachable (incubator-bots3) [10:20:53] PROBLEM Current Load is now: WARNING on incubator-bots2 incubator-bots2 output: WARNING - load average: 5.74, 5.30, 5.11 [11:35:53] RECOVERY Current Load is now: OK on incubator-bots2 incubator-bots2 output: OK - load average: 5.14, 4.58, 4.93 [11:44:13] PROBLEM Current Load is now: WARNING on incubator-bots2 incubator-bots2 output: WARNING - load average: 4.79, 5.13, 5.09 [13:53:33] petan, you there? [13:53:55] or petan|wk, you there? [13:55:44] nevermind [14:33:53] !log incubator Merged webaccess group with default, avoiding the hassle of having to click an extra button [14:33:55] Logged the message, Master [15:19:08] Grrr why does friggin enwp.org/Template:WMFLabsBot rank higher in Google than anything labs related itself when searching for 'wmf labs' [15:19:26] http://www.google.nl/search?q=wmf+labs [15:19:26] http://www.google.com/search?q=wmf+labs [15:28:20] 01/29/2012 - 15:28:20 - Creating a home directory for platonides at /export/home/incubator/platonides [15:28:52] !log incubator Temporarily adding platonides into the project to debug issues [15:28:54] Logged the message, Master [15:29:20] 01/29/2012 - 15:29:20 - Updating keys for platonides [15:37:11] !log incubator Removing platonides from the project and isolated the error [15:37:12] Logged the message, Master [16:38:45] PROBLEM Free ram is now: WARNING on prefixexport prefixexport output: Warning: 19% free memory [17:18:45] RECOVERY Free ram is now: OK on prefixexport prefixexport output: OK: 74% free memory [20:51:02] PROBLEM Current Load is now: CRITICAL on incubator-nfs incubator-nfs output: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:52] PROBLEM Current Load is now: WARNING on incubator-nfs incubator-nfs output: WARNING - load average: 5.62, 6.51, 5.72 [20:57:02] PROBLEM Current Load is now: WARNING on incubator-bots2 incubator-bots2 output: WARNING - load average: 5.86, 5.80, 5.32 [21:02:02] RECOVERY Current Load is now: OK on incubator-bots2 incubator-bots2 output: OK - load average: 2.42, 4.48, 4.95 [21:05:52] RECOVERY Current Load is now: OK on incubator-nfs incubator-nfs output: OK - load average: 4.83, 4.20, 4.80