[00:12:53] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 1.24, 7.97, 5.18 [00:18:02] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 1.66, 3.46, 3.94 [00:18:15] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [00:48:15] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [00:51:37] Ryan_Lane: So that labs database thing, is that something that I will have by lunchtime this Friday? [00:51:48] (Cause that's when my Berlin tutorial rehearsal is) [00:51:59] sure. let's set it up on thursday [00:53:00] OK, cool [00:59:09] RECOVERY Total Processes is now: OK on demo-deployment1 i-00000276 output: PROCS OK: 91 processes [00:59:39] RECOVERY dpkg-check is now: OK on demo-deployment1 i-00000276 output: All packages OK [01:00:31] PROBLEM HTTP is now: CRITICAL on deployment-web i-00000217 output: CRITICAL - Socket timeout after 10 seconds [01:01:01] RECOVERY Current Load is now: OK on demo-deployment1 i-00000276 output: OK - load average: 0.22, 0.46, 0.22 [01:01:25] RECOVERY Current Users is now: OK on demo-deployment1 i-00000276 output: USERS OK - 1 users currently logged in [01:02:01] RECOVERY Disk Space is now: OK on demo-deployment1 i-00000276 output: DISK OK [01:02:58] RECOVERY Free ram is now: OK on demo-deployment1 i-00000276 output: OK: 92% free memory [01:03:10] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [01:03:30] PROBLEM HTTP is now: CRITICAL on deployment-web4 i-00000214 output: CRITICAL - Socket timeout after 10 seconds [01:04:05] PROBLEM HTTP is now: CRITICAL on demo-deployment1 i-00000276 output: CRITICAL - Socket timeout after 10 seconds [01:05:30] PROBLEM HTTP is now: WARNING on deployment-web i-00000217 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.011 second response time [01:07:48] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.010 second response time [01:08:18] PROBLEM HTTP is now: WARNING on deployment-web4 i-00000214 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.012 second response time [01:18:24] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [01:28:22] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [01:29:42] PROBLEM Disk Space is now: CRITICAL on bz-dev i-000001db output: DISK CRITICAL - free space: / 29 MB (2% inode=43%): [01:34:41] PROBLEM Disk Space is now: WARNING on bz-dev i-000001db output: DISK WARNING - free space: / 52 MB (3% inode=43%): [01:48:24] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [02:18:26] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [02:29:51] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [02:42:20] 05/16/2012 - 02:42:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:43:26] Raelly need to revising my reading of laner as lamer :D [02:48:39] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [02:51:09] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [02:51:53] Reedy: willing to do some review ? ;D [02:52:06] Reedy: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/mediawiki-config,n,z [02:53:15] Wouldn't it make more sense to make that config a template then use the puppet vars which are garunteed? [02:53:17] 05/16/2012 - 02:53:17 - Updating keys for laner at /export/home/deployment-prep/laner [02:54:29] well mediawiki-config is not deployed by puppet :-D [02:54:44] yet [02:55:17] eh... oh you're not doing puppet stuff, really should update my mind that svn is diying...slowly [02:55:25] * Damianz goes away again [02:55:32] RECOVERY dpkg-check is now: OK on gerrit-bots i-00000272 output: All packages OK [02:55:32] RECOVERY Current Load is now: OK on gerrit-bots i-00000272 output: OK - load average: 1.49, 1.33, 0.72 [02:55:35] RECOVERY Current Users is now: OK on gerrit-bots i-00000272 output: USERS OK - 0 users currently logged in [02:55:35] RECOVERY Current Users is now: OK on build-precise1 i-00000273 output: USERS OK - 0 users currently logged in [02:56:25] RECOVERY Disk Space is now: OK on build-precise1 i-00000273 output: DISK OK [02:56:25] RECOVERY Disk Space is now: OK on gerrit-bots i-00000272 output: DISK OK [02:56:36] RECOVERY Free ram is now: OK on build-precise1 i-00000273 output: OK: 87% free memory [02:57:58] RECOVERY Free ram is now: OK on gerrit-bots i-00000272 output: OK: 89% free memory [02:58:57] RECOVERY Total Processes is now: OK on build-precise1 i-00000273 output: PROCS OK: 79 processes [02:59:12] RECOVERY Total Processes is now: OK on gerrit-bots i-00000272 output: PROCS OK: 79 processes [03:00:26] RECOVERY dpkg-check is now: OK on build-precise1 i-00000273 output: All packages OK [03:00:26] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 0.01, 0.08, 0.07 [03:08:23] 05/16/2012 - 03:08:23 - Updating keys for laner at /export/home/deployment-prep/laner [03:08:56] ^ 3 times [03:11:49] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 15.32, 13.50, 7.20 [03:15:36] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [03:19:17] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [03:20:18] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:07] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 4.61, 7.26, 7.09 [03:25:42] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Current Load is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Free ram is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:47] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Current Users is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:57] PROBLEM dpkg-check is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:46] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 71 MB (5% inode=40%): [03:31:17] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:18] PROBLEM Disk Space is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:18] PROBLEM Total Processes is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:49] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 5.07, 5.68, 5.07 [03:32:49] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [03:32:49] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [03:32:49] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 78 processes [03:32:54] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 94% free memory [03:33:10] errr... is this a bad time? [03:35:14] RECOVERY Disk Space is now: OK on gluster-devstack i-00000274 output: DISK OK [03:35:14] RECOVERY Total Processes is now: OK on gluster-devstack i-00000274 output: PROCS OK: 117 processes [03:35:34] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:34] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:39] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:39] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:36:09] RECOVERY Current Load is now: OK on gluster-devstack i-00000274 output: OK - load average: 0.49, 4.10, 4.62 [03:36:09] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 94% free memory [03:36:09] RECOVERY Free ram is now: OK on gluster-devstack i-00000274 output: OK: 88% free memory [03:36:09] RECOVERY Current Users is now: OK on gluster-devstack i-00000274 output: USERS OK - 0 users currently logged in [03:36:09] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 77 processes [03:36:14] RECOVERY dpkg-check is now: OK on gluster-devstack i-00000274 output: All packages OK [03:36:24] wonderfuckingful... [03:36:34] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [03:36:35] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 4.36, 5.88, 5.29 [03:36:35] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [03:36:35] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [03:36:35] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [03:36:35] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 83 processes [03:36:39] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 94% free memory [03:38:00] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.91, 3.58, 4.41 [03:39:26] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [03:39:26] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 82 processes [03:39:31] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 90% free memory [03:39:31] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [03:39:31] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 1.09, 5.35, 5.65 [03:40:56] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 0.12, 2.46, 4.00 [03:42:46] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 21% free memory [03:43:46] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.02, 2.00, 4.11 [03:45:06] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 0.52, 8.86, 7.90 [03:45:16] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:49:56] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:45] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 15% free memory [03:52:20] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [03:55:06] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.54, 1.92, 4.56 [04:05:43] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:56] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [04:06:09] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:09:29] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [04:12:59] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.07, 1.62, 3.58 [04:14:16] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.007 second response time [04:15:26] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:16:06] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:17:56] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.18, 0.89, 2.75 [04:21:33] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:09] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [04:25:55] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 11% free memory [04:32:01] ahh [04:36:30] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 4% free memory [04:41:32] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:41:37] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:53:10] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [04:55:00] !log deployment-prep bug 36871 - deleting bz-dev instance [04:55:06] Logged the message, Master [05:23:36] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [05:53:54] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [06:10:35] New patchset: Hashar; "class to install the ack-grep utility" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6467 [06:10:51] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6467 [06:13:14] New patchset: Hashar; "class to install the 'tree' utility" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6468 [06:13:29] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6468 [06:14:40] New review: Hashar; "rebased" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6468 [06:24:10] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [06:42:04] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:14] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:39] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.84, 11.94, 8.22 [06:42:39] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 8.39, 10.93, 6.48 [06:42:44] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:44] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:44] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:49] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:02] !log rebooting labs-nfs1 [06:43:02] rebooting is not a valid project. [06:43:53] !log testlabs rebooting labs-nfs1 [06:43:54] Logged the message, Master [06:47:44] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.39, 3.68, 3.65 [06:47:44] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [06:47:44] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [06:47:44] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 80 processes [06:47:49] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 94% free memory [06:53:50] hm. still can't ssh into labs-nfs1 [06:55:15] ok, wtf: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&s=by+name&c=Virtualization%2520cluster%2520pmtpa&tab=m&vn= [06:57:48] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [06:58:13] PROBLEM Current Load is now: CRITICAL on labs-nfs1 i-0000005d output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:33] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 9.82, 7.05, 3.89 [06:59:58] PROBLEM HTTP is now: CRITICAL on resourceloader2-apache i-000001d7 output: CRITICAL - Socket timeout after 10 seconds [06:59:58] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:58] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:58] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:58] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:59] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:03] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:04] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:08] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:21] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 4.06, 4.86, 5.42 [07:10:02] Ryan_Lane: there is some problem eh [07:10:06] yes [07:10:13] labs-nfs1 is having issues [07:10:14] nagios crashed [07:10:17] aha [07:10:27] but that shouldn't have anything common [07:12:42] rebooted [07:12:43] well, load is going insane [07:12:49] don't reboot anything [07:12:51] oh [07:12:52] late [07:12:54] it may make things worse [07:13:05] the machine was sort of down anyway [07:13:17] yeah, but it's going to cause more IO on boot [07:13:21] ok [07:14:46] @search load [07:14:54] !ping [07:14:54] Results (found 2): load, load-all, [07:14:55] pong [07:15:01] wow [07:15:05] wm-bot is lagged [07:15:09] !load-all [07:15:12] http://ganglia.wikimedia.org/2.2.0/?c=Virtualization%20cluster%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [07:16:56] !load [07:16:56] http://ganglia.wikimedia.org/2.2.0/graph_all_periods.php?h=virt2.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1327006829&g=load_report&z=large&c=Virtualization%20cluster%20pmtpa [07:18:21] is there anywhere a monitor for gluster storage we use [07:18:30] for the project storage, yes [07:18:43] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&s=by+name&c=Glusterfs%2520cluster%2520pmtpa&tab=m&vn= [07:19:04] the instance storage is on the virt hosts, so it with the virtual cluster [07:19:38] right [07:19:38] Current Load Avg (15, 5, 1m): [07:19:38] 1782%, 2005%, 2011% [07:19:48] :o [07:19:55] that's a problem I think [07:19:58] yeah [07:20:07] I assume it's unix load * 100 [07:20:15] so right now 17 [07:20:44] how many cpu these things have [07:20:48] a lot [07:20:50] it's due to IO [07:20:53] ah [07:21:10] maybe some array fail? [07:21:26] I mean disk array ofc [07:21:55] dmesg doesn't indicate that [07:22:06] hm. they have hardware raid, though [07:22:12] lemme check the hardware controllr [07:22:23] ah [07:22:50] but that probably wouldn't affect all of them in same time [07:22:57] nagios would tell me about that as well [07:23:04] well, it's using gluster [07:23:07] I don't see such a check in nagios [07:23:09] and it freaks ou [07:23:11] *out [07:23:19] there's a raid check in nagios [07:23:23] you don't have disk checks in nagios [07:23:27] http://nagios.wikimedia.org/ [07:23:28] yes, we do [07:23:33] I don't see it [07:23:40] [07:23:45] NTP [07:23:54] puppet shh [07:23:57] ssh * [07:24:01] nothing else [07:24:01] /usr/bin/check-raid.py [07:24:09] via nrpe [07:24:13] http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=1&host=virt1 [07:24:23] only 3 checks [07:25:02] ah [07:25:05] virt1 is degraded [07:25:07] look at that [07:25:36] right [07:29:21] petrb@bastion1:~$ ls [07:29:21] ls: cannot open directory .: Permission denied [07:30:08] petrb@bastion1:~$ ls /home -l | grep petr [07:30:08] ls: cannot access /home/preilly: Permission denied [07:30:08] ls: cannot access /home/petrb: Permission denied [07:30:08] d????????? ? ? ? ? ? petrb [07:30:08] likely because it isn't mounted [07:30:11] ah [07:30:13] right [07:30:25] because the stupid nfs server isn't working for some reason [07:30:30] I think that's really the cause of this [07:30:30] eh [07:30:35] right [07:31:03] the degraded raid needs to be fixed too, though [07:31:36] nagios is back [07:32:25] does some disk died due to all the I/O on labs? :-/ [07:32:47] maybe :o [07:32:53] no. the disk died because its a bad disk [07:32:57] hehe [07:34:05] Ryan_Lane: there was no message from nagios regarding that [07:34:11] in -tech [07:34:19] so that check doesn't work [07:34:23] seems like it [07:34:36] it might be degraded few days [07:36:57] could be [07:39:32] load is dropping [07:39:49] no idea why [07:39:49] or why it went up to begin with [07:40:03] of course things are still broken [07:40:48] nfs is working again somehow [07:41:01] I had to restart all of the rpc processes [07:41:04] I hate nfs [07:41:43] I can log into the bastion host again [07:42:02] yeah, this is all likely due to the nfs server rebooting and coming up improperly [07:42:18] really badly need to get away from that nfs instance [07:51:12] what's going on? [07:51:36] paravoid: labs-nfs1 rebooted [07:51:37] reading backlog [07:51:47] and when it came back up its nfs services had issues [07:51:53] it caused a cascading failure [07:52:05] on another note, virt1 has a degraded raid [07:52:06] "yay" [07:52:10] yep [07:52:18] I really want to kill off the nfs instance [07:52:28] but we need reliable gluster before we can do that [07:52:38] so, we need to attempt to upgrade again at some point [07:54:04] we also had some issues in production [07:54:20] likely due to a bad deploy or broken code [07:54:54] load is still kind of high on the virt cluster [07:55:20] I'm sure the degraded raidset isn't helping this [07:56:00] ok. must pack [07:58:03] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.69, 0.75, 4.13 [07:58:25] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [08:00:45] RECOVERY Current Load is now: OK on gerrit-bots i-00000272 output: OK - load average: 0.07, 0.98, 3.85 [08:00:45] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [08:01:19] Ryan_Lane: anything I can do? [08:03:15] RECOVERY Current Load is now: OK on deployment-nfs-memc i-000000d7 output: OK - load average: 2.48, 2.97, 4.27 [08:05:58] RECOVERY Current Load is now: OK on reportcard2 i-000001ea output: OK - load average: 0.41, 0.90, 3.94 [08:08:15] RECOVERY Current Load is now: OK on aggregator-test2 i-0000024e output: OK - load average: 1.60, 1.16, 4.59 [08:09:44] hm [08:10:05] I don't know. load is basically back to normal [08:10:22] oh, can you add an rt ticket for the degraded raid? [08:10:25] it's virt1 [08:10:31] it goes into the pmtpa queue [08:10:49] bonus points if you can figure out which disk is the bad one [08:10:50] heh [08:11:15] Pull them out one by one! Filesystem russian roulette! [08:11:25] well, the raid command should say [08:11:36] it's /usr/bin/MegaCli64 [08:11:54] Mmm megacli, my perfered kinda raid card. [08:12:06] will do [08:12:10] fucking raid commands need to have the most user unfriendly commandlines ever [08:12:23] Lol [08:12:32] seriously run: /usr/bin/MegaCli64 help [08:12:39] Perc cards are the worst, thanks dell. LSI cards are more friendly. [08:12:40] and just wait for your brain to melt [08:13:29] It's the ones that want like /c0/u0 command or '[c0] [u0]' command that really blow my brains [08:15:48] ok, back to packing [08:15:59] why did I come in this room again? i know it was for a reason [08:20:53] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.80, 0.90, 3.20 [08:23:14] Ryan_Lane: Because you love us so dearly? [08:24:05] Also, enjoy your trip accross country. [08:25:11] will do :) [08:25:43] Oh god, you have a scary gtalk picture [08:25:49] And pidgen makes scary noises [08:25:53] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.35, 1.40, 2.78 [08:25:55] * Damianz fixes his config [08:26:12] hahaha [08:26:22] I *still* don't understand how I have that pic for gtalk [08:26:32] I've changed it everywhere and it still pops up somehow [08:26:51] Lol, I'm pretty sure mine is a dog or something but my profile pic is totally different. [08:30:52] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [08:33:42] ok. I'm off [08:33:44] * Ryan_Lane waves [08:33:54] * Damianz nods and waves [09:01:02] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [09:31:12] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [10:00:43] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [10:01:38] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [10:08:47] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 71 MB (5% inode=40%): [10:31:41] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [11:01:54] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [11:04:24] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [11:34:27] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [12:05:21] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [12:08:39] ACKNOWLEDGEMENT host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [12:47:39] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 71 MB (5% inode=40%): [14:13:19] hello [15:00:44] and here I am again [15:02:22] paravoid: if you are wiling to have some fun, we can review my puppet changes in test branch :-D [15:02:23] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:test+topic:test,n,z [15:02:47] will do [15:03:58] gah, gerrit is so hard to use [15:04:40] the good news is that you can do most of the stuff through the CLI :D [15:05:20] Gerrit sucks UX wise [15:05:31] {{POV}} !!! [15:21:38] 05/16/2012 - 15:21:38 - Creating a home directory for aude at /export/home/wikidata-dev/aude [15:22:37] 05/16/2012 - 15:22:36 - Updating keys for aude at /export/home/wikidata-dev/aude [15:30:34] New review: Hashar; "That indeed solved the issue. dbdump has some archives in /home/wikipedia/logs" [operations/puppet] (test); V: 1 C: 1; - https://gerrit.wikimedia.org/r/7746 [15:30:48] paravoid: can you merge that one please? https://gerrit.wikimedia.org/r/#/c/7746/ [15:31:04] log rotate does not create olddir, that change makes puppet creates it for us :) [15:32:23] in the utilities class? [15:32:31] ah, no [15:33:03] <^demon> hashar: Did you take a look at that thing I mentioned yesterday? [15:33:12] I find it dumb that logrotate does not create the olddir :-( [15:33:25] ^demon: no sorry. Woke up like 1 hour and a half ago :( [15:33:31] ^demon: can you send me the link again? [15:33:38] <^demon> Sure :) [15:33:49] <^demon> https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#_a_id_changemerge_a_section_changemerge [15:34:27] ohhh [15:34:40] so that will prevents us from submitting a change that can not be merged? [15:34:43] am I correct? [15:34:45] <^demon> So the idea is "Submit" will be hidden if a dry-merge fails. [15:34:46] <^demon> Yep [15:35:01] I guess it is harmless [15:35:27] hashar: eh?! [15:35:30] how does that work? [15:35:35] it needs ensure => directory, doesn't it? [15:35:38] <^demon> I asked #gerrit. There's a minor performance hit when you first load a change, but the result of that merge (based on sha1) is held indefinitely, so only the first load is slower. [15:35:46] ^demon: isn't there a [Rebase] button to? [15:36:24] plus, 0755 should be avoided with puppet, 0644 is "preferred" (puppet converts that to 0755 for directories) [15:36:27] paravoid: indeed :-D [15:36:44] paravoid: yeah I already had that discussion about 0755 versus 0644 [15:36:51] <^demon> hashar: 2.4 I believe. Haven't tested. [15:36:54] <^demon> rc1 came out today [15:37:03] <^demon> Might be master/2.5 though :\ [15:37:26] paravoid: since they are directories, it seems more logical to me to use the interpolated mode (aka 0755) on directories [15:37:39] <^demon> I'm afraid hiding the button is going to confuse people though. Maybe we could float the idea on wikitech-l. [15:37:57] <^demon> I personally think it'd be less annoying than merging only to have gerrit yell at you, but better get consensus :) [15:38:08] the reasoning behind adding +x on 0644 directories is that you can recurse [15:38:42] you can say /foo/bar, mode => 0644, ensure => present, recurse => true and copy all of the contents with 0644 for files and 0755 for dirs/subdirs [15:40:05] ^demon: when the Submit button is hidden, is there an informative message saying something like "change need rebase before submission" ? [15:40:13] ^demon: else it is going to confuse me a lot :-] [15:40:34] paravoid: yeah I got the recursion point to. Then I am sometime asked to not use recursion hehe [15:40:35] <^demon> Don't know. We can enable it on gerrit-dev and find out [15:40:40] paravoid: I will switch to 0644 :-] [15:42:06] New patchset: Hashar; "(bug 36872) logrotate require archive dir to be created!" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7746 [15:42:15]