[00:12:53] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 1.24, 7.97, 5.18 [00:18:02] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 1.66, 3.46, 3.94 [00:18:15] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [00:48:15] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [00:51:37] Ryan_Lane: So that labs database thing, is that something that I will have by lunchtime this Friday? [00:51:48] (Cause that's when my Berlin tutorial rehearsal is) [00:51:59] sure. let's set it up on thursday [00:53:00] OK, cool [00:59:09] RECOVERY Total Processes is now: OK on demo-deployment1 i-00000276 output: PROCS OK: 91 processes [00:59:39] RECOVERY dpkg-check is now: OK on demo-deployment1 i-00000276 output: All packages OK [01:00:31] PROBLEM HTTP is now: CRITICAL on deployment-web i-00000217 output: CRITICAL - Socket timeout after 10 seconds [01:01:01] RECOVERY Current Load is now: OK on demo-deployment1 i-00000276 output: OK - load average: 0.22, 0.46, 0.22 [01:01:25] RECOVERY Current Users is now: OK on demo-deployment1 i-00000276 output: USERS OK - 1 users currently logged in [01:02:01] RECOVERY Disk Space is now: OK on demo-deployment1 i-00000276 output: DISK OK [01:02:58] RECOVERY Free ram is now: OK on demo-deployment1 i-00000276 output: OK: 92% free memory [01:03:10] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [01:03:30] PROBLEM HTTP is now: CRITICAL on deployment-web4 i-00000214 output: CRITICAL - Socket timeout after 10 seconds [01:04:05] PROBLEM HTTP is now: CRITICAL on demo-deployment1 i-00000276 output: CRITICAL - Socket timeout after 10 seconds [01:05:30] PROBLEM HTTP is now: WARNING on deployment-web i-00000217 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.011 second response time [01:07:48] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.010 second response time [01:08:18] PROBLEM HTTP is now: WARNING on deployment-web4 i-00000214 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.012 second response time [01:18:24] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [01:28:22] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [01:29:42] PROBLEM Disk Space is now: CRITICAL on bz-dev i-000001db output: DISK CRITICAL - free space: / 29 MB (2% inode=43%): [01:34:41] PROBLEM Disk Space is now: WARNING on bz-dev i-000001db output: DISK WARNING - free space: / 52 MB (3% inode=43%): [01:48:24] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [02:18:26] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [02:29:51] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [02:42:20] 05/16/2012 - 02:42:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:43:26] Raelly need to revising my reading of laner as lamer :D [02:48:39] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [02:51:09] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [02:51:53] Reedy: willing to do some review ? ;D [02:52:06] Reedy: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/mediawiki-config,n,z [02:53:15] Wouldn't it make more sense to make that config a template then use the puppet vars which are garunteed? [02:53:17] 05/16/2012 - 02:53:17 - Updating keys for laner at /export/home/deployment-prep/laner [02:54:29] well mediawiki-config is not deployed by puppet :-D [02:54:44] yet [02:55:17] eh... oh you're not doing puppet stuff, really should update my mind that svn is diying...slowly [02:55:25] * Damianz goes away again [02:55:32] RECOVERY dpkg-check is now: OK on gerrit-bots i-00000272 output: All packages OK [02:55:32] RECOVERY Current Load is now: OK on gerrit-bots i-00000272 output: OK - load average: 1.49, 1.33, 0.72 [02:55:35] RECOVERY Current Users is now: OK on gerrit-bots i-00000272 output: USERS OK - 0 users currently logged in [02:55:35] RECOVERY Current Users is now: OK on build-precise1 i-00000273 output: USERS OK - 0 users currently logged in [02:56:25] RECOVERY Disk Space is now: OK on build-precise1 i-00000273 output: DISK OK [02:56:25] RECOVERY Disk Space is now: OK on gerrit-bots i-00000272 output: DISK OK [02:56:36] RECOVERY Free ram is now: OK on build-precise1 i-00000273 output: OK: 87% free memory [02:57:58] RECOVERY Free ram is now: OK on gerrit-bots i-00000272 output: OK: 89% free memory [02:58:57] RECOVERY Total Processes is now: OK on build-precise1 i-00000273 output: PROCS OK: 79 processes [02:59:12] RECOVERY Total Processes is now: OK on gerrit-bots i-00000272 output: PROCS OK: 79 processes [03:00:26] RECOVERY dpkg-check is now: OK on build-precise1 i-00000273 output: All packages OK [03:00:26] RECOVERY Current Load is now: OK on build-precise1 i-00000273 output: OK - load average: 0.01, 0.08, 0.07 [03:08:23] 05/16/2012 - 03:08:23 - Updating keys for laner at /export/home/deployment-prep/laner [03:08:56] ^ 3 times [03:11:49] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 15.32, 13.50, 7.20 [03:15:36] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [03:19:17] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [03:20:18] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:19] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:07] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 4.61, 7.26, 7.09 [03:25:42] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Current Load is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Free ram is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:42] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:47] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Current Users is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:52] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:57] PROBLEM dpkg-check is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:30:46] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 71 MB (5% inode=40%): [03:31:17] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:18] PROBLEM Disk Space is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:18] PROBLEM Total Processes is now: CRITICAL on gluster-devstack i-00000274 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:49] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 5.07, 5.68, 5.07 [03:32:49] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [03:32:49] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [03:32:49] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 78 processes [03:32:54] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 94% free memory [03:33:10] errr... is this a bad time? [03:35:14] RECOVERY Disk Space is now: OK on gluster-devstack i-00000274 output: DISK OK [03:35:14] RECOVERY Total Processes is now: OK on gluster-devstack i-00000274 output: PROCS OK: 117 processes [03:35:34] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:34] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:39] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:39] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:36:09] RECOVERY Current Load is now: OK on gluster-devstack i-00000274 output: OK - load average: 0.49, 4.10, 4.62 [03:36:09] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 94% free memory [03:36:09] RECOVERY Free ram is now: OK on gluster-devstack i-00000274 output: OK: 88% free memory [03:36:09] RECOVERY Current Users is now: OK on gluster-devstack i-00000274 output: USERS OK - 0 users currently logged in [03:36:09] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 77 processes [03:36:14] RECOVERY dpkg-check is now: OK on gluster-devstack i-00000274 output: All packages OK [03:36:24] wonderfuckingful... [03:36:34] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [03:36:35] PROBLEM Current Load is now: WARNING on worker1 i-00000208 output: WARNING - load average: 4.36, 5.88, 5.29 [03:36:35] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [03:36:35] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [03:36:35] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [03:36:35] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 83 processes [03:36:39] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 94% free memory [03:38:00] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.91, 3.58, 4.41 [03:39:26] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [03:39:26] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 82 processes [03:39:31] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 90% free memory [03:39:31] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [03:39:31] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 1.09, 5.35, 5.65 [03:40:56] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 0.12, 2.46, 4.00 [03:42:46] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 21% free memory [03:43:46] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.02, 2.00, 4.11 [03:45:06] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 0.52, 8.86, 7.90 [03:45:16] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:49:56] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:45] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 15% free memory [03:52:20] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [03:55:06] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.54, 1.92, 4.56 [04:05:43] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:05:56] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [04:06:09] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:09:29] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [04:12:59] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.07, 1.62, 3.58 [04:14:16] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.007 second response time [04:15:26] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:16:06] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:17:56] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.18, 0.89, 2.75 [04:21:33] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:09] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [04:25:55] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 11% free memory [04:32:01] ahh [04:36:30] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 4% free memory [04:41:32] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:41:37] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:53:10] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [04:55:00] !log deployment-prep bug 36871 - deleting bz-dev instance [04:55:06] Logged the message, Master [05:23:36] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [05:53:54] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [06:10:35] New patchset: Hashar; "class to install the ack-grep utility" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6467 [06:10:51] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6467 [06:13:14] New patchset: Hashar; "class to install the 'tree' utility" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6468 [06:13:29] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6468 [06:14:40] New review: Hashar; "rebased" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6468 [06:24:10] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [06:42:04] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:14] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:39] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.84, 11.94, 8.22 [06:42:39] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 8.39, 10.93, 6.48 [06:42:44] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:44] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:44] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:46] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:49] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:02] !log rebooting labs-nfs1 [06:43:02] rebooting is not a valid project. [06:43:53] !log testlabs rebooting labs-nfs1 [06:43:54] Logged the message, Master [06:47:44] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.39, 3.68, 3.65 [06:47:44] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [06:47:44] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [06:47:44] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 80 processes [06:47:49] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 94% free memory [06:53:50] hm. still can't ssh into labs-nfs1 [06:55:15] ok, wtf: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&s=by+name&c=Virtualization%2520cluster%2520pmtpa&tab=m&vn= [06:57:48] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [06:58:13] PROBLEM Current Load is now: CRITICAL on labs-nfs1 i-0000005d output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:33] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 9.82, 7.05, 3.89 [06:59:58] PROBLEM HTTP is now: CRITICAL on resourceloader2-apache i-000001d7 output: CRITICAL - Socket timeout after 10 seconds [06:59:58] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:58] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:58] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:58] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:59] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:03] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:04] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:08] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:21] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 4.06, 4.86, 5.42 [07:10:02] Ryan_Lane: there is some problem eh [07:10:06] yes [07:10:13] labs-nfs1 is having issues [07:10:14] nagios crashed [07:10:17] aha [07:10:27] but that shouldn't have anything common [07:12:42] rebooted [07:12:43] well, load is going insane [07:12:49] don't reboot anything [07:12:51] oh [07:12:52] late [07:12:54] it may make things worse [07:13:05] the machine was sort of down anyway [07:13:17] yeah, but it's going to cause more IO on boot [07:13:21] ok [07:14:46] @search load [07:14:54] !ping [07:14:54] Results (found 2): load, load-all, [07:14:55] pong [07:15:01] wow [07:15:05] wm-bot is lagged [07:15:09] !load-all [07:15:12] http://ganglia.wikimedia.org/2.2.0/?c=Virtualization%20cluster%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [07:16:56] !load [07:16:56] http://ganglia.wikimedia.org/2.2.0/graph_all_periods.php?h=virt2.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1327006829&g=load_report&z=large&c=Virtualization%20cluster%20pmtpa [07:18:21] is there anywhere a monitor for gluster storage we use [07:18:30] for the project storage, yes [07:18:43] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&s=by+name&c=Glusterfs%2520cluster%2520pmtpa&tab=m&vn= [07:19:04] the instance storage is on the virt hosts, so it with the virtual cluster [07:19:38] right [07:19:38] Current Load Avg (15, 5, 1m): [07:19:38] 1782%, 2005%, 2011% [07:19:48] :o [07:19:55] that's a problem I think [07:19:58] yeah [07:20:07] I assume it's unix load * 100 [07:20:15] so right now 17 [07:20:44] how many cpu these things have [07:20:48] a lot [07:20:50] it's due to IO [07:20:53] ah [07:21:10] maybe some array fail? [07:21:26] I mean disk array ofc [07:21:55] dmesg doesn't indicate that [07:22:06] hm. they have hardware raid, though [07:22:12] lemme check the hardware controllr [07:22:23] ah [07:22:50] but that probably wouldn't affect all of them in same time [07:22:57] nagios would tell me about that as well [07:23:04] well, it's using gluster [07:23:07] I don't see such a check in nagios [07:23:09] and it freaks ou [07:23:11] *out [07:23:19] there's a raid check in nagios [07:23:23] you don't have disk checks in nagios [07:23:27] http://nagios.wikimedia.org/ [07:23:28] yes, we do [07:23:33] I don't see it [07:23:40] [07:23:45] NTP [07:23:54] puppet shh [07:23:57] ssh * [07:24:01] nothing else [07:24:01] /usr/bin/check-raid.py [07:24:09] via nrpe [07:24:13] http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=1&host=virt1 [07:24:23] only 3 checks [07:25:02] ah [07:25:05] virt1 is degraded [07:25:07] look at that [07:25:36] right [07:29:21] petrb@bastion1:~$ ls [07:29:21] ls: cannot open directory .: Permission denied [07:30:08] petrb@bastion1:~$ ls /home -l | grep petr [07:30:08] ls: cannot access /home/preilly: Permission denied [07:30:08] ls: cannot access /home/petrb: Permission denied [07:30:08] d????????? ? ? ? ? ? petrb [07:30:08] likely because it isn't mounted [07:30:11] ah [07:30:13] right [07:30:25] because the stupid nfs server isn't working for some reason [07:30:30] I think that's really the cause of this [07:30:30] eh [07:30:35] right [07:31:03] the degraded raid needs to be fixed too, though [07:31:36] nagios is back [07:32:25] does some disk died due to all the I/O on labs? :-/ [07:32:47] maybe :o [07:32:53] no. the disk died because its a bad disk [07:32:57] hehe [07:34:05] Ryan_Lane: there was no message from nagios regarding that [07:34:11] in -tech [07:34:19] so that check doesn't work [07:34:23] seems like it [07:34:36] it might be degraded few days [07:36:57] could be [07:39:32] load is dropping [07:39:49] no idea why [07:39:49] or why it went up to begin with [07:40:03] of course things are still broken [07:40:48] nfs is working again somehow [07:41:01] I had to restart all of the rpc processes [07:41:04] I hate nfs [07:41:43] I can log into the bastion host again [07:42:02] yeah, this is all likely due to the nfs server rebooting and coming up improperly [07:42:18] really badly need to get away from that nfs instance [07:51:12] what's going on? [07:51:36] paravoid: labs-nfs1 rebooted [07:51:37] reading backlog [07:51:47] and when it came back up its nfs services had issues [07:51:53] it caused a cascading failure [07:52:05] on another note, virt1 has a degraded raid [07:52:06] "yay" [07:52:10] yep [07:52:18] I really want to kill off the nfs instance [07:52:28] but we need reliable gluster before we can do that [07:52:38] so, we need to attempt to upgrade again at some point [07:54:04] we also had some issues in production [07:54:20] likely due to a bad deploy or broken code [07:54:54] load is still kind of high on the virt cluster [07:55:20] I'm sure the degraded raidset isn't helping this [07:56:00] ok. must pack [07:58:03] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.69, 0.75, 4.13 [07:58:25] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [08:00:45] RECOVERY Current Load is now: OK on gerrit-bots i-00000272 output: OK - load average: 0.07, 0.98, 3.85 [08:00:45] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [08:01:19] Ryan_Lane: anything I can do? [08:03:15] RECOVERY Current Load is now: OK on deployment-nfs-memc i-000000d7 output: OK - load average: 2.48, 2.97, 4.27 [08:05:58] RECOVERY Current Load is now: OK on reportcard2 i-000001ea output: OK - load average: 0.41, 0.90, 3.94 [08:08:15] RECOVERY Current Load is now: OK on aggregator-test2 i-0000024e output: OK - load average: 1.60, 1.16, 4.59 [08:09:44] hm [08:10:05] I don't know. load is basically back to normal [08:10:22] oh, can you add an rt ticket for the degraded raid? [08:10:25] it's virt1 [08:10:31] it goes into the pmtpa queue [08:10:49] bonus points if you can figure out which disk is the bad one [08:10:50] heh [08:11:15] Pull them out one by one! Filesystem russian roulette! [08:11:25] well, the raid command should say [08:11:36] it's /usr/bin/MegaCli64 [08:11:54] Mmm megacli, my perfered kinda raid card. [08:12:06] will do [08:12:10] fucking raid commands need to have the most user unfriendly commandlines ever [08:12:23] Lol [08:12:32] seriously run: /usr/bin/MegaCli64 help [08:12:39] Perc cards are the worst, thanks dell. LSI cards are more friendly. [08:12:40] and just wait for your brain to melt [08:13:29] It's the ones that want like /c0/u0 command or '[c0] [u0]' command that really blow my brains [08:15:48] ok, back to packing [08:15:59] why did I come in this room again? i know it was for a reason [08:20:53] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.80, 0.90, 3.20 [08:23:14] Ryan_Lane: Because you love us so dearly? [08:24:05] Also, enjoy your trip accross country. [08:25:11] will do :) [08:25:43] Oh god, you have a scary gtalk picture [08:25:49] And pidgen makes scary noises [08:25:53] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.35, 1.40, 2.78 [08:25:55] * Damianz fixes his config [08:26:12] hahaha [08:26:22] I *still* don't understand how I have that pic for gtalk [08:26:32] I've changed it everywhere and it still pops up somehow [08:26:51] Lol, I'm pretty sure mine is a dog or something but my profile pic is totally different. [08:30:52] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [08:33:42] ok. I'm off [08:33:44] * Ryan_Lane waves [08:33:54] * Damianz nods and waves [09:01:02] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [09:31:12] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [10:00:43] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [10:01:38] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [10:08:47] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 71 MB (5% inode=40%): [10:31:41] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [11:01:54] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [11:04:24] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [11:34:27] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [12:05:21] PROBLEM host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [12:08:39] ACKNOWLEDGEMENT host: gluster-devstack-brick1 is DOWN address: i-00000275 CRITICAL - Host Unreachable (i-00000275) [12:47:39] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 71 MB (5% inode=40%): [14:13:19] hello [15:00:44] and here I am again [15:02:22] paravoid: if you are wiling to have some fun, we can review my puppet changes in test branch :-D [15:02:23] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:test+topic:test,n,z [15:02:47] will do [15:03:58] gah, gerrit is so hard to use [15:04:40] the good news is that you can do most of the stuff through the CLI :D [15:05:20] Gerrit sucks UX wise [15:05:31] {{POV}} !!! [15:21:38] 05/16/2012 - 15:21:38 - Creating a home directory for aude at /export/home/wikidata-dev/aude [15:22:37] 05/16/2012 - 15:22:36 - Updating keys for aude at /export/home/wikidata-dev/aude [15:30:34] New review: Hashar; "That indeed solved the issue. dbdump has some archives in /home/wikipedia/logs" [operations/puppet] (test); V: 1 C: 1; - https://gerrit.wikimedia.org/r/7746 [15:30:48] paravoid: can you merge that one please? https://gerrit.wikimedia.org/r/#/c/7746/ [15:31:04] log rotate does not create olddir, that change makes puppet creates it for us :) [15:32:23] in the utilities class? [15:32:31] ah, no [15:33:03] <^demon> hashar: Did you take a look at that thing I mentioned yesterday? [15:33:12] I find it dumb that logrotate does not create the olddir :-( [15:33:25] ^demon: no sorry. Woke up like 1 hour and a half ago :( [15:33:31] ^demon: can you send me the link again? [15:33:38] <^demon> Sure :) [15:33:49] <^demon> https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#_a_id_changemerge_a_section_changemerge [15:34:27] ohhh [15:34:40] so that will prevents us from submitting a change that can not be merged? [15:34:43] am I correct? [15:34:45] <^demon> So the idea is "Submit" will be hidden if a dry-merge fails. [15:34:46] <^demon> Yep [15:35:01] I guess it is harmless [15:35:27] hashar: eh?! [15:35:30] how does that work? [15:35:35] it needs ensure => directory, doesn't it? [15:35:38] <^demon> I asked #gerrit. There's a minor performance hit when you first load a change, but the result of that merge (based on sha1) is held indefinitely, so only the first load is slower. [15:35:46] ^demon: isn't there a [Rebase] button to? [15:36:24] plus, 0755 should be avoided with puppet, 0644 is "preferred" (puppet converts that to 0755 for directories) [15:36:27] paravoid: indeed :-D [15:36:44] paravoid: yeah I already had that discussion about 0755 versus 0644 [15:36:51] <^demon> hashar: 2.4 I believe. Haven't tested. [15:36:54] <^demon> rc1 came out today [15:37:03] <^demon> Might be master/2.5 though :\ [15:37:26] paravoid: since they are directories, it seems more logical to me to use the interpolated mode (aka 0755) on directories [15:37:39] <^demon> I'm afraid hiding the button is going to confuse people though. Maybe we could float the idea on wikitech-l. [15:37:57] <^demon> I personally think it'd be less annoying than merging only to have gerrit yell at you, but better get consensus :) [15:38:08] the reasoning behind adding +x on 0644 directories is that you can recurse [15:38:42] you can say /foo/bar, mode => 0644, ensure => present, recurse => true and copy all of the contents with 0644 for files and 0755 for dirs/subdirs [15:40:05] ^demon: when the Submit button is hidden, is there an informative message saying something like "change need rebase before submission" ? [15:40:13] ^demon: else it is going to confuse me a lot :-] [15:40:34] paravoid: yeah I got the recursion point to. Then I am sometime asked to not use recursion hehe [15:40:35] <^demon> Don't know. We can enable it on gerrit-dev and find out [15:40:40] paravoid: I will switch to 0644 :-] [15:42:06] New patchset: Hashar; "(bug 36872) logrotate require archive dir to be created!" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7746 [15:42:15] paravoid: updated [15:42:21] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7746 [15:42:33] ^demon: or be bold and wait for people to complain :-]]]]]]]]]]]]]]]]] [15:42:46] <^demon> Noooooooo, then I'll get people yelling at me ;-) [15:42:51] <^demon> And I don't like being yelled at [15:43:08] that is why we have project manager for [15:43:12] tech staff break stuff [15:43:18] and customer yell at the project manager :-] [15:43:28] then everyone is happy [15:44:50] <^demon> I'm the project manager on git migration too. [15:44:56] <^demon> So I get to wear that hat and get yelled at :) [15:45:04] * Damianz shoots ^demon's had [15:45:16] :hashar- hello hashar [15:45:42] bonjour tparveen [15:45:51] I really should get a toolserver account hmm, this is going to be painful to do using the api -> sees days to weeks lead time on new accounts, thinks api it is [15:45:58] ^demon: so you will want to mail wikitech-l with the proposal :-D [15:46:12] :hashar- :) wanted to check how labs is doing [15:46:12] <^demon> Yeah I'm about halfway through an e-mail. [15:46:17] New review: Faidon; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7746 [15:46:19] ^demon: the only drawback I can see, is that some people will wonder why they are no more able to submit [15:46:19] Change merged: Faidon; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7746 [15:46:45] <^demon> !log gerrit set changeMerge.test to true in gerrit.config. Testing fun features :) [15:46:46] Logged the message, Master [15:46:54] <^demon> No bots? Lame. [15:47:27] !log deployment-pre Running puppet on dbdump for https://gerrit.wikimedia.org/r/7746 [15:47:27] deployment-pre is not a valid project. [15:47:29] hashar: I'm kinda lost with gerrit, what other changes do you have for me? [15:47:34] hashar: I see two [15:48:06] owner:hashar project:operations/puppet branch:test status:open [15:48:09] https://gerrit.wikimedia.org/r/#/q/owner:hashar+project:operations/puppet+branch:test+status:open,n,z [15:48:21] there are a few others [15:48:39] the "class to install the {tree,ack-grep}" have been reviewed by mutante IIRC [15:48:55] oh not the ack-grep [15:49:10] <^demon> I still don't know why you need a class for that. You can just copy the standalone to ~/bin :\ [15:49:16] <^demon> re: ack-grep [15:49:54] I have been asked to use class to install packages :-D [15:49:59] so I create classes! [15:50:16] <^demon> I'm saying skip the class and the package. [15:50:20] <^demon> No package needed ;-) [16:02:40] New patchset: Hashar; "puppet exec enforces full paths; that's cool." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7794 [16:02:54] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7794 [16:03:09] New review: Hashar; "Trying out cherry-pick from production." [operations/puppet] (test); V: 0 C: 1; - https://gerrit.wikimedia.org/r/7794 [16:08:30] yeah blank page http://commons.wikimedia.beta.wmflabs.org/ [16:08:36] and I have ZERO error log :D [16:14:49] solved [16:15:27] what whas it? [16:15:31] was [16:18:26] I am editing the CommonSettings.php file [16:18:37] I have enabled the multi version system on labs although it is broken [16:18:41] thus triggered white pages [16:18:56] multi version is what let us support several wiki versions on the cluster [16:19:10] it is replaced by a hack on labs, I have no idea why [16:35:51] stupid NFS [16:36:00] makes things so slow :-] [16:37:27] Anyone know the api better than me? [16:38:31] Damianz, which API? MediaWiki's? [16:38:39] mhm [16:38:50] Probably, though I'm not sure how well you know it [16:40:13] If I have a page id (and title, if that matters) and a revid how hard is it to pull the previous rev info? IE I have a users contribs, they reverted someone, I wanted the reverted edits info. Can't figure out how to pull the rev id without pulling the whole articles history and looping over it edit by edit until I hit my known id then shifting one back in the array (which would be painful). [16:41:20] Aha, there's an interesting question [16:41:27] I assume you've been through the docs? [16:43:09] Flicking though randomly, I just got the user conribs stuff working so I've not looked at revs in detail as I find the docs sorta confusing. [16:43:16] Damianz, from http://www.mediawiki.org/wiki/API:Undelete it looks like the revid might be sequential [16:43:20] There's like 5 ways to do something depending exactally what you want. [16:44:55] Damianz, I think you'd be better served in #mediawiki, and by waiting a little while until some more people get into work :) [16:45:32] That's a good point, I forget around that channel as I don't tend to do api stuff (fixing a bot which broke it's databases somewhat due to a bug). [16:45:36] Thanks for the pointer though :D [16:45:58] Absolutely! [16:46:16] !log deployment-prep cleaned up more of CommonSettings.php today. Moved some hacks to disable features as settings in InitialiseSettingsDeploy.php . See git log. [16:46:17] Logged the message, Master [16:46:19] and I am out [16:46:25] maybe [16:51:23] Actually, it seems http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Aquiline%20nose&rvstartid=492880920&rvlimit=2 works. Little hacky but should do. [16:58:33] Change abandoned: Hashar; "will merge that in test later." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7794 [17:05:05] New review: Faidon; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7255 [17:06:58] FFS [17:07:07] Thanks ubuntu for just overwriting that file without asking me [17:08:27] Ubuntu: Just sit back, we'll trash your filesystem for you. [17:08:53] Yeah, whatever happened to DUDE THAT FILE EXISTS, SURE YOU WANT TO OVERWRITE IT [17:08:58] 45min of work trashed ;( [17:27:38] 05/16/2012 - 17:27:38 - Creating a home directory for vrandezo at /export/home/wikidata-dev/vrandezo [17:28:39] 05/16/2012 - 17:28:39 - Updating keys for vrandezo at /export/home/wikidata-dev/vrandezo [17:33:14] 05/16/2012 - 17:33:14 - Creating a home directory for vrandezo at /export/home/bastion/vrandezo [17:34:15] 05/16/2012 - 17:34:14 - Updating keys for vrandezo at /export/home/bastion/vrandezo [17:34:22] is it intentional that anyone can add members to the bastion project? [17:34:49] Daniel_WMDE: eh? [17:38:09] paravoid: "eh?" to which part? [17:38:55] Ryan_Lane: ^ ? [17:39:06] Daniel_WMDE: sounds wrong to me... [17:39:06] yes [17:39:15] but Ryan would know better [17:39:27] ryan: ok then :) [17:39:36] though its something i've considered changing [17:40:28] and likely will before making labsconsole open registration [17:40:40] ic [17:41:01] Ryan_Lane: denny (vrandezo) is having problems logging into bastion (connection closed) [17:41:03] any idea? [17:41:14] is he in the project? [17:41:18] is his key valid? [17:41:32] Ryan_Lane: "likely"? :-) [17:41:42] i added him to the project (hence my question) [17:41:47] paravoid: not too sure yet [17:42:06] ah. heh. it just logged to the channel that his home directory is there now :) [17:42:15] he has a valid key uploaded, and is usinh ssh agent [17:42:25] we are not checking that it'S actually the correct key :) [17:42:30] oh [17:42:34] I have a feeling I know wy [17:42:54] <^demon> Ryan_Lane: Thanks for reviewing + merging that hook refactoring Monday. [17:42:54] have him try now [17:42:58] ^demon: yw [17:43:08] ^demon: and fixing it too, you mean, right? :D [17:43:11] what was it? [17:43:19] nscd negative cache [17:43:29] he tried logging in before he was in the project-bastion group [17:43:37] Ryan_Lane: works now, thanks! [17:43:38] so, his account was cached [17:43:51] ah [17:43:56] well, I guess not technically negative cache :) [17:44:15] yeah, this happened to me 1-2 weeks ago or so [17:44:22] the positive cache is about an hour [17:44:28] hm [17:44:30] for passwd [17:44:35] 6 hours for group [17:44:51] that's a really long ttl [17:49:34] <^demon> release notes for 2.4 are up (rc1 is out): http://gerrit-documentation.googlecode.com/svn/ReleaseNotes/ReleaseNotes-2.4.html [17:52:19] That's quick [17:52:25] 2.3 is still fairly recent [17:52:37] 05/16/2012 - 17:52:37 - Updating keys for vrandezo at /export/home/wikidata-dev/vrandezo [17:52:56] <^demon> 2.3 sat in rc status for far too long :) [17:52:59] issue 1035 Add rebase button to the change screen [17:53:01] Woot [17:53:07] Asynchronously send email so it does not block the UI [17:53:13] * RoanKattouw sighs deeply [17:53:15] 05/16/2012 - 17:53:15 - Updating keys for vrandezo at /export/home/bastion/vrandezo [18:48:35] PROBLEM Puppet freshness is now: CRITICAL on deployment-syslog i-00000269 output: Puppet has not run in last 20 hours [18:48:37] Ryan_Lane: can you create a labs account for Dan Foy [18:48:54] did he fill out the developer access page? [18:49:05] I need the info requested there [18:49:42] Ryan_Lane: this page http://www.mediawiki.org/w/index.php?title=Developer_access&action=edit§ion=new&preload=Template:Developer_access_request_preload&editintro=Template:Developer_access_request_editnotice [18:50:46] preilly: ok. gimme a sec [18:50:52] Ryan_Lane: okay, I've sent it to him [18:51:01] oh [18:51:02] yeah [18:51:03] that one [18:51:12] Ryan_Lane: once he fills it out I'll let you know [18:51:20] !log wikistats - updating mw versions for wmf wikis (heh, while deployment is going on) [18:51:22] Logged the message, Master [18:51:22] ok [19:13:12] PROBLEM Current Load is now: CRITICAL on demo-deployment1 i-00000276 output: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:03] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 8.53, 15.70, 9.54 [19:16:37] wat the fuck. just doing a copy of a repo spikes load? must. get. off. gluster. [19:18:13] RECOVERY Current Load is now: OK on demo-deployment1 i-00000276 output: OK - load average: 7.17, 5.68, 3.06 [19:22:10] Ryan_Lane: Ceph has a sexysite and irc now :D [19:22:26] we're considering ceph [19:22:47] <^demon> How about nfs? [19:22:49] * ^demon hides [19:23:34] ^demon: How about tcp over carrier pidgeon [19:23:40] ^demon: heh [19:36:12] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 1.77, 2.19, 4.28 [19:36:50] paravoid: any way we'll have per-project puppet by berlin? [19:38:40] PROBLEM host: gerrit-bots is DOWN address: i-00000272 check_ping: Invalid hostname/address - i-00000272 [19:39:12] Gerrit can haz no bots [19:39:30] * Damianz wonders about ganglia stuff [20:02:14] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: CRITICAL - Socket timeout after 10 seconds [20:07:00] PROBLEM HTTP is now: WARNING on deployment-web3 i-00000219 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.008 second response time [20:30:21] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 2.09, 11.08, 6.96 [20:40:21] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.12, 1.62, 3.70 [20:49:24] Thehelpfulone: you around? [20:49:31] yep [20:51:08] I have clearance now to run any and all of Fbot from the toolserver or labs [20:51:17] so now I need your help setting things up [20:52:25] ok [20:52:31] and just to get things clear, what's Fbot? [20:52:47] Fastily [20:52:50] 's bot [20:52:57] does file namespace work on Wikipedia [20:53:01] right, so what have you done so far in terms of labs setup? [20:53:06] I run copies of some tasks from Svenbot [20:54:14] Sven_Manguard: have you done anything to do with labs setup so far, or are we starting from scratch? [20:54:30] I have an account [20:54:41] you told me you were going to walk me through something [20:54:52] err with SHA? SSH? SHH? [20:54:55] yep ok, so did you install the software I told you to? [20:55:34] No! [20:55:43] That would require... err... competance [20:55:43] ok, let's start from the beginning then :) [20:56:29] Totally just raping mysql right now [21:02:02] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: CRITICAL - Socket timeout after 10 seconds [21:19:06] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [21:29:08] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [21:35:09] 05/16/2012 - 21:35:09 - Creating a home directory for svenmanguard at /export/home/bots/svenmanguard [21:35:25] !? [21:36:11] 05/16/2012 - 21:36:11 - Updating keys for svenmanguard at /export/home/bots/svenmanguard [21:36:20] !? [21:36:25] that's fine Sven_Manguard [21:36:27] Thehelpfulone: wha? [21:36:47] I'll be back in 5 mins Thehelpfulone [21:36:52] ok :) [22:17:15] 05/16/2012 - 22:17:15 - Creating a home directory for svenmanguard at /export/home/bastion/svenmanguard [22:17:32] Thehelpfulone: !? [22:17:43] that was me [22:17:53] don't worry, it means I was giving you access to be able to login to the server [22:18:00] aww sweet [22:18:14] 05/16/2012 - 22:18:14 - Updating keys for svenmanguard at /export/home/bastion/svenmanguard [22:18:29] soon I will be able to use labs to do evil things - like run approved bots and hold the door open for the elderly [22:18:41] and download porn [22:18:49] and feed homeless kittens [22:18:54] Did he sign the contract which states we can perform experiments on him while he's alive yet? [22:19:14] Damianz: Too late, I'm a lich [22:19:39] I can and will freeze your testicles off and market them to the serfs as meat popsicles if you annoy me [22:20:32] Testicles are not required [22:21:34] You also assume he's male [22:22:01] I assume his name is Damian, which is a male name Reedy [22:22:37] Could be his surname [22:23:12] Might be female in another country/culture [22:23:55] The Internet: where men are men, women are men, and children are FBI agents [22:24:25] Reedy: or he could actually be Damian Zaremba [22:24:47] I could, or I could be named BobTheBuilder and just choose to have that mask. [22:25:23] Damianz: hard to pull off a fake mask, fake social networking profiles, and everything else [22:25:36] Not really [22:25:52] I could sign up for a facebook, twitter account, create a blog and wp account then request the mask. [22:26:01] Or I could just go social engineer Reedy and login as him [22:26:01] :D [22:26:39] I convinced Logan that my name was Shaniqua Kwon once, so it's not too hard, I suppose [22:27:56] Reedy, he is in the internet [22:28:02] In it? [22:28:03] Wow. [22:28:10] there are no women there ;) [22:28:22] not true [22:28:29] In it? we all sat in lcarr's ports again? :D [22:28:31] Sue goes on IRC sometimes [22:42:10] PROBLEM HTTP is now: WARNING on deployment-web3 i-00000219 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.015 second response time [22:47:51] http://postimage.org/image/e4koh1b0z/6050fc0d/ [22:47:56] Thehelpfulone: that ^ [22:48:15] oh :P [22:48:19] you say yes to that [22:48:26] Say yes to everything [22:48:35] Including the button that sends us your credit card details [22:48:57] Also, imgur.com > random slow site [22:49:01] new error [22:49:14] PuTTy fatal error [22:49:24] screenshot? [22:49:34] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [22:49:37] "Server unexpectedly closed network connection" [22:49:38] that's it [22:49:54] ok, so try to reconnect again [22:50:02] using bastion.wmflabs.org as your host name [22:52:00] and the same other setting? [22:52:23] yes [22:52:31] login as: [22:52:33] ! [22:52:42] so now you get to experiment [22:52:45] try svenmanguard first [22:53:20] gave me a shitton of text [22:53:34] yep that's fine [22:53:48] won't let me Control+C it, it instead says svenmanguard@bastion1:~$ ^C [22:54:32] ok [22:54:44] so now backspace to remove teh ^C bit [22:54:47] and type: [22:54:52] ssh -A bots-3 [22:55:18] when it asks you another question a new host or something, you say yes again [22:55:21] Why forwarding the agent to bots-3? [22:55:47] authenticity not established, key is (yes/no) [22:55:50] so I say yes? [22:56:03] Sven_Manguard: you say yes [22:56:33] ah yes sorry you don't need the -A [22:56:35] just ssh bots-3 [22:58:13] so type no [22:58:18] then ssh bots-3 [22:58:25] type yes [22:58:30] then don't type anything else :) [22:58:34] I typed no [22:58:38] ok [22:58:42] because I had the -A in it [22:58:48] so now type ssh bots-3 [22:58:51] You know, someone should really screenshot this from the point of a total idiot and put it on the wiki [22:58:51] then type yes [22:59:20] Damianz: are you volunteering? ;) [22:59:43] Damianz: are you insulting me? I can't tell because I'm a total idiot [22:59:44] Sure, because I'm awesome at writing documentation for idiots and I totally have a windows box. [23:00:00] also: fuck off [23:00:15] Thehelpfulone: back to PM please [23:00:25] Sven_Manguard: woah, he's just trying to prove a point that there's no clear documentation at the moment [23:00:31] It's only insulting if you're insulted, though in this case I was just thinking more of the 4 thousand times this gets repeated. [23:00:48] Lack of documentation = having to repeat instructions = painful for all parties. [23:04:11] Reading recent backscroll. Please remember to be friendly, everyone. [23:05:19] * Damianz notes -> substitue 'idiots' for 'users' so sentances has same meaning while idio...users don't feel insulted. :D [23:30:13] Damianz, how do you run .bats on labs? [23:31:02] if you go to /home/svenmanguard/bot on bots-3 Sven is trying to run that bot [23:32:01] bat is a windows batch file usually [23:32:10] which likely won't run [23:32:10] yes [23:32:13] unless you wine it [23:32:13] that's what I thought [23:32:15] * Reedy barfs [23:32:15] fuck [23:32:17] That's windows commands [23:32:21] The bot is java though [23:32:26] Can easily port that to a shell script [23:32:28] aye [23:32:31] yeah, so how do you make it run? [23:32:33] java fooba.jar [23:32:40] PAstebin the content of the batch file? [23:32:42] Damianz: I don't have the source code because Sigma found it on the internet and stole it so Fastily nuked everything [23:33:01] http://pastebin.com/AALEAgZ7 [23:33:10] Sven_Manguard: that .jar is the source code [23:33:13] Of what? jar's are a) decompileable and b) cross platform [23:33:22] java -jar Fbot.jar -z [23:33:27] juse use that [23:33:49] okay, there's a second problem [23:33:57] no java installed? [23:33:57] anyone here know Java? [23:33:59] that doesn't look good Reedy, http://pastebin.com/z1aAxzTp [23:34:10] Class error [23:34:20] yeah, so it's not installed? [23:34:24] it's broken? [23:34:26] missing libraries [23:34:33] I think that's an openjdk vs sun library thing [23:34:45] Reedy: it uses the big library that a lot of the bots use [23:34:50] fuck yeah, propretary [23:34:52] Someone with an M made it [23:35:56] * Damianz waits [23:36:29] !log bots Installed default-jre on bots-3 for svenmanguard's bot [23:36:30] Logged the message, Master [23:36:49] looks like it needs x - you'll have to setup x forwarding to your local machine [23:38:12] How the hell do some revisions have no user or ip associated *mind blown* [23:38:44] x forwarding Damianz? That's new to me [23:40:25] Does the bot have some form of gui thing? [23:40:41] I don't really know java that well but it looks like it's trying to throw a gui window up for some login details. [23:40:46] yea [23:40:49] it did that [23:40:54] enter password field [23:41:03] lo. [23:41:05] l. [23:41:15] He did it for my benifit you arse [23:41:30] I have older versions too, but they don't work with Fastily's on-wiki pages gone [23:41:40] you want those Thehelpfulone? [23:41:52] On Wiki pages can always be recreated [23:41:53] Giving a window isn't any easier than a CLI asking for the same details [23:42:34] * Damianz votes you decompile the jar, edit the source and profit [23:42:44] "decompile" -> "unzip" [23:42:50] But it should work, if you have xforwarding setup [23:42:51] Sven_Manguard: you take things far too seriously and personally :) [23:43:13] Reedy: Well yeah, but un-doing byte compiled code is easy too if needed. [23:43:16] Thehelpfulone: sorry, I'm still raw over Fastily being chased off the project by assholes with ANI threads [23:43:25] Thehelpfulone: Did you not get the memo? The world is out to get you personally. [23:43:53] Damianz: oh my memoserv must be full, thanks for the heads-up :P [23:43:56] <3 ani [23:46:04] * Damianz think's it's nearly 1am and he should go to bed before work [23:46:31] eh [23:46:33] work is overrated [23:46:53] True but we're doing datacentre audit stuff tomorrow which is useful. [23:47:12] Best wishes - hope it goes well. [23:51:10] so Thehelpfulone now what? [23:51:37] ??? [23:51:43] I'm stumped. I don't know what x forwarding is. Reedy? [23:52:07] http://lmgtfy.com/?q=x-forwarding [23:54:59] oh thanks. I'm in the nearly 1am group right now, can't handle that search. :P What needs to be changed in the source? [23:55:11] Nothing really [23:55:32] depending on the os on your host machine, it's going to be more/less complex to setup [23:55:40] s/host/local/ [23:56:10] ok so I think you want http://www.math.umn.edu/systems_guide/putty_xwin32.html Sven_Manguard [23:56:17] I'll download and install it too [23:57:34] Thehelpfulone: okay, so let's be clear on this before I go any further [23:58:05] If I jump through all of these hoops I will be able to run the Fbot tasks on a timer, and not have to have my computer on as they run. Correct or no? [23:58:12] Because one task takes 16 hours [23:58:26] and my laptop fan died in Beijing of... well... Beijing air [23:59:40] it's the number two killer of laptops in Beijing, behind shockingly high levels of malware