[00:11:35] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [01:16:35] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 20% free memory [01:21:25] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 19% free memory [02:01:55] PROBLEM Free ram is now: WARNING on aggregator2 i-000002c0 output: Warning: 19% free memory [02:41:55] RECOVERY Free ram is now: OK on aggregator2 i-000002c0 output: OK: 20% free memory [02:49:55] PROBLEM Free ram is now: WARNING on aggregator2 i-000002c0 output: Warning: 19% free memory [03:28:45] PROBLEM Free ram is now: WARNING on mwreview-test2 i-000002cd output: Warning: 8% free memory [03:34:35] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [03:35:05] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 12% free memory [03:37:58] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:49:28] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 16% free memory [03:53:48] RECOVERY Free ram is now: OK on mwreview-test2 i-000002cd output: OK: 41% free memory [03:55:08] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 4% free memory [03:57:58] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 3% free memory [03:59:58] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 13% free memory [04:00:08] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:02:58] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:09:28] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:14:28] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:14:58] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 4% free memory [04:19:58] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 94% free memory [04:37:58] PROBLEM Free ram is now: WARNING on test3 i-00000093 output: Warning: 12% free memory [04:42:58] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [06:36:29] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:09] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:40:37] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 19% free memory [06:41:37] PROBLEM Free ram is now: WARNING on incubator-bot1 i-00000251 output: Warning: 17% free memory [06:42:47] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [06:48:38] PROBLEM Current Load is now: CRITICAL on jenkins2 i-00000102 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:48] PROBLEM Disk Space is now: CRITICAL on ipv6test1 i-00000282 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:06] RECOVERY Current Load is now: OK on jenkins2 i-00000102 output: OK - load average: 0.53, 3.29, 4.43 [06:55:32] PROBLEM Free ram is now: CRITICAL on aggregator2 i-000002c0 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:47] PROBLEM Disk Space is now: WARNING on ipv6test1 i-00000282 output: DISK WARNING - free space: / 70 MB (5% inode=57%): [06:59:31] PROBLEM dpkg-check is now: CRITICAL on aggregator2 i-000002c0 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:31] PROBLEM Free ram is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:31] PROBLEM Disk Space is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:40] PROBLEM Current Users is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:41] PROBLEM Current Load is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:41] PROBLEM Total Processes is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:13] PROBLEM Free ram is now: WARNING on aggregator2 i-000002c0 output: Warning: 17% free memory [07:00:13] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 5.13, 10.59, 7.44 [07:00:13] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 3.16, 5.04, 5.53 [07:00:52] PROBLEM dpkg-check is now: CRITICAL on incubator-bot0 i-00000296 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:52] PROBLEM Total Processes is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:59] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:59] PROBLEM Current Users is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:59] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:04] PROBLEM Free ram is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:04] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:04] PROBLEM Current Load is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:04] PROBLEM Disk Space is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:04] PROBLEM dpkg-check is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:05] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:05] PROBLEM Free ram is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:06] PROBLEM Current Users is now: CRITICAL on e3 i-00000291 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:01:51] PROBLEM Current Users is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:18] PROBLEM Current Load is now: WARNING on deployment-jobrunner05 i-0000028c output: WARNING - load average: 9.68, 8.15, 5.31 [07:03:23] PROBLEM Free ram is now: CRITICAL on dumps-incr i-000002bb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:23] PROBLEM dpkg-check is now: CRITICAL on dumps-incr i-000002bb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:23] PROBLEM Current Load is now: CRITICAL on dumps-incr i-000002bb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:23] PROBLEM Total Processes is now: CRITICAL on dumps-incr i-000002bb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:28] PROBLEM Current Users is now: CRITICAL on dumps-incr i-000002bb output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:28] PROBLEM Disk Space is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:44] RECOVERY dpkg-check is now: OK on aggregator2 i-000002c0 output: All packages OK [07:04:24] PROBLEM Free ram is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:24] PROBLEM Current Users is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:24] PROBLEM dpkg-check is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:24] PROBLEM Total Processes is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:29] PROBLEM Current Load is now: CRITICAL on mobile-wlm i-000002bc output: CHECK_NRPE: Socket timeout after 10 seconds. [07:05:14] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 2.23, 3.36, 4.67 [07:05:14] RECOVERY Total Processes is now: OK on e3 i-00000291 output: PROCS OK: 95 processes [07:05:19] PROBLEM Current Load is now: WARNING on e3 i-00000291 output: WARNING - load average: 3.84, 7.01, 5.48 [07:05:19] RECOVERY Disk Space is now: OK on e3 i-00000291 output: DISK OK [07:05:19] RECOVERY dpkg-check is now: OK on e3 i-00000291 output: All packages OK [07:05:19] RECOVERY Free ram is now: OK on e3 i-00000291 output: OK: 93% free memory [07:05:19] RECOVERY Current Users is now: OK on e3 i-00000291 output: USERS OK - 0 users currently logged in [07:05:24] RECOVERY Current Load is now: OK on pediapress-ocg2 i-00000234 output: OK - load average: 6.03, 5.97, 3.97 [07:05:24] RECOVERY Current Users is now: OK on pediapress-ocg2 i-00000234 output: USERS OK - 0 users currently logged in [07:05:24] RECOVERY Disk Space is now: OK on pediapress-ocg2 i-00000234 output: DISK OK [07:05:24] RECOVERY Total Processes is now: OK on pediapress-ocg2 i-00000234 output: PROCS OK: 107 processes [07:05:30] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [07:05:30] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000234 output: OK: 78% free memory [07:06:24] RECOVERY Current Users is now: OK on tw-next i-0000027e output: USERS OK - 0 users currently logged in [07:08:14] RECOVERY Free ram is now: OK on dumps-incr i-000002bb output: OK: 93% free memory [07:08:14] RECOVERY dpkg-check is now: OK on dumps-incr i-000002bb output: All packages OK [07:08:14] RECOVERY Current Load is now: OK on dumps-incr i-000002bb output: OK - load average: 0.32, 1.91, 1.69 [07:08:14] RECOVERY Total Processes is now: OK on dumps-incr i-000002bb output: PROCS OK: 91 processes [07:08:19] RECOVERY Current Users is now: OK on dumps-incr i-000002bb output: USERS OK - 1 users currently logged in [07:08:19] RECOVERY Disk Space is now: OK on mobile-wlm i-000002bc output: DISK OK [07:08:30] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner05 i-0000028c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:19] RECOVERY Current Users is now: OK on mobile-wlm i-000002bc output: USERS OK - 0 users currently logged in [07:09:19] RECOVERY Free ram is now: OK on mobile-wlm i-000002bc output: OK: 80% free memory [07:09:19] RECOVERY Total Processes is now: OK on mobile-wlm i-000002bc output: PROCS OK: 100 processes [07:09:24] RECOVERY dpkg-check is now: OK on mobile-wlm i-000002bc output: All packages OK [07:09:24] RECOVERY Current Load is now: OK on mobile-wlm i-000002bc output: OK - load average: 2.12, 4.10, 3.41 [07:10:14] RECOVERY Current Load is now: OK on e3 i-00000291 output: OK - load average: 0.46, 2.74, 4.03 [07:13:14] RECOVERY Current Load is now: OK on deployment-jobrunner05 i-0000028c output: OK - load average: 0.50, 2.25, 3.60 [07:15:14] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.64, 0.89, 3.12 [07:20:14] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.26, 0.49, 2.33 [10:50:30] !accountreq [10:50:30] case you want to have an account on labs please read here: https://labsconsole.wikimedia.org/wiki/Help:Access#Access_FAQ [11:09:53] Thehelpfulone: there? [11:11:32] Hydriz: just popping out, will be back in 15 minutes [11:11:57] ping me when you are back :) [11:12:06] sure [11:12:11] got to discuss about the interwiki.py instances you are running [11:12:12] :P [11:46:11] Hydriz: back [11:46:17] oh great :) [11:46:20] what trouble am I in? ;) [11:46:26] not trouble heh [11:46:41] but did anyone assign you a bots project server that you can/should use? [11:46:57] nope, I was told to use bots-3 [11:47:04] hmm... interesting [11:47:06] but Ryan_Lane said if that gets too clogged I can use bots-4 [11:47:18] cos its down to 500MB left [11:47:18] do you think we should be assigned an instance? [11:47:32] not assigned though [11:47:39] just when a server is free, just use it [11:47:54] and I was thinking if it would be better if you use bots-4 [11:47:56] its empty [11:47:58] well I'd rather just stay on one instance [11:48:09] ok I can move [11:49:06] do I just need to copy my crontab from one to the other? [11:49:19] because of the auto mount everything else (my files) should be there right? [11:49:24] yep [11:49:30] not sure about the crontab though [11:49:37] * Hydriz does not use cron at all :P [11:49:44] or anything related to it [11:49:47] screens are per instance too right? [11:49:51] Yep [11:49:57] See this: http://ganglia.wmflabs.org/latest/?c=bots [11:50:02] ok so I need to recreate those [11:50:08] how much am I using of bots-3? [11:50:35] not sure though lol [11:52:58] PID USER PR NI VIRT RES SHR S PU %MEM TIME+ COMMAND [11:52:59] 10709 beetstra 20 0 90532 26m 1964 R 40.5 1.3 4077:55 perl [11:53:00] 23686 beetstra 20 0 99364 36m 2692 S 21.6 1.8 34:42.96 perl [11:53:01] 23149 beetstra 20 0 96228 33m 2684 S 19.6 1.6 30:13.21 perl [11:53:01] 5631 beetstra 20 0 95356 32m 2684 R 17.6 1.6 57:56.86 perl [11:53:06] it's not me that's using up all the memory :P [11:53:21] which instance is that? [11:53:33] bots-3 [11:53:48] is there again one that has eaten memory .. will check [11:54:08] but people should start using bots-4... [11:54:53] -> "2057692k total, 1465296k used, 592396k free," <- that is still 25% memory free [11:55:49] I do have to find the memory leak in unblockbot .. but that one is not massive yet [11:56:37] afaik interwiki.py leaks lots of memory [11:56:47] thatÅ› not my stuff [11:56:58] One of my perl-bots has a problem, but I don't know where [11:57:22] Generally it is an array that gets pushed and popped unequally, but I don't see which one [11:58:48] how do I view my processes? [11:59:05] top -c? [13:48:42] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 23% free memory [14:13:22] PROBLEM Free ram is now: WARNING on incubator-bot2 i-00000252 output: Warning: 19% free memory [14:26:00] Amgine: heading underground now... [14:34:46] Tanvir: so you can SSH into bastion fine but when you try to SSH into a bots instance it doesn't like it? [14:34:48] Hi all. I am having problem to login to bots. Successfully made to Bastion, but when I try ssh bots-1 or ssh bots-4 ; it returns to "Permission denied (publickey)". Any clue how to solve this? [14:34:57] Oops, same time. lol [14:34:59] heh [14:35:07] you're in the bots group [14:35:19] Aye, I see that. [14:36:04] have you enabled agent forwarding? [14:36:30] No I think. How to do that? [14:37:20] what are you using to SSH? [14:39:00] https://labsconsole.wikimedia.org/wiki/Access#Using_agent_forwarding [14:40:31] Barebone: [14:40:33] [15:37:18] what are you using to SSH? [14:40:34] [15:38:58] https://labsconsole.wikimedia.org/wiki/Access#Using_agent_forwarding [14:44:21] Sorry was disconnected Thehelpfulone. Last message: " what are you using to SSH?". And I don't understand what are saying, sorry. /me peferct no0b. [14:44:51] heh ok, what's your SSH client? [14:45:48] Barebone: there's also some instructions on agent forwarding at https://labsconsole.wikimedia.org/wiki/Access#Using_agent_forwarding [14:46:24] Thehelpfulone, openSSH. [14:46:50] Thehelpfulone, I did that forwarding thing I think. [14:46:52] How to check if that works? [14:48:08] hmm, not sure - did you follow the instructions linked above? [14:58:40] Thehelpfulone, I did I think and I just did it again. [15:01:07] Thehelpfulone, there? Can you add Hoo to the bots temporarily to check things? [15:01:31] He asked for it to figure out the problem.. [15:01:50] yeah I can add him permanently [15:01:57] but I think it's something on your end [15:02:09] I think so too. Just don't know what it is.. [15:02:40] from bastion, try ssh -A bots-4 [15:03:13] Same thing. :-/ [15:05:34] meh I'm not so good at troubleshooting this, I think we should wait for someone who's done it more [15:06:36] Okay. [17:08:55] PROBLEM Total Processes is now: CRITICAL on zeromq1 i-000002b7 output: CHECK_NRPE: Socket timeout after 10 seconds. [17:13:45] RECOVERY Total Processes is now: OK on zeromq1 i-000002b7 output: PROCS OK: 81 processes [18:05:50] New patchset: Andrew Bogott; "Log output for mediawiki install.sh on failure." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/10762 [18:06:07] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/10762 [18:09:41] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10762 [18:09:44] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/10762 [18:13:55] PROBLEM Current Load is now: CRITICAL on mwrevew-test3 i-000002ce output: Connection refused by host [18:14:25] PROBLEM Current Users is now: CRITICAL on mwrevew-test3 i-000002ce output: Connection refused by host [18:15:06] PROBLEM Disk Space is now: CRITICAL on mwrevew-test3 i-000002ce output: Connection refused by host [18:15:47] PROBLEM Free ram is now: CRITICAL on mwrevew-test3 i-000002ce output: Connection refused by host [18:16:55] PROBLEM Total Processes is now: CRITICAL on mwrevew-test3 i-000002ce output: Connection refused by host [18:17:35] PROBLEM dpkg-check is now: CRITICAL on mwrevew-test3 i-000002ce output: Connection refused by host [18:55:05] RECOVERY Disk Space is now: OK on mwrevew-test3 i-000002ce output: DISK OK [18:55:45] RECOVERY Free ram is now: OK on mwrevew-test3 i-000002ce output: OK: 92% free memory [18:56:55] RECOVERY Total Processes is now: OK on mwrevew-test3 i-000002ce output: PROCS OK: 85 processes [18:57:35] RECOVERY dpkg-check is now: OK on mwrevew-test3 i-000002ce output: All packages OK [18:58:35] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 28% free memory [18:58:45] RECOVERY Current Load is now: OK on mwrevew-test3 i-000002ce output: OK - load average: 0.38, 0.29, 0.17 [18:59:25] RECOVERY Current Users is now: OK on mwrevew-test3 i-000002ce output: USERS OK - 2 users currently logged in [19:38:25] PROBLEM host: mwrevew-test3 is DOWN address: i-000002ce check_ping: Invalid hostname/address - i-000002ce [19:43:45] PROBLEM Current Load is now: CRITICAL on mwreview-test4 i-000002d1 output: Connection refused by host [19:44:25] PROBLEM Current Users is now: CRITICAL on mwreview-test4 i-000002d1 output: Connection refused by host [19:45:05] PROBLEM Disk Space is now: CRITICAL on mwreview-test4 i-000002d1 output: Connection refused by host [19:45:45] PROBLEM Free ram is now: CRITICAL on mwreview-test4 i-000002d1 output: Connection refused by host [19:46:55] PROBLEM Total Processes is now: CRITICAL on mwreview-test4 i-000002d1 output: Connection refused by host [19:47:35] PROBLEM dpkg-check is now: CRITICAL on mwreview-test4 i-000002d1 output: Connection refused by host [20:27:35] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af output: Warning: 19% free memory [20:39:35] PROBLEM Free ram is now: WARNING on deployment-squid i-000000dc output: Warning: 19% free memory [21:05:35] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf output: Critical: 5% free memory [21:45:01] is it possible to have private files on labs? [22:00:40] PROBLEM Free ram is now: WARNING on aggregator-test1 i-000002bf output: Warning: 6% free memory [22:35:40] PROBLEM Free ram is now: CRITICAL on aggregator-test1 i-000002bf output: Critical: 5% free memory