[00:04:23] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [00:05:13] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [00:05:43] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [00:06:13] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [00:07:33] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [00:08:18] PROBLEM dpkg-check is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused by host [00:11:13] RECOVERY Free ram is now: OK on ipv6test1 i-00000282 output: OK: 22% free memory [00:19:13] PROBLEM Free ram is now: WARNING on ipv6test1 i-00000282 output: Warning: 17% free memory [00:22:39] RECOVERY dpkg-check is now: OK on aggregator-test3 i-00000293 output: All packages OK [00:23:49] RECOVERY Current Load is now: OK on aggregator-test3 i-00000293 output: OK - load average: 0.56, 0.81, 0.81 [00:24:31] RECOVERY Current Users is now: OK on aggregator-test3 i-00000293 output: USERS OK - 1 users currently logged in [00:25:49] RECOVERY Free ram is now: OK on aggregator-test3 i-00000293 output: OK: 95% free memory [00:25:49] RECOVERY Disk Space is now: OK on aggregator-test3 i-00000293 output: DISK OK [00:26:59] RECOVERY Total Processes is now: OK on aggregator-test3 i-00000293 output: PROCS OK: 95 processes [00:32:19] New patchset: Sara; "Install ganglia-webfrontend deb on labs ganglia::web server." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8988 [00:32:34] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8988 [00:34:57] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8988 [00:35:00] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8988 [00:37:09] New patchset: Andrew Bogott; "Not configuring user with generic_my.cnf.erb" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8989 [00:37:24] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8989 [00:38:29] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8989 [00:38:31] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8989 [00:45:35] PROBLEM dpkg-check is now: CRITICAL on aggregator-test3 i-00000293 output: DPKG CRITICAL dpkg reports broken packages [01:07:07] ssmollett: Do you know where the git repo for the puppet master's class files live? [01:07:25] I assume on virt0 someplace, but I can't find the actual files. Just trying to merge in a change... [01:13:45] PROBLEM Current Load is now: WARNING on aggregator-test3 i-00000293 output: WARNING - load average: 1.73, 11.55, 7.80 [01:26:11] ssmollett: nm, I think I sorted it out. [01:29:51] New patchset: Andrew Bogott; "Fixing generic_my.cnf again." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8992 [01:30:06] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8992 [01:30:08] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8992 [01:30:25] New review: Andrew Bogott; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8992 [01:30:28] Change merged: Andrew Bogott; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8992 [01:47:36] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 4.54, 5.07, 5.02 [01:48:14] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [01:48:55] PROBLEM host: mwr-proto is DOWN address: i-00000292 check_ping: Invalid hostname/address - i-00000292 [01:54:25] PROBLEM Current Load is now: CRITICAL on mwreview-proto i-00000295 output: Connection refused by host [01:55:05] PROBLEM Current Users is now: CRITICAL on mwreview-proto i-00000295 output: Connection refused by host [01:55:45] PROBLEM Disk Space is now: CRITICAL on mwreview-proto i-00000295 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:56:25] PROBLEM Free ram is now: CRITICAL on mwreview-proto i-00000295 output: CHECK_NRPE: Error - Could not complete SSL handshake. [01:57:35] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 5.01, 5.02, 5.00 [01:59:25] RECOVERY Current Load is now: OK on mwreview-proto i-00000295 output: OK - load average: 0.39, 0.87, 0.62 [02:00:14] RECOVERY Current Users is now: OK on mwreview-proto i-00000295 output: USERS OK - 1 users currently logged in [02:00:44] RECOVERY Disk Space is now: OK on mwreview-proto i-00000295 output: DISK OK [02:01:24] RECOVERY Free ram is now: OK on mwreview-proto i-00000295 output: OK: 86% free memory [02:30:46] RECOVERY Free ram is now: OK on deployment-squid i-000000dc output: OK: 59% free memory [02:38:30] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 21% free memory [02:39:21] 05/26/2012 - 02:39:21 - Updating keys for laner at /export/home/deployment-prep/laner [02:41:11] PROBLEM dpkg-check is now: CRITICAL on e3 i-00000291 output: DPKG CRITICAL dpkg reports broken packages [02:42:20] 05/26/2012 - 02:42:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:43:20] 05/26/2012 - 02:43:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:46:01] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 17% free memory [02:46:18] 05/26/2012 - 02:46:18 - Updating keys for laner at /export/home/deployment-prep/laner [02:48:01] RECOVERY Puppet freshness is now: OK on bots-4 i-000000e8 output: puppet ran at Sat May 26 02:47:48 UTC 2012 [02:53:20] 05/26/2012 - 02:53:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:55:20] 05/26/2012 - 02:55:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:56:11] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 7.19, 6.31, 5.46 [02:58:01] RECOVERY Total Processes is now: OK on maps-test3 i-0000028f output: PROCS OK: 96 processes [02:58:06] RECOVERY Disk Space is now: OK on tutorial-mysql i-0000028b output: DISK OK [02:58:11] RECOVERY dpkg-check is now: OK on maps-test3 i-0000028f output: All packages OK [02:58:11] RECOVERY Total Processes is now: OK on tutorial-mysql i-0000028b output: PROCS OK: 96 processes [02:58:16] RECOVERY Free ram is now: OK on tutorial-mysql i-0000028b output: OK: 92% free memory [02:58:41] RECOVERY dpkg-check is now: OK on tutorial-mysql i-0000028b output: All packages OK [02:59:11] RECOVERY Current Load is now: OK on maps-test3 i-0000028f output: OK - load average: 0.14, 0.10, 0.03 [02:59:11] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [02:59:21] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 81% free memory [02:59:31] RECOVERY Current Users is now: OK on maps-test3 i-0000028f output: USERS OK - 0 users currently logged in [03:00:31] RECOVERY Disk Space is now: OK on maps-test3 i-0000028f output: DISK OK [03:01:31] RECOVERY Free ram is now: OK on maps-test3 i-0000028f output: OK: 93% free memory [03:01:31] RECOVERY Current Load is now: OK on tutorial-mysql i-0000028b output: OK - load average: 0.09, 0.08, 0.03 [03:01:51] RECOVERY Current Users is now: OK on tutorial-mysql i-0000028b output: USERS OK - 0 users currently logged in [03:03:29] RECOVERY Current Load is now: OK on maps-tilemill1 i-00000294 output: OK - load average: 3.13, 3.31, 2.56 [03:03:29] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 1 users currently logged in [03:03:29] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 113 processes [03:03:44] RECOVERY dpkg-check is now: OK on maps-tilemill1 i-00000294 output: All packages OK [03:06:17] RECOVERY dpkg-check is now: OK on e3 i-00000291 output: All packages OK [03:31:11] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 2.95, 3.77, 4.61 [03:38:51] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [03:46:31] PROBLEM Total Processes is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:21] PROBLEM SSH is now: CRITICAL on maps-tilemill1 i-00000294 output: CRITICAL - Socket timeout after 10 seconds [03:52:21] PROBLEM Disk Space is now: CRITICAL on maps-tilemill1 i-00000294 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:21] PROBLEM Free ram is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused or timed out [03:52:21] PROBLEM Current Load is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused or timed out [03:52:21] PROBLEM Current Users is now: CRITICAL on maps-tilemill1 i-00000294 output: Connection refused or timed out [03:53:41] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 12% free memory [03:53:55] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [03:57:11] RECOVERY SSH is now: OK on maps-tilemill1 i-00000294 output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:57:11] RECOVERY Disk Space is now: OK on maps-tilemill1 i-00000294 output: DISK OK [03:57:21] RECOVERY Free ram is now: OK on maps-tilemill1 i-00000294 output: OK: 87% free memory [03:57:31] RECOVERY Current Users is now: OK on maps-tilemill1 i-00000294 output: USERS OK - 0 users currently logged in [03:58:51] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:01:31] RECOVERY Total Processes is now: OK on maps-tilemill1 i-00000294 output: PROCS OK: 121 processes [04:04:00] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [04:08:54] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 14% free memory [04:13:52] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:23:51] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 3% free memory [04:24:01] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:28:51] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:29:01] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 3% free memory [04:34:01] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [05:40:15] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 5.02, 5.19, 5.05 [05:42:53] !log deployment-prep hashar: squid was killed by linux OOM! [05:42:59] Logged the message, Master [05:45:15] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 3.97, 4.56, 4.82 [05:59:32] !log deployment-prep hashar: Edited squid.conf to limit memory to 1G and restarted squid [05:59:33] Logged the message, Master [06:18:16] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 4.74, 5.23, 5.08 [06:30:15] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [06:54:10] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 4.86, 8.31, 5.72 [06:56:37] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:47] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:47] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:57] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 1.25, 5.37, 5.41 [07:01:37] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 79% free memory [07:01:37] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 95 processes [07:01:42] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:04:07] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.00, 2.17, 3.66 [07:04:57] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 1.15, 2.65, 4.21 [07:09:10] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 1.12, 1.39, 2.89 [07:28:40] PROBLEM Current Load is now: WARNING on deployment-sql i-000000d0 output: WARNING - load average: 5.93, 5.78, 5.27 [08:17:05] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [08:25:05] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [11:20:09] PROBLEM Puppet freshness is now: CRITICAL on deployment-jobrunner05 i-0000028c output: Puppet has not run in last 20 hours [11:48:11] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [14:03:47] What is up with labs atm? [14:07:19] eh, whats up with it? [14:08:34] Sorry! This site is experiencing technical difficulties. [14:08:43] Try waiting a few minutes and reloading. [14:08:57] as it told you, reload :P [14:09:10] Uhm, you didn't tell me to do that [14:09:25] It's what I've been doing but it keeps happening, and at least afaik it shouldn't [14:09:38] do you mean the deployment wikis? [14:09:46] ... [14:09:47] No [14:10:03] I mean the beta wikis [14:10:06] on labs [14:10:18] eh, its the same... [14:10:27] anyway, it looks like its the mysql error thing [14:10:41] I've always recognised "deployment" wikis as the live wikis [14:11:06] (Can't contact the database server: Host '[some host is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' (deployment-sql)) [14:11:11] yep [14:11:15] I am getting that too [14:11:19] but refreshing helps [14:11:26] It does, yes [14:11:33] But then it happens almost straight after [14:11:56] hmm, its a problem no doubt [14:12:37] Now it's happening literally every time I click on a link [14:13:05] I'm away now anyway, just thought it's something the devs/sysadmins/whatever would want to know [16:31:14] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [17:09:22] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 5.24, 5.38, 5.09 [17:14:22] RECOVERY Current Load is now: OK on bots-sql2 i-000000af output: OK - load average: 4.18, 4.81, 4.93 [18:14:02] !log deployment-prep hashar: killed webtranscode job on commons [18:14:03] RECOVERY Current Load is now: OK on deployment-sql i-000000d0 output: OK - load average: 0.13, 2.04, 3.90 [18:14:05] Logged the message, Master [21:21:07] PROBLEM Puppet freshness is now: CRITICAL on deployment-jobrunner05 i-0000028c output: Puppet has not run in last 20 hours [21:39:37] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 7.57, 6.67, 5.55 [21:49:07] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [23:14:43] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 2.48, 3.44, 4.43