[00:03:44] PROBLEM dpkg-check is now: CRITICAL on deployment-thumbproxy i-0000026b output: Connection refused by host [00:05:08] PROBLEM Current Load is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:05:44] PROBLEM Current Users is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:06:19] PROBLEM Disk Space is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:06:54] PROBLEM Free ram is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Error - Could not complete SSL handshake. [00:08:44] RECOVERY dpkg-check is now: OK on deployment-thumbproxy i-0000026b output: All packages OK [00:10:04] RECOVERY Current Load is now: OK on deployment-thumbproxy i-0000026b output: OK - load average: 0.11, 0.91, 0.97 [00:10:44] RECOVERY Current Users is now: OK on deployment-thumbproxy i-0000026b output: USERS OK - 1 users currently logged in [00:11:14] RECOVERY Disk Space is now: OK on deployment-thumbproxy i-0000026b output: DISK OK [00:11:54] RECOVERY Free ram is now: OK on deployment-thumbproxy i-0000026b output: OK: 84% free memory [00:18:08] drdee: reportcard isn't working for some reason [00:34:19] New patchset: Hashar; "stop apache when having nginx thumb proxy" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7253 [00:34:34] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7253 [00:41:24] New patchset: Hashar; "ability to change thumbnail server name" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7255 [00:41:39] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7255 [00:50:55] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 17% free memory [00:56:45] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [01:00:29] !deployement-prep moving upload.beta.wmflabs.org from the non working instances back to the main entry point [01:04:45] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/dzahn 612 MB (3% inode=80%): [01:05:55] PROBLEM Free ram is now: CRITICAL on bots-3 i-000000e5 output: Critical: 5% free memory [01:13:33] hashar_: s/deploye/deploy/ [01:15:55] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 60% free memory [01:20:15] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [01:23:38] RoanKattouw: thx :)D [01:23:45] RoanKattouw: we really need to switch to french [01:23:52] !deployment-prep moving upload.beta.wmflabs.org from the non working instances back to the main entry point [01:23:52] deployment-prep is a project to test mediawiki at beta.wmflabs.org before putting it to prod [01:24:02] !log deployment-prep moving upload.beta.wmflabs.org from the non working instances back to the main entry point [01:24:05] Logged the message, Master [01:28:15] !log deployment-prep Created upload2.beta.wmflabs.org to be the entry point for the "new" thumbnailing infrastructure [01:28:16] Logged the message, Master [01:34:44] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [02:27:02] PROBLEM Disk Space is now: CRITICAL on deployment-apache09 i-0000025e output: Connection refused or timed out [02:28:26] !log deployment-prep deleted all remaining deployment-apache instances : : You don't have enough free space in /var/cache/apt/archives/. So we really want to use m1.large , not m1.tiny pretending to save disk space :-D [02:28:28] Logged the message, Master [02:28:31] good [02:28:33] danke [02:43:33] !deployment-prep setting up "apache20" instance by using only puppet. We will see what happens :-D [02:43:33] deployment-prep is a project to test mediawiki at beta.wmflabs.org before putting it to prod [02:44:14] RECOVERY Current Load is now: OK on exim-test i-00000265 output: OK - load average: 0.92, 0.50, 0.18 [02:44:45] RECOVERY Disk Space is now: OK on exim-test i-00000265 output: DISK OK [02:44:45] RECOVERY Current Users is now: OK on exim-test i-00000265 output: USERS OK - 0 users currently logged in [02:45:04] PROBLEM Current Users is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused by host [02:45:44] PROBLEM Disk Space is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused by host [02:45:54] RECOVERY Free ram is now: OK on exim-test i-00000265 output: OK: 88% free memory [02:46:34] PROBLEM Free ram is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused by host [02:46:54] RECOVERY Total Processes is now: OK on exim-test i-00000265 output: PROCS OK: 81 processes [02:46:59] PROBLEM HTTP is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused [02:48:04] RECOVERY dpkg-check is now: OK on exim-test i-00000265 output: All packages OK [02:48:14] PROBLEM Total Processes is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused by host [02:48:27] it is installing [02:48:28] hopefully [02:48:30] :-D [02:49:24] PROBLEM Current Load is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused by host [02:49:44] PROBLEM dpkg-check is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused by host [03:06:58] RECOVERY HTTP is now: OK on deployment-apache20 i-0000026c output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.004 second response time [03:09:28] RECOVERY Current Load is now: OK on deployment-apache20 i-0000026c output: OK - load average: 1.34, 2.31, 2.39 [03:09:48] RECOVERY dpkg-check is now: OK on deployment-apache20 i-0000026c output: All packages OK [03:10:35] RECOVERY Current Users is now: OK on deployment-apache20 i-0000026c output: USERS OK - 2 users currently logged in [03:10:45] RECOVERY Disk Space is now: OK on deployment-apache20 i-0000026c output: DISK OK [03:11:35] RECOVERY Free ram is now: OK on deployment-apache20 i-0000026c output: OK: 93% free memory [03:11:45] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/dzahn 601 MB (3% inode=80%): [03:13:15] RECOVERY Total Processes is now: OK on deployment-apache20 i-0000026c output: PROCS OK: 138 processes [03:13:45] PROBLEM Current Load is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused by host [03:14:25] PROBLEM Current Users is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused by host [03:14:55] PROBLEM HTTP is now: WARNING on deployment-apache20 i-0000026c output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.005 second response time [03:15:05] PROBLEM Disk Space is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused by host [03:16:01] PROBLEM Free ram is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused by host [03:16:34] PROBLEM HTTP is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused [03:17:43] PROBLEM Total Processes is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused by host [03:18:28] PROBLEM dpkg-check is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused by host [03:21:21] RECOVERY HTTP is now: OK on deployment-apache21 i-0000026d output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.003 second response time [03:26:10] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [03:30:08] petan: are you the correct person to ping for setting stuff on beta? [03:34:00] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 78 MB (5% inode=40%): [03:38:11] RECOVERY Current Load is now: OK on kripke i-00000268 output: OK - load average: 0.55, 0.34, 0.17 [03:38:11] RECOVERY Disk Space is now: OK on kripke i-00000268 output: DISK OK [03:38:41] RECOVERY dpkg-check is now: OK on kripke i-00000268 output: All packages OK [03:39:01] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [03:39:21] RECOVERY Current Users is now: OK on kripke i-00000268 output: USERS OK - 0 users currently logged in [03:39:21] RECOVERY Free ram is now: OK on kripke i-00000268 output: OK: 96% free memory [03:39:21] RECOVERY Total Processes is now: OK on kripke i-00000268 output: PROCS OK: 217 processes [03:40:21] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 16% free memory [03:41:03] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [03:47:42] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [03:47:42] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 15% free memory [03:55:39] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/dzahn 595 MB (3% inode=80%): [03:56:11] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [03:59:17] PROBLEM Disk Space is now: CRITICAL on deployment-feed i-00000118 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:59:17] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: CHECK_NRPE: Socket timeout after 10 seconds. [04:00:23] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [04:04:06] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [04:05:36] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:06:06] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:12:36] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 4% free memory [04:17:36] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:37:44] !log deployment-prep [~2 hrs ago] 02:43:33 < hashar_> !deployment-prep setting up "apache20" instance by using only puppet. We will see what happens :-D [04:37:45] Logged the message, Master [04:39:47] the new instances are quite large [04:44:06] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 78 MB (5% inode=40%): [04:45:45] PROBLEM Free ram is now: CRITICAL on dumps-8 i-0000026e output: Connection refused by host [04:46:55] PROBLEM Total Processes is now: CRITICAL on dumps-8 i-0000026e output: Connection refused by host [04:47:35] PROBLEM dpkg-check is now: CRITICAL on dumps-8 i-0000026e output: Connection refused by host [04:48:45] PROBLEM Current Load is now: CRITICAL on dumps-8 i-0000026e output: Connection refused by host [04:52:05] PROBLEM Disk Space is now: CRITICAL on dumps-8 i-0000026e output: Connection refused by host [04:52:25] PROBLEM Current Users is now: CRITICAL on dumps-8 i-0000026e output: Connection refused by host [05:04:53] PROBLEM Free ram is now: WARNING on test3 i-00000093 output: Warning: 8% free memory [06:02:02] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.34, 9.17, 6.85 [06:04:52] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [06:12:15] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.72, 1.77, 3.86 [06:31:18] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 6.50, 5.92, 3.21 [06:34:36] PROBLEM Current Users is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:36] PROBLEM Total Processes is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:14] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:15] PROBLEM Current Users is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:15] PROBLEM Total Processes is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:33] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [06:38:24] PROBLEM Current Load is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:24] PROBLEM dpkg-check is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:24] PROBLEM Disk Space is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:29] PROBLEM Current Load is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [06:38:29] PROBLEM Disk Space is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:54] PROBLEM Free ram is now: CRITICAL on ve-nodejs i-00000245 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:25] PROBLEM Current Load is now: WARNING on nova-essex-test i-000001f9 output: WARNING - load average: 8.12, 9.61, 6.33 [06:41:26] RECOVERY Current Users is now: OK on nova-essex-test i-000001f9 output: USERS OK - 0 users currently logged in [06:41:26] RECOVERY Total Processes is now: OK on nova-essex-test i-000001f9 output: PROCS OK: 126 processes [06:41:31] PROBLEM Disk Space is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:52] PROBLEM Total Processes is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:12] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 12.04, 7.25, 4.64 [06:43:31] RECOVERY Disk Space is now: OK on nova-essex-test i-000001f9 output: DISK OK [06:44:44] PROBLEM Current Load is now: WARNING on nova-production1 i-0000007b output: WARNING - load average: 7.22, 7.92, 5.84 [06:44:45] PROBLEM Disk Space is now: WARNING on bz-dev i-000001db output: DISK WARNING - free space: / 46 MB (3% inode=43%): [06:45:14] PROBLEM Free ram is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:15] PROBLEM Current Users is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:15] PROBLEM Disk Space is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:30] PROBLEM Current Users is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:30] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:55] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [06:47:00] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:09] PROBLEM dpkg-check is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:09] PROBLEM SSH is now: CRITICAL on mobile-enwp i-000000ce output: CRITICAL - Socket timeout after 10 seconds [06:48:22] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:22] PROBLEM Current Users is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:23] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:06] PROBLEM Disk Space is now: CRITICAL on maps-test2 i-00000253 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:46] PROBLEM Current Users is now: CRITICAL on maps-test2 i-00000253 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:47] PROBLEM Total Processes is now: CRITICAL on maps-test2 i-00000253 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:55] PROBLEM dpkg-check is now: CRITICAL on maps-test2 i-00000253 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:56] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:05] PROBLEM Free ram is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:05] PROBLEM Current Load is now: CRITICAL on maps-test2 i-00000253 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:50] PROBLEM Current Load is now: WARNING on bots-3 i-000000e5 output: WARNING - load average: 6.01, 6.69, 5.68 [06:53:12] PROBLEM Total Processes is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:32] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/dzahn 703 MB (4% inode=80%): [06:53:32] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 12.31, 10.56, 7.36 [06:54:33] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 78 MB (5% inode=40%): [06:54:34] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000234 output: All packages OK [06:54:34] RECOVERY Current Load is now: OK on maps-test2 i-00000253 output: OK - load average: 6.72, 6.84, 4.18 [06:54:34] RECOVERY Current Users is now: OK on maps-test2 i-00000253 output: USERS OK - 0 users currently logged in [06:54:34] RECOVERY Total Processes is now: OK on maps-test2 i-00000253 output: PROCS OK: 101 processes [06:54:38] RECOVERY dpkg-check is now: OK on maps-test2 i-00000253 output: All packages OK [06:54:43] RECOVERY Free ram is now: OK on ve-nodejs i-00000245 output: OK: 80% free memory [06:54:43] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000234 output: OK: 90% free memory [06:54:43] RECOVERY Total Processes is now: OK on pediapress-ocg2 i-00000234 output: PROCS OK: 83 processes [06:55:53] PROBLEM Current Load is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:53] PROBLEM Free ram is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:53] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:19] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg1 i-00000233 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:49] PROBLEM dpkg-check is now: CRITICAL on en-wiki-db-precise i-0000023c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:49] PROBLEM Disk Space is now: CRITICAL on en-wiki-db-precise i-0000023c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:50] PROBLEM Current Users is now: CRITICAL on en-wiki-db-precise i-0000023c output: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:35] PROBLEM Current Load is now: CRITICAL on nova-essex-test i-000001f9 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:44] PROBLEM Current Load is now: WARNING on pediapress-ocg2 i-00000234 output: WARNING - load average: 5.12, 6.95, 5.85 [06:57:44] RECOVERY Current Users is now: OK on pediapress-ocg2 i-00000234 output: USERS OK - 0 users currently logged in [06:57:44] RECOVERY Disk Space is now: OK on pediapress-ocg2 i-00000234 output: DISK OK [06:58:45] PROBLEM Current Load is now: CRITICAL on nova-precise1 i-00000236 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:34] RECOVERY Current Users is now: OK on pediapress-ocg1 i-00000233 output: USERS OK - 0 users currently logged in [07:00:22] PROBLEM Current Load is now: WARNING on aggregator-test i-0000024d output: WARNING - load average: 7.64, 23.33, 16.21 [07:01:16] RECOVERY Current Load is now: OK on bots-3 i-000000e5 output: OK - load average: 6.13, 4.55, 4.88 [07:01:16] PROBLEM Disk Space is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused by host [07:01:16] PROBLEM Free ram is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused by host [07:01:51] !deployment-prep del [07:01:51] You are not autorized to perform this, sorry [07:01:55] meh [07:02:17] hey [07:02:18] PROBLEM HTTP is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused [07:02:24] PROBLEM Disk Space is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused by host [07:02:29] if someone again ping regarding beta tell them to use bz plix [07:02:30] PROBLEM Free ram is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused by host [07:02:48] PROBLEM Current Users is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused by host [07:02:49] PROBLEM Current Load is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused by host [07:03:03] where is hashar [07:03:05] PROBLEM Current Users is now: CRITICAL on fr-wiki-db-precise i-0000023e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:05] PROBLEM Free ram is now: CRITICAL on fr-wiki-db-precise i-0000023e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:05] RECOVERY Total Processes is now: OK on mobile-enwp i-000000ce output: PROCS OK: 114 processes [07:03:13] PROBLEM Current Load is now: CRITICAL on pediapress-ocg2 i-00000234 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:34] PROBLEM Total Processes is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused by host [07:03:34] PROBLEM HTTP is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused [07:03:34] PROBLEM Current Load is now: WARNING on nova-precise1 i-00000236 output: WARNING - load average: 10.57, 10.25, 8.72 [07:03:59] RECOVERY Disk Space is now: OK on maps-test2 i-00000253 output: DISK OK [07:04:24] PROBLEM dpkg-check is now: CRITICAL on deployment-apache22 i-0000026f output: Connection refused by host [07:04:24] PROBLEM Total Processes is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused by host [07:04:34] PROBLEM Total Processes is now: CRITICAL on fr-wiki-db-precise i-0000023e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:40] PROBLEM dpkg-check is now: CRITICAL on fr-wiki-db-precise i-0000023e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:40] RECOVERY Current Users is now: OK on mobile-enwp i-000000ce output: USERS OK - 2 users currently logged in [07:04:40] RECOVERY Disk Space is now: OK on mobile-enwp i-000000ce output: DISK OK [07:04:40] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 28% free memory [07:04:54] PROBLEM dpkg-check is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused by host [07:05:24] PROBLEM Current Users is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused by host [07:05:24] PROBLEM Current Load is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused by host [07:06:14] PROBLEM Current Load is now: WARNING on pediapress-ocg1 i-00000233 output: WARNING - load average: 4.75, 6.08, 5.87 [07:06:14] RECOVERY Disk Space is now: OK on pediapress-ocg1 i-00000233 output: DISK OK [07:06:14] RECOVERY Free ram is now: OK on pediapress-ocg1 i-00000233 output: OK: 84% free memory [07:06:26] RECOVERY SSH is now: OK on mobile-enwp i-000000ce output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:06:26] RECOVERY dpkg-check is now: OK on mobile-enwp i-000000ce output: All packages OK [07:06:26] RECOVERY Total Processes is now: OK on pediapress-ocg1 i-00000233 output: PROCS OK: 89 processes [07:06:34] RECOVERY dpkg-check is now: OK on pediapress-ocg1 i-00000233 output: All packages OK [07:07:05] RECOVERY Disk Space is now: OK on en-wiki-db-precise i-0000023c output: DISK OK [07:07:05] RECOVERY dpkg-check is now: OK on en-wiki-db-precise i-0000023c output: All packages OK [07:07:05] RECOVERY Current Users is now: OK on en-wiki-db-precise i-0000023c output: USERS OK - 0 users currently logged in [07:07:50] RECOVERY Total Processes is now: OK on nova-precise1 i-00000236 output: PROCS OK: 125 processes [07:08:09] RECOVERY Disk Space is now: OK on nova-precise1 i-00000236 output: DISK OK [07:08:09] RECOVERY Current Users is now: OK on fr-wiki-db-precise i-0000023e output: USERS OK - 0 users currently logged in [07:08:09] RECOVERY Free ram is now: OK on fr-wiki-db-precise i-0000023e output: OK: 78% free memory [07:08:09] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 7.12, 3.04, 3.82 [07:08:55] RECOVERY Current Load is now: OK on ve-nodejs i-00000245 output: OK - load average: 0.50, 3.59, 4.91 [07:09:03] RECOVERY dpkg-check is now: OK on ve-nodejs i-00000245 output: All packages OK [07:09:03] RECOVERY Disk Space is now: OK on ve-nodejs i-00000245 output: DISK OK [07:09:25] RECOVERY Total Processes is now: OK on fr-wiki-db-precise i-0000023e output: PROCS OK: 83 processes [07:09:30] RECOVERY dpkg-check is now: OK on fr-wiki-db-precise i-0000023e output: All packages OK [07:11:08] RECOVERY Current Load is now: OK on pediapress-ocg1 i-00000233 output: OK - load average: 0.12, 2.53, 4.42 [07:11:20] RECOVERY Current Users is now: OK on ve-nodejs i-00000245 output: USERS OK - 0 users currently logged in [07:11:20] RECOVERY Total Processes is now: OK on ve-nodejs i-00000245 output: PROCS OK: 81 processes [07:14:28] RECOVERY Current Load is now: OK on nova-production1 i-0000007b output: OK - load average: 0.05, 1.45, 4.29 [07:19:01] PROBLEM Current Load is now: WARNING on mobile-enwp i-000000ce output: WARNING - load average: 4.56, 7.14, 18.19 [07:25:29] RECOVERY Current Load is now: OK on aggregator-test i-0000024d output: OK - load average: 0.25, 0.63, 4.04 [07:30:02] Hydriz: I don't see any possible reason to need 8 instances to upload dumps [07:30:19] thats going to be 100 wikis per instance [07:30:23] 100*5 [07:30:29] upload slower [07:30:47] you're eating a ton of IO [07:31:05] heh [07:31:31] then, give me some desired figures, and I can adapt to it [07:31:43] I mean, I don't know what is possible [07:31:48] so I need feedback [07:32:04] use a couple instances [07:32:32] 4? [07:33:26] ok [07:33:47] you are uploading directly off the gluster share now, right? [07:34:25] yep [07:34:29] ok. cool [07:34:33] no wait [07:34:38] which gluster share? [07:34:46] the nfs one, sorry [07:34:51] /dumps-project or /publicdata-project [07:34:58] the latter [07:35:13] oh, just doing initial tests [07:35:33] but its a little out of date the other time I checked [07:35:34] (a few days ago) [07:35:49] I'll check to see why [07:35:59] * Hydriz checks again [07:36:09] what is the interval for updating this directory? [07:36:26] I could have broken it somehow recently [07:36:30] it should be consistent [07:37:15] yeah, what did you set the interval for updating this directory to be? [07:37:35] if its once a day, then its really out of date [07:37:51] I didn't [07:37:53] ariel did [07:38:01] looks like its once every two days then :) [07:38:13] May 09 is there [07:38:17] but not 10 and 11 [07:39:27] yep [07:39:28] I broke it [07:39:38] no, it constantly updates [07:39:42] but I broke it monday [07:39:59] so, I just fixed it [07:40:03] heh [07:40:04] it'll start updating again [07:40:31] I am figuring out how to extract the dates of the dump in python [07:40:52] the script for uploading is almost done, just lacking this and a date range check [07:44:36] PROBLEM Current Load is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [07:44:51] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [07:45:01] PROBLEM Free ram is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [07:46:10] PROBLEM Total Processes is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [07:51:35] RECOVERY HTTP is now: OK on deployment-apache22 i-0000026f output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.028 second response time [07:53:35] RECOVERY HTTP is now: OK on deployment-apache23 i-00000270 output: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 0.005 second response time [08:00:38] RECOVERY Total Processes is now: OK on deployment-apache21 i-0000026d output: PROCS OK: 149 processes [08:00:58] RECOVERY Free ram is now: OK on deployment-apache21 i-0000026d output: OK: 96% free memory [08:00:58] RECOVERY Disk Space is now: OK on deployment-apache21 i-0000026d output: DISK OK [08:01:28] RECOVERY Current Load is now: OK on deployment-apache21 i-0000026d output: OK - load average: 0.19, 0.27, 0.11 [08:01:28] RECOVERY dpkg-check is now: OK on deployment-apache21 i-0000026d output: All packages OK [08:01:28] RECOVERY Current Users is now: OK on deployment-apache21 i-0000026d output: USERS OK - 0 users currently logged in [09:08:44] RECOVERY Current Load is now: OK on deployment-apache23 i-00000270 output: OK - load average: 0.35, 0.21, 0.15 [09:08:44] RECOVERY Current Users is now: OK on deployment-apache23 i-00000270 output: USERS OK - 0 users currently logged in [09:09:50] !log deployment-prep petrb: fixed nrpe on boxes where it was failing, we need to insert motd to puppet [09:09:53] Logged the message, Master [09:10:04] RECOVERY Disk Space is now: OK on deployment-apache23 i-00000270 output: DISK OK [09:10:04] RECOVERY dpkg-check is now: OK on deployment-apache22 i-0000026f output: All packages OK [09:10:04] RECOVERY Free ram is now: OK on deployment-apache23 i-00000270 output: OK: 93% free memory [09:10:44] RECOVERY Disk Space is now: OK on deployment-apache22 i-0000026f output: DISK OK [09:10:44] RECOVERY Free ram is now: OK on deployment-apache22 i-0000026f output: OK: 94% free memory [09:11:14] RECOVERY Current Load is now: OK on deployment-apache22 i-0000026f output: OK - load average: 0.33, 0.33, 0.18 [09:11:27] RECOVERY Current Users is now: OK on deployment-apache22 i-0000026f output: USERS OK - 1 users currently logged in [09:11:34] RECOVERY Total Processes is now: OK on deployment-apache22 i-0000026f output: PROCS OK: 129 processes [09:12:34] RECOVERY Total Processes is now: OK on deployment-apache23 i-00000270 output: PROCS OK: 127 processes [09:12:54] RECOVERY dpkg-check is now: OK on deployment-apache23 i-00000270 output: All packages OK [09:27:44] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 78 MB (5% inode=40%): [09:33:46] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [09:37:39] PROBLEM HTTP is now: WARNING on deployment-apache21 i-0000026d output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.006 second response time [09:47:47] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/dzahn 681 MB (3% inode=80%): [11:20:28] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [11:56:15] mutante: can you insert a file to puppet so that all servers on deployment have it in /etc/motd.tail [11:56:25] I don't know how to do that [11:57:34] do you know the right class already? [11:57:44] yes there is class for apache servers [11:57:46] or puppet group [11:58:56] apaches::service [11:59:42] ok, if you are in that file, and you scroll down a little to apaches::pybal-check [12:00:06] you can see a line with file { [12:00:32] that block puts 2 files in place [12:00:54] one is a directory "ensure => directory" in there [12:01:24] and the other one is a file and gets the content from a variable (content => ..) [12:01:52] then scrolling down further to class::apaches:syslog ... [12:02:23] you can see another one, which uses source => instead, and you can see that point to puppet:///files/.... [12:02:42] that refers to the pathes in the git repo [12:05:54] so you put the file you want somewhere in ./files/ in git repo and type "git add ",then add a file { block like that in the class, by saying "ensure => present;" you tell puppet to make sure it exists [12:12:44] ok now another problem one of boxes I use to connect doesn't resolve for some reason, idea how to fix it? [12:12:58] I can't access git because of that [12:13:01] :o [12:13:25] I can't resolve any ip for some reason, is it possible to restart resolver? [12:13:32] or whatever service [12:14:01] I am connected on office pc to one of my servers and from there to labs [12:15:11] will try #ubuntu [12:15:39] petan|wk: does not look like a labs problem, worksforme. temp work around put something in /etc/hosts (just dont forget to remove once resolving works again) [12:16:29] mutante: no it's not a labs problem I just thought you could know that [12:16:41] oh, you meant local DNS cache? /etc/init.d/nscd restart ? [12:17:07] nscd is the caching daemon [12:17:44] no such a service on that box [12:18:11] I tried to use another dns servers but all of them doesn't work :o [12:18:14] but I can ping them [12:18:43] like ping one of opendns servers work, but when I try to resolve and dns it fail [12:18:53] any [12:19:06] firewalling port 53? [12:19:21] no it used to work in past I didn't change anything on firewall [12:19:27] I think the service just died [12:19:33] but I don't know which service it is [12:19:34] which one [12:19:39] the one which resolve dns [12:19:44] where are you? you meant wmf office? [12:20:34] eh, no I am in office in my work, then I connect to one of my servers which is in entirely different place and from there I connect to bastion (I can't connect to bastion from office because of firewall here) [12:21:16] I just wanted to install some packages I need to upload that patch to gerrit [12:21:23] but aptitude can't resolve the repo [12:21:32] :/ [12:22:07] if only I know how the linux resolve stuff [12:22:25] I believe there is some service which needs to be restarted [12:22:27] first it looks in /etc/nsswitch.conf [12:22:40] to see where it is supposed to check (files, DNS,..) [12:23:20] if there is (also) "files" in there, it looks up names in /etc/hosts [12:23:26] ok, but when you type ping blabla does it contact some local service to resolve it? or use some api? [12:23:43] otherwise it looks in /etc/resolv.conf [12:24:02] which tells it which nameserver to ask [12:24:07] "it" is a program I called or a service which resolve it? [12:24:24] "it" is like the answer to "how does linux resolve stuff" [12:24:27] I mean if I call some program, is it that program which directly contact dns server or some service [12:24:56] I guess it call some api [12:25:04] which does all the stuff you described [12:25:18] or it could just query some local service which would do that stuff [12:25:34] that would allow caching of dns records etc [12:26:19] my question is if there is any service which if I restart could make it work [12:26:32] because it just stopped working for no reason [12:26:56] I didn't change anything on that box, it's running almost 8 months with no restart [12:26:57] afaik what you meant with api here is glibc [12:27:30] re: caching, that would be the nscd you did not have installed though [12:27:52] you may ask the work admin first to confirm there is no issue with local DNS server (likely "bind")? [12:28:13] actually I don't use local dns, in resolv.conf I have another dns server [12:28:15] opendns [12:28:22] I can ping it [12:28:30] but resolving doesn't work [12:28:47] I thought their service is down, but changing dns servers didn't fix it [12:29:50] that's weird I will try ubuntu guys [12:30:57] ok [12:39:35] mutante: found out what the problem was [12:40:06] all possible dns servers I tried were down, maybe some hackers are trying to get internet down, hehe [12:40:15] finally I found one which is up [12:40:54] when in situations like that: 8.8.8.8 is google [12:47:54] mutante: which file is apache class hashar used in [12:48:17] I know the name of class only [12:48:54] I think I found that [14:01:09] PROBLEM dpkg-check is now: CRITICAL on deployment-apache20 i-0000026c output: DPKG CRITICAL dpkg reports broken packages [14:23:02] PROBLEM HTTP is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused [14:26:02] RECOVERY dpkg-check is now: OK on deployment-apache20 i-0000026c output: All packages OK [14:26:42] PROBLEM HTTP is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused [14:28:02] PROBLEM HTTP is now: WARNING on deployment-apache20 i-0000026c output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.013 second response time [14:31:42] PROBLEM HTTP is now: WARNING on deployment-apache23 i-00000270 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.018 second response time [16:20:18] Soulparadox gets cookies [16:33:22] hello [16:37:28] !log deployment-prep updated MediaWiki up to 05e656a (aka master) [16:37:31] Logged the message, Master [16:39:56] !log deployment-prep cloning mediawiki/extensions.git which has all extensions as submodules [16:39:57] Logged the message, Master [16:49:55] PROBLEM HTTP is now: WARNING on deployment-apache22 i-0000026f output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.012 second response time [16:55:52] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 75% free memory [16:55:52] RECOVERY Current Load is now: OK on mobile-enwp i-000000ce output: OK - load average: 2.78, 1.43, 0.56 [16:57:12] RECOVERY Total Processes is now: OK on mobile-enwp i-000000ce output: PROCS OK: 118 processes [16:57:52] RECOVERY Disk Space is now: OK on deployment-feed i-00000118 output: DISK OK [17:05:49] PROBLEM Disk Space is now: WARNING on deployment-feed i-00000118 output: DISK WARNING - free space: / 77 MB (5% inode=40%): [17:30:59] PROBLEM Disk Space is now: CRITICAL on bz-dev i-000001db output: DISK CRITICAL - free space: / 35 MB (2% inode=43%): [17:31:20] 05/11/2012 - 17:31:19 - Updating keys for laner [17:32:19] 05/11/2012 - 17:32:19 - Updating keys for laner [17:33:32] !log deployment-prep restarted squid several time to fix some minor typos in conf [17:33:35] Logged the message, Master [17:33:59] PROBLEM Current Load is now: WARNING on mobile-enwp i-000000ce output: WARNING - load average: 11.37, 9.86, 6.36 [17:35:59] PROBLEM Disk Space is now: WARNING on bz-dev i-000001db output: DISK WARNING - free space: / 45 MB (3% inode=43%): [17:38:20] 05/11/2012 - 17:38:19 - Updating keys for laner [17:39:42] 05/11/2012 - 17:39:42 - Creating a home directory for laner at /export/home/analytics/laner [17:39:42] 05/11/2012 - 17:39:42 - Creating a home directory for diederik at /export/home/analytics/diederik [17:39:42] 05/11/2012 - 17:39:42 - Creating a home directory for nimishg at /export/home/analytics/nimishg [17:39:42] 05/11/2012 - 17:39:42 - Creating a home directory for otto at /export/home/analytics/otto [17:39:42] 05/11/2012 - 17:39:42 - Creating a home directory for declerambaul at /export/home/analytics/declerambaul [17:40:40] 05/11/2012 - 17:40:40 - Updating keys for declerambaul [17:40:40] 05/11/2012 - 17:40:40 - Updating keys for diederik [17:40:40] 05/11/2012 - 17:40:40 - Updating keys for otto [17:40:40] 05/11/2012 - 17:40:40 - Updating keys for laner [17:40:40] 05/11/2012 - 17:40:40 - Updating keys for nimishg [17:49:20] 05/11/2012 - 17:49:19 - Updating keys for laner [17:50:09] PROBLEM HTTP is now: CRITICAL on deployment-apache22 i-0000026f output: CRITICAL - Socket timeout after 10 seconds [17:50:20] 05/11/2012 - 17:50:20 - Updating keys for laner [17:53:19] 05/11/2012 - 17:53:19 - Updating keys for laner [17:55:17] PROBLEM HTTP is now: WARNING on deployment-apache22 i-0000026f output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.029 second response time [18:04:19] hi guys [18:04:32] dschoon just created a new labs instances for the analytics team [18:04:36] i think we've set everything up [18:04:39] but i can't log in [18:04:42] so close though [18:05:06] my ssh client is getting [18:05:06] Connection closed by UNKNOWN [18:05:18] and auth.log on the instance says May 11 18:02:54 i-00000268 sshd[12687]: pam_access(sshd:account): access denied for user `otto' from `i-000000ba.pmtpa.wmflabs' [18:06:17] 05/11/2012 - 18:06:17 - Updating keys for laner [18:06:23] any ideas? i'm pretty sure we've set up the access in labs properly [18:07:51] PROBLEM HTTP is now: CRITICAL on deployment-apache23 i-00000270 output: Connection refused [18:07:56] Ryan_Lane: Did... did I just see you on Ellen ? [18:07:58] (dancing behind someone in an electronics store) [18:08:05] hahahaha [18:08:05] no [18:08:09] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [18:08:43] They have this section where they dance behind people, record it and send it in. A dutch program highlighting a collection of a few of the best. I'd bet I saw you in one of them. [18:08:46] Okay, then. [18:08:48] :D [18:09:18] 05/11/2012 - 18:09:18 - Updating keys for laner [18:12:11] PROBLEM HTTP is now: WARNING on deployment-apache23 i-00000270 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.301 second response time [18:15:25] Reedy: https://gerrit.wikimedia.org/r/7298 [18:15:32] Reedy: really lame change for ya to review :- [18:15:40] Reedy: this is showing in orange ? :-] [18:15:48] mmmm [18:16:02] PROBLEM Disk Space is now: CRITICAL on nagios 127.0.0.1 output: DISK CRITICAL - free space: /home/dzahn 614 MB (3% inode=80%): [18:16:21] PROBLEM HTTP is now: CRITICAL on deployment-apache20 i-0000026c output: Connection refused [18:18:19] Ryan_Lane: I let apt update pam on a labs instance. It says config files are out of date, but have local changes. It wants me to run `pam-auth-update --force`. Is this a good idea? [18:18:32] o.O [18:18:45] that was my response also [18:18:54] what pam stuff did you install? [18:19:04] or is this just on apt-get update? [18:19:06] err [18:19:08] apt-get upgrade [18:19:10] apt was all "there are security updates!!!" [18:19:12] ah [18:19:18] so i let it install them [18:19:21] don't run tnat [18:19:32] you told it to not use the distro's pam version, right? [18:19:32] and now everything is ruined and ottomata can't log in [18:19:35] heh [18:19:38] force run puppet [18:19:46] i ran sudo puppetd --test [18:19:48] poor ottomata :-( [18:19:49] hm [18:19:51] same result [18:20:01] it definitely updated a ton of pam files [18:20:06] but functionally speaking, still fucked [18:20:20] which instance? [18:20:30] kripke.pmtpa.wmflabs [18:20:48] http://en.wikipedia.org/wiki/Saul_Kripke [18:20:51] :) [18:21:41] !log deployment-prep Replaced extensions with a fresh clone of mediawiki/extensions.git [18:21:43] Logged the message, Master [18:22:48] petan: are you using tcpdump somewhere? [18:24:01] or not. file just says that [18:24:22] petan: you have .nfs files in your home directory under the production directory [18:24:24] they are huge [18:24:28] and they are error files [18:25:32] petan: I deleted some since they were eating up 5 GB of space [18:25:50] dschoon: hm [18:25:55] I don't see why this would fail for him [18:25:57] Ryan_Lane: hm [18:25:58] RECOVERY Disk Space is now: OK on nagios 127.0.0.1 output: DISK OK [18:26:14] which project is this? [18:26:28] PROBLEM HTTP is now: CRITICAL on deployment-apache21 i-0000026d output: Connection refused [18:26:31] analytics [18:26:40] he's not in the project [18:26:45] i'm not? [18:26:47] no [18:26:53] is too [18:26:56] i added him this morning. [18:26:56] really? [18:26:58] really. [18:27:08] ah ha [18:27:11] nscd cache [18:27:18] RECOVERY Disk Space is now: OK on labs-nfs1 i-0000005d output: DISK OK [18:27:23] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Analytics [18:27:25] nscd -i passwd; nscd -i group [18:27:38] do i need to run that? [18:27:41] I just did [18:27:45] hokay. [18:27:52] ottomata should try again now? [18:27:59] !log deployment-prep deleting symlinks in /home/wikipedia to /data/project : breaks logging [18:28:00] dschoon: pam_access requires that a user be in a specific group [18:28:00] Logged the message, Master [18:28:02] yes [18:28:22] dschoon: if he tries to log in before he's added to the group, he gets added into the nscd cache [18:28:26] which means no group [18:28:28] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.70, 7.25, 6.48 [18:28:33] then next time he tries to log in.... [18:28:37] fail [18:29:10] i'm in! [18:29:15] \o/ [18:29:18] 05/11/2012 - 18:29:18 - Updating keys for laner [18:29:24] o.O [18:29:30] hahaha [18:29:33] that's an odd message [18:29:44] where did it change my key? [18:29:45] danke Ryan_Lane! [18:30:09] dschoon: did you add me to the project? [18:30:18] Yes. [18:30:20] ah ha [18:30:30] that message is scary with no context [18:32:54] dschoon: I can ssh into projects without being added to them ;) [18:33:01] *shrug* [18:33:28] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.26, 2.86, 4.78 [18:35:00] !log deployment-prep restarting udp2-log on dbdump [18:35:01] Logged the message, Master [18:40:20] 05/11/2012 - 18:40:20 - Updating keys for laner [18:41:19] PROBLEM HTTP is now: WARNING on deployment-apache20 i-0000026c output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.008 second response time [18:41:29] PROBLEM HTTP is now: WARNING on deployment-apache21 i-0000026d output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.006 second response time [18:46:00] ottomata, dschoon: did you guys add a sudo file called reportcard? [18:46:23] i'm sure one of us did [18:46:33] i didn't on kripke, but might have on reportcard2 instance [18:46:48] wtf... [18:46:52] ? [18:46:57] this is a *really* bad sudoers file [18:47:03] haha [18:47:17] I mean your instance is going to get seriously owned kind of sudoers file [18:47:24] this is incredibly insecure [18:47:43] dschoon? [18:47:48] oh he's getting a muffin [18:47:49] :) [18:47:54] while you are looking at that [18:48:00] sudo: parse error in /etc/sudoers.d/ops near line 6 [18:48:02] I'm going to go ahead and remove that.... [18:48:13] which instance is saying that? [18:48:16] kripke [18:48:25] and i get a big stack trace when i try to sudo [18:48:46] New patchset: Hashar; "syslog-server requires /home/wikipedia/syslog" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6546 [18:48:49] I'm not getting an errror [18:49:01] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6546 [18:49:05] works fine for me. [18:49:13] I just sudoed without error [18:49:28] ah [18:49:32] it's because I removed reportcard [18:49:34] Ryan_Lane: can you possibly merge the already reviewed change https://gerrit.wikimedia.org/r/#/c/6546/ ? :-D [18:49:59] ja sudo is fine now [18:50:10] New review: Ryan Lane; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6546 [18:50:13] Change merged: Ryan Lane; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6546 [18:50:18] 05/11/2012 - 18:50:18 - Updating keys for laner [18:50:21] danke! [18:50:30] chown/chmod should never be added to sudo for a user [18:50:43] it's the equivalent of giving them root [18:51:00] if you are going to do that, you may as well give them full root [18:51:17] dschoon: ^^ [18:52:18] yeah, that is to the www-deploy group [18:52:22] which is me and dsc and fabian [18:52:24] who already have root [18:54:16] oh [18:54:21] PROBLEM Current Load is now: CRITICAL on mobile-enwp i-000000ce output: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:23] I thought it was www-data [18:54:34] no. [18:54:36] it is not. [18:54:36] :D [18:54:38] ok [18:55:10] btw, you can manage sudo via labsconsole now [18:55:15] and it'll affect all instances in a project [18:55:22] oh, sweet [18:55:41] you aren't subscribed to labs-l, are you? :D [18:55:50] probably. i merely don't read it. [18:55:54] * Ryan_Lane nods [18:56:27] New patchset: Hashar; "stop apache when having nginx thumb proxy" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7253 [18:56:42] New patchset: Hashar; "ability to change thumbnail server name" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7255 [18:56:56] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7253 [18:56:56] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7255 [18:57:17] dschoon, drdee, robla: I talked with jos (is that how his name is spelled?) yesterday at UDS, and he has a log collection system that appears to be more efficient than ours [18:57:28] implemented using nodejs [18:57:42] everyone in the universe has a log collection system more efficient than ours. [18:57:46] he's at a startup that's doing real-time analytics [18:58:05] heh [18:58:06] this will be a non-issue when, you know, the analytics team has hardware. [18:58:15] I'm talking about udp2log [18:58:20] *nod* [18:58:28] it will get phased out. [18:58:47] how do you plan on collecting the udp messages, then? [18:59:09] PROBLEM Current Load is now: WARNING on mobile-enwp i-000000ce output: WARNING - load average: 6.14, 7.39, 7.24 [18:59:28] we'll have a discussion with you guys about that. there are quite a few frameworks in use at places with traffic comparable to ours [18:59:43] both scribe and flume are good candidates. [18:59:48] * Ryan_Lane nods [19:00:13] ottomata was running scribe at couchsurfing, and one log server could handle several hundred producers [19:00:29] with the same number of requests? [19:00:42] I think we likely get an order of magnitude more requests than them [19:01:02] we do. so there will definitely be some load testing. [19:01:03] ja for sure, but scribe is built for huge stuff [19:01:04] facebook wrote it [19:01:07] but yeah [19:01:10] but yeah. what ottomata just said. [19:01:13] * Ryan_Lane nods [19:01:21] it's designed for horizontal scaling. as is flume. [19:01:25] ah. great [19:01:30] and kafka [19:01:32] we might need TWO log servers?!!1 [19:01:37] which looks really nice too [19:01:39] yeah. [19:01:56] as long as we don't plan on needing to switch to rsyslog to deliver the messages ;) [19:01:56] (though i think we need to look more into the ecosystem there. i don't know many people using it.) [19:02:01] heh [19:02:05] aye heh [19:02:09] but that's what wikia is doing?!!! [19:02:14] are they? [19:02:16] yes. [19:02:16] !log deployment-prep Removing misc::mediawiki-logger from dbdump, it is on 'feed' [19:02:19] syslog is a bad fucking idea [19:02:19] Logged the message, Master [19:02:23] i agree. [19:02:32] but we get more requests per minute than they get per day [19:02:37] yeah [19:03:04] I've heard issues of syslog blocking and causing cascading failures [19:03:45] which seems odd to me [19:05:18] jos also mentioned they wrap their log format in json [19:05:25] which seems like a sane idea [19:12:37] Ryan_Lane, how do public IPs in labs work? [19:12:44] NAT [19:13:05] we've got an IP and a dns record pointing at our instance [19:13:12] but i can't seem to reach that instance via the public IP [19:13:46] !log deployment-prep Removed OnlineStatusBar extension. It is not in Gerrit / WMF [19:13:48] Logged the message, Master [19:15:03] PROBLEM HTTP is now: WARNING on deployment-apache22 i-0000026f output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.009 second response time [19:18:03] Ryan_Lane, example, nginx is listening on port 80 there [19:18:23] from another labs instance, I can telnet into 80, and curl successfully to the eth0 ip [19:18:25] ottomata: security groups [19:18:31] they are firewall rules [19:18:36] on the instance? [19:18:42] for the project [19:18:48] which security groups is the instance in? [19:18:59] analytics, looking... [19:19:06] well looky thar! [19:19:08] you need to add a firewall rule to the group to allow 80 from 0.0.0.0/0 [19:19:11] !security [19:19:11] https://labsconsole.wikimedia.org/wiki/Help:Security_Groups [19:21:05] New patchset: Hashar; "/home/wikipedia/logs on mediawiki-logger service" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/7304 [19:21:20] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/7304 [19:29:56] Ryan_Lane, still problems [19:30:06] i got it to add a rule for port 80 [19:30:13] but the CIDR range I entered is blank [19:30:17] when I try to add a rule for 443 [19:30:21] I get [19:30:21] Failed to add rule. [19:30:29] hm [19:30:31] lemme try [19:30:35] this is for analytics? [19:30:38] and with my currently added port 80 rule, still no response [19:30:39] ja [19:32:12] I deleted the old rule [19:32:27] and added it and it worked for me [19:32:34] had you specified a group? [19:34:35] click the check box on save? [19:34:41] if so, then yes [19:34:49] i clicked tha analytics: web one [19:34:51] ah [19:35:04] groups and cidr ranges are mutually exclusive [19:35:32] ? [19:35:49] when adding a group to a rule it allows all traffic from that other group [19:35:56] it's to link two projects together [19:36:03] I need to hide the group stuff by default [19:36:13] like I do with puppet config on instance creation [19:36:23] oh i thought that had to be checked to add it to my new group web [19:36:43] ok, still no luck on port 80 though [19:36:50] do I need to wait for somethign to happen? [19:38:07] hm [19:38:14] well, i mean, that rule is defined in the reportcard : web group [19:38:18] but there is no edit rule [19:38:40] to add those rules (or group?) to my analytics uhhh meta(?)e group [19:39:08] RECOVERY Current Load is now: OK on mobile-enwp i-000000ce output: OK - load average: 4.01, 4.25, 4.80 [19:40:28] ping Ryan_Lane :) [19:42:46] he's gone to look for om nom noms [19:43:24] ayyeeee danke [19:46:52] * Damianz omnomnoms on Ryan [20:47:08] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:59] ottomata: ? [20:56:06] yup [20:56:29] ja, still no access on port 80 [20:56:31] externally [20:56:38] not sure if there is anything else I need to do [20:57:00] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.84, 18.23, 18.99 [21:02:49] ottomata: hm [21:03:28] ah [21:03:35] kripke isn't in the web security group [21:03:38] no fixing that now [21:03:52] ottomata: add that security group rule to the default group [21:07:56] got it thank you! [21:08:53] yeehaw, yeah much better [21:21:09] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [21:21:58] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.81, 0.75, 4.24 [21:34:33] paravoid: deployment-thumbproxy <--- that is the dumb instance I have been working on [21:34:57] paravoid: you probably want to start over though. It has the media-storage::thumbs-handler media-storage::thumbs-server