[01:34:30] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [01:43:29] sure [01:45:30] i half expected you were asleep and logged in accidentally. ;-P [01:45:36] * jeremyb pokes mailman [01:46:32] I'm off to sleep soon but I guess I can leave you to poke at the poor mailman [01:49:55] Thehelpfulone: try now [01:50:14] ah there we go [01:50:26] will puppet reset that? [01:50:33] also you've got a PM jeremyb [01:52:05] puppet may. i stopped puppet for now but it's not the first time i stopped puppet and then it got started again [01:52:11] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.009 second response time [02:04:26] Thehelpfulone: try now [02:04:39] more stuff broken by puppet [02:05:09] ok [02:05:18] yep, works now [02:05:39] PROBLEM HTTP is now: CRITICAL on deployment-web4 i-00000214 output: CRITICAL - Socket timeout after 10 seconds [02:05:39] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: CRITICAL - Socket timeout after 10 seconds [02:05:39] PROBLEM HTTP is now: CRITICAL on deployment-web i-00000217 output: CRITICAL - Socket timeout after 10 seconds [02:05:39] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [02:07:29] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [02:08:25] Thehelpfulone: -> channel? ;) [02:08:35] anyway, so: sudo mmsitepass -c lakjsdflkajsdflkajsdlkfjalksdfj [02:09:31] and if you do it over an existing one, it will overwrite it? [02:09:36] yes [02:09:58] but you can have a site and a creator at the same time. they don't overwrite eachother [02:10:07] yep [02:10:29] PROBLEM HTTP is now: WARNING on deployment-web4 i-00000214 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.014 second response time [02:10:29] PROBLEM HTTP is now: WARNING on deployment-web3 i-00000219 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.017 second response time [02:10:29] PROBLEM HTTP is now: WARNING on deployment-web i-00000217 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.018 second response time [02:10:29] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.007 second response time [02:12:39] !log mailman stopped puppet again. copied /etc/mailman/mm_cfg.py{.bak,} /etc/lighttpd/conf-available/50-mailman.conf{.bak,}. booted lighttpd. [02:13:06] and list creator can only do that, crete lists? [02:13:09] create* [02:13:21] i guess? [02:13:34] what do the docs say? ;) [02:14:15] yep seems to be just that [02:17:01] !log mailman [mailman-01] stopped puppet again. copied /etc/mailman/mm_cfg.py{.bak,} /etc/lighttpd/conf-available/50-mailman.conf{.bak,}. booted lighttpd. [02:17:03] Logged the message, Master [02:17:07] danke [02:22:34] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [02:24:42] !log bots [bots-2] labs-morebots was running but working. $ sudo service adminbot status; * logslogbot is running; $ sudo service adminbot restart; * Restarting IRC Logging bot for WMF labs logslogbot; ...done. [02:24:43] Logged the message, Master [02:26:55] !log mailman [mailman-01] Thehelpfulone and I both have the site and list creator passwords [02:26:56] Logged the message, Master [02:27:41] !log bots [bots-2] then investigated further (after the restart) and it turns out there were 3 adminlogbot.py procs (including the new one that had just been started). the other 2 were from May 9 and May 12. killed them all and started again from scratch [02:27:43] Logged the message, Master [02:28:15] !log bots [bots-2] could use some lockfiles... either in wrapper or in python itself [02:28:16] Logged the message, Master [02:28:25] !log bots [bots-2] should find out what prod uses [02:28:26] Logged the message, Master [02:35:33] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [02:46:20] 05/21/2012 - 02:46:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:53:18] 05/21/2012 - 02:53:18 - Updating keys for laner at /export/home/deployment-prep/laner [02:56:19] 05/21/2012 - 02:56:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:02:35] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: CRITICAL - Socket timeout after 10 seconds [03:02:35] PROBLEM HTTP is now: CRITICAL on deployment-web4 i-00000214 output: CRITICAL - Socket timeout after 10 seconds [03:02:35] PROBLEM HTTP is now: CRITICAL on deployment-web i-00000217 output: CRITICAL - Socket timeout after 10 seconds [03:02:35] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [03:04:18] i wonder what's up with ryan's key syncing ^^^ [03:04:35] (once is enough? and why not do all of his projects at once?) [03:05:50] computers suck [03:06:41] i mean, i've seen him do it in the past and it did *all* of them (which is a lot) [03:07:02] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.19, 5.59, 5.34 [03:09:25] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 10.43, 11.07, 6.93 [03:12:47] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 1.20, 2.88, 4.23 [03:14:14] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: CRITICAL - Socket timeout after 10 seconds [03:16:43] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 10.83, 9.68, 7.34 [03:20:00] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:00] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.011 second response time [03:20:26] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:26] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:26] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:27] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:53] PROBLEM Current Load is now: WARNING on deployment-nfs-memc i-000000d7 output: WARNING - load average: 9.42, 9.71, 7.66 [03:22:04] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:04] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:45] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:59] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [03:25:30] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 5.08, 6.11, 4.21 [03:25:31] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 81 processes [03:25:41] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:41] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:41] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:41] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:55] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 86% free memory [03:25:55] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [03:26:54] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [03:28:29] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 144 processes [03:29:56] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 1.08, 4.66, 4.48 [03:29:56] RECOVERY Disk Space is now: OK on mobile-testing i-00000271 output: DISK OK [03:29:56] RECOVERY Current Users is now: OK on mobile-testing i-00000271 output: USERS OK - 0 users currently logged in [03:29:56] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 85% free memory [03:31:19] 05/21/2012 - 03:31:19 - Updating keys for laner at /export/home/deployment-prep/laner [03:32:06] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [03:41:00] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.75, 1.61, 3.91 [03:41:11] PROBLEM Current Load is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:04] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.63, 4.40, 4.84 [03:47:05] RECOVERY Current Load is now: OK on deployment-nfs-memc i-000000d7 output: OK - load average: 1.74, 2.39, 4.52 [03:52:16] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 17% free memory [03:57:06] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 6.81, 5.89, 4.39 [03:59:58] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:07:35] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 14% free memory [04:15:34] PROBLEM Current Load is now: WARNING on deployment-nfs-memc i-000000d7 output: WARNING - load average: 9.04, 8.61, 6.63 [04:15:35] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 3% free memory [04:15:35] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:35] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:39] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:40] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:40] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:40] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:40] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:41] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:41] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:45] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:45] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:46] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:46] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:46] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:46] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:46] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:17:23] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:18:14] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [04:18:24] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:19:07] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:19:07] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:19:07] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:19:07] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:19:23] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [04:19:23] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [04:19:23] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 3.58, 5.19, 4.34 [04:19:23] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [04:19:23] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [04:19:23] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 5.59, 5.76, 3.75 [04:19:24] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [04:19:24] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 87 processes [04:19:28] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [04:19:28] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 81 processes [04:19:50] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 83% free memory [04:21:06] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:22:41] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 4% free memory [04:22:41] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 6% free memory [04:24:00] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 4.93, 5.01, 3.43 [04:24:00] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [04:24:00] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 64% free memory [04:24:00] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [04:24:20] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [04:24:20] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [04:24:20] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 90 processes [04:24:25] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 2.06, 4.87, 4.35 [04:24:25] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 92% free memory [04:27:23] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:28:09] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [04:28:34] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: CHECK_NRPE: Socket timeout after 10 seconds. [04:33:18] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 7% free memory [04:36:49] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [04:36:55] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [04:36:55] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [04:36:55] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [04:36:55] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:28] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 89 processes [04:48:33] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 3.81, 5.88, 5.03 [04:48:33] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [04:48:33] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [04:48:33] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 91% free memory [04:49:46] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 1.80, 14.38, 11.19 [04:50:12] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:12] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:12] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:12] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:12] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:12] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:17] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:17] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:22] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:22] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:22] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [04:53:48] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.56, 2.21, 3.64 [05:04:58] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.98, 1.36, 4.58 [05:09:10] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 1.95, 2.46, 3.72 [05:12:23] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 14% free memory [05:18:18] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 1.94, 23.26, 19.10 [05:43:29] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.81, 1.42, 4.65 [05:52:28] PROBLEM Free ram is now: WARNING on deployment-squid i-000000dc output: Warning: 19% free memory [06:33:20] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:20] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:20] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:20] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:20] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:25] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:25] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:25] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:27] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:09] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: CRITICAL - Socket timeout after 10 seconds [06:43:09] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:09] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:04] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.005 second response time [07:00:07] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 15.85, 13.94, 8.17 [07:00:23] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 18.98, 23.28, 13.46 [07:01:28] Err .. !help ..? [07:01:42] Bastion refuses my login ... [07:01:46] HELP [07:01:51] Probably iolaggedout again [07:01:54] Happens [07:01:58] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:02:02] * Beetstra kicks bastion [07:02:03] Tends to be everything alerts at the same time ^ [07:02:49] hmm [07:02:52] :-( [07:05:39] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 9.10, 14.28, 14.10 [07:05:44] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 11.57, 12.61, 11.54 [07:05:49] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 7.11, 9.76, 10.59 [07:06:04] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:09] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:09] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:09] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [07:06:09] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:09] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:10] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:10] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:10] PROBLEM Disk Space is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:10] PROBLEM Free ram is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:10] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:15] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:16] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:27] PROBLEM Current Load is now: WARNING on mobile-feeds i-000000c1 output: WARNING - load average: 4.90, 5.23, 5.34 [07:09:27] PROBLEM Current Load is now: WARNING on jenkins2 i-00000102 output: WARNING - load average: 5.58, 6.37, 6.05 [07:09:27] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 6.45, 7.28, 7.57 [07:09:27] PROBLEM Current Load is now: WARNING on swift-be2 i-000001c8 output: WARNING - load average: 6.39, 6.77, 6.41 [07:09:27] PROBLEM Current Load is now: WARNING on ganglia-collector i-000000b7 output: WARNING - load average: 3.75, 5.24, 5.16 [07:09:27] PROBLEM Current Load is now: WARNING on deployment-apache23 i-00000270 output: WARNING - load average: 5.21, 5.30, 5.08 [07:09:28] PROBLEM Current Load is now: WARNING on deployment-imagescaler01 i-0000025a output: WARNING - load average: 13.91, 13.79, 12.05 [07:09:28] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 6.83, 7.15, 6.56 [07:09:37] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 15.50, 12.32, 7.42 [07:09:37] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [07:09:37] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [07:09:37] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 28% free memory [07:09:37] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 6.54, 5.81, 5.24 [07:09:37] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 6.99, 6.53, 6.32 [07:09:38] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [07:18:15] PROBLEM Current Load is now: WARNING on mailman-01 i-00000235 output: WARNING - load average: 10.21, 8.40, 7.31 [07:18:15] PROBLEM Current Load is now: WARNING on swift-be3 i-000001c9 output: WARNING - load average: 7.60, 7.22, 5.90 [07:18:15] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:15] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:15] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:15] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:20] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:20] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:22] PROBLEM HTTP is now: CRITICAL on mailman-01 i-00000235 output: CRITICAL - Socket timeout after 10 seconds [07:18:25] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:25] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:25] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:25] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:25] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:25] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:26] PROBLEM Current Load is now: CRITICAL on labs-lvs1 i-00000057 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:18:30] PROBLEM Current Load is now: WARNING on precise-test i-00000231 output: WARNING - load average: 8.82, 10.06, 10.29 [07:18:30] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 100 processes [07:18:35] PROBLEM Current Load is now: WARNING on migration1 i-00000261 output: WARNING - load average: 7.10, 6.96, 8.03 [07:18:35] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [07:18:35] RECOVERY Disk Space is now: OK on migration1 i-00000261 output: DISK OK [07:18:35] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 99 processes [07:18:40] RECOVERY Free ram is now: OK on migration1 i-00000261 output: OK: 83% free memory [07:18:40] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [07:18:50] PROBLEM Current Load is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - load average: 34.58, 22.06, 16.48 [07:18:50] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 77% free memory [07:26:35] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:41] RECOVERY Disk Space is now: OK on incubator-bot2 i-00000252 output: DISK OK [07:26:41] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 91% free memory [07:26:41] RECOVERY Free ram is now: OK on incubator-bot2 i-00000252 output: OK: 63% free memory [07:26:41] RECOVERY dpkg-check is now: OK on incubator-bot2 i-00000252 output: All packages OK [07:26:41] RECOVERY Current Users is now: OK on incubator-bot2 i-00000252 output: USERS OK - 0 users currently logged in [07:26:41] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 11.19, 12.46, 12.02 [07:26:41] PROBLEM Current Load is now: WARNING on labs-lvs1 i-00000057 output: WARNING - load average: 1.40, 3.65, 5.04 [07:29:01] RECOVERY Disk Space is now: OK on reportcard2 i-000001ea output: DISK OK [07:29:02] RECOVERY Total Processes is now: OK on reportcard2 i-000001ea output: PROCS OK: 98 processes [07:29:07] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 3.48, 5.63, 6.26 [07:29:07] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.99, 4.29, 15.53 [07:29:24] RECOVERY Current Users is now: OK on reportcard2 i-000001ea output: USERS OK - 0 users currently logged in [07:29:24] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: Connection refused or timed out [07:29:24] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:34] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 213 processes [07:32:39] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 14.26, 12.82, 14.07 [07:32:44] RECOVERY Current Load is now: OK on deployment-apache23 i-00000270 output: OK - load average: 4.10, 2.89, 3.84 [07:32:44] RECOVERY Current Load is now: OK on ganglia-collector i-000000b7 output: OK - load average: 2.86, 3.56, 4.64 [07:32:44] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 3.28, 3.47, 4.51 [07:32:44] RECOVERY Current Load is now: OK on mobile-feeds i-000000c1 output: OK - load average: 1.19, 2.72, 4.82 [07:32:44] PROBLEM Current Load is now: WARNING on upload-wizard i-0000021c output: WARNING - load average: 4.00, 4.65, 6.13 [07:32:44] RECOVERY Current Load is now: OK on wep i-000000c2 output: OK - load average: 3.79, 4.21, 4.91 [07:32:49] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 7.94, 7.22, 7.03 [07:32:49] PROBLEM Current Load is now: WARNING on gerrit i-000000ff output: WARNING - load average: 6.92, 5.72, 5.38 [07:32:54] PROBLEM Current Load is now: WARNING on deployment-transcoding i-00000105 output: WARNING - load average: 9.87, 9.12, 8.00 [07:32:54] PROBLEM Current Load is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [07:34:30] RECOVERY Current Load is now: OK on mailman-01 i-00000235 output: OK - load average: 0.32, 1.34, 3.89 [07:34:30] RECOVERY Current Load is now: OK on swift-be3 i-000001c9 output: OK - load average: 0.93, 3.26, 4.97 [07:35:24] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:13] RECOVERY Current Load is now: OK on labs-lvs1 i-00000057 output: OK - load average: 0.71, 1.44, 3.29 [07:36:14] PROBLEM Current Load is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:14] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:34] RECOVERY Current Load is now: OK on labs-nfs1 i-0000005d output: OK - load average: 0.65, 1.63, 3.61 [07:37:53] PROBLEM Current Load is now: WARNING on bots-sql2 i-000000af output: WARNING - load average: 9.86, 11.04, 10.27 [07:37:53] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 100 processes [07:38:09] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:38:09] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [07:39:11] RECOVERY Current Load is now: OK on swift-be2 i-000001c8 output: OK - load average: 1.17, 1.57, 3.63 [07:39:11] RECOVERY Current Load is now: OK on swift-be4 i-000001ca output: OK - load average: 1.67, 1.66, 3.95 [07:39:11] RECOVERY Current Load is now: OK on gerrit i-000000ff output: OK - load average: 0.79, 2.54, 4.09 [07:39:11] RECOVERY Current Load is now: OK on jenkins2 i-00000102 output: OK - load average: 0.40, 2.03, 4.18 [07:39:11] PROBLEM Puppet freshness is now: CRITICAL on localpuppet1 i-0000020b output: Puppet has not run in last 20 hours [07:39:11] PROBLEM Current Load is now: WARNING on aggregator-test2 i-0000024e output: WARNING - load average: 5.06, 5.29, 5.51 [07:41:04] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:04] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:04] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:04] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:14] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:57] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 11.37, 8.25, 6.91 [07:41:57] PROBLEM Current Load is now: WARNING on robh2 i-000001a2 output: WARNING - load average: 2.01, 4.17, 5.28 [07:43:21] PROBLEM Current Load is now: WARNING on rds i-00000207 output: WARNING - load average: 6.48, 6.64, 6.83 [07:43:21] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [07:43:21] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [07:43:21] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [07:43:21] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 85 processes [07:43:25] helllo [07:43:26] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [07:43:26] PROBLEM Disk Space is now: CRITICAL on reportcard2 i-000001ea output: Connection refused or timed out [07:43:27] PROBLEM Current Users is now: CRITICAL on reportcard2 i-000001ea output: Connection refused or timed out [07:43:27] PROBLEM Total Processes is now: CRITICAL on reportcard2 i-000001ea output: Connection refused or timed out [07:43:32] PROBLEM dpkg-check is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:37] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:44:51] paravoid: Hi :-D [07:45:05] paravoid: I believe that is the cronjobs that kills the lab [07:45:37] PROBLEM Current Users is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:37] PROBLEM Disk Space is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:37] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:42] RECOVERY Current Load is now: OK on deployment-imagescaler01 i-0000025a output: OK - load average: 0.12, 0.63, 3.78 [07:47:13] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 86% free memory [07:47:13] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [07:47:13] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 0 users currently logged in [07:47:13] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [07:47:18] RECOVERY Current Load is now: OK on robh2 i-000001a2 output: OK - load average: 0.15, 1.71, 3.91 [07:47:19] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:44] or the bots [07:48:45] :-D [07:49:27] !bots [07:49:27] http://www.mediawiki.org/wiki/Wikimedia_Labs/Create_a_bot_running_infrastructure proposal for bots [07:49:38] It's around the same time very morning though [07:50:14] RECOVERY Current Load is now: OK on deployment-transcoding i-00000105 output: OK - load average: 0.84, 1.91, 4.44 [07:50:16] hashar: hi [07:50:19] PROBLEM Current Load is now: CRITICAL on aggregator-test2 i-0000024e output: CRITICAL - load average: 33.83, 13.69, 8.17 [07:50:24] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [07:52:59] Damianz: should be that [07:53:33] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [07:53:33] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [07:53:34] RECOVERY Free ram is now: OK on mobile-testing i-00000271 output: OK: 79% free memory [07:53:34] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 1.54, 5.80, 11.82 [07:53:42] all ubuntu instances are having their cron jobs kicking at 6:25 UTC (aka 1 hour and a half ago) [07:54:06] The root cause is an I/O issue though [07:54:19] ?? [07:54:48] The random lags is I/O meh gluster wise apparently, and probably is what causes the total drop out [07:54:53] PROBLEM Current Load is now: WARNING on aggregator-test2 i-0000024e output: WARNING - load average: 1.84, 6.43, 6.53 [07:55:47] and gluster goes wild because everyone starts moving their huge log files at the same minute :-D [07:56:05] Hence why my logs rotate based on size :D [07:56:11] <3 supervisord [07:56:21] yeahh http://ganglia.wmflabs.org/latest/?c=bots&h=bots-cb&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=05%2F21%2F2012%2004%3A00%20&ce=05%2F21%2F2012%2008%3A00%20 [07:56:32] bots-cb had like 120 procs :) [07:56:39] Only 120? [07:56:50] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 103 processes [07:57:18] ah, better reporting of ganglia now :) [07:57:28] Ganglia is shiny [07:57:45] do any of you know about the bot infrastructure ? [07:58:05] bots-apache1 is 100% CPU since yesterday [07:58:08] Interestingly bots-cb doesn't use project storage or much disk when running yet it still dies [07:58:30] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 0.25, 1.37, 3.85 [07:58:36] I know about it somewaht, really busy with work atm though. [07:58:44] I can understand :-D [07:58:54] will ask petan whenever he is back around [08:00:00] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 2.55, 3.03, 4.32 [08:00:00] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.63, 1.38, 3.84 [08:08:06] PROBLEM Free ram is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:06] PROBLEM Disk Space is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:06] PROBLEM Current Users is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:06] PROBLEM dpkg-check is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:56] meh [08:09:04] hashar: hi [08:11:51] Here is the bug https://bugzilla.wikimedia.org/36993 -- Labs cluster dies daily at roughly 6:30 UTC [08:11:51] ;) [08:12:00] petan|wk: looks like some "bots" instances are in trouble [08:12:07] apache1 is 100% CPU since yesterday [08:12:09] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 106 processes [08:12:14] bots-1 & bots-2 are both 100% cpu [08:12:52] http://ganglia.wmflabs.org/latest/?r=custom&cs=05%2F21%2F2012+5%3A00+&ce=05%2F21%2F2012+9%3A00+&m=cpu_report&s=by+name&c=bots&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [08:12:58] aha [08:14:51] cpu doesn't matter to me, only load [08:15:09] problem is that load might just be some I/O wait ;-D [08:15:22] And if you're wanked on cpu then your load is going to climb [08:15:32] I will probably ask to add some disk usage metrics in Ganglia [08:15:41] so we find out who is heavily writing / reading from "disks" [08:15:46] Damianz: simple english? [08:15:48] :D [08:16:18] which kind of disks? [08:16:28] /data/project or others? [08:17:37] I guess any disk [08:17:51] since everything is going to hit the same disk array isn't it? [08:18:04] hmm, looks like me then... [08:18:11] New patchset: Dzahn; "adjust path to update file now in /usr/lib/..." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8307 [08:18:26] /data/* is a seperate gluster cluster [08:18:26] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8307 [08:18:37] Everything else is images on gluster on the virt nodes [08:18:50] New review: Dzahn; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8307 [08:18:50] if its *that* cluster then ping me :) [08:18:53] Change merged: Dzahn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8307 [08:19:23] 05/21/2012 - 08:19:23 - Updating keys for petrb at /export/home/deployment-prepbackup/petrb [08:19:28] so I just need more details though [08:19:28] 05/21/2012 - 08:19:28 - Updating keys for petrb at /export/home/upload-wizard/petrb [08:19:34] 05/21/2012 - 08:19:33 - Updating keys for petrb at /export/home/hugglewa/petrb [08:19:40] 05/21/2012 - 08:19:39 - Updating keys for petrb at /export/home/gareth/petrb [08:20:13] 05/21/2012 - 08:20:13 - Updating keys for petrb at /export/home/openstack/petrb [08:20:15] 05/21/2012 - 08:20:15 - Updating keys for petrb at /export/home/turnkey-mediawiki/petrb [08:20:17] 05/21/2012 - 08:20:17 - Updating keys for petrb at /export/home/bots/petrb [08:20:18] 05/21/2012 - 08:20:17 - Updating keys for petrb at /export/home/nagios/petrb [08:20:22] 05/21/2012 - 08:20:21 - Updating keys for petrb at /export/home/bastion/petrb [08:20:23] 05/21/2012 - 08:20:23 - Updating keys for petrb at /export/home/huggle/petrb [08:20:27] 05/21/2012 - 08:20:26 - Updating keys for petrb at /export/home/search/petrb [08:20:27] 05/21/2012 - 08:20:26 - Updating keys for petrb at /export/home/deployment-prep/petrb [08:20:33] * Beetstra is afraid that he is guilty of massive data transfer with linkwatcher.pl ... [08:20:55] It does several mysql queries a second .. [08:21:29] PROBLEM Free ram is now: WARNING on aggregator-test i-0000024d output: Warning: 19% free memory [08:22:12] !ping [08:22:12] pong [08:22:50] !log deployment-prep installing iotop on deployment-nfs-memc [08:22:52] Logged the message, Master [08:23:18] With 80 external links added per minute, that is 1.33 per second .. one insert, and a handful of COUNTs and SELECTs per link .. [08:23:37] hashar: bots-1 load is 0.1 [08:23:46] I don't think it's overloaded hm [08:24:08] also if it was, nagios would be alarming us [08:24:19] I don't seem guilty :) [08:24:20] @search log [08:24:20] Results (found 11): morebots, labs-morebots, credentials, logging, terminology, newgrp, initial-login, requests, hyperon, logs, hashar, [08:24:26] !logs [08:24:26] http://bots.wmflabs.org/~wm-bot/searchlog/index.php?action=search&channel=%23wikimedia-labs&query=$1 [08:25:07] Beetstra: where does it run [08:25:13] which sql server it uses [08:25:25] linkwatcher is sole bot on bots-2, storing on bots-sql2 now [08:25:45] I am using large caches, so there should be no problem making lot of sql queries [08:25:48] (it stuffed bots-sql3 already ..) [08:25:57] it wouldn't use disk anyway probably [08:26:11] OK .. if that is not it [08:26:23] I know that linkwatcher.pl is a beast for the machines it runs on [08:26:42] bots-2 is a bit loaded but that's normal [08:26:50] The previous box could not do it .. and now bots-2 is pretty loaded [08:27:19] !load-all [08:27:19] http://ganglia.wikimedia.org/2.2.0/?c=Virtualization%20cluster%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [08:27:24] !load [08:27:24] http://ganglia.wikimedia.org/2.2.0/graph_all_periods.php?h=virt2.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1327006829&g=load_report&z=large&c=Virtualization%20cluster%20pmtpa [08:27:46] peace at virt5 [08:29:28] Hmm .. they all seem to peak about 1.5 hours ago [08:29:52] I added some infos at https://bugzilla.wikimedia.org/36993 [08:30:05] 6:25am UTC is 8:25 UTC, aka 2 hours ago [08:30:15] that is the default time for Ubutuntu daily cron [08:30:18] 05/21/2012 - 08:30:18 - Updating keys for laner at /export/home/deployment-prep/laner [08:30:27] We can probably stage them via puppet me thinks. [08:30:55] what I would love is to find out which process cause the massive I/O load :-] [08:31:24] What about running all the cron tasks again in 30 minutes, and looking? [08:31:28] one way would be to manually change the time of the daily cronjob on some instances to bisect the issue [08:31:33] hehe [08:31:39] * Beetstra feels brilliant [08:31:59] well if the issue is a job that process a day of logs [08:32:14] it will only process 2 hours worth of logs and might not actually kill the cluster [08:32:15] :-D [08:32:37] Still you might be able to see that one does more work than others [08:32:42] probably [08:32:58] And normally it peaks for an hour .. it must be significant even after 2 hours [08:33:54] bots-cb has a nice user/system CPU spike at that time : http://ganglia.wmflabs.org/latest/?r=custom&cs=05%2F21%2F2012+5%3A00+&ce=05%2F21%2F2012+9%3A00+&m=cpu_report&s=by+name&c=bots&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [08:34:17] though load happens before that [08:34:19] grr [08:34:35] bots-cb is a bad example - it will be spiky all the time due to the way it forks [08:34:45] But around half 8 it does lag out and sometimes crash [08:35:03] the instances (at least bots-2) are slooowww in reacting to commands [08:35:12] I/O is generally slow [08:35:21] I can push a bot to a load of 20 by running ls [08:35:25] s/bot/box/ [08:38:06] yeah loaded again :D [08:39:47] While talking about slowness .. I am moving a HUGE table in 10 pieces from bots-sql3 to bots-sql2 - downloaded .sql files, SOURCE-ing them manual into sql - it takes 5-30 seconds to source ~150 records .. [08:40:20] (copying them through phpmyadmin fails constantly for some reason - just stops after a couple of million of records) [08:41:30] Query OK, 145 rows affected (50.85 sec) [08:41:30] Records: 145 Duplicates: 0 Warnings: 0 [08:41:38] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:38] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:50] oh god no [08:42:50] hashar: that's problem with gluster, dunno where [08:42:54] maybe some disk failed again [08:43:04] it shouldn't really happen [08:43:42] if there is really a time factor in this, then its someone being funny or something [08:44:19] well bots are having fun again http://ganglia.wmflabs.org/latest/?c=bots&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [08:44:35] I am not sure why bots-cb has so many procs though [08:44:57] Because it creates a fork for each request it processes [08:45:09] Really should use thread pools but not got time to re-write bits atm [08:45:11] !nagios [08:45:11] http://nagios.wmflabs.org/nagios3 [08:45:24] hashar: nagios probably has more accurate data and it seems ok there [08:45:53] 107 processes on -cb [08:45:56] that's ok [08:46:45] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [08:49:03] !log deployment-prep rebooting deployment-feed [08:49:05] Logged the message, Master [08:49:57] this doesn't look good... [08:51:01] yep, its blown [08:51:14] it ? [08:51:19] what are you referring to? [08:51:36] no activity from labs [08:51:37] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [08:51:37] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:10] EXT3-fs: INFO: recovery required on readonly filesystem. [08:53:12] yeah!!! [08:53:23] !log deployment-prep Looks like -feed is dead : EXT3-fs: INFO: recovery required on readonly filesystem. [08:53:25] Logged the message, Master [08:53:37] hashar: hi, whats up [08:53:43] i did not read all backlog yet. . but i can report something [08:53:45] hello :-] [08:53:53] on the weekend i got this behaviour: [08:54:06] changing a file in /var/www on an instance, saving it [08:54:16] reloading in browser and DANG --> 404 [08:54:24] waiting / reloading a while.. and back it is [08:54:29] yeah I/O can be very slow [08:54:35] like after every save, it is "gone" for a while [08:54:58] vim !! ;-D [08:55:13] causes it deletes the file, then writes a new one [08:55:14] but its not like just slow to apply changes.. the webserver thinks it is actually not there at all (for a while) [08:55:22] ah, yea, makes sense [08:55:23] so with high I/O latency, at one point, the file just does not exist anymore [08:55:25] @vim [08:55:39] ack, /me nods [08:55:51] that causes some blank pages on deployment-prep (stuff like: PHP Fatal Error - CommonSettings.php does not exist ) [08:56:22] yea, like i edited my global config at one point... and bam [08:56:38] btw, have you ever seen such error : EXT3-fs: INFO: recovery required on readonly filesystem. [08:56:46] that is on deployment-feed instance [08:56:52] looks like its FS is corrupted [08:57:00] https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&project=deployment-prep&instanceid=i-00000118 [08:57:01] console log [08:57:02] yes, well, sounds like you want fsck [08:57:13] ohh [08:57:25] oh can I attach to the console? *grin* [08:57:32] :-] [08:57:44] you could just reboot, and it should trigger it..usually [08:58:02] !log deployment-prep rerebooting deployment-feed [08:58:03] Logged the message, Master [08:58:19] Failed to reboot instance. [08:58:22] unlucky [08:58:23] ouch [08:58:36] well I have tried to reboot it a few minutes ago [08:58:48] and the box eventually hang at that message while rebooting [08:58:55] oh wait, it is just mounted readonly anyways [08:58:59] guess now it is locked in some "wait for user to fix it" state [08:59:07] hrmm,ok [08:59:23] bah [08:59:25] it is rebooting [08:59:27] ahah [08:59:38] ah, good sign, the delay may have been the "fsck" run:) [08:59:41] still hang at the same NFS error message [08:59:58] [ 1.574317] EXT3-fs: INFO: recovery required on readonly filesystem. [08:59:58] [ 1.579836] EXT3-fs: write access will be enabled during recovery. [09:00:16] maybe the fsck is actually running [09:00:21] and the console output does not show it [09:00:25] s output [09:00:40] yea [09:01:06] PROBLEM Current Load is now: WARNING on swift-be2 i-000001c8 output: WARNING - load average: 10.85, 12.46, 9.70 [09:01:06] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 5.47, 5.41, 5.76 [09:01:50] hashar: so i think this is related. NFS write errors in general. and then ext3 decides to switch to read-only if it encounters too many errors [09:03:50] if there is something like errors=remount-ro in /etc/fstab [09:04:59] hoho [09:05:05] PROBLEM dpkg-check is now: CRITICAL on bots-cb i-0000009e output: Connection refused or timed out [09:05:06] PROBLEM Free ram is now: CRITICAL on bots-cb i-0000009e output: Connection refused or timed out [09:05:28] seems like something bad is escalating? [09:06:58] mutante: yeah it recovered!!! [09:07:05] hashar: pheew:) [09:07:19] hashar: /var/log/fsck ? [09:07:19] http://dpaste.org/peMRM/ [09:07:28] took 240 sec to recover the FS [09:07:36] looks like the kernel got upgraded meanwhile [09:07:43] fairly normal that it takes a while [09:08:10] !log deployment-prep -feed took like 240 sec to recover and apparently upgraded its kernel [09:08:32] PROBLEM HTTP is now: CRITICAL on hugglewiki i-000000aa output: CRITICAL - Socket timeout after 10 seconds [09:09:21] hashar: so a bug is that it tells you "failed" too early? [09:09:58] the bug is that the console does not give clue about the recovery progress [09:10:27] anyway that one is solved [09:10:50] mutante: do you have any nova tool to find out how much data an instance is trying to write to disk ? [09:10:55] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 3.30, 5.35, 11.51 [09:10:57] the labs cluster is dieing again [09:11:07] http://ganglia.wmflabs.org/latest/ [09:11:37] hrmm, i doubt nova-manage can .. checking [09:12:16] load average: 463.90, 463.10, 442.89 [09:12:18] on virt1 [09:12:50] I am pretty sure one of the instances is running a heavy I/O processing job [09:13:04] which might even be triggered daily at 6:25am by Ubuntu daily cronjob [09:13:09] that happened this morning [09:27:57] mutante: have you found anything ? [09:29:37] hashar: I don't know if there is something to be found [09:29:47] links about metering in openstack and stuff, but not really something we already have to just execute [09:29:56] wrote a mail to ryan though asking about it [09:30:15] we should probably start some monitoring daemon on gluster to check which files are being accessed so much etc [09:30:23] i also found there is an openstack "planet" [09:30:29] http://planet.openstack.org/ [09:30:53] planets are great [09:30:56] http://wiki.openstack.org/EfficientMetering [09:31:01] though I already receive too many notifications [09:31:37] http://wiki.openstack.org/SystemUsageData [09:31:50] <--this looks like what we already use in wiki RC [09:32:13] but does not talk about resource usage [09:32:42] the link on metering does, they want to even implement a billing system based on usage [09:33:12] PROBLEM Current Load is now: WARNING on labs-nfs1 i-0000005d output: WARNING - load average: 8.41, 7.57, 6.83 [09:33:12] PROBLEM Current Load is now: WARNING on bots-2 i-0000009c output: WARNING - load average: 13.95, 12.86, 10.81 [09:33:12] PROBLEM Total Processes is now: WARNING on nagios 127.0.0.1 output: PROCS WARNING: 303 processes [09:33:12] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 7.95, 7.20, 7.29 [09:35:10] * hashar looks for gmetrics [09:36:18] mutante: right now it seems that gluster is down [09:36:29] !ping [09:36:35] pong [09:36:40] everything what writes to disk is lagged [09:36:51] or read [09:37:02] !blah is xx [09:37:03] Key was added! [09:37:05] !blag [09:37:06] !blah [09:37:07] xx [09:37:17] !blah [09:37:21] xx [09:37:23] !blah del [09:37:25] Successfully removed blah [09:37:32] right, it seems that bot can write to disk [09:37:37] but anyway [09:37:44] I can't even ssh to bastion now [09:37:53] !sexytime is xxx [09:37:53] Key was added! [09:37:55] :D [09:38:29] PROBLEM Current Load is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:34] PROBLEM Current Load is now: CRITICAL on deployment-syslog i-00000269 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:35] PROBLEM Disk Space is now: CRITICAL on deployment-syslog i-00000269 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:35] PROBLEM Total Processes is now: CRITICAL on deployment-syslog i-00000269 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:38:40] PROBLEM HTTP is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - Socket timeout after 10 seconds [09:38:47] PROBLEM SSH is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - Socket timeout after 10 seconds [09:39:10] PROBLEM Total Processes is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:05] PROBLEM Total Processes is now: CRITICAL on nagios 127.0.0.1 output: PROCS CRITICAL: 481 processes [09:40:40] PROBLEM dpkg-check is now: CRITICAL on deployment-syslog i-00000269 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:45] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:00] RECOVERY Free ram is now: OK on bots-cb i-0000009e output: OK: 65% free memory [09:41:10] PROBLEM Current Load is now: WARNING on hugglewa-db i-00000188 output: WARNING - load average: 4.09, 5.15, 6.25 [09:41:15] PROBLEM Current Load is now: WARNING on swift-be3 i-000001c9 output: WARNING - load average: 6.06, 5.98, 6.38 [09:41:15] PROBLEM Current Load is now: WARNING on swift-fe1 i-000001d2 output: WARNING - load average: 5.97, 5.89, 6.03 [09:41:15] PROBLEM Current Load is now: WARNING on publicdata-administration i-0000019e output: WARNING - load average: 5.99, 5.68, 5.38 [09:41:21] mutante: we have problems ^ [09:41:21] PROBLEM Current Load is now: WARNING on bots-4 i-000000e8 output: WARNING - load average: 5.34, 6.00, 6.57 [09:41:21] PROBLEM Current Load is now: WARNING on deployment-transcoding i-00000105 output: WARNING - load average: 6.33, 6.06, 5.85 [09:41:30] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 12.81, 9.88, 8.09 [09:41:40] PROBLEM Current Users is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:41] PROBLEM Disk Space is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:41] PROBLEM SSH is now: CRITICAL on bots-apache1 i-000000b0 output: CRITICAL - Socket timeout after 10 seconds [09:41:55] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:01] PROBLEM Disk Space is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:01] PROBLEM Total Processes is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:23] I think nagios died [09:43:43] at some point it would be nice to have nagios on separate server [09:43:43] I think we're fucked on I/O [09:43:54] And yes, nagios on a seperate server would be good. [09:43:59] heh [09:44:25] My bots are still running but REALLY slowly. [09:44:44] !ping [09:44:44] pong [09:44:49] my is running fine [09:44:52] :D [09:47:53] PROBLEM dpkg-check is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:53] PROBLEM Free ram is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:53] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:21] PROBLEM Free ram is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:22] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:28] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:28] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:28] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:28] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:56] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:56] PROBLEM dpkg-check is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:56] PROBLEM Total Processes is now: CRITICAL on bots-apache1 i-000000b0 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:38] Eh, by I/O, do you people mean the network usage is high? [09:50:15] Disk but never wise it would affect disk [09:50:21] Just throw things at lcarr network wise :D [09:50:52] oh, so you mean write operations to the disks? [09:51:02] or read [09:51:06] and not network like downloading/uploading [09:51:14] try running ls on some nodes thoughout the day [09:51:17] they are sloooooo [09:51:19] wwww [09:51:28] oh thank... nevermind [09:51:31] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:35] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:35] PROBLEM Free ram is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:35] PROBLEM SSH is now: CRITICAL on bots-sql2 i-000000af output: CRITICAL - Socket timeout after 10 seconds [09:51:45] though I must say I am dominating the network graphs on wmflabs [09:51:47] :) [09:51:57] Porn downloads? [09:52:05] zzz [09:52:09] http://pornfortheblind.org/ < :D [09:52:11] Hydriz: in this case, I guess I/O is referring to disk activity [09:52:18] ahah [09:52:30] lol then its still not me haha [09:52:48] its designed to be running quite slowly :) [09:53:22] compare http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=dumps&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1337593974&g=network_report&z=large&c=dumps and http://ganglia.wmflabs.org/latest/graph_all_periods.php?me=wmflabs&m=load_one&r=hour&s=by%20name&hc=4&mc=2&g=network_report&z=large [09:53:29] quite similar haha [09:54:48] PROBLEM Current Load is now: WARNING on grail i-0000021e output: WARNING - load average: 6.87, 7.75, 6.90 [09:54:48] PROBLEM Current Load is now: WARNING on robh2 i-000001a2 output: WARNING - load average: 5.91, 5.66, 6.08 [09:54:48] PROBLEM Current Load is now: WARNING on wikidata-dev-3 i-00000225 output: WARNING - load average: 5.99, 6.06, 5.92 [09:54:48] PROBLEM Current Load is now: WARNING on deployment-web3 i-00000219 output: WARNING - load average: 8.90, 8.08, 7.54 [09:54:48] PROBLEM Current Load is now: WARNING on venus i-000000ea output: WARNING - load average: 2.79, 4.27, 5.21 [09:54:48] PROBLEM Current Load is now: WARNING on demo-web1 i-00000255 output: WARNING - load average: 4.69, 5.54, 5.76 [09:54:49] PROBLEM Current Load is now: WARNING on deployment-apache20 i-0000026c output: WARNING - load average: 6.34, 6.03, 5.85 [09:54:49] PROBLEM Current Load is now: WARNING on ee-prototype i-0000013d output: WARNING - load average: 5.55, 7.91, 7.21 [09:54:50] RECOVERY HTTP is now: OK on hugglewiki i-000000aa output: HTTP OK: HTTP/1.1 200 OK - 901 bytes in 0.034 second response time [09:55:04] oh yes, does openstackmanager have a delete project feature? [09:55:19] Should do. [09:55:27] I see [09:55:41] there seems to be some projects that doesn't seem to be used at all [09:55:50] though I can't really list any now... [09:56:48] I am pretty sure we do not deletes projects [09:56:53] Ryan wrote about that on some bug report [09:57:32] looks like its an issue on the server side then [09:57:33] here it is : https://bugzilla.wikimedia.org/show_bug.cgi?id=36241#c5 [09:58:38] I see [09:58:50] "You should use larger instances" [09:59:01] seems to refer more to the bots cluster :) [09:59:03] PROBLEM Total Processes is now: CRITICAL on deployment-feed i-00000118 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:43] PROBLEM Current Users is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:43] PROBLEM Disk Space is now: CRITICAL on deployment-thumbproxy i-0000026b output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:43] PROBLEM Free ram is now: CRITICAL on dumps-2 i-00000257 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:43] PROBLEM Current Load is now: CRITICAL on dumps-6 i-00000266 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:43] PROBLEM Free ram is now: CRITICAL on dumps-6 i-00000266 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:43] PROBLEM Total Processes is now: CRITICAL on dumps-6 i-00000266 output: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:48] PROBLEM dpkg-check is now: CRITICAL on dumps-6 i-00000266 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:00:21] now dumps project is affected [10:00:30] but somehow the files are still being uploaded nicely though [10:01:56] and it just stopped 1 minute ago :( [10:02:18] and back, so it should probably not affect network I guess... [10:05:47] @log sql2 [10:08:56] Hydriz: actually when you download / upload, that also produce disk I/O [10:09:36] but it has a delay in every request [10:09:42] RECOVERY Total Processes is now: OK on nagios 127.0.0.1 output: PROCS OK: 159 processes [10:09:51] RECOVERY Current Load is now: OK on swift-be2 i-000001c8 output: OK - load average: 2.67, 3.00, 4.27 [10:09:51] RECOVERY Current Load is now: OK on hugglewa-db i-00000188 output: OK - load average: 0.80, 0.63, 1.65 [10:09:51] RECOVERY Current Load is now: OK on swift-fe1 i-000001d2 output: OK - load average: 2.57, 1.97, 2.88 [10:09:51] RECOVERY Current Load is now: OK on publicdata-administration i-0000019e output: OK - load average: 0.51, 0.68, 1.65 [10:09:51] RECOVERY Current Load is now: OK on bots-4 i-000000e8 output: OK - load average: 0.95, 0.81, 2.18 [10:09:51] RECOVERY Disk Space is now: OK on bots-apache1 i-000000b0 output: DISK OK [10:09:52] RECOVERY Current Users is now: OK on bots-apache1 i-000000b0 output: USERS OK - 0 users currently logged in [10:10:01] RECOVERY dpkg-check is now: OK on deployment-syslog i-00000269 output: All packages OK [10:10:06] RECOVERY SSH is now: OK on bots-apache1 i-000000b0 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:10:06] RECOVERY Free ram is now: OK on bots-apache1 i-000000b0 output: OK: 84% free memory [10:10:10] RECOVERY Total Processes is now: OK on bots-apache1 i-000000b0 output: PROCS OK: 113 processes [10:10:11] RECOVERY dpkg-check is now: OK on bots-apache1 i-000000b0 output: All packages OK [10:10:43] though I hope that Wikimedia can just directly give me a physical server on their network and allow me to access it so that I can just upload straight from dataset1001 [10:12:00] * Damianz implants some cat7 into Hydriz and offers him up for connection [10:12:42] * Hydriz mounts Damianz as an NFS server :) [10:13:51] * Damianz lags on on I/O and causes Hydriz loads of issues [10:14:26] PROBLEM Current Load is now: CRITICAL on deployment-feed i-00000118 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:14:26] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:14:36] PROBLEM Current Load is now: WARNING on deployment-imagescaler01 i-0000025a output: WARNING - load average: 21.99, 21.25, 18.44 [10:14:40] PROBLEM Current Load is now: WARNING on mailman-01 i-00000235 output: WARNING - load average: 6.69, 6.54, 6.62 [10:14:40] RECOVERY Current Load is now: OK on dumps-6 i-00000266 output: OK - load average: 1.66, 1.64, 2.84 [10:14:40] RECOVERY Free ram is now: OK on dumps-6 i-00000266 output: OK: 84% free memory [10:14:40] RECOVERY Total Processes is now: OK on dumps-6 i-00000266 output: PROCS OK: 93 processes [10:14:53] RECOVERY dpkg-check is now: OK on dumps-6 i-00000266 output: All packages OK [10:14:53] RECOVERY Current Load is now: OK on grail i-0000021e output: OK - load average: 0.36, 0.83, 2.66 [10:14:53] RECOVERY Current Load is now: OK on wikidata-dev-3 i-00000225 output: OK - load average: 1.78, 3.80, 4.85 [10:14:53] RECOVERY Current Load is now: OK on deployment-web3 i-00000219 output: OK - load average: 2.35, 2.74, 4.56 [10:14:53] RECOVERY Current Load is now: OK on venus i-000000ea output: OK - load average: 1.09, 0.85, 1.97 [10:14:53] RECOVERY Current Load is now: OK on demo-web1 i-00000255 output: OK - load average: 0.71, 1.21, 3.09 [10:14:54] RECOVERY Current Users is now: OK on deployment-thumbproxy i-0000026b output: USERS OK - 0 users currently logged in [10:14:54] RECOVERY Disk Space is now: OK on deployment-thumbproxy i-0000026b output: DISK OK [10:14:55] RECOVERY Total Processes is now: OK on deployment-feed i-00000118 output: PROCS OK: 106 processes [10:14:58] RECOVERY Free ram is now: OK on dumps-2 i-00000257 output: OK: 92% free memory [10:14:59] \o/ everything is back after Damianz replaces labstore1 :) [10:15:12] :D [10:15:12] Hydriz: are you loading some database dump actually ? [10:15:15] nope [10:15:40] rather, what are you referring to? [10:16:04] well we are trying to find out what is generating a lot of I/O disk activity and basically killing glusterfs [10:16:08] ;-D [10:16:59] * Hydriz is hands-free from this :P [10:18:08] PROBLEM Current Users is now: CRITICAL on deployment-feed i-00000118 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:08] PROBLEM Free ram is now: CRITICAL on deployment-feed i-00000118 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:08] PROBLEM dpkg-check is now: CRITICAL on deployment-feed i-00000118 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:41] :-( : [10:19:46] Query OK, 142 rows affected (30 min 11.64 sec) [10:19:49] Records: 142 Duplicates: 0 Warnings: 0 [10:22:18] RECOVERY Disk Space is now: OK on deployment-syslog i-00000269 output: DISK OK [10:22:18] RECOVERY Current Load is now: OK on deployment-syslog i-00000269 output: OK - load average: 0.00, 0.02, 0.04 [10:22:18] RECOVERY Total Processes is now: OK on deployment-syslog i-00000269 output: PROCS OK: 84 processes [10:22:23] RECOVERY Current Load is now: OK on deployment-thumbproxy i-0000026b output: OK - load average: 3.02, 3.59, 3.92 [10:22:23] RECOVERY Total Processes is now: OK on deployment-thumbproxy i-0000026b output: PROCS OK: 158 processes [10:27:17] PROBLEM Current Load is now: WARNING on deployment-feed i-00000118 output: WARNING - load average: 6.42, 7.44, 6.66 [10:27:22] RECOVERY Current Load is now: OK on wep i-000000c2 output: OK - load average: 0.43, 1.39, 3.23 [10:27:22] RECOVERY Current Users is now: OK on deployment-feed i-00000118 output: USERS OK - 0 users currently logged in [10:27:22] RECOVERY Free ram is now: OK on deployment-feed i-00000118 output: OK: 66% free memory [10:27:27] PROBLEM Current Load is now: CRITICAL on deployment-imagescaler01 i-0000025a output: CRITICAL - load average: 23.89, 22.68, 20.74 [10:27:27] RECOVERY Current Load is now: OK on mailman-01 i-00000235 output: OK - load average: 1.03, 2.26, 4.30 [10:27:27] RECOVERY Current Load is now: OK on deployment-apache20 i-0000026c output: OK - load average: 1.52, 1.76, 3.50 [10:27:27] RECOVERY dpkg-check is now: OK on deployment-feed i-00000118 output: All packages OK [10:27:27] RECOVERY Current Load is now: OK on ee-prototype i-0000013d output: OK - load average: 1.59, 3.48, 4.55 [10:27:27] RECOVERY HTTP is now: OK on bots-apache1 i-000000b0 output: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.609 second response time [10:37:54] PROBLEM Current Load is now: WARNING on ganglia-test2 i-00000250 output: WARNING - load average: 6.59, 7.77, 7.82 [10:41:50] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [10:49:12] Good morning. Where can i find the Commons/UploadWizard instance on labs? The project and instances links on https://labsconsole.wikimedia.org/wiki/Main_Page return nothing [10:49:34] mutante: Probably more on topic here ;-) [10:49:47] deployment-prep is also down for some read [10:49:50] reason* [10:49:52] PROBLEM Current Users is now: CRITICAL on bots-sql2 i-000000af output: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:02] PROBLEM Current Users is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:03] PROBLEM Disk Space is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:18] Hydriz: deployment-prep is down :( [10:50:30] my bots have serious problems as well [10:50:46] okay... lol [10:50:57] it should have been recovered though [10:51:18] And no-response on bots-2, got 'broken pipe'-d on bots-sql2 ... [10:53:05] RECOVERY Current Load is now: OK on deployment-feed i-00000118 output: OK - load average: 0.81, 2.09, 3.99 [10:53:05] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [10:53:10] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 31% free memory [10:53:10] RECOVERY SSH is now: OK on bots-sql2 i-000000af output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:53:11] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 1.52, 2.10, 4.08 [10:53:42] multichill: there is an instance on http://commons.wikimedia.beta.wmflabs.org/ though it is broken right now [11:01:17] I called it. We're using https://test.wikipedia.org/ [11:02:12] mutante: Please don't shut down the prototypes right away. We probably want to recover some configurations before it gets shutdown [11:03:33] multichill: i am not. ryan might though. want me to just copy/paste that line to the list thread? [11:04:11] i replied about people from Wiki Loves Monuments "were/are" still using it [11:06:10] http://lists.wikimedia.org/pipermail/wikilovesmonuments/2012-May/002911.html [11:06:28] mutante: [11:06:51] you could reply to that mail and says that http://commons.wikimedia.beta.wmflabs.org/ is being completely rebuild :-] [11:07:16] I have been breaking it heavily for the last two weeks [11:08:38] well [11:08:38] I am out [11:08:42] multichill: ok, i meant the engineering thread about shutting down all those VMs.. how about this. i do the engineering list and you keep the wlm list informed [11:08:57] I am migrating to the coworking space and will do some perl this afternoon ;-] [11:08:59] PROBLEM Total Processes is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:09:10] and i need to install more analytics servers [11:09:20] <3 perl [11:09:37] We'll just use test. That's more suitable anyway. The testing is all end user. [11:09:37] I like perl too [11:09:58] TCL is great too [11:10:03] Ewwww [11:10:06] TCL is horrid :( [11:10:13] python would be the best if indentation was not part of the syntax [11:10:20] Python is awesome [11:10:38] I guess some of you had to learn Ruby since you moved to Puppet [11:11:07] TCL has some nice uses cases when it comes to text processing [11:11:15] hashar: TCL = the language people know from eggdrop :) [11:11:21] multichill: need any rights on test? [11:11:26] bbl [11:11:30] mutante: exactly :-]]] [11:11:38] mutante: and I loved eggdrop too [11:11:43] PROBLEM dpkg-check is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:55] Thehelpfulone: Bureaucrat would be helpful so I can assign the rights to the right people [11:11:56] sstraddle: well puppet is some language build on top of ruby [11:11:59] hashar: yes, in a large botnet, chatting on partyline:) [11:12:02] sure [11:12:08] sstraddle: it is not really ruby though [11:12:24] mutante: or using it as a primitive p2p network using DCC :) [11:12:26] done [11:12:30] that was the good old days [11:12:37] ty [11:12:40] hashar: .bottree had to look nice :) [11:12:54] hashar: I see. [11:13:01] ok, out for a bit [11:13:18] sstraddle: there is a term for it. Cant remember [11:13:27] RECOVERY Current Load is now: OK on bots-2 i-0000009c output: OK - load average: 6.21, 4.94, 4.70 [11:13:51] ahh DSL [11:14:03] hashar: ralsh? [11:14:04] sstraddle: so yeah puppet is a Domain-specific language http://en.wikipedia.org/wiki/Domain-specific_language [11:14:17] build on top of ruby [11:14:37] hashar: replace some perl bots with TCL scripts all running on the same stable eggdrop package?:) /me runs [11:17:18] mutante: good idea :-D [11:17:28] * hashar replaces wikibugs & ircecho by eggdrop [11:18:01] * Damianz dies [11:18:13] Lets write a push based messaging framework and do crazy shit [11:20:00] mutante: setting up egg drop bots on the cluster will open lot of security flaws though :-D [11:20:20] I am out for now [11:20:30] going to H&M [11:20:37] then coworking place [11:21:15] I've read http://blog.wikimedia.org/2012/04/16/introduction-to-wikimedia-labs/ and I'm hoping to be able to help around Deployment-prep. Any mailing lists I should subscribe to to get up to speed? The Labs list looks like it's low traffic. [11:21:43] It is, but if you post someone will answer [11:23:16] Damianz: OK, sounds good, any other resources I can use? [11:23:29] !account-request [11:23:34] sstraddle: I became the lead dev for the deployment-prep project [11:23:38] !account-req [11:23:38] fuck what's that thing [11:23:41] !account [11:23:42] in order to get an access to labs, please type !account-questions and ask Ryan, or someone who is in charge of creating account on labs [11:23:43] There's a page to request acocunt [11:23:47] go there, fill in details [11:23:50] !account-questions [11:23:55] I need the following info from you: 1. Your preferred wiki user name. This will also be your git username, so if you'd prefer this to be your real name, then provide your real name. 2. Your preferred email address. 3. Your SVN account name, or your preferred shell account name, if you do not have SVN access. [11:23:56] Once you have an account bug people in the project about it [11:24:16] sstraddle: so I guess you can ask me for details. I am heading out to do some shopping and then join a coworking place. So I should be back in a bit more than an hour [11:24:28] sstraddle: OR use hashar @ free dot fr [11:24:31] Damianz: I've already emailed Sumanah for a private IP because I don't want my details public. [11:24:39] cya [11:24:57] s/IP/account/ [11:24:59] That works also [11:25:19] tell us your password :P [11:25:28] Sure Hydriz [11:25:30] Once you have an account basically bug hashar, petan, me or someone for access/info on projects - prefrably not me as I tend to just fix stuff :P [11:25:30] :) [11:25:34] * sstraddle 's afk [11:25:59] Will do Damianz, excuse me, BRB [11:26:41] PROBLEM Free ram is now: WARNING on aggregator-test2 i-0000024e output: Warning: 19% free memory [11:28:04] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [11:28:04] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [11:28:04] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 96 processes [11:30:49] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [11:35:26] RECOVERY Total Processes is now: OK on test3 i-00000093 output: PROCS OK: 77 processes [11:40:19] RECOVERY Current Load is now: OK on swift-be4 i-000001ca output: OK - load average: 3.65, 2.76, 3.49 [11:40:59] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [11:41:00] PROBLEM Free ram is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:00] PROBLEM Current Users is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:00] PROBLEM Disk Space is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:00] PROBLEM Total Processes is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:24] 05/21/2012 - 11:45:24 - Creating a home directory for oren at /export/home/translation-memory/oren [11:46:26] 05/21/2012 - 11:46:26 - Updating keys for oren at /export/home/translation-memory/oren [11:54:29] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 104 processes [12:11:15] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 3.45, 7.66, 9.67 [12:11:15] RECOVERY SSH is now: OK on bots-cb i-0000009e output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:11:25] RECOVERY Free ram is now: OK on reportcard2 i-000001ea output: OK: 85% free memory [12:11:25] RECOVERY dpkg-check is now: OK on reportcard2 i-000001ea output: All packages OK [12:21:51] PROBLEM Free ram is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:04] mutante: willing to reboot the virt* instances ? :-D [12:23:04] :O [12:24:13] he is busy ;-D [12:25:19] but don't :P [12:25:34] unless its really hopeless [12:25:45] well it is currently fully scrwed [12:25:50] really? [12:25:52] !ping [12:26:09] pong [12:26:09] hm [12:26:09] true [12:26:09] :D [12:26:09] I am just not going to work on deployment-prep today [12:26:11] wow [12:26:19] lol [12:26:19] !ping [12:26:19] pong [12:26:23] 5 sec [12:26:31] !ping [12:26:31] pong [12:26:38] 8 this time... [12:26:58] but at least its not *fully* screwed [12:27:09] heh [12:27:23] or my work will really get screwed [12:28:36] @whoami [12:28:39] You are admin identified by name .*@wikimedia/Petrb [12:28:45] it's getting faster [12:28:50] meh [12:28:55] !nagios | hashar [12:29:05] hmm, doesn't seem like anything is wrong though [12:29:06] hashar: http://nagios.wmflabs.org/nagios3 [12:29:16] hashar: nagios works, and there is still lot of green [12:29:23] at least *my* labs wikis work well [12:29:55] http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [12:30:03] most instances are waiting for I/O access :-D [12:30:11] not a big deal though [12:30:15] will work on it tomorrow [12:30:19] if it get fixed overnight [12:30:21] ;-) [12:30:35] 23 instances :( [12:31:14] * Hydriz wishes for access to deployment-prep :) [12:31:56] PROBLEM Current Load is now: WARNING on deployment-web3 i-00000219 output: WARNING - load average: 6.65, 6.62, 6.79 [12:32:46] 05/21/2012 - 12:32:45 - Updating keys for laner at /export/home/deployment-prep/laner [12:33:08] !ping [12:33:08] pong [12:33:12] 1 sec [12:34:11] 05/21/2012 - 12:33:19 - Updating keys for laner at /export/home/deployment-prep/laner [12:34:58] hashar: let's remove old apaches for now? [12:35:09] maybe it speed up labs a bit [12:35:18] petan|wk: no [12:35:18] ok [12:35:26] petan|wk: will do that once I am able to access the machine and actually watch the log [12:35:32] ah right [12:35:34] you can't? [12:35:45] petan|wk: I also suspect two of the apaches to be sent some traffic from the squid [12:35:51] I can't log in labs instances [12:35:57] been trying the whole morning ;) [12:36:06] ok [12:36:14] I create a ticket [12:36:18] so I will remove the web* instances tomorrow IF the cluster is fixed [12:36:36] petan|wk: a ticket for ? [12:36:57] petan|wk: for labs dieing I did : https://bugzilla.wikimedia.org/show_bug.cgi?id=36993 [12:37:11] hashar: for "now" [12:37:16] right now labs are down [12:37:22] that is blocker [12:37:26] needs to be fixed asap [12:37:36] and I guess people from ops do not even know that [12:37:50] I can't ssh to bastion as well [12:38:10] I have talked to the ops people [12:38:12] basically [12:38:26] ha! I accessed nicely :) [12:38:26] wait for Ryan Lane [12:39:10] bastion loads nicely too [12:40:46] Hydriz: I typed ls in bastion and it died [12:40:52] hmm? [12:41:01] should I be evil? [12:41:01] there is huge load on bastion as well [12:41:14] this needs to be fixed, no matter if it looks ok to you or not ;) [12:42:03] I just did an ls on dumps-1 [12:42:11] in a directory with rather lot of files [12:42:30] :P [12:43:59] hashar: I managed to login to dbdump [12:44:04] PROBLEM Current Load is now: WARNING on swift-be4 i-000001ca output: WARNING - load average: 3.25, 5.77, 8.06 [12:44:43] okok, maybe I give some space for breathing first... [12:44:57] let me free up something... [12:49:55] petan|wk: also, deployment-prep IRC feeds need to go elsewhere [12:49:57] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 116 processes [12:50:21] Thehelpfulone: that's what -feed is for [12:50:22] PROBLEM HTTP is now: CRITICAL on deployment-web i-00000217 output: CRITICAL - Socket timeout after 10 seconds [12:50:22] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: CRITICAL - Socket timeout after 10 seconds [12:50:22] PROBLEM HTTP is now: CRITICAL on deployment-web4 i-00000214 output: CRITICAL - Socket timeout after 10 seconds [12:50:22] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [12:50:36] petan|wk: I mean on irc.wikimedia.org it goes to the production channels [12:50:45] that shouldn't [12:50:48] since when [12:50:57] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [12:51:00] well I noticed it a couple of days ago [12:51:03] btw Ryan told me there is a firewall preventing that to happen heh [12:51:22] I guess his firewall has a hole [12:53:32] there is some wgRC2UDP variable [12:53:38] which is not override in labs [12:53:57] somehow settings in InitialiseSettingsDeploy.php are not applied ;-D [12:54:01] something I need to fix [12:55:08] hashar: did you modify InitaliseSettings.php? [12:55:50] specifically CommonsSettings contains the code that override it [12:56:12] I don't know if someone did play with the file or why it started to happen [12:56:19] let me check [12:56:33] RECOVERY Disk Space is now: OK on test3 i-00000093 output: DISK OK [12:56:33] RECOVERY Current Users is now: OK on test3 i-00000093 output: USERS OK - 0 users currently logged in [12:56:38] RECOVERY dpkg-check is now: OK on test3 i-00000093 output: All packages OK [12:57:29] @search gluster [12:57:29] No results found! :| [12:58:13] btw Reedy I temporarily switched repository for wm-bot source, wikimedia svn didn't work to me last friday and I needed to go home [12:58:16] petan|wk: I did lot of tweaks and updates to the php files last week [12:58:39] where is bots-3 running from? [12:58:44] sorry wm-bot* [12:58:51] Reedy: here https://github.com/benapetr/wikimedia-bot [12:59:02] Thehelpfulone: bots-1 [12:59:15] ok, why can't I access that instance? [12:59:21] RECOVERY Free ram is now: OK on bots-sql2 i-000000af output: OK: 77% free memory [12:59:21] RECOVERY Current Users is now: OK on bots-sql2 i-000000af output: USERS OK - 0 users currently logged in [12:59:22] RECOVERY Total Processes is now: OK on bots-sql2 i-000000af output: PROCS OK: 103 processes [12:59:23] hm... :) [12:59:26] RECOVERY Disk Space is now: OK on bots-sql2 i-000000af output: DISK OK [12:59:46] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:47] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:47] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:47] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:56] PROBLEM Current Load is now: WARNING on deployment-web i-00000217 output: WARNING - load average: 9.03, 8.37, 7.44 [12:59:56] PROBLEM Current Load is now: WARNING on deployment-web4 i-00000214 output: WARNING - load average: 8.95, 8.26, 7.36 [12:59:56] PROBLEM Current Load is now: WARNING on deployment-web5 i-00000213 output: WARNING - load average: 8.92, 8.25, 7.35 [13:00:13] Thehelpfulone: question is if there is a need for people to access it, given that it's kind of "full" there are 3 live bots running and I don't know if it's needed for more people to run bots there [13:00:23] !log dumps Killed the uploader process on all instances (except dumps-2) to see if Labs would get resolved. [13:00:34] Thehelpfulone: but if you want I can make it accessible, of course [13:00:38] what bots are running on it? [13:00:54] AFC bot, wm-bot and czech wikipedia talk page archive bot [13:01:03] I'm just thinking we don't need to restrict things unless absolutely necessary [13:01:14] Thehelpfulone: Ryan has different opinion [13:01:27] in fact his idea of production bots is that everything is restricted [13:01:56] but didn't he setup labs to allow sudo access for all? [13:01:57] this bots project is just a temporary solution [13:02:06] although I guess he also did setup the per project sudo policy [13:02:16] temporary? what would be permanent? [13:02:22] Thehelpfulone: no, sudo was never meant to be allowed to all [13:02:27] !bots | Thehelpfulone [13:02:28] Thehelpfulone: http://www.mediawiki.org/wiki/Wikimedia_Labs/Create_a_bot_running_infrastructure proposal for bots [13:02:37] that describes the permanent [13:02:45] Production servers will have no access [13:02:47] god damn dumps-2, no response... [13:02:49] yeh [13:03:24] This space is designed for development work and packaging the bots and puppetizing them to push into production... this runs production stuff only because production cluster isn't done yet. [13:03:33] PROBLEM Current Load is now: WARNING on mobile-feeds i-000000c1 output: WARNING - load average: 4.21, 5.51, 5.64 [13:03:33] PROBLEM Current Load is now: WARNING on mailman-01 i-00000235 output: WARNING - load average: 5.91, 5.94, 5.36 [13:03:33] PROBLEM Current Load is now: WARNING on bastion1 i-000000ba output: WARNING - load average: 0.70, 3.12, 5.15 [13:03:33] PROBLEM Current Load is now: WARNING on wikistream-1 i-0000016e output: WARNING - load average: 10.77, 8.88, 6.99 [13:03:33] PROBLEM Current Load is now: WARNING on jenkins2 i-00000102 output: WARNING - load average: 5.68, 5.72, 5.40 [13:03:58] PROBLEM Free ram is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:59] PROBLEM dpkg-check is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [13:03:59] ok that page makes sense - it's more like the toolserver [13:04:08] not really [13:04:14] toolserver doesn't have puppet [13:04:27] eh, does rebooting on labsconsole recommended now? [13:04:27] Hydriz: it's never [13:04:27] !b 36997 [13:04:27] just do it [13:04:27] https://bugzilla.wikimedia.org/36997 [13:04:28] and pray [13:04:43] and pray? that's optimistic :P [13:05:00] yeah, I got to stop dumps-2 from continuing the upload [13:05:00] RECOVERY dpkg-check is now: OK on bots-sql2 i-000000af output: All packages OK [13:05:00] PROBLEM Free ram is now: WARNING on aggregator-test2 i-0000024e output: Warning: 18% free memory [13:05:00] and see if Labs can get fixed [13:05:09] petan|wk: so where is http://bots.wmflabs.org/~petrb stored? [13:05:16] doing a nano is already taking very long [13:05:21] /mnt/public_html [13:05:27] everything's still screwed up I presume? [13:05:51] !project Dumps [13:05:51] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Dumps [13:06:23] * Hydriz prays [13:06:50] Thehelpfulone: it's stored on bots-nfs which is stored on project storage which is stored on gluster [13:06:58] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:07:05] no [13:07:31] it's stored on instance storage which is virtual hdd mounted from gluster [13:08:42] great, I have saved all work, its OK for virt* to be restarted for me :) [13:08:43] Thehelpfulone: if you want to become admin of bots project you just have to ask that's all [13:09:00] that's what I wrote in that email regarding the new setup [13:09:06] etc [13:09:20] * Hydriz asks for becoming an admin of bots project (with a cup of warm tea) [13:09:33] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [13:09:38] PROBLEM SSH is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - Socket timeout after 10 seconds [13:09:38] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:10:12] petan|wk: ok, yeah can you make me an admin then please [13:10:37] Hydriz: restarting virt*? you sure it help us [13:10:43] ok done for both, don't break stuff [13:10:48] :) [13:11:03] I am pretty sure we have problem with gluster, not virt [13:11:07] at least what it was raised before or something [13:11:14] anyway, it feels great to have saved work [13:11:25] hmm can't seem to SSH into bastion either [13:11:41] yep [13:11:53] If you reboot any virt nodes you'll probably cause more issues [13:12:02] Because gluster will try to heal its self and go batshit on io [13:12:10] I'm not planning to do that :) [13:12:23] at least I am not involved haha [13:12:46] the admin logs as some hints on April 30th [13:12:58] seems like each virt box need to be rebooted one after the other [13:13:07] but I guess we all agree we would prefer Ryan to handle that ;-D [13:14:48] PROBLEM Current Load is now: WARNING on wikidata-dev-3 i-00000225 output: WARNING - load average: 5.40, 6.14, 6.54 [13:14:51] stopping after uploading for a week of dumps [13:14:54] feels weird [13:15:22] Yeah... but if you reboot one you might end up VERY quickly in a place where you have to take everything down to get the nodes under control if gluster is still dodgy... which IIRC we're not running the latest 'fixed' version due to auth issues. [13:15:30] fcking gluster [13:15:49] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:59] PROBLEM HTTP is now: CRITICAL on hugglewiki i-000000aa output: CRITICAL - Socket timeout after 10 seconds [13:17:14] Gluster is awesome.... when it works and a real PITA when it breaks. [13:17:24] I've never seen it work properly [13:17:44] even when things were working around here, I/O was lagging quite a lot considering the workload and the hardware [13:17:50] I have it working sorta-properly on a magento cluster I built but it has IO thoughput issues reading small files :( [13:18:07] Considering shifting to Ceph because it seems to be progressing towards a better community now. [13:18:31] PROBLEM dpkg-check is now: CRITICAL on test3 i-00000093 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:50] hmm [13:20:12] RECOVERY Current Load is now: OK on wikistream-1 i-0000016e output: OK - load average: 0.23, 1.45, 4.12 [13:20:29] PROBLEM Current Load is now: WARNING on deployment-nfs-memc i-000000d7 output: WARNING - load average: 8.66, 8.48, 9.23 [13:20:29] PROBLEM Current Load is now: WARNING on wep i-000000c2 output: WARNING - load average: 4.76, 6.91, 6.42 [13:21:51] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:51] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:51] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:51] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:22:36] PROBLEM HTTP is now: CRITICAL on deployment-apache21 i-0000026d output: CRITICAL - Socket timeout after 10 seconds [13:22:41] PATH="/usr/local/python/bin/:$PATH" PYTHONPATH="/usr/local/python/" LD_LIBRARY_PATH="/usr/local/python/lib:/usr/local/python/:$LD_LIBRARY_PATH" PKG_CONFIG_PATH=/usr/local/cairo/lib/pkgconfig:$PKG_CONFIG_PATH /usr/local/python/bin/python waf configure --prefix=/usr/local/python// [13:22:47] RECOVERY dpkg-check is now: OK on bots-cb i-0000009e output: All packages OK [13:22:47] Errr, wrong window [13:24:00] RECOVERY Current Load is now: OK on mobile-feeds i-000000c1 output: OK - load average: 0.16, 1.51, 3.74 [13:24:00] RECOVERY Current Load is now: OK on bastion1 i-000000ba output: OK - load average: 0.51, 3.24, 4.50 [13:24:00] RECOVERY Current Load is now: OK on jenkins2 i-00000102 output: OK - load average: 0.28, 2.24, 4.38 [13:24:00] RECOVERY Disk Space is now: OK on bots-cb i-0000009e output: DISK OK [13:24:00] RECOVERY Current Users is now: OK on bots-cb i-0000009e output: USERS OK - 0 users currently logged in [13:24:00] RECOVERY Total Processes is now: OK on bots-cb i-0000009e output: PROCS OK: 113 processes [13:24:05] PROBLEM HTTP is now: WARNING on deployment-web i-00000217 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.017 second response time [13:24:05] PROBLEM HTTP is now: WARNING on deployment-web3 i-00000219 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.020 second response time [13:24:05] PROBLEM HTTP is now: WARNING on deployment-web4 i-00000214 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.023 second response time [13:24:05] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.014 second response time [13:24:10] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 4.69, 8.18, 8.89 [13:24:10] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [13:24:10] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [13:24:10] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 114 processes [13:24:15] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 96% free memory [13:24:15] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [13:24:19] o_O [13:24:27] eh?! [13:24:46] nagios can't check stuff because nrpe is lagged because of load [13:24:53] so that it throw random errors [13:25:02] I get that, the "OK" is what I don't get [13:25:17] nope, incubator-bot1 is still giving SSH troubles [13:25:30] load get better and nrpe start responding, that's why it say OK [13:25:41] yes, things seem to be recovering [13:25:44] unless nagios has exclusive access :P [13:25:58] that happens over all day, I am sure in few mins it get worse again [13:25:59] \o/ incubator-bot1 loads :) [13:26:06] one of the cascading problems is that instances mount off labs-nfs1, which is an instance [13:26:24] paravoid: since morning it is getting worse and then better and again [13:26:45] load average: 1.99, 5.52, 7.77 (incubator-bot1) [13:27:04] but its decreasing steadily :) [13:27:33] RECOVERY Current Load is now: OK on deployment-web i-00000217 output: OK - load average: 0.12, 0.99, 3.91 [13:27:33] RECOVERY Current Load is now: OK on deployment-web4 i-00000214 output: OK - load average: 0.00, 0.95, 3.88 [13:27:33] RECOVERY Current Load is now: OK on deployment-web5 i-00000213 output: OK - load average: 0.00, 0.96, 3.90 [13:27:36] that's why hashar gave up and we are so sad [13:27:36] PROBLEM HTTP is now: WARNING on deployment-apache21 i-0000026d output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 8.499 second response time [13:28:50] things looks good (now) [13:33:56] RECOVERY Current Load is now: OK on incubator-bot1 i-00000251 output: OK - load average: 0.16, 1.36, 4.87 [13:38:56] RECOVERY Current Load is now: OK on deployment-nfs-memc i-000000d7 output: OK - load average: 0.51, 1.64, 4.79 [13:42:26] RECOVERY Current Load is now: OK on ganglia-test2 i-00000250 output: OK - load average: 0.34, 0.96, 3.96 [13:47:36] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.51, 0.46, 3.77 [13:51:20] 05/21/2012 - 13:51:20 - Updating keys for laner at /export/home/deployment-prep/laner [13:52:36] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.72, 0.39, 2.82 [14:02:36] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.22, 0.39, 17.14 [14:22:36] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.40, 0.41, 5.00 [14:48:36] petan, also any idea why deployment-prep is always so slow? [14:49:50] It doesn't have 400 servers and 2 data centers behind it. It tries to simulate dozens of servers into one virtual computer. It could be faster though, probably. [14:49:53] (hopefully) [14:50:24] maybe some caching layers haven't been implemented yet into the beta puppet that the production cluster does have [14:51:50] Thehelpfulone: depends when, sometimes is very fast, sometimes it's caused because of IO and such [14:52:17] we don't want to use different caches than what we have on prod now [14:52:41] ok [14:52:43] but in fact the servers are usually loaded because labs are having problems [14:52:51] and if a change is made on production, how do you sync it to labs? [14:53:00] puppet [14:53:10] that's how is it meant to work [14:53:20] there are two branches, changes aren't going to be done on prod [14:53:22] does puppet run by itself or do you puppetd -tv [14:53:31] in future we will do changes on labs and then, push to prod [14:53:43] so labs will be first place to try all changes to prod [14:53:56] PROBLEM Free ram is now: WARNING on aggregator-test2 i-0000024e output: Warning: 18% free memory [14:54:04] puppet run by itself of course [14:54:08] that's why nagios check it [14:55:26] so that request on bugzilla for aawiki to be added to labs, puppet doesn't add the wikis? [14:55:42] right now it must be done using other way [14:55:55] puppet isn't going to manage wikis, because they need to be inserted to sql [14:56:24] in past we had a script for that which doesn't work now [15:00:03] 05/21/2012 - 15:00:02 - Updating keys for andrew at /export/home/testlabs/andrew [15:00:06] 05/21/2012 - 15:00:06 - Updating keys for andrew at /export/home/gluster/andrew [15:00:08] 05/21/2012 - 15:00:08 - Updating keys for andrew at /export/home/openstack/andrew [15:00:16] 05/21/2012 - 15:00:16 - Updating keys for andrew at /export/home/bastion/andrew [15:00:20] 05/21/2012 - 15:00:20 - Updating keys for andrew at /export/home/deployment-prep/andrew [15:37:33] does emailing work on deployment-prep petan? [15:38:03] no [15:38:15] is it going to be turned on? [15:39:24] also did you manage to find a fix so that the IRC feed doesn't go to the production channels? [15:44:57] what we need is a different "server" [15:45:22] I began to make it, but it still allowed you to talk :s [15:47:35] btw, how is deployment-prep updated? [15:54:48] morning Ryan_Lane [15:54:53] morning [15:54:57] bug https://bugzilla.wikimedia.org/show_bug.cgi?id=37002 [15:54:58] I see there were labs issues [15:55:15] the IRC feed for labs is going to production server / channels [15:55:31] Platonides: apparently it's through puppet [15:55:46] this happens when people pull in the InitilizeSettings and CommonSettings from production [15:55:57] Ryan_Lane: hashar is working on that [15:56:00] * Ryan_Lane nods [15:56:07] I can't wait to get off of gluster [15:56:10] today labs were really broken [15:56:15] petan|wk: I am not :-D [15:56:21] Ryan_Lane: he is not [15:56:22] :D [15:56:25] Ryan_Lane: morning... [15:56:31] well, in like a week and a half we'll get hardware in [15:56:32] hashar: you're not? [15:56:35] and we'll move off of gluster [15:56:39] petan|wk: well I will fix the beta logs flushing to IRC whenever labs is fixed ;-D [15:56:44] Ryan_Lane: so, gluster was /probably/ the primary issue [15:56:51] yes [15:56:59] the secondary issue that labs-nfs1 got unresponsive (probably because of gluster) [15:57:00] chrismcmahon: the labs cluster had some issue for the whole day that prevented any access to it [15:57:02] though all the instances doing IO at the same time is a bad idea ;) [15:57:04] ahhhh [15:57:08] that damn NFS server [15:57:09] and then all other instances got fcked up [15:57:13] hashar: I see, got it. [15:57:21] no home dirs for starters, not being able to log into bastion hosts etc. [15:57:25] lemme hop onto the gluster channel and see if I can get someone to make a package of the newer qa builds [15:57:30] hashar: you self-promoted to... eh, lead dev or something, howcome you don't work on that XD [15:57:54] petan|wk: well that is my 20% day today, so I happily escaped the labs for some wikibugs loving ;-D [15:58:13] ok guess I will be lead dev for rest of day [15:58:16] Ryan_Lane: any ideas on how to get rid of labs-nfs1 without getting blocked by the gluster replacement? [15:58:27] replace it with something else? [15:58:49] oh rly? :) [15:58:55] heh [15:59:00] well, I don't know what else to say ;) [15:59:15] we have to have shared home directories per-project [15:59:23] could we move labs-nfs1 to local storage for example? [15:59:30] it is on local storage [15:59:32] oh [15:59:34] is it? [15:59:37] I think it's on gluster [15:59:38] you mean not gluster [15:59:40] yes [15:59:42] nope [15:59:50] can we employ a physical server for this then? [15:59:59] not unless we have hardware for it ;) [16:00:07] we don't need a Cisco for that [16:00:10] which we will in about a week and a half [16:00:19] I'd prefer not to add in a piece of hardware for a week and a half [16:00:29] Ryan_Lane: why we need shared home directories per project [16:00:35] I guess we're pretty unstable right now [16:00:36] Ryan_Lane: why they are not in /data/home [16:00:47] we had several hours of downtime that resulted into several people not being able to work though [16:01:01] petan|wk: because the current version of gluster we're on isn't reliable [16:01:07] we have to do /something/, even if it's for a week and a half I'd say... [16:01:11] it has a locking issue that can lead to data corruption [16:01:14] ok why we need to have shared home directories [16:01:18] paravoid: indeed. ask rob about a misc server [16:01:24] preferably a high-perf one [16:01:36] petan|wk: because you need ssh keys to log in [16:01:37] I would prefer to have /home/petrb on local fs and /home/petrb/share on nfs [16:01:52] puppet can update your key on all instances [16:01:56] no it can't [16:01:57] or whatever else [16:02:20] puppet doesn't know anything about your ssh key [16:02:32] ok, something update it now, that thing could do that [16:02:51] where would it run? [16:02:53] on every instance? [16:03:00] yes [16:03:02] it's a pretty expensive script [16:03:13] that update of production apaches does run on each apache as well [16:03:15] the way things are done now make sense [16:03:37] it's not so fast but reliable [16:03:43] Thehelpfulone: petan|wk : email do not work on deployment-prep. I need to have exim configuration in puppet updated so we can choose a different SMTP server on labs [16:03:45] when nfs is down your home isn't gone [16:03:52] it's usefull to have a home [16:03:55] Thehelpfulone: petan|wk : feel free to log a bug. I guess that is low priority for now. [16:04:04] ideally your storage would never go away [16:04:44] the gluster storage is reliable enough for home directories from the availability perspective, but the locking bug issue is a problem [16:05:18] paravoid: Ryan_Lane the bug I opened about this morning (UTC) outage is https://bugzilla.wikimedia.org/show_bug.cgi?id=36993 [16:05:51] * Ryan_Lane nods [16:08:00] how can rotating a log bring the machine to its knees? [16:08:12] Platonides: if there are many logs [16:08:14] it should be renaming a file, not copying to a new one [16:08:18] Platonides: not really [16:08:28] sometimes you need to gzip [16:08:28] yeah, it gzips [16:08:32] hmm... gzips... [16:08:35] and gluster's IO is *really* crappy [16:08:45] also, it could be doing GBs of IO [16:09:45] petan: by the way, betawikis reporting to production irc, that is not just config issue. It shouldn't be possible to do that from outside the cluster [16:10:15] at least not this easy [16:10:46] well, you can connect to the server from the outside [16:10:51] if one would be to setup an irc client that logs on the irc server and has permission to talk, then okay. But in this case labs isn't doing that. It is just broadcasting to the production UDP feed it seems. [16:10:54] (that's the point!) [16:10:56] Platonides: Sure [16:11:06] but one shouldn't be able to connect to the UPD feed from the outside [16:11:13] maybe the problem is that the config contained the irc service password ? [16:11:16] which the ircbot is listening to to send to the irc server [16:11:23] No, I don't think so [16:11:27] could be, but I don't think so [16:11:31] mediawiki isn't logging into irc [16:11:55] a separate mediawiki-independent script is running that logs into irc and sends messages that it receives from UDP [16:12:02] mediawiki just sends into UDP, not to IRC directly [16:12:05] (afaik) [16:12:14] yes [16:12:24] Krinkle: I know [16:12:29] it's problem of firewall as well [16:12:32] Ryan_Lane: Can you confirm that this is an issue with the labs/production firewall ? [16:12:35] okay [16:12:38] nvm, then :) [16:15:15] !log deployment-prep petrb: replaced wgRC2UDP with localhost in Initialise, needs to be fixed permanently using override from wmf-config/InitialiseSettingsDeploy.php [16:17:22] we're going to see if we can easily block Labs from the update stream [16:18:41] I have tried to inject a message, but doesn't seem to pass [16:24:06] jeremyb: looks like mailman is down again, evil puppet? [16:24:21] surely [16:24:57] ok, I've found now how is it being injected [16:28:14] petan|wk: make sure to git commit your change :-D [16:28:34] jeremyb: what's the permanent solution to fix it? [16:28:42] petan|wk: once /home/wikipedia/common is good enough, I will replace it with mediawiki-config from production [16:28:48] smacking the servers is out of the question I think [16:28:48] still need to port the labs overriding stuff though [16:29:23] Thehelpfulone: get me a puppetmaster that I can play with directly or merge access to the test repo or per project puppet. one option is to run your own master [16:30:01] Ryan_Lane: ^ which one of those is coming soonest? [16:30:12] how do i invoke labs-logs-bottie? [16:30:59] the current solution of "fix stuff manually and turn off puppet" is a hack [16:31:16] we can try to hunt down how puppet is getting started again and make sure it doesn't [16:31:21] but it's still a hack [16:33:34] maybe it's upstart's fault [16:33:37] idk how that works [16:35:36] huh, do we even use upstart? [16:35:51] of course we do [16:41:29] Thehelpfulone: ask paravoid :) [16:41:45] he's pretty close to having this done, I belive [16:41:51] localpuppet? [16:41:57] yes, the plan was to have it today [16:42:01] \o/ [16:42:14] the labs downtime didn't help though [16:43:33] Thehelpfulone: ok, fixed [16:43:51] Thehelpfulone: i think i tracked down what needs to be disabled to stop puppet from starting again [16:43:57] ok [16:44:01] paravoid: whoa, nice ;) [16:44:12] let's see it working first [16:44:27] paravoid: but i still don't know what will be in it. are there some specs i can read? [16:44:28] don't celebrate just yet [16:44:38] no, sorry [16:44:41] just a local puppetmaster [16:44:47] so that you can try manifests before pushing them in [16:44:56] nothing too rocket-science-y [16:46:10] ok. what about more widespread access to make stuff live on test branch without hoops? [16:46:34] or without even review at all beyond "this worked on the local puppet" [16:47:06] idk even who has merge access there atm [16:47:22] and what does merge access mean? does the puppetmaster pull automatically or is it manual? [16:47:40] PROBLEM HTTP is now: WARNING on mailman-01 i-00000235 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 498 bytes in 0.014 second response time [16:47:54] it means that merge requests will not be just single-commit merges "let's see if this works" [16:47:58] but actual patchsets [16:48:07] Ryan_Lane: what is manage-exports? [16:48:17] and why does it eat 30% cpu on labs-nfs1? [16:48:26] paravoid: it's the expensive script I was talking about [16:48:59] right, agreed [16:49:14] what's it doing? updating keys? [16:50:08] it's doing all kinds of crap [16:50:16] it's a script I wrote like 6 years ago [16:50:37] it manages home directories, mostly [16:50:44] but also does per-project nfs now, too [17:39:10] PROBLEM Puppet freshness is now: CRITICAL on localpuppet1 i-0000020b output: Puppet has not run in last 20 hours [18:22:22] Hey Ryan_Lane [18:22:30] howdy [18:22:43] What's up with the databases-in-labs thing? [18:23:18] <^demon> It's Ryan! [18:23:28] oh yeah [18:23:35] forgot about that [18:23:44] RoanKattouw: didn't I make a project for you for this? :D [18:24:09] No, we tried to come up with a name and then it never happened [18:24:32] <^demon> Ryan_Lane: Is ldapsupportlib.py something we maintain, or is that an upstream package? I was trying to do a modify-ldap-group over the weekend and I got a stacktrace :( [18:25:11] It's something I wrote ages ago [18:25:42] <^demon> http://p.defau.lt/?bR4tuvPF6vuvEjo7VjIvTA :\ [18:26:15] you need to run that via sudo [18:26:42] <^demon> Prompts me for password though? [18:27:30] it does? [18:27:33] on formey? [18:27:50] <^demon> Yup [18:28:11] ah [18:28:17] you don't have permissions to use that [18:28:19] lemme add him [18:28:43] <^demon> Ok thanks. Guess we'll need to add it to the list of scripts I can sudo for :) [18:28:47] done [18:28:49] <^demon> Easy enough. [18:29:04] hmm. yeah, if I can do it for this specific group [18:29:20] it would obviously be problematic if you could add yourself to ops, for instance ;) [18:29:31] though you're also an admin in gerrit, so meh [18:29:50] I wish we could make it so that people could only create repos and not touch others [18:30:03] ACLs for gerrit admins would be nice [18:30:04] <^demon> We can hand out create permissions only since 2.3 [18:30:31] <^demon> https://gerrit.wikimedia.org/r/#/admin/projects/All-Projects,access - under "Global Capabilities" [18:34:30] awesome [18:34:42] so, I can take away your admin privs? :D [18:35:04] <^demon> Nooo, I do lots of gerrit stuffs. [18:36:20] heh [18:36:39] it's a loophole in rights for you to have that ;) [18:37:28] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [18:45:28] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [19:18:05] <^demon> Ryan_Lane: I added you as reviewer on a couple of other gerrit fixes. Should be much easier than the big rewrite I did. [19:18:16] heh [19:18:17] ok [19:20:28] ^demon, why would gerrit fail a merge which works on the cli? [19:20:37] <^demon> Example? [19:20:51] It does weird shit from time to time [19:20:59] rebase, repush, approve [19:21:04] https://gerrit.wikimedia.org/r/8017 [19:21:18] I saw you rebased it [19:21:20] <^demon> Most likely solution is different merge tool on local install vs. gerrit. [19:21:26] <^demon> eg: diff3 vs. something else [19:21:26] well, sure [19:21:32] <^demon> *cause [19:21:32] I don't have a gerrit server locally :) [19:21:34] <^demon> Not solution [19:21:35] it was 16 lines or so out IIRC [19:21:43] but still, trivial [19:22:02] Install all of the diff tools! [19:22:12] * Platonides blames java [19:22:14] <^demon> Yeah, it's merged so I'm not gonna fret right now. [19:22:19] <^demon> Way bigger fish to fry. [19:22:25] ^demon, reminder, I'd love ownership of wikimedia/orgchart when you get a chance [19:22:39] <^demon> Ryan_Lane added you to the group earlier :) [19:22:43] <^demon> You should be good to go now. [19:22:45] Woot! [19:41:04] ^demon, how do I push my current commits to the repository? It doesn't like fast-forward references ("can not update the reference as a fast forward") [19:41:25] <^demon> Are you pushing for review? [19:41:30] <^demon> Oh, existing work. [19:41:34] *nod* [19:41:40] <^demon> We'll need to force the push. Let me tweak the permissions [19:41:47] <^demon> (They're disabled by default, for good reason) [19:41:55] Agreed [19:43:19] <^demon> Ok, try `git push -f [remote-for-gerrit] [branch]` [19:43:39] Fun! [19:43:42] Thanks much [19:44:15] Is it simple enough to merge between branches?> [19:44:18] <^demon> https://gerrit.wikimedia.org/r/gitweb?p=wikimedia%2Forgchart.git;a=shortlog;h=refs%2Fheads%2Fmaster looks good :) [19:44:33] <^demon> I don't know. I haven't done much merging. [19:45:09] Hm, I'll have to look it up [19:50:18] jeremyb: is it possible on mailman to separate the http://mailman.wmflabs.org/mailman/listinfo - lists? so that we have one set of "public" lists and then one set of "closed subscription" lists? [19:50:50] public / private is unrelated to closed / open subscription [19:55:45] i think there's no way to give a listing for private lists at all [19:55:51] you just have to already know the name [19:56:04] Thehelpfulone [19:56:32] ok [19:56:53] Funny, I can't clone master [19:57:21] <^demon> Hmm, HEAD seems to be pointing at the wrong place. [19:57:28] <^demon> Manually specifying -b master works though [19:58:21] <^demon> marktraceur_: ^ [19:58:29] <^demon> `git clone -b master` should work [19:58:37] Right, got it [19:58:42] <^demon> HEAD is pointing at refs/meta/config, which is wrong. [19:58:45] <^demon> I'll have to fix that. [19:59:20] <^demon> Fix one thing, 8 more things break. Story of my so-called git life. [20:00:23] Welcome to the glorious git future. [20:00:26] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [20:01:21] I for one welcome our git overlords. [20:30:39] !log deployment-prep changed default assignee in Bugzilla [20:39:16] hashar, https://bugzilla.wikimedia.org/show_bug.cgi?id=37007 [20:40:41] hashar: default assignee? do you mean you default CC? [20:41:23] I think the practice is to keep the default assignee as wikibugs, then add yourself to the default CC list. [20:41:32] I have changed the default assignee [20:41:34] let me rewrite [20:41:52] oh you changed it back [20:42:44] !log deployment-prep In Bugzilla, I have removed Petr Bena as a default assignee of bugs opened for "deployment-prep (beta)" component. Default is now "Nobody", Petr is on CC. That will makes bug triage a bit easier. [20:43:33] hashar, could you add me to the CC, too= [20:43:34] ? [20:43:58] !log deployment-prep Adding Faidon and Platonides to default CC list of "depoyment prep (beta)" component [20:44:18] Platonides: done :-D [20:44:21] uh-oh :) [20:44:44] Platonides: changed uid to 33? [20:44:50] Platonides: what where why how ? ;-D [20:45:19] the apache UID is set by ubuntu when installing the package, so we should really not modify it in /etc/passwd :-] [20:46:34] $ grep apache /etc/passwd [20:46:34] apache:x:33:33::/var/www:/sbin/nologin [20:46:35] oh [20:46:40] on -web3 [20:46:47] anyone else wanting to be added to default CCs for any of bugzilla? [20:46:59] * Thehelpfulone is offering free default CC's for a limited time only [20:47:05] hehe [20:47:18] well, and others [20:47:38] precisely to overcome that issue of files-owned by www-data [20:47:39] Platonides: the uid 33 seems to be on the -web* instances which I am going to destroy [20:47:50] poor instances [20:52:07] !log deployment-prep On deployment-nfs-memc : added apache (uid 48) entry in /etc/passwd [20:52:42] !log deployment-prep deployment-nfs-memc : fix user right for upload6 : chown apache /mnt/export/upload6 [20:52:55] where the hell is that stupid bot [20:54:46] Platonides: Thehelpfulone are you able to restart bots on [bots-2] ? [20:54:56] should be : sudo service adminbot restart [20:55:08] should be abel to [20:55:25] I need to get added to that project so I can restart bots myself hehe [20:55:38] I can add you if you like hashar [20:56:34] please do :-D [20:56:42] I pro mess I am not going to mess with them [20:56:51] just restart them whenever they die [20:57:11] sure, I'll add you now [20:57:30] hmm Ryan_Lane you there? [20:57:38] https://labsconsole.wikimedia.org/wiki/Special:NovaProject is not showing me any projects [20:58:00] unlog from wiki and relog ? [20:58:17] yep. do that [20:58:25] it's the same "no credentials for your account" bug [20:58:41] <^demon> That bug? Super annoying :( [20:58:42] I keep hitting that silly bug :( [20:59:10] 05/21/2012 - 20:59:10 - Creating a home directory for hashar at /export/home/bots/hashar [20:59:34] hmm that's odd Ryan_Lane, petan added Hydriz and me as admins on bots but it doesn't seem to have added? [21:00:12] 05/21/2012 - 21:00:11 - Updating keys for hashar at /export/home/bots/hashar [21:00:15] well, maybe he didn't actually add them? :) [21:00:37] would it show up somewhere when you add someone? like it says "updating keys..." [21:00:48] nope [21:00:58] it also doesn't show in recent changes :( [21:01:00] I need to fix that [21:04:22] ^demon: is it going to be possible to create git repos by devs? [21:04:31] or we need to bother you with every new repo in future [21:04:39] <^demon> I'm going to write a special page that automates the process. [21:04:43] k [21:04:47] <^demon> I don't want to give out permissions to more peeps. [21:04:52] petan|wk: did you remember to press save when adding me as an admin on bots? ;) [21:04:59] yes [21:05:01] why [21:05:24] I doesn't show that I've been added - https://labsconsole.wikimedia.org/wiki/Special:NovaProject [21:05:33] it's not on that page [21:05:45] you should be able to sudo on all instances now [21:06:04] oh! you did it through sudogrouppolicy [21:06:08] yes [21:11:01] !log bots restarted labs-morebot : root@bots-2:~# service adminbot restart [21:11:03] Logged the message, Master [21:11:20] !log deployment-prep In Bugzilla, I have removed Petr Bena as a default assignee of bugs opened for "deployment-prep (beta)" component. Default is now "Nobody", Petr is on CC. That will makes bug triage a bit easier. [21:11:22] Logged the message, Master [21:11:30] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory [21:11:33] !log deployment-prep Adding Faidon and Platonides to default CC list of "depoyment prep (beta)" component [21:11:35] Logged the message, Master [21:11:43] !log deployment-prep On deployment-nfs-memc : added apache (uid 48) entry in /etc/passwd [21:11:45] Logged the message, Master [21:11:50] !log deployment-prep deployment-nfs-memc : fix user right for upload6 : chown apache /mnt/export/upload6 [21:11:51] Logged the message, Master [21:18:00] Thehelpfulone: thanks for adding me as a bot restarter [21:19:34] no problem ;) [21:41:21] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [21:56:31] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 20% free memory [22:23:13] PROBLEM Puppet freshness is now: CRITICAL on deployment-apache21 i-0000026d output: Puppet has not run in last 20 hours [22:33:16] PROBLEM Puppet freshness is now: CRITICAL on labs-relay i-00000103 output: Puppet has not run in last 20 hours [22:39:32] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 19% free memory