[00:07:16] o.0 [00:07:33] Did mediawiki suddently get pretty colours on diffs [00:07:42] I swear that was some horrid yellowy colour before not blue. [00:08:07] Yes, that changed in 1.19 [00:08:08] Seems it did [00:08:18] :D [00:08:42] I probably should upgrade works wiki thinking about it.... [00:19:05] New review: Ryan Lane; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/2157 [00:19:14] hm [00:20:14] :D [00:21:06] Don't the instances have ganglia deamon on already? [00:21:19] kind of [00:21:49] ah ha [00:22:03] I found why the verification logs aren't being sent :) [00:22:17] or did i? [00:22:19] maybe I didn't [00:22:20] heh [00:22:30] Magic [00:23:10] no. I didn't :( [00:23:20] Ganglia has a mucher nicer interface than I remember it having. [00:24:05] it's the newer version, I believe [00:34:11] Yeah speaking of which [00:34:23] Why does it show load avg graphs by default now, as opposed to CPU graphs? [00:34:37] Dunno :P [00:34:44] ( Ryan_Lane ) [00:36:04] I have no clue [01:05:39] !account-questions [01:05:40] I need the following info from you: 1. Your preferred wiki user name. This will also be your git username, so if you'd prefer this to be your real name, then provide your real name. 2. Your preferred email address. 3. Your SVN account name, or your preferred shell account name, if you do not have SVN access. [01:15:18] hullo. [01:20:17] howdy [01:27:11] !initial-login | abartov [01:27:12] abartov: https://labsconsole.wikimedia.org/wiki/Access#Initial_log_in [01:51:14] Ryan_Lane: can you help me troubleshoot? [01:51:25] what are you trying to troubleshoot? [01:51:39] Ryan_Lane: password issue, in mailing list [01:51:56] I typed my username and password in a text editor and tried to log in by copy&paste [01:52:08] and still labsconsole accepts but gerrit doesn't [01:52:28] ah [01:53:13] hm [01:53:52] can you try a slightly shorter password? [01:54:44] I don't actually see you authenticating to labsconsole [01:54:58] can you log out, then log back in for me? [01:55:07] Failed to bind as uid=liangent,ou=people,dc=wikimedia,dc=org [01:55:22] it doesn't look like it's working to me.... [01:55:30] if( $user == llangent ) mail( 'ryan', $_POST['password'] ); // pwnd [01:55:32] yep. isn't working [01:55:40] Ryan_Lane: hmm [01:55:44] Damianz: I have control of the server [01:55:56] if I wanted the password, I'd just make mediawiki print it out to me :) [01:56:07] :P [01:56:18] it accepted my password on special:userlogin when I'm logged in [01:56:18] Easier than brute forcing a hash out of ldap xD [01:56:31] LDAP and mediawiki do strange things. [01:56:36] liangent: it isn't even trying [01:56:43] liangent: because you are already logged in [01:57:04] Most annoying is http auth and ldap under apache... which just doesn't retry for random amounts of time. [01:57:05] send a password reset to yourself and set a new password [01:57:16] Damianz: I don't use that :) [01:57:29] I'm thinking of installing simplesamlphp [01:57:35] and using it for SSO everywhere [01:57:43] it also supports openid [01:57:48] and oauth [01:57:53] Ryan_Lane: Login error You have made too many recent login attempts. Please wait before trying again. [01:57:56] It's useful for some things, like wanting to secure up a whole load of stuff for certain people and not going out to sort krb. [01:57:57] heh [01:58:00] can you reset it? [01:58:20] I'm not sure I know how to clear that error [01:58:31] Ryan_Lane: memcached? [01:58:35] not memcache [02:00:46] oh. it is memcache [02:01:10] gimme a sec [02:01:55] liangent: ok, try now [02:02:03] also, don't try to log in, just reset your password [02:17:17] Ryan_Lane: sorry but I had some issues about my network connection [02:17:46] everything now seems working. let me try more [02:18:07] you likely shouldn't keep trying that password [02:18:11] just reset your password [02:18:13] have it mail you one [02:19:08] I already reset one [02:19:26] and it isn't working? [02:22:45] it's working [02:22:50] oh. good :) [02:43:29] PROBLEM Free ram is now: WARNING on puppet-lucid puppet-lucid output: Warning: 12% free memory [03:03:29] PROBLEM Free ram is now: CRITICAL on puppet-lucid puppet-lucid output: Critical: 3% free memory [03:06:49] RECOVERY Total Processes is now: OK on wikistream-1 wikistream-1 output: PROCS OK: 80 processes [03:07:29] RECOVERY dpkg-check is now: OK on wikistream-1 wikistream-1 output: All packages OK [03:08:59] RECOVERY Current Users is now: OK on wikistream-1 wikistream-1 output: USERS OK - 0 users currently logged in [03:09:39] RECOVERY Disk Space is now: OK on wikistream-1 wikistream-1 output: DISK OK [03:09:39] RECOVERY Current Load is now: OK on wikistream-1 wikistream-1 output: OK - load average: 0.02, 0.08, 0.03 [03:10:19] RECOVERY Free ram is now: OK on wikistream-1 wikistream-1 output: OK: 61% free memory [03:13:02] Ryan_Lane: thanks! I'll take a look when I'm home. [03:28:29] RECOVERY dpkg-check is now: OK on backport backport output: All packages OK [03:28:29] RECOVERY Free ram is now: OK on puppet-lucid puppet-lucid output: OK: 20% free memory [03:29:09] RECOVERY Current Load is now: OK on backport backport output: OK - load average: 0.06, 0.03, 0.00 [03:29:49] RECOVERY Current Users is now: OK on backport backport output: USERS OK - 0 users currently logged in [03:30:29] RECOVERY Disk Space is now: OK on backport backport output: DISK OK [03:31:19] RECOVERY Free ram is now: OK on backport backport output: OK: 92% free memory [03:32:39] RECOVERY Total Processes is now: OK on backport backport output: PROCS OK: 94 processes [08:39:19] 03/01/2012 - 08:39:19 - Updating keys for hydriz [08:40:09] 03/01/2012 - 08:40:09 - Updating keys for hydriz [08:45:09] 03/01/2012 - 08:45:09 - Updating keys for hydriz [08:45:12] 03/01/2012 - 08:45:12 - Updating keys for hydriz [08:45:21] 03/01/2012 - 08:45:21 - Updating keys for hydriz [08:45:26] 03/01/2012 - 08:45:26 - Creating a project directory for dumps [08:45:26] 03/01/2012 - 08:45:26 - Creating a home directory for hydriz at /export/home/dumps/hydriz [08:45:26] 03/01/2012 - 08:45:26 - Creating a home directory for laner at /export/home/dumps/laner [08:46:27] 03/01/2012 - 08:46:27 - Updating keys for hydriz [08:46:27] 03/01/2012 - 08:46:27 - Updating keys for laner [08:48:14] !log dumps New project created for uploading of Wikimedia Dumps to the Internet Archive. [08:48:31] oh I hate this bot [09:04:26] PROBLEM dpkg-check is now: CRITICAL on dumps-1 dumps-1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:05:06] PROBLEM Current Load is now: CRITICAL on dumps-1 dumps-1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:05:51] PROBLEM Current Users is now: CRITICAL on dumps-1 dumps-1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:06:36] PROBLEM Disk Space is now: CRITICAL on dumps-1 dumps-1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:07:16] PROBLEM Free ram is now: CRITICAL on dumps-1 dumps-1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:08:46] PROBLEM Total Processes is now: CRITICAL on dumps-1 dumps-1 output: Connection refused by host [09:23:58] PROBLEM Current Load is now: CRITICAL on dumps-nfs1 dumps-nfs1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:24:38] PROBLEM Current Users is now: CRITICAL on dumps-nfs1 dumps-nfs1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:25:18] PROBLEM Disk Space is now: CRITICAL on dumps-nfs1 dumps-nfs1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:26:08] PROBLEM Free ram is now: CRITICAL on dumps-nfs1 dumps-nfs1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:27:28] PROBLEM Total Processes is now: CRITICAL on dumps-nfs1 dumps-nfs1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:28:18] PROBLEM dpkg-check is now: CRITICAL on dumps-nfs1 dumps-nfs1 output: CHECK_NRPE: Error - Could not complete SSL handshake. [09:30:18] RECOVERY Disk Space is now: OK on dumps-nfs1 dumps-nfs1 output: DISK OK [09:31:08] RECOVERY Free ram is now: OK on dumps-nfs1 dumps-nfs1 output: OK: 87% free memory [09:31:38] RECOVERY Disk Space is now: OK on dumps-1 dumps-1 output: DISK OK [09:32:18] RECOVERY Free ram is now: OK on dumps-1 dumps-1 output: OK: 95% free memory [09:32:28] RECOVERY Total Processes is now: OK on dumps-nfs1 dumps-nfs1 output: PROCS OK: 83 processes [09:33:18] RECOVERY dpkg-check is now: OK on dumps-nfs1 dumps-nfs1 output: All packages OK [09:33:38] RECOVERY Total Processes is now: OK on dumps-1 dumps-1 output: PROCS OK: 97 processes [09:33:58] RECOVERY Current Load is now: OK on dumps-nfs1 dumps-nfs1 output: OK - load average: 0.01, 0.17, 0.32 [09:34:28] RECOVERY dpkg-check is now: OK on dumps-1 dumps-1 output: All packages OK [09:34:38] RECOVERY Current Users is now: OK on dumps-nfs1 dumps-nfs1 output: USERS OK - 1 users currently logged in [09:35:08] RECOVERY Current Load is now: OK on dumps-1 dumps-1 output: OK - load average: 0.00, 0.03, 0.16 [09:35:48] RECOVERY Current Users is now: OK on dumps-1 dumps-1 output: USERS OK - 1 users currently logged in [10:04:25] 03/01/2012 - 10:04:25 - Updating keys for edsu [10:05:10] 03/01/2012 - 10:05:10 - Updating keys for edsu [10:15:38] PROBLEM dpkg-check is now: CRITICAL on wikistream-1 wikistream-1 output: DPKG CRITICAL dpkg reports broken packages [10:33:48] RECOVERY Current Users is now: OK on prefixexport prefixexport output: USERS OK - 0 users currently logged in [10:34:08] RECOVERY Total Processes is now: OK on prefixexport prefixexport output: PROCS OK: 109 processes [10:34:13] RECOVERY SSH is now: OK on prefixexport prefixexport output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:34:13] RECOVERY Current Load is now: OK on prefixexport prefixexport output: OK - load average: 1.14, 0.36, 0.13 [10:34:13] RECOVERY Free ram is now: OK on prefixexport prefixexport output: OK: 92% free memory [10:34:13] RECOVERY Disk Space is now: OK on prefixexport prefixexport output: DISK OK [10:37:38] RECOVERY HTTP is now: OK on prefixexport prefixexport output: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.027 second response time [10:37:38] RECOVERY dpkg-check is now: OK on prefixexport prefixexport output: All packages OK [10:45:38] RECOVERY dpkg-check is now: OK on wikistream-1 wikistream-1 output: All packages OK [11:27:38] PROBLEM Free ram is now: CRITICAL on deployment-web4 deployment-web4 output: Critical: 4% free memory [11:27:38] PROBLEM Free ram is now: CRITICAL on deployment-web2 deployment-web2 output: Critical: 2% free memory [11:27:48] PROBLEM Free ram is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:28] PROBLEM Current Load is now: CRITICAL on deployment-web3 deployment-web3 output: CRITICAL - load average: 62.93, 30.56, 12.14 [11:29:08] PROBLEM Free ram is now: CRITICAL on deployment-web3 deployment-web3 output: Critical: 2% free memory [11:29:48] PROBLEM Current Load is now: CRITICAL on deployment-web2 deployment-web2 output: CRITICAL - load average: 35.67, 23.01, 10.21 [11:29:58] PROBLEM Current Load is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [11:30:08] PROBLEM Current Load is now: CRITICAL on deployment-web4 deployment-web4 output: CRITICAL - load average: 42.74, 26.36, 11.67 [11:32:38] RECOVERY Free ram is now: OK on deployment-web4 deployment-web4 output: OK: 39% free memory [11:32:48] PROBLEM Disk Space is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:48] PROBLEM SSH is now: CRITICAL on deployment-web deployment-web output: CRITICAL - Socket timeout after 10 seconds [11:32:48] PROBLEM Current Users is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:48] PROBLEM Total Processes is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:58] PROBLEM dpkg-check is now: CRITICAL on deployment-web deployment-web output: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:18] PROBLEM Total Processes is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:53] PROBLEM dpkg-check is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:34:58] PROBLEM Current Users is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:35:08] PROBLEM Current Load is now: WARNING on deployment-web4 deployment-web4 output: WARNING - load average: 1.15, 13.71, 10.82 [11:35:53] PROBLEM Disk Space is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:18] PROBLEM Current Users is now: CRITICAL on deployment-web3 deployment-web3 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:18] PROBLEM SSH is now: CRITICAL on deployment-web2 deployment-web2 output: CRITICAL - Socket timeout after 10 seconds [11:37:08] PROBLEM SSH is now: CRITICAL on deployment-web3 deployment-web3 output: CRITICAL - Socket timeout after 10 seconds [11:37:48] PROBLEM Disk Space is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:48] PROBLEM Total Processes is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:58] PROBLEM dpkg-check is now: CRITICAL on deployment-web2 deployment-web2 output: CHECK_NRPE: Socket timeout after 10 seconds. [11:38:30] labs-home-wm: yeah, yeah we get the message [11:49:48] PROBLEM host: deployment-web is DOWN address: deployment-web CRITICAL - Host Unreachable (deployment-web) [11:50:15] RECOVERY Current Load is now: OK on deployment-web4 deployment-web4 output: OK - load average: 0.01, 0.75, 4.15 [11:54:35] PROBLEM Total Processes is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [11:55:15] PROBLEM dpkg-check is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [11:56:05] PROBLEM Current Load is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [11:56:45] PROBLEM Current Users is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [11:57:25] PROBLEM Disk Space is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [11:58:15] PROBLEM Free ram is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [12:20:44] PROBLEM host: deployment-web is DOWN address: deployment-web CRITICAL - Host Unreachable (deployment-web) [12:31:04] PROBLEM host: deployment-web2 is DOWN address: deployment-web2 CRITICAL - Host Unreachable (deployment-web2) [12:51:44] PROBLEM host: deployment-web is DOWN address: deployment-web CRITICAL - Host Unreachable (deployment-web) [12:55:24] PROBLEM host: deployment-web3 is DOWN address: deployment-web3 CRITICAL - Host Unreachable (deployment-web3) [13:01:14] PROBLEM host: deployment-web2 is DOWN address: deployment-web2 CRITICAL - Host Unreachable (deployment-web2) [13:17:35] lol one host left for beta.wmflabs.org wikis [13:22:45] PROBLEM host: deployment-web is DOWN address: deployment-web CRITICAL - Host Unreachable (deployment-web) [13:24:05] PROBLEM Current Load is now: CRITICAL on dumps-2 dumps-2 output: Connection refused by host [13:24:45] PROBLEM Current Users is now: CRITICAL on dumps-2 dumps-2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:25] PROBLEM Disk Space is now: CRITICAL on dumps-2 dumps-2 output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:25:45] PROBLEM host: deployment-web3 is DOWN address: deployment-web3 CRITICAL - Host Unreachable (deployment-web3) [13:29:05] RECOVERY Current Load is now: OK on dumps-2 dumps-2 output: OK - load average: 0.25, 0.71, 0.54 [13:29:45] RECOVERY Current Users is now: OK on dumps-2 dumps-2 output: USERS OK - 0 users currently logged in [13:30:25] RECOVERY Disk Space is now: OK on dumps-2 dumps-2 output: DISK OK [13:31:15] PROBLEM host: deployment-web2 is DOWN address: deployment-web2 CRITICAL - Host Unreachable (deployment-web2) [13:52:45] PROBLEM host: deployment-web is DOWN address: deployment-web CRITICAL - Host Unreachable (deployment-web) [13:54:07] !log deployment-prep rebooting servers [13:55:30] petan|wk: No log bot :( [13:55:45] PROBLEM host: deployment-web3 is DOWN address: deployment-web3 CRITICAL - Host Unreachable (deployment-web3) [13:55:57] BTW where are the log bot files located? It would be good if anyone can just reboot it if it dies out [13:57:35] RECOVERY Disk Space is now: OK on deployment-web2 deployment-web2 output: DISK OK [13:57:35] RECOVERY SSH is now: OK on deployment-web deployment-web output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:57:45] RECOVERY host: deployment-web2 is UP address: deployment-web2 PING OK - Packet loss = 0%, RTA = 0.74 ms [13:57:45] RECOVERY host: deployment-web is UP address: deployment-web PING OK - Packet loss = 0%, RTA = 0.54 ms [13:57:45] RECOVERY SSH is now: OK on deployment-webs1 deployment-webs1 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:57:45] RECOVERY Free ram is now: OK on deployment-web2 deployment-web2 output: OK: 90% free memory [13:57:45] RECOVERY Total Processes is now: OK on deployment-web2 deployment-web2 output: PROCS OK: 104 processes [13:57:50] RECOVERY Free ram is now: OK on deployment-web deployment-web output: OK: 95% free memory [13:57:50] RECOVERY Current Users is now: OK on deployment-webs1 deployment-webs1 output: USERS OK - 0 users currently logged in [13:57:50] RECOVERY Current Users is now: OK on deployment-web deployment-web output: USERS OK - 0 users currently logged in [13:57:50] RECOVERY Total Processes is now: OK on deployment-webs1 deployment-webs1 output: PROCS OK: 104 processes [13:57:55] RECOVERY Total Processes is now: OK on deployment-web deployment-web output: PROCS OK: 99 processes [13:58:00] RECOVERY dpkg-check is now: OK on deployment-web2 deployment-web2 output: All packages OK [13:58:00] RECOVERY Disk Space is now: OK on deployment-webs1 deployment-webs1 output: DISK OK [13:58:00] RECOVERY dpkg-check is now: OK on deployment-web deployment-web output: All packages OK [13:58:00] RECOVERY host: deployment-webs1 is UP address: deployment-webs1 PING OK - Packet loss = 0%, RTA = 1.27 ms [13:58:20] RECOVERY dpkg-check is now: OK on deployment-webs1 deployment-webs1 output: All packages OK [13:59:20] RECOVERY Current Load is now: OK on deployment-webs1 deployment-webs1 output: OK - load average: 0.37, 0.17, 0.06 [13:59:50] RECOVERY Current Users is now: OK on deployment-web2 deployment-web2 output: USERS OK - 0 users currently logged in [13:59:50] RECOVERY Current Load is now: OK on deployment-web2 deployment-web2 output: OK - load average: 0.18, 0.08, 0.02 [13:59:50] RECOVERY Current Load is now: OK on deployment-web deployment-web output: OK - load average: 0.31, 0.22, 0.09 [13:59:50] RECOVERY dpkg-check is now: OK on deployment-web3 deployment-web3 output: All packages OK [14:00:00] RECOVERY host: deployment-web3 is UP address: deployment-web3 PING OK - Packet loss = 0%, RTA = 0.42 ms [14:00:05] bots-labs down... [14:00:49] RECOVERY Disk Space is now: OK on deployment-web3 deployment-web3 output: DISK OK [14:00:49] RECOVERY Free ram is now: OK on deployment-webs1 deployment-webs1 output: OK: 75% free memory [14:01:19] RECOVERY SSH is now: OK on deployment-web2 deployment-web2 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:01:19] RECOVERY Current Users is now: OK on deployment-web3 deployment-web3 output: USERS OK - 0 users currently logged in [14:01:59] RECOVERY SSH is now: OK on deployment-web3 deployment-web3 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:02:39] RECOVERY Disk Space is now: OK on deployment-web deployment-web output: DISK OK [14:03:29] RECOVERY Current Load is now: OK on deployment-web3 deployment-web3 output: OK - load average: 0.11, 0.16, 0.07 [14:03:41] Hydriz: Define down? [14:03:51] see nagios [14:03:59] PROBLEM Current Load is now: CRITICAL on deployment-web5 deployment-web5 output: Connection refused by host [14:04:03] !nagio [14:04:05] !nagios [14:04:05] http://nagios.wmflabs.org/nagios3 [14:04:09] RECOVERY Total Processes is now: OK on deployment-web3 deployment-web3 output: PROCS OK: 107 processes [14:04:14] RECOVERY Free ram is now: OK on deployment-web3 deployment-web3 output: OK: 85% free memory [14:04:19] yeah, back up again lulz [14:04:27] Oh I see [14:04:36] I was like hmm all the bots server arn't down [14:04:39] PROBLEM Current Users is now: CRITICAL on deployment-web5 deployment-web5 output: Connection refused by host [14:04:41] Would explain the lack of lgobot [14:05:07] but anyone has any idea how to actually set up a central location of files? [14:05:14] like, /usr/local/apache/common/live [14:05:18] how do we use it? [14:05:19] PROBLEM Disk Space is now: CRITICAL on deployment-web5 deployment-web5 output: Connection refused by host [14:05:32] For the deployment mw install? No idea [14:05:40] I know it's slow as hell and one of the webservers is playing up [14:05:53] yeah [14:05:58] I am hoping to know how [14:06:19] PROBLEM Free ram is now: CRITICAL on deployment-web5 deployment-web5 output: CHECK_NRPE: Socket timeout after 10 seconds. [14:06:24] but I can't seem to find where morebots reside in [14:06:28] Hmm [14:06:33] It doesn't seem to be on bots labs [14:06:36] Maybe it's still on 2 [14:06:59] Ryan will you PLEASE FIX OSM [14:07:06] I checked bots-1, bots-2 and bots-labs [14:07:12] but I can't find it [14:07:15] hmm [14:07:25] unless its in someone's home [14:07:30] bots-2 seems laggy as hell [14:07:32] */home [14:07:33] It might be in petan's home [14:07:39] PROBLEM Total Processes is now: CRITICAL on deployment-web5 deployment-web5 output: Connection refused or timed out [14:07:41] Thought it had it's own user though [14:08:06] sigh [14:08:18] If I knew where it was I could have restarted it long time ago [14:08:19] PROBLEM dpkg-check is now: CRITICAL on deployment-web5 deployment-web5 output: Connection refused or timed out [14:08:24] and log lots of messages [14:08:40] bots-2 seems borked [14:08:55] Either that or my connection to bastion died [14:10:50] yeah, its in petrb [14:10:53] 's home [14:12:29] !log . [14:13:45] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [14:14:54] * Damianz pokes petan|wk [14:15:02] petan|wk: Where is the bot located? [14:15:06] bots-2 [14:15:08] Is that on labs or can it be moved to labs? [14:15:14] should be [14:15:17] if u can move it [14:15:21] I don't where the package is [14:15:27] hyperon: ^ [14:15:31] he knows [14:15:42] !hyperon is admin of logs [14:15:42] Key was added! [14:15:43] where of bots-2? [14:15:52] huh [14:15:55] I don't know [14:15:57] somewhere [14:16:26] god damn [14:16:27] we need someone to purge dns [14:23:04] grrr where is the bot... [14:23:55] !log dumps New project created for uploading of Wikimedia Dumps to the Internet Archive. [14:23:56] Logged the message, Master [14:23:58] * Damianz sends Hydriz to fix it [14:24:08] its on bots-2, right? [14:24:14] but I just can't find it [14:24:45] now bots-2 is hanging... [14:24:49] I can't even login to bots-2 [14:24:55] RECOVERY host: deployment-web5 is UP address: deployment-web5 PING OK - Packet loss = 0%, RTA = 0.78 ms [14:25:02] Really need to sort out the bots servers D: [14:25:39] !log bots fixed you, logbot [14:25:40] Logged the message, Master [14:25:55] yeah, fixed, but for now [14:26:28] we need a more permanent solution [14:28:07] YES found it [14:30:22] Damianz: we need to balance load [14:30:34] some of instances are 0 loaded and some are too much [14:33:55] PROBLEM host: deployment-web5 is DOWN address: deployment-web5 CRITICAL - Host Unreachable (deployment-web5) [14:34:08] lulz down again [14:38:01] no [14:38:05] problem with dns only [14:40:35] petan: Yeah we could do with writing up some schedular and just having 3/4 nodes with all the bots installed. [14:40:42] yup [14:40:57] problem is that you can't easily move one bot to another instance while it's running [14:41:15] some bots are running all the time [14:41:23] like irc bots [14:41:35] but other could be solved using that [14:41:52] we should create a static instances and dynamic [14:41:59] on static would be running bots which run all time [14:42:08] on dynamic would be tasks [14:42:18] task would be scheduled and started on machine with lowest load [14:47:33] Yeah [14:54:53] Damianz: you know an open source cluster scheduler for this [14:55:05] which could start the tasks over all instances? [14:55:59] I don't [14:56:39] Not off the top of my head that would perfectlaly suit this [14:56:44] heh [15:04:05] RECOVERY Free ram is now: OK on deployment-web5 deployment-web5 output: OK: 80% free memory [15:04:15] RECOVERY host: deployment-web5 is UP address: deployment-web5 PING OK - Packet loss = 0%, RTA = 3.94 ms [15:04:15] RECOVERY dpkg-check is now: OK on deployment-web5 deployment-web5 output: All packages OK [15:04:35] RECOVERY Total Processes is now: OK on deployment-web5 deployment-web5 output: PROCS OK: 89 processes [15:04:40] RECOVERY Current Users is now: OK on deployment-web5 deployment-web5 output: USERS OK - 0 users currently logged in [15:04:55] RECOVERY Current Load is now: OK on deployment-web5 deployment-web5 output: OK - load average: 0.25, 0.18, 0.06 [15:08:25] RECOVERY Disk Space is now: OK on deployment-web5 deployment-web5 output: DISK OK [15:08:47] how many web servers do I need to make to handle load [15:08:48] damn [15:11:08] Once we have ganglia we could do some cool auto spinning up stuff based on load metrics. [15:11:17] hm... [15:11:21] I can see in nagios the load [15:11:36] I use it instead of ganglia to track it :P [15:11:55] it's ok most of time but then all instances go to load over 30 at same time [15:12:14] and go oom [15:12:30] Them ooming is the main outage from what I've seen on deployment [15:12:32] interesting is that squid has 0 load in that time [15:12:49] like the requests are not even for cached content [15:13:09] that could be either some hacker or someone doing a lot of edits [15:14:32] I don't believe so many people are using deployment site [15:14:45] we likely have more web server than users [15:15:15] rofl [15:15:43] the squid is fascinating it never eats a lot of memory neither load [15:15:53] it looks to me that it just proxy the web [15:16:02] I am wondering if it actually cache anything [15:21:41] If it's never over a load of 0 and the apache servers are falling over it's probably not [15:22:05] Squid is a lot nicer to the servers than apache though :D [15:31:49] the load is not always 0 but I think it's some background services eating [15:32:19] I would need to talk to someone who understand squid but I found it hard [15:32:41] I find varnish with multiple backends and a director easier to configure [15:32:43] because there are almost no people who do [15:32:52] does it have a documentation heh? [15:33:03] I mean some proper heh [15:33:22] because I would like to switch to anything else what does [15:33:35] however we should follow the prod config [15:33:50] unless we switch to varnish on prod we shouldn't do that on labs [15:34:07] petan, I remember I checked the squid load with you [15:34:09] also it would be nice if squid was in puppet and kept synced with prod [15:34:15] I don't remember the conclusion, though [15:34:26] conclusion was that we didn't find out if it works or not [15:34:39] headers were "cache hiy [15:34:42] cache hit [15:34:50] then it was being cached [15:34:51] however apache wrote to log that it created the page [15:34:54] in same time [15:35:01] so even if it was cached it loaded apache [15:35:01] was it? [15:35:08] yes [15:35:10] oh, I remember now [15:35:17] since now we have 6 apaches [15:35:19] it did the If-Modified-by [15:35:22] it's quite harder to check it [15:35:34] yes [15:35:35] instead of knowing that apache would have notified him [15:35:48] I mean, if all apaches servers are down't you won't open any page [15:35:58] that's weird [15:36:07] I would like the squid to send cached pages instead of error [15:36:31] yes, that's what we want [15:36:44] although the current configuration is doing something [15:36:50] we need to talk with someone who understand how squid works and tell us what's wrong [15:37:02] mediawiki isn't so loaded as if it had to reparse it [15:37:07] people from ##squid likely send you to ##apache where they send you back [15:37:23] or if we could view WMF production config [15:37:26] we could diff them [15:37:27] we can't [15:37:32] that suck [15:37:38] *if* [15:37:41] it does [15:38:00] I don't really know what is so secret on it [15:38:03] it will be some silly option [15:38:08] the banned ip adresses should be banned on labs too [15:38:23]