[00:08:53] matanya: You can log into bastion but not nagios-dev? Or neither? [00:09:07] niether [00:09:25] all I get is: If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [00:10:04] hm… give bastion another try? [00:12:01] no shell [00:16:42] matanya: You're on windows using putty? [00:16:54] no [00:17:00] linux shell on linux [00:17:32] kosher [00:17:35] Hm, ok. Should be easy then… I assume you set up your keypair already? [00:17:58] I did [00:18:21] on labsconsole as well as on gerrit? (oddly you have to upload your public key to both places at the moment.) [00:18:23] GNU shell on linux [00:18:28] (probably) [00:18:54] yes marktraceur [00:19:06] I don't remember, TBH andrewbogott [00:19:53] I see i did andrewbogott [00:20:11] here? https://labsconsole.wikimedia.org/wiki/Special:NovaKey [00:20:40] yes [00:22:55] andrewbogott: no luck so far [00:25:13] Is it kicking you off, or hanging? [00:25:22] hanging [00:25:41] matanya: what does 'ssh -v user@host' say? [00:25:51] now I was kicked [00:25:57] It's doing that for me too -- something is broken. [00:26:16] now I'm in [00:26:20] yep [00:26:32] it is slow as hell [00:26:36] Yeah, and now it's behaving OK. [00:26:45] This may be related to the failure of labsconsole an hour ago [00:27:34] thank you andrewbogott . how can I access my instance now? I'm kicked on publickey issue [00:27:53] you need to forward your key to bastion. Lemme find the doc page for that [00:28:22] https://labsconsole.wikimedia.org/wiki/Help:Access#Using_agent_forwarding [00:28:26] (If you aren't already doing that) [00:29:42] I think I'm I'll re-try [00:30:25] PROBLEM Total processes is now: WARNING on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS WARNING: 151 processes [00:33:18] andrewbogott: I can't remember the passphrase [00:33:46] for your ssh key? You can just make a new one and upload that. [00:34:03] ok [00:38:22] RECOVERY Free ram is now: OK on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: OK: 22% free memory [00:39:02] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 22% free memory [00:39:16] andrewbogott: Permissions 0644 for '/home/matanya/.ssh/id_rsa' are too open. [00:39:17] It is required that your private key files are NOT accessible by others. [00:39:17] This private key will be ignored. [00:39:17] bad permissions: ignore key: /home/matanya/.ssh/id_rsa [00:39:17] Permission denied (publickey). [00:39:52] matanya: Run chmod 600 /home/matanya/.ssh/id_rsa and try again [00:40:00] i did [00:40:03] it is after this [00:40:23] RECOVERY Total processes is now: OK on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS OK: 149 processes [00:40:42] oh, missed the pub [00:41:20] matanya@bastion1:~$ ssh nagios-dev [00:41:20] If you are having access problems, please see: https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [00:41:20] Connection closed by 10.4.0.201 [00:41:39] I'm doing something wrong [00:42:02] RoanKattouw: can you have a look please? [00:42:03] you're connecting to bastion with ssh -A ? [00:42:08] yes [00:42:33] no public key error [00:42:40] but no connection [00:42:46] If you just uploaded a new key, then it might take a few minutes for it to get copied... [00:42:59] ok, thanks [00:51:52] andrewbogott: 10 minutes? [00:52:31] hi, what's wrong with /data/project? (can't open) [00:52:37] is it just me? [00:53:11] matanya: Hard for me to guess what's happening… maybe you're ssh-adding your public rather than your private key, or uploaded your private key rather than your public? [00:53:31] no, I didn't [00:53:49] gribeco: It isn't just you but it might be just your instance. What system is that happening on? [00:54:05] the problem is from bots-salebot [00:54:23] bastion1 too [00:56:24] debug1: Authentications that can continue: publickey [00:56:24] debug1: Next authentication method: publickey [00:56:24] debug1: Offering public key: /home/matanya/.ssh/id_rsa-labs [00:56:24] debug1: Server accepts key: pkalg ssh-rsa blen 279 [00:56:25] Connection closed by 10.4.0.201 [00:57:49] gribeco, is that a brand new instance or has it been around a while? [00:58:10] andrewbogott: it's been up 44 days [00:58:38] gribeco: Ok, good to know... [00:58:50] bastion1 is showing the same problem: [00:58:52] gribeco@bastion1:~$ ls /data/project/ [00:58:53] ls: cannot open directory /data/project/: No such file or directory [01:00:16] matanya, try now? [01:02:11] hanging [01:03:32] stuck here: debug1: Authentications that can continue: publickey [01:03:32] debug1: Next authentication method: publickey [01:03:32] debug1: Offering public key: /home/matanya/.ssh/id_rsa-labs [01:05:57] PROBLEM Free ram is now: WARNING on message-remailer.pmtpa.wmflabs 10.4.0.251 output: Warning: 15% free memory [01:06:27] PROBLEM Free ram is now: WARNING on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: Warning: 18% free memory [01:08:57] PROBLEM Total processes is now: WARNING on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS WARNING: 186 processes [01:09:11] gribeco, matanya, I guess gluster is still unstable, I'm seeing all kinds of volume hang issues. (And ssh keys are mounted via gluster so that affects logins as well.) [01:09:33] andrewbogott: ok, I'll use a local fs for now [01:09:50] sorry. I'm working on it but might not have it sorted out today. [01:10:04] thanks andrewbogott I'll try again some other time. [01:10:19] matanya: We've been very stable lately, you've just had the bad luck to show up during a storm :( [01:10:34] as usual :) [01:11:36] oh, and every day, someone seems to be banging on mysql like crazy, it's causing my bot to fail connecting to its database [01:11:45] everyday at this time [01:11:52] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [01:12:12] no I get publickey issues [01:12:42] PROBLEM Total processes is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS WARNING: 151 processes [01:14:02] RECOVERY Total processes is now: OK on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS OK: 107 processes [01:25:19] is there a problem with the login to the instances ? [01:26:02] Wikinaut: Sort of… gluster is exceptionally slow and misbehaviing in other ways. So authentication takes forever. [01:26:18] yes : I don't get authentication [01:26:25] so better trying tomorrow ... [01:26:33] ty [01:26:42] I need to go for a bit but will look at this again later tonight. [01:26:58] I'm getting Permission denied (publickey). on bastion [01:34:22] PROBLEM Total processes is now: WARNING on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS WARNING: 151 processes [01:42:33] PROBLEM Total processes is now: WARNING on vumi-metrics.pmtpa.wmflabs 10.4.1.13 output: PROCS WARNING: 151 processes [01:42:43] PROBLEM Current Load is now: WARNING on bastion-restricted1.pmtpa.wmflabs 10.4.0.85 output: WARNING - load average: 9.05, 8.23, 6.17 [01:57:33] PROBLEM SSH is now: CRITICAL on bastion-restricted1.pmtpa.wmflabs 10.4.0.85 output: Server answer: [02:09:23] RECOVERY Total processes is now: OK on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS OK: 146 processes [02:19:22] PROBLEM Free ram is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: Critical: 5% free memory [02:32:02] PROBLEM Current Load is now: WARNING on nagios-dev.pmtpa.wmflabs 10.4.0.201 output: WARNING - load average: 5.00, 5.02, 5.01 [02:37:34] RECOVERY Total processes is now: OK on vumi-metrics.pmtpa.wmflabs 10.4.1.13 output: PROCS OK: 147 processes [02:56:52] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 151 processes [03:07:24] cant log into bastion O_o [03:11:04] PROBLEM Free ram is now: WARNING on bots-liwa.pmtpa.wmflabs 10.4.1.65 output: Warning: 19% free memory [03:16:05] RECOVERY Free ram is now: OK on bots-liwa.pmtpa.wmflabs 10.4.1.65 output: OK: 20% free memory [03:24:03] PROBLEM Free ram is now: WARNING on bots-liwa.pmtpa.wmflabs 10.4.1.65 output: Warning: 19% free memory [03:42:24] PROBLEM Total processes is now: WARNING on wikidata-testrepo.pmtpa.wmflabs 10.4.0.164 output: PROCS WARNING: 151 processes [03:44:54] PROBLEM Current Load is now: WARNING on etherpad-lite.pmtpa.wmflabs 10.4.0.87 output: WARNING - load average: 6.00, 5.71, 5.24 [03:47:22] RECOVERY Total processes is now: OK on wikidata-testrepo.pmtpa.wmflabs 10.4.0.164 output: PROCS OK: 150 processes [03:47:33] PROBLEM Current Load is now: WARNING on bots-apache01.pmtpa.wmflabs 10.4.0.141 output: WARNING - load average: 6.05, 5.83, 5.33 [03:47:33] PROBLEM Current Load is now: WARNING on wikidata-testrepo.pmtpa.wmflabs 10.4.0.164 output: WARNING - load average: 6.00, 5.81, 5.33 [03:55:22] PROBLEM Total processes is now: WARNING on wikidata-testrepo.pmtpa.wmflabs 10.4.0.164 output: PROCS WARNING: 154 processes [04:40:54] RECOVERY Free ram is now: OK on message-remailer.pmtpa.wmflabs 10.4.0.251 output: OK: 23% free memory [04:41:24] RECOVERY Free ram is now: OK on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: OK: 22% free memory [04:41:34] PROBLEM Total processes is now: WARNING on vumi-metrics.pmtpa.wmflabs 10.4.1.13 output: PROCS WARNING: 151 processes [04:41:44] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 147 processes [04:41:54] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 22% free memory [04:49:22] PROBLEM Free ram is now: CRITICAL on rt-puppetdev6.pmtpa.wmflabs 10.4.0.24 output: Critical: 5% free memory [04:49:22] PROBLEM Free ram is now: WARNING on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: Warning: 18% free memory [04:53:58] PROBLEM Free ram is now: WARNING on message-remailer.pmtpa.wmflabs 10.4.0.251 output: Warning: 15% free memory [04:54:28] PROBLEM Free ram is now: WARNING on rt-puppetdev6.pmtpa.wmflabs 10.4.0.24 output: Warning: 7% free memory [05:04:52] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [05:06:58] log in to pmtpa bastion is broken. see labs-l and i just reported it in #wikimedia-operations too [05:13:11] gluster is fucky [05:13:20] andrewbogott_afk was going to look at it after he went to do something [05:13:44] [01:26:21] Wikinaut: Sort of… gluster is exceptionally slow and misbehaviing in other ways. So authentication takes forever. [05:13:47] was a while ago [05:37:02] RECOVERY Current Load is now: OK on nagios-dev.pmtpa.wmflabs 10.4.0.201 output: OK - load average: 0.35, 2.95, 4.25 [05:37:32] RECOVERY SSH is now: OK on bastion-restricted1.pmtpa.wmflabs 10.4.0.85 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [05:37:42] RECOVERY Total processes is now: OK on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS OK: 150 processes [05:44:58] RECOVERY Current Load is now: OK on etherpad-lite.pmtpa.wmflabs 10.4.0.87 output: OK - load average: 0.03, 0.97, 4.28 [05:47:32] RECOVERY Current Load is now: OK on bots-apache01.pmtpa.wmflabs 10.4.0.141 output: OK - load average: 0.01, 0.65, 3.73 [05:47:32] RECOVERY Current Load is now: OK on wikidata-testrepo.pmtpa.wmflabs 10.4.0.164 output: OK - load average: 0.01, 0.64, 3.70 [05:47:42] RECOVERY Current Load is now: OK on bastion-restricted1.pmtpa.wmflabs 10.4.0.85 output: OK - load average: 0.01, 0.77, 4.31 [06:16:41] andrewbogott: still can't log in. now to bastion itself. :( [06:16:57] Permission denied (publickey). [06:17:03] I'm working on it but it's going to be a while. [06:17:23] sure. I won't bother you. can you give an ETA? [06:17:39] PROBLEM Free ram is now: WARNING on aggregator1.pmtpa.wmflabs 10.4.0.79 output: Warning: 19% free memory [06:20:55] nope! [06:28:45] PROBLEM Total processes is now: WARNING on deployment-mc.pmtpa.wmflabs 10.4.0.81 output: PROCS WARNING: 153 processes [06:29:44] PROBLEM Total processes is now: WARNING on parsoid-roundtrip5-8core.pmtpa.wmflabs 10.4.0.125 output: PROCS WARNING: 156 processes [06:30:38] PROBLEM Total processes is now: WARNING on syslogcol-srv.pmtpa.wmflabs 10.4.0.238 output: PROCS WARNING: 155 processes [06:30:47] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 155 processes [06:31:37] PROBLEM Total processes is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS WARNING: 154 processes [06:41:42] PROBLEM Total processes is now: CRITICAL on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS CRITICAL: 203 processes [06:44:02] RECOVERY Free ram is now: OK on message-remailer.pmtpa.wmflabs 10.4.0.251 output: OK: 28% free memory [06:44:32] PROBLEM Total processes is now: WARNING on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: PROCS WARNING: 165 processes [06:49:41] PROBLEM Free ram is now: CRITICAL on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:41] RECOVERY Total processes is now: OK on parsoid-roundtrip5-8core.pmtpa.wmflabs 10.4.0.125 output: PROCS OK: 150 processes [06:52:14] PROBLEM dpkg-check is now: CRITICAL on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: DPKG CRITICAL dpkg reports broken packages [06:53:44] RECOVERY Total processes is now: OK on deployment-mc.pmtpa.wmflabs 10.4.0.81 output: PROCS OK: 149 processes [06:54:14] PROBLEM Free ram is now: WARNING on newchanges-bot.pmtpa.wmflabs 10.4.0.221 output: Warning: 16% free memory [06:54:25] PROBLEM Free ram is now: WARNING on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: Warning: 18% free memory [06:54:54] PROBLEM dpkg-check is now: CRITICAL on newchanges-bot.pmtpa.wmflabs 10.4.0.221 output: DPKG CRITICAL dpkg reports broken packages [06:57:15] RECOVERY dpkg-check is now: OK on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: All packages OK [06:59:32] PROBLEM Total processes is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: PROCS CRITICAL: 217 processes [07:00:33] RECOVERY Total processes is now: OK on syslogcol-srv.pmtpa.wmflabs 10.4.0.238 output: PROCS OK: 150 processes [07:00:43] PROBLEM dpkg-check is now: CRITICAL on spellcheckself-bot.pmtpa.wmflabs 10.4.0.246 output: DPKG CRITICAL dpkg reports broken packages [07:01:43] PROBLEM Free ram is now: WARNING on spellcheckself-bot.pmtpa.wmflabs 10.4.0.246 output: Warning: 8% free memory [07:04:14] RECOVERY Free ram is now: OK on newchanges-bot.pmtpa.wmflabs 10.4.0.221 output: OK: 26% free memory [07:04:43] RECOVERY dpkg-check is now: OK on newchanges-bot.pmtpa.wmflabs 10.4.0.221 output: All packages OK [07:08:32] PROBLEM Total processes is now: WARNING on syslogcol-srv.pmtpa.wmflabs 10.4.0.238 output: PROCS WARNING: 151 processes [07:11:43] PROBLEM Total processes is now: WARNING on bastion1.pmtpa.wmflabs 10.4.0.54 output: PROCS WARNING: 152 processes [07:11:53] PROBLEM Total processes is now: WARNING on deployment-mc.pmtpa.wmflabs 10.4.0.81 output: PROCS WARNING: 152 processes [07:20:43] RECOVERY dpkg-check is now: OK on spellcheckself-bot.pmtpa.wmflabs 10.4.0.246 output: All packages OK [07:21:33] RECOVERY Free ram is now: OK on spellcheckself-bot.pmtpa.wmflabs 10.4.0.246 output: OK: 22% free memory [07:21:43] PROBLEM Total processes is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS WARNING: 165 processes [07:42:47] PROBLEM Total processes is now: WARNING on parsoid-roundtrip5-8core.pmtpa.wmflabs 10.4.0.125 output: PROCS WARNING: 151 processes [08:20:08] PROBLEM dpkg-check is now: CRITICAL on deployment-cache-upload03.pmtpa.wmflabs 10.4.0.214 output: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:13] RECOVERY Total processes is now: OK on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: PROCS OK: 141 processes [08:22:50] RECOVERY dpkg-check is now: OK on deployment-cache-upload03.pmtpa.wmflabs 10.4.0.214 output: All packages OK [08:37:41] RECOVERY Total processes is now: OK on parsoid-roundtrip5-8core.pmtpa.wmflabs 10.4.0.125 output: PROCS OK: 145 processes [08:39:31] RECOVERY Free ram is now: OK on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: OK: 22% free memory [08:39:51] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 22% free memory [08:40:52] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 147 processes [08:47:51] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [08:51:42] RECOVERY Total processes is now: OK on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS OK: 150 processes [08:51:43] RECOVERY Total processes is now: OK on bastion1.pmtpa.wmflabs 10.4.0.54 output: PROCS OK: 149 processes [08:59:43] PROBLEM Total processes is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS WARNING: 151 processes [09:02:23] PROBLEM Free ram is now: WARNING on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: Warning: 18% free memory [09:44:43] RECOVERY Total processes is now: OK on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS OK: 150 processes [10:03:53] PROBLEM Total processes is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:03] PROBLEM SSH is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CRITICAL - Socket timeout after 10 seconds [10:04:34] PROBLEM Current Load is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:13] PROBLEM dpkg-check is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:07:02] PROBLEM Disk Space is now: CRITICAL on bots-3.pmtpa.wmflabs 10.4.0.59 output: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:22] !log deployment-prep rebooting the varnish-t3 instance, nslcd can't resolve somepath [10:12:25] Logged the message, Master [10:21:55] !log deployment-prep nslcd probably points to a wrong LDAP or has a faulty DNS configuration. Can't login on it anymore :/ [10:21:57] Logged the message, Master [10:55:43] PROBLEM Total processes is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS WARNING: 151 processes [12:37:28] RECOVERY Free ram is now: OK on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: OK: 22% free memory [12:37:48] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 22% free memory [12:45:44] RECOVERY Total processes is now: OK on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS OK: 150 processes [12:47:34] PROBLEM Total processes is now: WARNING on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: PROCS WARNING: 170 processes [12:57:33] PROBLEM Total processes is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: PROCS CRITICAL: 208 processes [13:00:23] PROBLEM Free ram is now: WARNING on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: Warning: 18% free memory [13:05:53] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [13:24:17] !pinh [13:24:19] !ping [13:24:19] pong [13:32:00] i'm having trouble logging into my labs instance. is bastion ok? [13:33:00] <^demon> Working for me. [13:34:29] thanks -- this might be something with my connection [15:02:42] PROBLEM Total processes is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: PROCS WARNING: 158 processes [16:42:28] !ping [16:42:29] does anyone see what am I saying? [16:42:29] !log test test [16:42:29] test is not a valid project. [16:42:29] I suppose yes [16:42:30] there is some problem :/ [16:42:30] labs are too slow [16:50:53] petan, try ssh'ing to bots-1 now. [16:50:53] PROBLEM Free ram is now: WARNING on mediawiki-bugfix-kozuch.pmtpa.wmflabs 10.4.0.26 output: Warning: 18% free memory [16:50:54] yes I am there [16:50:55] but no idea how much stable is that [16:50:55] !ping [16:50:55] Ryan seems very convinced this is an ldap issue… maybe gluster is broken because of ldap and not the other way 'round... [16:50:55] PROBLEM Disk Space is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:55] PROBLEM Current Load is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:55] PROBLEM SSH is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: CRITICAL - Socket timeout after 10 seconds [16:50:55] PROBLEM dpkg-check is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:56] pong [16:50:56] PROBLEM Free ram is now: CRITICAL on incubator-apache.pmtpa.wmflabs 10.4.0.116 output: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:45] !ping [16:51:45] pong [16:52:15] well, it used to be faster but it seems to be recovered somehow, some data were lost though... but it was because of me panicking too [16:52:38] !ping [16:52:38] pong [16:52:53] andrewbogott how did u fix it? [16:53:23] PROBLEM Free ram is now: WARNING on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: Warning: 18% free memory [16:53:29] I stopped and restarted that particular volume. But Ryan changed some ldap stuff at the same time, so might be my efforts were meaningless [16:53:51] mhm [16:54:35] I will just keep an open root session in screen, just to make sure I will be able to get there should this happen in future [16:56:36] andrewbogott what is /public/keys/root for? [16:57:57] hm… that is probably where I should have a key so I can ssh in as root [16:58:47] oh, nope. [16:58:52] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [16:59:42] RECOVERY Total processes is now: OK on bastion1.pmtpa.wmflabs 10.4.0.54 output: PROCS OK: 144 processes [17:01:28] andrewbogott I tried to upload my key to /root/.ssh/authorized_keys but, for some reason ssh is not looking there for keys, but to /public/keys [17:01:49] and there is no key at all in there [17:01:59] petan: Yes -- /public/keys is the gluster share. That's how logins are managed across all projects. [17:02:09] So it won't help much to have a root key there. [17:02:17] meh, it would be cool to have alternative for time when gluster is down [17:08:33] RECOVERY Free ram is now: OK on mediawiki-bugfix-kozuch.pmtpa.wmflabs 10.4.0.26 output: OK: 26% free memory [17:32:44] hmm, cant resolve bots2 or 4 O_o [17:34:14] PROBLEM Disk Space is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: DISK WARNING - free space: / 573 MB (5% inode=70%): [17:34:59] petan, has something happened resolving instance hostnames? [17:36:33] addshore yes [17:36:41] @labs-resolve bastion [17:36:41] I don't know this instance - aren't you are looking for: I-000000ba (bastion1), I-0000019b (bastion-restricted1), I-00000390 (deployment-bastion), [17:36:46] @labs-resolve bastion1 [17:36:46] The bastion1 resolves to instance I-000000ba with a fancy name bastion1 and IP 10.4.0.54 [17:36:55] @labs-resolve bots-1 [17:36:55] The bots-1 resolves to instance I-000000a9 with a fancy name bots-1 and IP 10.4.0.48 [17:37:00] @labs-resolve bots-2 [17:37:00] The bots-2 resolves to instance I-0000009c with a fancy name bots-2 and IP 10.4.0.42 [17:37:03] hmm [17:37:13] addshore it's bots-2 not bots2 [17:38:07] I know, im just gonna stick in i-000000b4.pmtpa.wmflabs e.t.c for now :) [18:02:40] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Kolossos link https://www.mediawiki.org/w/index.php?diff=639383 edit summary: [+337] /* OSM */ [18:02:57] Change on 12mediawiki a page Wikimedia Labs/Toolserver features needed in Tool Labs was modified, changed by Kolossos link https://www.mediawiki.org/w/index.php?diff=639384 edit summary: [+0] /* OSM */ [18:06:58] PROBLEM Current Load is now: WARNING on nagios-main.pmtpa.wmflabs 10.4.0.120 output: WARNING - load average: 10.77, 9.75, 7.18 [18:22:35] hi, is reportcard.wmflabs.org/ live? [18:23:57] oh, you could have just asked that to begin with :) [18:25:31] jeremyb: oh? [18:27:22] sj__: doesn't work for me either but idk the latest with that service [18:28:09] sj__: i'll try to dig up the mail where it was announced (which says who was running it). may mail or jabber you because my computer should be closed for the next 60+ mins (probably) [18:28:14] bbl [18:29:36] RECOVERY Total processes is now: OK on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS OK: 150 processes [18:30:13] thanks. [18:37:43] PROBLEM Total processes is now: WARNING on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS WARNING: 152 processes [18:41:03] sj__: errrr, no i was totally wrong. i was thinking of orgcharts [18:41:41] seems to be a dns cache issue. [18:41:53] (I'm in SF) [18:41:58] hosts file :) [18:42:03] ok, bbl [18:42:11] right, thx [18:42:23] RECOVERY Current Load is now: OK on nagios-main.pmtpa.wmflabs 10.4.0.120 output: OK - load average: 1.83, 2.75, 4.67 [18:42:43] RECOVERY Total processes is now: OK on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS OK: 150 processes [18:42:45] hey guys - something odd seems to be happening to DNS because http://reportcard.wmflabs.org isn't accessible from there but http://208.80.153.208/ is [18:42:58] the kripke instance seems to be down, ssh-ing into it doesn't work because I can't reach bastion [19:08:33] RECOVERY Free ram is now: OK on stackfarm-sql2.pmtpa.wmflabs 10.4.1.23 output: OK: 21% free memory [19:09:21] hey Krinkle, do you have any insight on labs DNS being down? [19:09:30] I'm stalking anyone who can help at this point :) [19:09:55] milimetric: No, but I can walk around the office and see if someone else knows [19:10:10] milimetric: Exactly what is the issue? [19:10:14] http://cvn.wmflabs.org/ seems to work fine for me [19:10:23] robla tried - CT and Ryan aren't around [19:10:24] thanks very much though [19:10:44] well it looks like some people who have DNS entries cached are ok [19:10:45] http://deployment.wikimedia.beta.wmflabs.org/ and http://integration.wmflabs.org/testswarm/ also work [19:10:47] andrewbogott: ping ^ [19:10:50] so clearly DNS isn't broken I guess? [19:10:54] but when those caches expire, bastion isn't accessible [19:10:56] milimetric: Right.. [19:11:13] sumanah: I'm working on it [19:11:16] ok [19:11:20] sorry for the nag [19:11:24] woa!! [19:11:25] it's back up [19:11:48] Krinkle - you're a magician :) [19:13:02] hehe, you have no idea. [19:13:22] or you do. [19:25:33] RECOVERY Current Load is now: OK on parsoid-roundtrip7-8core.pmtpa.wmflabs 10.4.1.26 output: OK - load average: 4.65, 4.83, 5.00 [19:46:34] PROBLEM Total processes is now: WARNING on parsoid-spof.pmtpa.wmflabs 10.4.0.33 output: PROCS WARNING: 151 processes [20:00:42] ok… milimetric, matanya, marktraceur, petan: What's still broken? [20:00:59] nothing I know [20:01:05] !ping [20:01:05] pong [20:08:05] Ahm [20:08:09] hi andrewbogott: nothing that affects me [20:08:12] Creating directory '/home/catrope'. [20:08:14] Unable to create and initialize directory '/home/catrope'. [20:08:18] Is this one of the gluster problems? [20:08:22] I get this when logging into an instance [20:08:37] RoanKattouw, what instance, what project? [20:08:41] thank you andrewbogott, we all appreciate the help [20:08:56] andrewbogott: ve-change-marking [20:09:29] RoanKattouw, that's the project name too? [20:09:37] Ahm [20:09:39] VisualEditor, probably? [20:13:36] RoanKattouw, any change? [20:14:03] Nope, still borked [20:14:14] hm. Won't let me in either, but different behavior. [20:19:15] RoanKattouw: Other instances in that project look OK… is it possible that that instance is just broken? [20:19:27] I also can't get a console log for it, which seems suspicious [20:19:39] (can /you/ log into other VisualEditor instances?) [20:20:31] I can log into parsoid-spof [20:21:38] might be worth rebooting vw-change-marking if you think that won't break anything… I suspect it's having a problem unrelated to the general labs issues. [20:32:44] https://www.mediawiki.org/wiki/Talk:Wikimedia_Labs/Agreement_to_disclosure_of_personally_identifiable_information [20:34:55] <^demon> Um, I don't think so? [20:35:03] <^demon> There's nowhere in Gerrit that exposes your IP. [20:35:17] <^demon> Dunno about rest of labs, but not gerrit. [20:35:55] PROBLEM Current Load is now: WARNING on nagios-main.pmtpa.wmflabs 10.4.0.120 output: WARNING - load average: 8.66, 8.74, 6.08 [20:38:23] ^demon, in labs other users could run last(1) or who(1) to see your ip [20:38:37] <^demon> Yeah, there's that. [20:38:50] can anybody point me to the labs rules which prohibit mails there, pls? thx [20:40:13] ^demon, I suggest you reply explaining that using just gerrit doesn't expose your ip [20:40:26] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 22% free memory [20:41:53] <^demon> "If you're just using Gerrit, there isn't any place where your IP would be exposed to other users. ^demon (talk) 20:41, 1 February 2013 (UTC)" [20:45:31] !log integration Upgraded Gerrit on integration-jenkins2 to 336eb70b51fe2328d4dd21fef3c78ba11e32758d [20:45:33] Logged the message, Master [20:48:26] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [20:49:36] andrewbogott: if I am still around when you come back from lunch, do you happen to know whether labs support LVS (I guess no) [20:50:33] andrewbogott: 2nd question, why do we have labs IP reserved for LVS service in puppet? [21:03:59] hashar: I'm back, but I don't know anything. [21:04:17] andrewbogott: :-] [21:04:29] I have it that earlier today while polishing the mobile class [21:04:33] for beta [21:04:34] grr [21:04:38] does not make any sense let me restart [21:04:49] so when polishing up the role::cache::mobile class for beta [21:05:10] I have ended up with the varnish backend pointing to a random ip address instead of the beta apaches. [21:05:19] turns out the IP comes from lvs::configuration::some_variable [21:05:26] which is pointing to some 10.4.0.254 [21:05:38] maybe they got pre reserved for a future LVS setup in labs [21:05:42] I got to hack that somehow Iguess [21:05:53] RECOVERY Current Load is now: OK on nagios-main.pmtpa.wmflabs 10.4.0.120 output: OK - load average: 2.75, 3.43, 4.88 [21:08:13] hashar: Link me to your patch? [21:09:09] https://gerrit.wikimedia.org/r/#/c/44709/ [21:09:16] it is not doing that much though [21:09:24] the root cause is mark rewrite [21:09:53] which made all role::cache sub classes to lookup the IP from a huge hash somewhere in role::cache::configuration [21:09:59] and from another hash in lvs::configuration [21:10:36] wouldn't that huge hash be totally different on labs than in prod though? [21:10:53] s/wouldn't/shouldn't/ [21:11:07] ah and the varnish backends have : [21:11:08] directors => { [21:11:08] "backend" => $lvs::configuration::lvs_service_ips[$::realm]['apaches'][$::mw_primary], [21:11:09] "api" => $lvs::configuration::lvs_service_ips[$::realm]['api'][$::mw_primary], [21:11:42] yeah I will pass an array in the lvs::configuration::lvs_service_ips for labs [21:11:43] that would fix it [21:12:29] That does sort of make it look like we should have lvs in labs… that's a good mark question. [21:12:43] or Leslie / Ryan [21:14:23] yeah, I think Leslie will be back online in an hour or two. [21:16:31] ahh [21:16:47] different subject: I got vim to report errors in my puppet manifests whenever I save the file [21:16:49] hugeee time saver [21:16:58] I should blog about it / write a wiki article [21:17:13] yeah, sounds useful. [21:18:03] that is handled by the syntactic vim plugin https://github.com/scrooloose/syntastic [21:18:11] on save, it triggers puppet parser validate [21:18:14] analyze the output [21:18:27] if there is any error, it adds a left column in vim that show up a mark for any error [21:18:38] and display a window at the bottom showing the list of errors [21:18:39] niceee stuff [21:23:00] ahh [21:23:06] http://en.wikipedia.m.beta.wmflabs.org/ <-- Domain not configured [21:23:06]