[00:19:15] 05/22/2012 - 00:19:15 - Updating keys for bsitu at /export/home/bastion/bsitu [00:19:27] 05/22/2012 - 00:19:27 - Updating keys for bsitu at /export/home/editor-engagement/bsitu [02:02:43] PROBLEM Free ram is now: WARNING on bots-2 i-0000009c output: Warning: 14% free memory [02:39:31] RECOVERY Free ram is now: OK on deployment-squid i-000000dc output: OK: 20% free memory [02:43:19] 05/22/2012 - 02:43:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:50:21] 05/22/2012 - 02:50:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:52:03] PROBLEM Free ram is now: WARNING on deployment-squid i-000000dc output: Warning: 18% free memory [02:54:19] 05/22/2012 - 02:54:19 - Updating keys for laner at /export/home/deployment-prep/laner [02:54:23] RECOVERY Puppet freshness is now: OK on deployment-apache21 i-0000026d output: puppet ran at Tue May 22 02:54:06 UTC 2012 [02:55:20] 05/22/2012 - 02:55:20 - Updating keys for laner at /export/home/deployment-prep/laner [02:56:53] RECOVERY Puppet freshness is now: OK on labs-relay i-00000103 output: puppet ran at Tue May 22 02:56:31 UTC 2012 [03:05:00] New review: Hashar; "First line is some normal text" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/4144 [03:05:47] PROBLEM HTTP is now: CRITICAL on deployment-web3 i-00000219 output: CRITICAL - Socket timeout after 10 seconds [03:05:47] PROBLEM HTTP is now: CRITICAL on deployment-web i-00000217 output: CRITICAL - Socket timeout after 10 seconds [03:05:47] PROBLEM HTTP is now: CRITICAL on deployment-web5 i-00000213 output: CRITICAL - Socket timeout after 10 seconds [03:05:47] PROBLEM HTTP is now: CRITICAL on deployment-web4 i-00000214 output: CRITICAL - Socket timeout after 10 seconds [03:05:51] New review: Hashar; "First line really" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/4144 [03:10:45] PROBLEM HTTP is now: WARNING on deployment-web3 i-00000219 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.017 second response time [03:10:45] PROBLEM HTTP is now: WARNING on deployment-web i-00000217 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.016 second response time [03:10:45] PROBLEM HTTP is now: WARNING on deployment-web5 i-00000213 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.016 second response time [03:10:45] PROBLEM HTTP is now: WARNING on deployment-web4 i-00000214 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.022 second response time [03:37:13] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 15% free memory [03:37:19] !log testlabs Test [03:37:22] Logged the message, Master [03:40:09] Really need to get a wiki [03:40:13] PROBLEM Puppet freshness is now: CRITICAL on localpuppet1 i-0000020b output: Puppet has not run in last 20 hours [03:40:22] Wikipedia-Labs cloak. [03:40:51] What is happening with the wiki.. [03:47:43] !log deployment-prep hashar: (Bug 36870) deleting deployment-web{,3,4,5} [03:47:45] Logged the message, Master [03:52:05] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [03:58:41] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [04:00:31] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:01:17] !log Status PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:01:17] Status is not a valid project. [04:01:36] !log status PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:01:37] status is not a valid project. [04:01:56] !log freenode PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:01:57] freenode is not a valid project. [04:02:11] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:02:12] !log test PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:02:12] test is not a valid project. [04:02:39] !log testlabs PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [04:02:40] Logged the message, Master [04:03:43] !log testlabs RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 96% free memory [04:03:44] Logged the message, Master [04:04:41] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 17% free memory [04:08:51] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 15% free memory [04:11:04] !log bot bots-apache1 has two defunct processes eating CPU: pdflushsh (pid 6382) and 10 (pid 6278) [04:11:05] bot is not a valid project. [04:11:12] !log bots bots-apache1 has two defunct processes eating CPU: pdflushsh (pid 6382) and 10 (pid 6278) [04:11:14] Logged the message, Master [04:13:55] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:14:23] New review: Hashar; "Foo" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/4144 [04:18:53] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:25:30] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:28:50] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 5% free memory [04:29:40] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 21% free memory [04:35:30] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 94% free memory [04:38:46] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 95% free memory [05:11:49] !log deployment-prep creating a second job runner instance deployment-jobrunner02 . Will apply puppet classes later on. [05:11:51] Logged the message, Master [05:17:52] Is bastion down again? [05:18:04] Or, inaccessible - 'connection timed out' [05:24:24] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner02 i-00000279 output: CHECK_NRPE: Error - Could not complete SSL handshake. [05:25:04] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner02 i-00000279 output: CHECK_NRPE: Error - Could not complete SSL handshake. [05:26:14] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner02 i-00000279 output: CHECK_NRPE: Error - Could not complete SSL handshake. [05:26:54] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner02 i-00000279 output: CHECK_NRPE: Error - Could not complete SSL handshake. [05:27:34] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner02 i-00000279 output: CHECK_NRPE: Error - Could not complete SSL handshake. [05:28:14] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner02 i-00000279 output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:04:41] PROBLEM Free ram is now: WARNING on test3 i-00000093 output: Warning: 9% free memory [06:09:41] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: Critical: 1% free memory [06:14:11] RECOVERY Free ram is now: OK on bots-2 i-0000009c output: OK: 57% free memory [06:29:54] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [06:50:28] PROBLEM Current Load is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:18] RECOVERY Current Load is now: OK on migration1 i-00000261 output: OK - load average: 0.28, 1.38, 1.04 [07:42:20] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [07:49:37] ahhh [07:49:39] back [07:49:46] !log deployment-prep installing jobrunner2 [07:49:48] Logged the message, Master [08:19:20] 05/22/2012 - 08:19:19 - Updating keys for laner at /export/home/deployment-prep/laner [08:21:19] 05/22/2012 - 08:21:19 - Updating keys for laner at /export/home/deployment-prep/laner [08:22:06] RECOVERY Current Load is now: OK on deployment-jobrunner02 i-00000279 output: OK - load average: 0.43, 0.94, 1.10 [08:22:25] \O/ [08:23:16] RECOVERY Current Users is now: OK on deployment-jobrunner02 i-00000279 output: USERS OK - 1 users currently logged in [08:23:16] RECOVERY Disk Space is now: OK on deployment-jobrunner02 i-00000279 output: DISK OK [08:23:22] !log deployment-prep started job loop on deployment-job-runner02 [08:23:25] Logged the message, Master [08:23:36] RECOVERY Free ram is now: OK on deployment-jobrunner02 i-00000279 output: OK: 84% free memory [08:24:26] RECOVERY Total Processes is now: OK on deployment-jobrunner02 i-00000279 output: PROCS OK: 113 processes [08:25:31] two job runners yeahhhh [08:25:36] RECOVERY dpkg-check is now: OK on deployment-jobrunner02 i-00000279 output: All packages OK [08:25:38] not sure what they are doing though [08:41:50] !log deployment-prep restarted upd2log on -feed (again) [08:41:52] Logged the message, Master [08:54:53] !log deployment-prep purged all logs from /home/wikipedia/logs/archive/ just to be safe [08:54:55] Logged the message, Master [09:12:45] New patchset: Hashar; "/home/wikipedia/log need to be writable by udp2log!" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8442 [09:13:00] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8442 [09:16:26] New patchset: Hashar; "(bug 37014) udp2log needs write access to /h/w/log" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8442 [09:16:41] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/8442 [09:17:08] New review: Hashar; "Patchset 2 amend commit message to reference the bug number." [operations/puppet] (test); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8442 [09:20:24] New review: ArielGlenn; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8442 [09:20:27] Change merged: ArielGlenn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/8442 [09:25:44] !log deployment-prep Fixed udp2log not able to add new log files in /home/wikipedia/log , that dir need to be writable by udp2log user! See https://gerrit.wikimedia.org/r/8442 | https://bugzilla.wikimedia.org/37014 [09:25:46] Logged the message, Master [09:30:51] !log deployment-prep hashar: jobrunner logs are available in /home/wikipedia/logs/runJobs.log now [09:30:52] Logged the message, Master [09:39:57] !log deployment-prep hashar: rebooting jobrunner02 just to be sure it is properly loaded up [09:39:59] Logged the message, Master [09:44:37] hi [09:44:43] hello [09:44:43] :) [09:44:49] everything ok? [09:44:53] I hope labs are back [09:45:22] I will try to setup some replication of sql [09:45:37] because we really need backup if sql server crash [09:46:27] btw hashar there is a script to create wiki [09:46:31] you wanted to make one [09:46:44] but it needs to be tweaked to work with mwmultiversion [09:49:54] yeah addWiki [09:50:00] in extensions/WikimediaMaintenance IIRC [09:50:03] I have used that script [09:50:29] through multiversion ;-D [09:50:40] probably need some better tweaking though [09:51:33] petan|wk: if you are going to setup replication, please use puppet [09:51:39] hashar: when you have time could you give me description of what each machine is for [09:51:51] job runners, cache etc [09:51:57] sure [09:52:08] any idea where to document that? [09:52:16] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Deployment-prep [09:52:39] https://labsconsole.wikimedia.org/wiki/Deployment/Help [09:52:41] I guess [09:53:19] 05/22/2012 - 09:53:19 - Updating keys for laner at /export/home/deployment-prep/laner [09:58:12] ok [09:58:22] is it possible to use puppet for that? [10:00:48] hashar: does the version of script you used work? if so maybe update bin/addWiki [10:01:09] because what we have now doesn't [10:03:06] I have updated the help page : https://labsconsole.wikimedia.org/wiki/Deployment/Help [10:03:12] ok [10:03:17] probably need to use a table instead of a list [10:03:39] hashar: do we want to keep using that motd we have now on some instances or not / can we move it to puppet? [10:04:06] if you like it, move it to puppet ! [10:04:19] I like it because when you ssh there you know where you are :) and what the box is for, it could help people who joined the project and aren't sure how it works [10:04:38] and who to ask if they get in trouble of course [10:04:54] what I did is that whenever I log in labs, I get a yellow PS1 ;-) [10:05:05] huh [10:05:10] oh right [10:05:16] that's not a bad idea [10:05:21] https://github.com/hashar/alix/commit/920e8e7637597d170522002bcc8f81ad4143c600 [10:05:21] we might do that for root [10:05:25] no [10:05:30] don't alter the root prompt [10:05:33] please ;-D [10:05:34] so that when you sudo su you see that you are root [10:05:41] like red line [10:05:42] XD [10:05:49] and big label "don't break stuff" [10:05:52] you will know you are root cause the line is blank and ends with a # ;-D [10:05:55] that is enough [10:05:58] heh [10:06:15] I don't sudo su much [10:06:24] anyway, my bashrc is at https://github.com/hashar/alix/blob/master/bashrc [10:06:42] I once mistyped and removed my home on my computer [10:06:49] since then I don't do that much [10:07:05] I had a backup though [10:07:20] but it wasn't fun to see it disappear heh [10:07:25] do you know what deployment-webs1 is? [10:07:39] yes, it's former instance which was supposed to become https server [10:07:50] ohh [10:07:52] probably can be removed now [10:07:54] so I guess we can get ride of it ? [10:07:55] sure [10:08:11] someone that build the HTTPS infrastructure in production will set it up on labs [10:08:15] I guess Roan / Ryan [10:08:17] k [10:08:19] Roan wanted to do that [10:08:24] but that is very low priority for now [10:08:27] I know [10:09:01] we definitely should setup some kind of backups [10:09:21] until config is in puppet etc [10:09:37] !log deployment-prep Remove deployment-webs instance which was meant to emulate the HTTPS access. Hacky and low priority for now, we will need to setup a nginx proxy one day to properly replicate the production infrastructure. [10:09:39] Logged the message, Master [10:09:58] apaches* and jobrunner* have been installed from puppet [10:10:13] deleting -webs NOW [10:10:58] ok [10:11:19] bbl [10:11:25] cya ;)) [10:12:21] 05/22/2012 - 10:12:21 - Updating keys for laner at /export/home/deployment-prep/laner [10:16:50] PROBLEM host: deployment-webs1 is DOWN address: i-0000012a check_ping: Invalid hostname/address - i-0000012a [10:23:23] New review: Dzahn; "as mentioned above: just for labs (which already has it manually)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6468 [10:26:36] New patchset: Hashar; "class to install the 'tree' utility" [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6468 [10:26:51] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6468 [10:27:39] New review: Dzahn; "(no comment)" [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6468 [10:27:42] Change merged: Dzahn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6468 [10:30:24] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 16% free memory [10:40:24] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 20% free memory [10:51:43] "A newer build of the Ubuntu lucid server image is available. [10:52:11] what does that mean? Must I upgrade? Should I upgrade? How would I do that? [11:04:46] * Barebone pokes Thehelpfulone hard. [11:05:06] hashar: do you know if it's possible to do a rename of the wiki account on labs? [11:05:14] or does that need to be changed in gerrit (?) too [11:06:02] oh actually mutante you might know, being a crat on the labsconsole wiki [11:06:24] can you rename Barebone from "Tanvir Rahman" to "Wikitanvir" on wiki? [11:06:29] Thehelpfulone: quote "If you haven't yet logged into gerrit I can rename your account. Once [11:06:32] you log into gerrit you're mostly stuck." [11:07:06] he still wants gerrit to be Tanvir Rahman, just the wiki username to be wikitanvir [11:07:17] or are they the same because of LDAP? [11:07:18] Nikerabbit: got sudo? you can "apt-get dist-upgrade" your instance, then reboot if it installed a newer kernel [11:07:36] Thehelpfulone: I have no idea [11:07:50] Nikerabbit: the part about the image means there is a new image to install from if you create a new instance, but you dont have to reinstall the existing one [11:08:27] Thehelpfulone: the only thing I know is that both my login and real name are set to "hashar" :-( [11:08:45] thats what I did, don't want to complicate things ;) [11:09:00] Thehelpfulone: wiki user = git user, yes, LDAP, i dont think he can [11:09:18] * Barebone is okay with Tanvir Rahman then. [11:09:23] "Preferred wiki username. This will also be the user's git username, so legal name would be reasonable " [11:09:24] Hola guys btw! [11:09:35] Barebone: have you put your SSH key into labsconsole wiki? [11:09:42] Thehelpfulone, not yet. [11:09:49] I've given you access to both bastion and bots but nothing is being updated [11:09:53] Barebone: your shell user name can be different [11:10:06] ok, you need to put it in open SSH format into https://labsconsole.wikimedia.org/wiki/Special:NovaKey [11:10:12] mutante: how can we change our real name in LDAP ? [11:10:15] Mutante, it is my nickname/first name --> tanvir [11:10:22] I am happy with that. :-D [11:10:28] Barebone: looks good, id keep it that way, ok [11:10:29] mutante: Gerrit show me as Hashar [11:11:35] hashar: not sure, afraid that is the "mostly stuck" part from the quote above [11:11:51] yeah I guess [11:11:55] we had the question before.. but hmm,,, [11:12:03] and looked at different LDAP fields [11:12:16] I suspect that Ryan uses the same field for wiki username and git realname [11:12:28] which does not make sense to me ;-D [11:12:34] he I will have to poke him [11:12:39] yes, that was what we saw last time we checked [11:12:42] afair [11:12:48] laner is something for him [11:12:49] yes,please do [11:12:52] could be his shell username [11:12:56] it is [11:13:03] oh yeah I remember [11:13:19] so we have a shell username and a wiki username/git realname [11:13:31] we probably need to get the later split in two ;-D [11:13:46] to get fields like: shellname / username / realname [11:14:00] Barebone: well, welcome to labs. planning to join a specific project? [11:14:14] going to need some nice LDAP migration script and conf update in labsconsole / gerrit [11:14:18] hashar: ack [11:15:10] btw, the rules for the shell user name are: "(1 to 17 numbers and lower case ASCII letters, as well as the . (full stop), - (hyphen-minus) and _ (low line) characters) " heh :p [11:16:29] would not recommend shell user "-_._-" though :) [11:21:45] Mutante, for my bot mostly. [11:22:55] Barebone: ah, alright. would be nice if you add some docs about it later here https://labsconsole.wikimedia.org/wiki/Nova_Resource:Bots/Documentation [11:24:27] oh yeah, and i would like to merge that page with these some way: http://wikitech.wikimedia.org/view/Category:Bots [11:24:42] (and of course update them all) [11:25:09] any bot owner who wants to help, it's appreciated [11:28:56] Sure Mutante, will think about it. [11:29:16] peeps .. I seem to be unable to login to MySQL on phpmyadmin on bots-sql2 [11:29:31] shell mysql works fine (just got out of that), bots-sql3 works as well [11:29:52] eh, is Labs settled down from yesterday? [11:29:57] -> "#1129 Cannot log in to the MySQL server" [11:30:10] -> "Connection for controluser as defined in your configuration failed." [11:30:35] Maybe not completely settled, but phpmyadmin does not have troubles on bots-sql3 [11:30:52] hmm [11:31:04] I wonder if its a good time to start everything again... [11:31:23] start or restart ? [11:31:31] start [11:31:39] since I already killed everything, perhaps? [11:31:45] :-D [11:31:52] everything is dead? [11:32:07] all bots are stopped, uploading processes too [11:32:31] eh, I am referring to those on my projects [11:32:32] Beetstra: confirmed running mysqld process on bots-sql2 (has root password which i dont know so cant confirm login) [11:33:06] I just killed a server :( [11:33:24] mutante, I know that mysqld works, since I was working in a shell on bots-sql2 (mysql command) [11:33:25] i dont know if its a good time to restart "things" [11:33:28] and labsconsole refuses to report what happened to it now... [11:33:41] PROBLEM host: dumps-6 is DOWN address: i-00000266 CRITICAL - Host Unreachable (i-00000266) [11:33:42] But phpmyadmin apparently does not work (properly) [11:33:55] boom, just reported... [11:33:58] My bots are writing into the database as well .. [11:34:02] without problem [11:34:23] Beetstra: ok, well, i just use shell. report phpmyadmin issue as bugzilla bug? [11:34:29] Beetstra, what does your bots do? [11:34:30] ahablaba [11:34:32] 1pm [11:34:41] good feed [11:34:47] nood feed! [11:34:57] I am running LiWa3 in #wikipedia-en-spam and #cvn-sw-spam -> every link addition on wikipedia goes into a db .. [11:35:11] oh, so it checks for link spam? [11:36:05] * Hydriz 's interest was just killed [11:36:10] Yes [11:36:19] Well, it checks for ALL link additions [11:36:31] also the good stuff [11:36:38] zzz sudo reboot breaks everything, sigh [11:37:06] Beetstra: that info is also something that would be great on bots doc page, i once added something to your talk page, but that was somehow lost because at that time liquid threads was on and later disabled [11:37:08] is there some kind of firewall to/on the instances by default? [11:37:14] RECOVERY host: dumps-6 is UP address: i-00000266 PING OK - Packet loss = 0%, RTA = 0.61 ms [11:37:42] Nikerabbit: AFAIK yes [11:37:45] i think nobody had used user:talk on labsconsole much before. but lets do [11:37:55] It is documented, but not really good, mutante [11:38:25] Hydriz: I can access :ssh on my instance but not :8080 from bastion [11:38:39] hmm [11:39:01] not sure about that though :( [11:39:27] Nikerabbit: there is a security group "default" in your project, it probably does not have 8080 in it [11:39:32] \o/ just fixed an error [11:40:16] Nikerabbit: but for example there is group "web" in testlabs, which has 80,443 and 8080 (security groups) [11:40:53] hmm [11:41:04] why is :5666 in the defaults? [11:41:27] it's puppet [11:41:39] wikipedia says nagios [11:42:20] oh yea, of course. you are right [11:42:58] it is Nagios NRPE [11:43:23] to execute checks like disk space on the instances, which you cant just check from remote [11:43:28] remote plugin executor [11:45:01] wow it takes three clicks to remove a rule (with page load between each) [11:55:57] !log deployment-prep create two more job runner instances [11:55:59] Logged the message, Master [11:56:10] going to grab food while nova install them [11:56:22] hmm I can't seem to tunnel into bots-1 petan? [11:56:29] hi [11:56:36] hey [11:56:55] I'm using WinSCP to view files but I can't seem to get into bots-1 [11:57:08] right [11:57:10] can you ssh ther [11:57:14] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 6.30, 6.12, 5.38 [11:57:20] yes [11:57:32] you are using scp or sftp [11:57:32] ah there we go [11:57:46] !tunnel [11:57:50] ok so it's scp [11:57:51] !putty [11:57:51] official site: http://www.chiark.greenend.org.uk/~sgtatham/putty/ | how to tunnel - http://oldsite.precedence.co.uk/nc/putty.html [11:57:55] and it's telling me connection refused [11:58:13] you are using socks proxy? [11:58:17] or tunnel [11:58:25] SSH tunnel [11:58:28] through bastion [11:58:31] how did you initiate it [11:58:36] using winSCP [11:58:41] you need to bind to port 22 [11:58:46] yes I have done [11:58:46] did you? [11:58:49] ok [11:58:59] but bots-1:22 [11:59:10] what local port did you bind [11:59:11] I put the port as 22 [11:59:27] when you tunnel you need to specify local port remote address and port [11:59:43] I need to know all these 3 values you used [12:00:04] ok one sec, - it was working with bots-3 the other day [12:00:17] it should have been local port 1234 remote address bots-1 port 22 [12:00:29] then connect sftp to localhost:1234 [12:00:45] ok I can get into bots-3 fine petan|wk [12:00:50] but bots-1 doesn't work [12:01:03] that's clear then [12:01:14] you configured remote address of bots-3 [12:01:26] that's wrong [12:01:36] you can't use remote address of bots-3 and connect to bots-1 [12:01:37] I set this up with mutante, all I need to do in change the instance name [12:01:42] yes [12:01:49] open putty [12:02:00] I can get in via SSH fine [12:02:08] it's just winSCP is telling me "connection refused" [12:02:26] create new tunnel local port 4858 remote address bots-1:22 [12:02:29] ssh to bastio [12:02:41] then open winscp and connect to localhost:4858 [12:03:03] !socks-proxy [12:03:03] ssh @bastion.wmflabs.orgĀ -D ; # [12:03:09] the tunneling is done through winscp not putty [12:03:12] @search bastion [12:03:12] Results (found 4): sudo, bastion, ryanland, socks-proxy, [12:03:20] on winscp, under connection -> tunnel [12:03:23] Thehelpfulone: ok, I don't know winscp [12:03:26] host name: bastion.wmflabs.org [12:03:30] port number: 22 [12:03:34] user name: thehelpfulone [12:03:42] but for the actual "session" [12:03:44] Thehelpfulone: http://winscp.net/eng/docs/ui_login_tunnel [12:03:45] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner04 i-0000027b output: Connection refused by host [12:03:45] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner03 i-0000027a output: Connection refused by host [12:04:02] mutante: it works for bots-3, and mailman-01, just not bots-1 [12:04:07] Thehelpfulone:I can only tell you how to open tunnel using ssh [12:04:12] I don't know scp [12:04:25] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner04 i-0000027b output: Connection refused by host [12:04:25] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner03 i-0000027a output: Connection refused by host [12:05:05] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner03 i-0000027a output: Connection refused by host [12:05:05] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner04 i-0000027b output: Connection refused by host [12:05:14] Thehelpfulone: i guess you have saved sessions in winscp with the tunnel settings for these, but its not activated in te one for bots-1 ? if not it would be weird, tunnel should always go through bastion and not make a difference which instance from there [12:05:29] works on bots-2 too [12:05:38] well, it's the same ssh tunnel thing that you setup in putty [12:05:42] it says the it authenticates properly [12:05:45] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner03 i-0000027a output: Connection refused by host [12:05:46] just in winscp [12:05:56] but then it says network error connection refused [12:06:00] yes [12:06:15] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner04 i-0000027b output: Connection refused by host [12:06:16] hmm, so I can't get into bots-1 or bots-4 [12:06:55] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner04 i-0000027b output: Connection refused by host [12:06:55] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner03 i-0000027a output: Connection refused by host [12:07:35] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner03 i-0000027a output: Connection refused by host [12:08:15] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner04 i-0000027b output: Connection refused by host [12:08:15] Thehelpfulone: you dont have a homedir on bots-4 , thats why [12:09:02] Thehelpfulone: you do have a /hoe on bots-1, you do not on bots-4 [12:09:05] home [12:09:20] hmm [12:09:56] double check if they are all in the same project and if puppet ran [12:10:27] if yes, declare it a bug there are no home dirs on bots-4 [12:11:41] no home? [12:12:36] note: permissions on home dirs in bots-1 are all over the place (root vs. user, svn vs. wikidev) [12:12:59] looks like some are puppet created and some are not? [12:13:15] or just created before/after changes to the way they are created [12:14:07] (saying that because perms on .ssh in home may also be reason for not getting ssh login) [12:14:56] yours on bots-1 looks like it should work though, Thehelpfulone try one more login on bots-1 please [12:15:20] 05/22/2012 - 12:15:19 - Updating keys for lcarr at /export/home/deployment-prep/lcarr [12:16:21] 05/22/2012 - 12:16:20 - Updating keys for catrope at /export/home/deployment-prep/catrope [12:16:21] 05/22/2012 - 12:16:20 - Updating keys for laner at /export/home/deployment-prep/laner [12:16:56] ok sure [12:17:03] !log bots running puppet on bots-4 [12:17:04] Logged the message, Master [12:17:19] 05/22/2012 - 12:17:19 - Updating keys for catrope at /export/home/deployment-prep/catrope [12:17:20] 05/22/2012 - 12:17:19 - Updating keys for lcarr at /export/home/deployment-prep/lcarr [12:17:20] 05/22/2012 - 12:17:19 - Updating keys for laner at /export/home/deployment-prep/laner [12:17:23] session opened for user thehelpfulone [12:17:25] there you go [12:17:29] there we go [12:17:35] how did you fix that? [12:17:43] on bots-1 i did not fix anything [12:17:47] heh [12:17:48] i just watched the log [12:17:57] did you see the failures before? [12:18:27] no, "grep helpful" just shows 2 successes and thats all [12:18:34] looks like your tunnel setup [12:18:54] I didn't change anything though :S [12:19:09] ok, and running puppet on bots-4 does not create homedirs [12:19:20] i guess it must use other puppet classes then [12:19:30] or none [12:22:14] Thehelpfulone: the "connection refused" you got was most likely from bastion host then, not from the instance [12:22:38] ok [12:23:09] ok, gotta go again for the moment. packing bags [12:23:17] travel in a little while [12:24:06] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [12:24:31] petan|wk: maybe you wanna check about bots-4? you, me and hydriz have /homes there, but nobody else .. not sure why yet [12:24:56] if you were involved in bots-4 at all that is [12:24:58] bbl [12:27:19] Thehelpfulone: / petan|wk : ah nevermind! duh, of course that is just the automount thing, it doesnt exist until somebody "cd"s into it. Thehelpfulone try again on bots-4 [12:27:32] lol [12:28:04] yep works :) [12:28:09] ls /home merely tells you who ever connected to this instance before: [12:28:11] :) [12:28:19] where is the home page for bots.wmflabs.org located? [12:28:28] ok cool, stupid me to forget it again.. out now:) [12:28:48] Thehelpfulone: https://labsconsole.wikimedia.org/wiki/Nova_Resource:Bots [12:31:18] Thehelpfulone: on bots-apache1 [12:31:37] yep found it [12:31:38] (you can see in Special:NovaAdress which instance owns which public IP) [12:31:42] out for real [12:31:51] so many instances ;) [12:34:31] oh finally project storage works for precise instances :) [12:52:10] !log deployment-prep deleting deployment-jobrunner{3,4} installation failed I got permission denied. Will recreate them using same hostname [12:52:12] Logged the message, Master [12:53:30] PROBLEM Total Processes is now: CRITICAL on incubator-bot2 i-00000252 output: PROCS CRITICAL: 1139 processes [12:57:38] PROBLEM host: deployment-jobrunner04 is DOWN address: i-0000027b check_ping: Invalid hostname/address - i-0000027b [12:58:19] PROBLEM host: deployment-jobrunner03 is DOWN address: i-0000027a check_ping: Invalid hostname/address - i-0000027a [12:58:28] RECOVERY Total Processes is now: OK on incubator-bot2 i-00000252 output: PROCS OK: 106 processes [13:00:04] !log incubator Fun just started on bot1 and bot2, starting interwiki bots. [13:00:06] Logged the message, Master [13:00:35] RECOVERY host: deployment-jobrunner04 is UP address: i-0000027d PING OK - Packet loss = 50%, RTA = 1072.19 ms [13:01:15] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner04 i-0000027d output: Connection refused by host [13:01:55] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner04 i-0000027d output: Connection refused by host [13:02:25] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 19% free memory [13:03:15] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner04 i-0000027d output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:25] RECOVERY host: deployment-jobrunner03 is UP address: i-0000027c PING OK - Packet loss = 0%, RTA = 1.04 ms [13:03:57] PROBLEM Current Load is now: CRITICAL on deployment-jobrunner03 i-0000027c output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:57] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner04 i-0000027d output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:27] PROBLEM Current Users is now: CRITICAL on deployment-jobrunner03 i-0000027c output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:27] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner04 i-0000027d output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:08:17] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner04 i-0000027d output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:08:17] PROBLEM Disk Space is now: CRITICAL on deployment-jobrunner03 i-0000027c output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:09:07] PROBLEM Free ram is now: CRITICAL on deployment-jobrunner03 i-0000027c output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:10:05] PROBLEM Total Processes is now: CRITICAL on deployment-jobrunner03 i-0000027c output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:10:37] PROBLEM dpkg-check is now: CRITICAL on deployment-jobrunner03 i-0000027c output: CHECK_NRPE: Error - Could not complete SSL handshake. [13:14:51] PROBLEM Current Load is now: WARNING on deployment-nfs-memc i-000000d7 output: WARNING - load average: 10.70, 8.98, 6.43 [13:24:39] !log deployment-prep deleting refreshLinks2 jobs from enwiki database [13:24:40] Logged the message, Master [13:27:25] PROBLEM Current Load is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:24] PROBLEM Current Users is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:44] PROBLEM Free ram is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:45] PROBLEM Disk Space is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:28:45] PROBLEM Free ram is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:04] eh, god damn it?! [13:30:03] PROBLEM Current Load is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:03] PROBLEM Current Users is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:03] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:03] PROBLEM Total Processes is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:29] PROBLEM Total Processes is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:34] PROBLEM Free ram is now: CRITICAL on worker1 i-00000208 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:15] PROBLEM Current Load is now: CRITICAL on nagios 127.0.0.1 output: CRITICAL - load average: 12.04, 13.97, 8.34 [13:34:15] PROBLEM Free ram is now: WARNING on aggregator-test2 i-0000024e output: Warning: 7% free memory [13:35:39] RECOVERY Free ram is now: OK on rds i-00000207 output: OK: 92% free memory [13:35:40] RECOVERY Current Users is now: OK on rds i-00000207 output: USERS OK - 0 users currently logged in [13:35:40] RECOVERY Disk Space is now: OK on rds i-00000207 output: DISK OK [13:35:40] RECOVERY Total Processes is now: OK on rds i-00000207 output: PROCS OK: 80 processes [13:36:55] PROBLEM dpkg-check is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:51] RECOVERY Current Load is now: OK on rds i-00000207 output: OK - load average: 6.04, 6.07, 4.23 [13:38:51] RECOVERY Current Users is now: OK on worker1 i-00000208 output: USERS OK - 0 users currently logged in [13:39:48] PROBLEM Total Processes is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:18] PROBLEM Current Users is now: CRITICAL on migration1 i-00000261 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:08] RECOVERY dpkg-check is now: OK on migration1 i-00000261 output: All packages OK [13:42:08] PROBLEM Puppet freshness is now: CRITICAL on localpuppet1 i-0000020b output: Puppet has not run in last 20 hours [13:43:36] again? [13:44:04] RECOVERY Total Processes is now: OK on worker1 i-00000208 output: PROCS OK: 84 processes [13:44:09] RECOVERY Free ram is now: OK on worker1 i-00000208 output: OK: 92% free memory [13:44:10] PROBLEM Free ram is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:14] doesn't look like it though... [13:45:25] RECOVERY Total Processes is now: OK on migration1 i-00000261 output: PROCS OK: 84 processes [13:45:29] its confined to only a few instances [13:45:30] RECOVERY Current Users is now: OK on migration1 i-00000261 output: USERS OK - 0 users currently logged in [13:45:30] PROBLEM Free ram is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:31] PROBLEM Current Load is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:31] PROBLEM Disk Space is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:39] ... [13:45:40] PROBLEM dpkg-check is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:45] PROBLEM Current Users is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:45] PROBLEM Total Processes is now: CRITICAL on incubator-bot1 i-00000251 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:53] oh god, gluster is rather screwish [13:47:18] ohrly? [13:47:22] :P [13:47:30] PROBLEM dpkg-check is now: CRITICAL on ganglia-test2 i-00000250 output: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:04] PROBLEM Current Users is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:04] PROBLEM Disk Space is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:04] PROBLEM Total Processes is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:09] PROBLEM SSH is now: CRITICAL on ganglia-test2 i-00000250 output: CRITICAL - Socket timeout after 10 seconds [13:49:25] !log deployment-prep Deleting jobrunner03 and 04, not going to need them afterall [13:49:27] Logged the message, Master [13:50:47] yeah lab in trouble [13:50:48] ;-) [13:51:59] paravoid: now I am pretty sure that is the `dump` project [13:52:08] it just overload the glusterfs [13:53:00] RECOVERY SSH is now: OK on ganglia-test2 i-00000250 output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:53:27] fyi: https://gerrit.wikimedia.org/r/ loads as an infinite loop (in .js) for user agent 'Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/15.0 Firefox/15.0a1' [13:55:18] paravoid: http://ganglia.wikimedia.org/latest/?r=hour&cs=05%2F22%2F2012+13%3A00+&ce=05%2F22%2F2012+14%3A00+&m=load_report&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [13:55:29] that is the virtualization cluster from 13:00 to 14:00 UTC [13:55:40] dead? [13:56:39] RECOVERY Current Load is now: OK on worker1 i-00000208 output: OK - load average: 0.29, 1.79, 3.50 [13:56:39] RECOVERY Disk Space is now: OK on worker1 i-00000208 output: DISK OK [13:56:54] RECOVERY dpkg-check is now: OK on ganglia-test2 i-00000250 output: All packages OK [13:57:11] paravoid: and here the dumps cluster doing some file transfer : http://ganglia.wmflabs.org/latest/graph.php?c=dumps&m=network_report&r=custom&s=by%20name&hc=4&mc=2&cs=05%2F22%2F2012%2011%3A00%20&ce=05%2F22%2F2012%2014%3A00%20&st=1337694997&g=network_report&z=medium&c=dumps [13:57:25] right... [13:57:28] I am pretty sure that is what is killing the cluster [13:57:34] it was the same monday morning [13:57:39] PROBLEM Current Load is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:41] stopping now... [13:57:57] so whatever that project is doing it seems to be heavy I/O [13:58:15] uploading the Wikimedia dumps [13:58:44] I guess that project could use a different disk than GlusterFS [13:58:52] I wished [13:58:57] PROBLEM dpkg-check is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:26] PROBLEM Current Load is now: WARNING on reportcard2 i-000001ea output: WARNING - load average: 5.13, 5.82, 5.02 [14:01:06] oh god, can Wikimedia just sponsor me a server? [14:02:41] PROBLEM Total Processes is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [14:03:31] PROBLEM Current Load is now: CRITICAL on bots-cb i-0000009e output: CRITICAL - load average: 106.50, 50.99, 25.58 [14:03:31] PROBLEM Current Load is now: WARNING on incubator-bot1 i-00000251 output: WARNING - load average: 3.45, 5.16, 5.07 [14:03:31] RECOVERY Current Users is now: OK on incubator-bot1 i-00000251 output: USERS OK - 0 users currently logged in [14:03:31] RECOVERY Disk Space is now: OK on incubator-bot1 i-00000251 output: DISK OK [14:03:31] RECOVERY Total Processes is now: OK on incubator-bot1 i-00000251 output: PROCS OK: 126 processes [14:03:36] RECOVERY Free ram is now: OK on incubator-bot1 i-00000251 output: OK: 89% free memory [14:03:48] PROBLEM Current Load is now: WARNING on bots-apache1 i-000000b0 output: WARNING - load average: 10.78, 10.49, 7.71 [14:04:07] lol the cluster is taking turns [14:04:25] RECOVERY dpkg-check is now: OK on incubator-bot1 i-00000251 output: All packages OK [14:04:33] RECOVERY Current Users is now: OK on aggregator-test2 i-0000024e output: USERS OK - 0 users currently logged in [14:04:33] RECOVERY Disk Space is now: OK on aggregator-test2 i-0000024e output: DISK OK [14:04:33] RECOVERY Total Processes is now: OK on aggregator-test2 i-0000024e output: PROCS OK: 203 processes [14:04:38] PROBLEM Current Load is now: CRITICAL on mobile-testing i-00000271 output: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:38] PROBLEM Current Users is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:38] PROBLEM Disk Space is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:38] PROBLEM Free ram is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:38] PROBLEM Total Processes is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:43] I have updated https://bugzilla.wikimedia.org/show_bug.cgi?id=36993 [14:06:04] Hydriz: I have a few ideas about what we could do [14:06:15] it is not in my hands though [14:06:31] PROBLEM Current Load is now: WARNING on incubator-bot2 i-00000252 output: WARNING - load average: 4.32, 6.35, 5.49 [14:07:09] * Hydriz facepalsm [14:07:10] PROBLEM Current Load is now: WARNING on hugglewiki i-000000aa output: WARNING - load average: 1.97, 6.16, 5.02 [14:07:20] PROBLEM Current Load is now: CRITICAL on upload-wizard i-0000021c output: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:21] PROBLEM Disk Space is now: CRITICAL on rds i-00000207 output: CHECK_NRPE: Socket timeout after 10 seconds. [14:07:26] hashar: do tell [14:07:34] I don't feel it's in my hands either [14:07:43] I have added my though on bug 36993 https://bugzilla.wikimedia.org/show_bug.cgi?id=36993 [14:07:50] PROBLEM SSH is now: CRITICAL on aggregator-test2 i-0000024e output: CRITICAL - Socket timeout after 10 seconds [14:07:58] Just some context I can give [14:08:06] I am downloading them from the host at your.org [14:08:10] PROBLEM Current Load is now: WARNING on bots-cb i-0000009e output: WARNING - load average: 0.95, 18.90, 18.61 [14:08:10] RECOVERY Current Load is now: OK on incubator-bot1 i-00000251 output: OK - load average: 0.54, 2.30, 3.85 [14:08:18] then uploading them to the Internet Archive, I believe thats in the documentation [14:08:20] Hydriz: yeah from the eqiad datacenter [14:08:25] PROBLEM Current Load is now: WARNING on aggregator-test2 i-0000024e output: WARNING - load average: 12.10, 9.54, 9.52 [14:08:32] don't you do some processing meanwhile ? [14:08:40] no, not at all [14:08:54] so its something like an airport (transfer flight kind of thing) [14:09:18] PROBLEM Current Load is now: WARNING on mobile-testing i-00000271 output: WARNING - load average: 3.98, 15.16, 12.56 [14:09:18] RECOVERY Current Users is now: OK on upload-wizard i-0000021c output: USERS OK - 0 users currently logged in [14:09:18] RECOVERY Disk Space is now: OK on upload-wizard i-0000021c output: DISK OK [14:09:18] RECOVERY Free ram is now: OK on upload-wizard i-0000021c output: OK: 91% free memory [14:09:18] RECOVERY Total Processes is now: OK on upload-wizard i-0000021c output: PROCS OK: 91 processes [14:09:30] lets recap [14:09:31] how do you access the files from dataset1001 ? [14:09:38] PROBLEM Current Load is now: CRITICAL on reportcard2 i-000001ea output: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:38] rsync download [14:09:47] to which directory on the labs instance ? [14:10:01] dumps-project's /data/project/dumps [14:10:04] RECOVERY Total Processes is now: OK on mobile-testing i-00000271 output: PROCS OK: 201 processes [14:10:09] which hits glusterFS [14:10:21] bastion isn't accepting my RSA public key -- am I a project member? [14:10:32] then what do you do to upload them to the Internet Archive ? [14:10:44] curl [14:10:51] via their S3 interface [14:10:56] hmm [14:10:58] RECOVERY Current Load is now: OK on incubator-bot2 i-00000252 output: OK - load average: 0.35, 2.73, 4.19 [14:11:00] http://archive.org/help/abouts3.txt [14:11:15] is there any point in writing the dump files in /data/project/dumps ? [14:11:29] what? [14:11:37] can't you just curl the files straight from dataset1001 directly to internet archive ? [14:11:41] oh no that is an rsync [14:11:42] ahh [14:11:49] thats what I am hoping to do [14:11:53] direct from dataset1001 [14:12:08] RECOVERY Current Load is now: OK on hugglewiki i-000000aa output: OK - load average: 0.18, 2.34, 3.66 [14:12:08] RECOVERY Current Load is now: OK on upload-wizard i-0000021c output: OK - load average: 0.53, 4.35, 4.66 [14:12:08] maybe dataset1001 could NFS export the dump [14:12:15] and this is why I wanted Wikimedia to sponsor me a server and allow me to mount to dataset1001 [14:12:38] RECOVERY SSH is now: OK on aggregator-test2 i-0000024e output: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:12:40] I don't think its feasible to mount dataset1001 from Labs [14:12:52] I don't think we want to do that ;-D [14:12:57] yep [14:13:35] so the only way is to download it and upload it on Labs [14:14:00] let me update the bug report ;-D [14:14:09] RECOVERY Current Load is now: OK on reportcard2 i-000001ea output: OK - load average: 0.11, 1.98, 3.69 [14:14:09] RECOVERY dpkg-check is now: OK on aggregator-test2 i-0000024e output: All packages OK [14:15:06] yes, as you can see from the update [14:15:11] its only the last 5 [14:15:21] which is totally pointless :( [14:15:58] Hydriz: why do you still have 8 instances? [14:16:20] what did you say you wanted to have? [14:16:30] we agreed 4 [14:16:37] ok [14:16:38] I still think that's too many to upload dumps [14:16:43] Hydriz: apparently the dumps might be shared in glusterFS already! Ariel/apergos commented about it https://bugzilla.wikimedia.org/show_bug.cgi?id=36993#c10 [14:16:49] I don't see why more than one is needed [14:16:53] yes, I have access to that share [14:17:03] Hydriz: please just upload directly from there [14:17:25] one by one first, I can answer why [14:17:26] the extra IO of downloading to gluster is killing labs [14:17:46] >1 because its is very slow to upload to the Archie [14:17:48] *Archive [14:18:04] uploading directly from publicdata-project is already somewhat done [14:18:10] because its only the last 5 dumps [14:18:25] hopefully we'll fix the IO situation at some point, but we need to stop killing labs until we've remedied it [14:18:31] Ryan_Lane: oh hi there [14:18:39] which, I have not worked on, to properly extract what dates the dump was generated [14:18:39] paravoid: howdy [14:18:42] didn't realize you were here [14:18:50] just got here [14:19:25] Hydriz: and if you continuously upload the last 5, then archive.org will have all of them [14:19:34] from the point you start uploading [14:19:39] I don't see the problem [14:19:41] I know [14:19:58] and that is what is going to be done very soon [14:20:06] since I was almost finished with the complete run [14:21:05] well, the IO is something we can't handle right now [14:21:27] I have stopped already BTW [14:21:30] ok [14:21:45] loading from the public datasets is fine [14:21:57] the share, that is [14:22:23] okok, since I am actually almost finished with the first complete run, its very possible to cut the instances [14:22:39] which I am not doing now to avoid escalating any kind of issues [14:23:18] RECOVERY Current Load is now: OK on bots-apache1 i-000000b0 output: OK - load average: 3.00, 3.31, 4.85 [14:28:19] RECOVERY Current Load is now: OK on aggregator-test2 i-0000024e output: OK - load average: 0.24, 0.80, 3.91 [14:28:45] okay, I'm still new around here [14:29:00] but uploads to archive.org sound more production-y than labs to me [14:29:03] aren't they? [14:29:07] no [14:29:13] PROBLEM Current Load is now: WARNING on nagios 127.0.0.1 output: WARNING - load average: 0.77, 0.99, 3.92 [14:29:18] RECOVERY Current Load is now: OK on mobile-testing i-00000271 output: OK - load average: 0.47, 1.03, 4.08 [14:29:18] because we already have 3 offsite mirrors of dumps [14:29:36] and archive.org only mirrors the media [14:29:39] not the dumps [14:29:48] RECOVERY Current Load is now: OK on deployment-nfs-memc i-000000d7 output: OK - load average: 0.16, 1.02, 3.65 [14:33:08] RECOVERY Current Load is now: OK on bots-cb i-0000009e output: OK - load average: 0.28, 0.48, 4.01 [14:37:16] 05/22/2012 - 14:37:16 - Updating keys for happy-melon at /export/home/bastion/happy-melon [14:38:43] Hydriz: so looks like the solution is to use curl directly from the already existing copy ;-D [14:38:56] Hydriz: that will make uploaded the dumps faster I guess [14:39:13] hmm [14:39:15] RECOVERY Current Load is now: OK on nagios 127.0.0.1 output: OK - load average: 0.79, 0.98, 2.50 [14:39:29] I try to avoid using the main copy (if you referred to dataset1001) [14:39:31] Ryan_Lane: paravoid : should we close https://bugzilla.wikimedia.org/show_bug.cgi?id=36993 about the glusterFS being overloaded / labs dieing often? [14:39:48] Ryan_Lane: paravoid: since we found the cause (the dump copy) [14:39:59] sure [14:40:07] Deleted the 4 additional instances [14:40:31] well, I'd say we found the trigger and worked around the problem [14:40:32] Hydriz: Ariel said that dumps are copied everyday around 4am UTC [14:40:44] yeah. the real problem is gluster [14:40:53] once instance should not kill labs, whatever it does [14:41:16] I will see if I can work around it without using too much bandwidth [14:41:33] so do we reuse that bug to track the glusterFS trouble ? [14:41:51] it should likely be a different bug [14:42:17] PROBLEM Current Load is now: WARNING on deployment-sql i-000000d0 output: WARNING - load average: 6.37, 5.97, 5.30 [14:42:19] k marking 36993 as fixed so [14:42:34] Ryan_Lane: What causes gluster to overload just due to I/O? [14:42:42] gluster does [14:42:43] :) [14:43:19] using it for workloads it wasn't designed (yet) to support most probably [14:43:31] likely [14:43:59] dumps were on instance storage, right? [14:44:13] on project storage mostly [14:44:29] *the* dumps [14:45:00] oh? and then why was it killing other instances' I/O? [14:45:50] Hydriz: does any part of your process involve writing to anywhere other than /data/project? [14:45:59] no [14:46:13] I wonder if it holds it in memory [14:46:17] which causes it to swap [14:46:32] or if anything writes to a tmp directory [14:46:46] but why does it need to write to anywhere else other than /data/project? [14:47:45] it shouldn't need to [14:48:09] are you using curl? rsync? wget? [14:48:24] rsync for download, curl for uploading [14:48:41] wget was an older version of this project when I was doing it on the Toolserver [14:49:07] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 3.44, 3.90, 4.86 [14:49:08] hmm, if I write to /mnt, does it bring lesser chaos? [14:49:12] no [14:49:12] more [14:50:04] then, running just 1 or 2 processes of this uploading brings much lesser chaos? [14:50:10] yes [14:50:14] because it'll be less IO [14:50:20] hmm... [14:51:18] my plan was to finish up with this complete run first before I just have one host that scans the dumps.wikimedia.org/backup-index.html file automatically and uploads new dumps, which would then be rather slow by then. [14:51:32] looks like I didn't get to hold out till then [14:51:48] I have ~200 wikis left, out of say, 800+? [14:54:30] Ryan_Lane needs moar io [15:02:06] hmm, rsync downloading still shows more of output than input... [15:02:07] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 10.13, 7.91, 6.28 [15:11:50] ^demon|away: . [15:15:30] !log testswarm created tw-next instance using oineric something Ubuntu version. [15:15:32] Logged the message, Master [15:15:34] I am off [15:19:00] !log bots petrb: patching wm-bot [15:19:01] Logged the message, Master [15:22:37] yay [15:22:38] @whoami [15:22:39] You are admin identified by name .*@wikimedia/Petrb [15:23:24] PROBLEM Free ram is now: WARNING on aggregator-test2 i-0000024e output: Warning: 8% free memory [15:23:44] PROBLEM Current Load is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:24:25] PROBLEM Current Users is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:24:58] @whoami [15:24:59] You are trusted identified by name .*@wikimedia/.* [15:25:06] PROBLEM Disk Space is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:25:29] !log bastion added Tanvir to the object a few hours ago [15:25:30] Logged the message, Master [15:25:42] lol? [15:25:44] PROBLEM Free ram is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:25:53] !log bots gave tanvir bots access, will be running interwiki bots [15:25:54] Logged the message, Master [15:26:00] oh [15:26:04] I think we're supposed to log when we give people access TBloemink [15:26:11] Object :D [15:26:12] access to what? [15:26:15] petan just reminded me ;) [15:26:27] TBloemink: to a project [15:26:27] !Thehelpfulone is !log bastion added Tanvir to the object a few hours ago [15:26:28] Key was added [15:26:42] to the object? 9_9 [15:26:46] hehe [15:26:47] you wrote it [15:26:54] PROBLEM Total Processes is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:27:21] I would quip you on bugzilla but people would not get it [15:27:22] don't know what I was thinking :P [15:27:34] PROBLEM dpkg-check is now: CRITICAL on tw-next i-0000027e output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:29:18] !log bastion also added Sven_Manguard a few days ago [15:29:20] Logged the message, Master [15:29:41] !log bots added Sven_Manguard a few days ago to the *project* :P [15:29:42] Logged the message, Master [15:38:34] PROBLEM Free ram is now: CRITICAL on aggregator-test2 i-0000024e output: CHECK_NRPE: Socket timeout after 10 seconds. [15:43:32] PROBLEM Free ram is now: WARNING on aggregator-test2 i-0000024e output: Warning: 7% free memory [15:49:37] !log bots wmib: restarting wm-bot [15:49:39] Logged the message, Master [16:11:39] PROBLEM HTTP is now: CRITICAL on deployment-apache23 i-00000270 output: CRITICAL - Socket timeout after 10 seconds [16:16:30] PROBLEM HTTP is now: WARNING on deployment-apache23 i-00000270 output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.006 second response time [16:34:20] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 20% free memory [16:48:18] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 13% free memory [17:03:01] ^demon|away: when u here, can you create a repo for me [17:03:17] RECOVERY Free ram is now: OK on bots-3 i-000000e5 output: OK: 20% free memory [17:05:44] <^demon|away> Can you ask on the on-wiki request page please so there's a paper trail? [17:05:56] <^demon|away> I don't like doing this kind of thing from a random request on IRC. [17:06:43] ^demon|away: i already asked [17:06:55] I hoped it could be made today [17:07:00] just empty new repo [17:07:12] I suppose it's like matter of 10 seconds, or not [17:07:29] <^demon|away> Empty repo + new group + permissions config. [17:07:31] <^demon|away> > 10 seconds [17:07:41] ah [17:07:50] that should be automated somehow [17:07:58] creating a repo in github is quick [17:07:59] <^demon|away> You're telling me. [17:08:19] <^demon|away> Yeah well isn't github just so amazing? [17:08:23] <^demon|away> Too bad we're not all like github. [17:08:28] heh [17:08:38] * ^demon|away hates github [17:08:51] why [17:09:33] <^demon|away> I forever blame them for calling personal clones "forks." [17:09:46] aha [17:09:56] <^demon|away> Fork has an awful connotation, and I think it was a sin to use the terminology. [17:10:43] I don't really care, I just need a place to store my source at :D [17:10:57] PROBLEM HTTP is now: CRITICAL on deployment-apache22 i-0000026f output: CRITICAL - Socket timeout after 10 seconds [17:10:59] <^demon|away> Hard drives have been around for ages ;-) [17:11:07] not reliable enough [17:11:24] + hard to access from other spots [17:11:47] <^demon|away> Reliable enough for everyone until someone decided clouds are useful for things other than raining on you :p [17:12:16] :D [17:15:47] PROBLEM HTTP is now: WARNING on deployment-apache22 i-0000026f output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 0.019 second response time [17:34:42] hey Ryan_Lane - got a moment to talk about the setup in Labs for Roan's talk at the Berlin hackathon? [17:35:05] ah. right. we need to figure out a name for the project [17:35:15] <^demon> Names are hard :) [17:35:20] yep [17:35:24] the hardest part [17:35:37] <^demon> Especially names like git repos that are damn near impossible to change. [17:35:49] so are projects in labs [17:37:05] how about a tutorial project? [17:37:14] and we'll just reuse it for all tutorials [17:37:40] .... [17:37:46] sumanah: ? [17:38:04] Hi Ryan_Lane - thanks for pinging. Reading backscroll [17:38:21] Ryan_Lane: yes, "tutorial" as the name of the Labs project sounds fine to me [17:39:07] this implies that we'll be free to add any random Labs user to the access list for that Labs project and they won't be able to hurt anything really important :) which sounds great [17:39:19] yes [17:40:01] as long as we have no dump sync during the tutorials :-) [17:40:02] * paravoid ducks [17:40:29] :D [17:40:32] paravoid: I do seriously want to ensure that all of this works, for the tutorial participants and for other infrastructure users [17:40:41] well, then pray [17:40:42] is this a substantial concern? if so, I ought to address it [17:40:45] 05/22/2012 - 17:40:45 - Creating a project directory for tutorial [17:40:46] 05/22/2012 - 17:40:45 - Creating a home directory for sumanah at /export/home/tutorial/sumanah [17:40:46] 05/22/2012 - 17:40:45 - Creating a home directory for catrope at /export/home/tutorial/catrope [17:40:54] or hope our hardware comes in on time [17:41:00] Ryan_Lane: There is also meant to be a Plan B that Roan can switch to in case of Labs failure. [17:41:26] He's working on both Plan A (use the Labs infrastructure so people can interactively & immediately try out the things he's teaching) and a Plan B. [17:41:28] I don't foresee a failure [17:41:46] 05/22/2012 - 17:41:46 - Updating keys for catrope at /export/home/tutorial/catrope [17:41:46] 05/22/2012 - 17:41:46 - Updating keys for sumanah at /export/home/tutorial/sumanah [17:42:54] Ryan_Lane: Roan is in transit at the moment and I know that he & you have talked far more about the logistics of what ought to happen than I have (with him or you). Danielle, deebki on IRC, has been collaborating with Roan on the plans for this tutorial, in case you need info about the requirements while Roan isn't around [17:43:10] all I need to do is make the project [17:43:11] PROBLEM Puppet freshness is now: CRITICAL on nova-ldap1 i-000000df output: Puppet has not run in last 20 hours [17:43:12] * Ryan_Lane shrugs [17:43:33] Really? Hmm, Roan made it sound like he was waiting on you for a lot more than that [17:43:40] I'd hope not [17:43:55] we were hoping to do this on real hardware with the user databases [17:44:00] alas, still no hardware [17:44:23] Can you give Roan & me the necessary access to add multiple Labs users to work within the "tutorial" Labs project? [17:44:29] already did [17:44:54] OK. So, in order to do that, I go to https://labsconsole.wikimedia.org/wiki/Special:NovaProject ? [17:44:59] yep [17:45:05] they only need member access [17:45:12] * sumanah logs out, in again [17:45:19] roan will need to adjust the sudo policy [17:45:22] ^^ most annoying bug [17:45:29] the logging in and out I mean [17:45:33] you only need to give people access as members [17:45:47] I'll gladly take patches to LdapAuthentication to fix the problem ;) [17:46:51] Ryan_Lane: ok, so: "roan will need to adjust the sudo policy" - that is, he needs to log into the tutorial Labs project and adjust the sudo policy across that whole Labs project, right? [17:47:06] well, to give himself sudo [17:47:19] no one else needs it [17:47:40] also, Ryan_Lane, when I look at https://labsconsole.wikimedia.org/wiki/Nova_Resource:Tutorial I see no mention of any instances for the "tutorial" Labs project. Roan or I can create an instance, right? [17:47:48] yes [17:47:51] it's an empty project [17:49:11] * sumanah looks at the instance creation page, sees lots of options, decides to let Roan do it because he knows what options he'll want [17:50:38] ok. Thanks Ryan_Lane - I'll email him about this [17:54:39] <^demon> Ugh, http://code.google.com/p/gerrit/issues/detail?id=1052 [17:54:46] <^demon> I want to be able to change what HEAD points to. [18:10:09] PROBLEM HTTP is now: CRITICAL on deployment-apache20 i-0000026c output: CRITICAL - Socket timeout after 10 seconds [18:14:59] PROBLEM HTTP is now: WARNING on deployment-apache20 i-0000026c output: HTTP WARNING: HTTP/1.1 403 Forbidden - 366 bytes in 5.810 second response time [18:49:49] PROBLEM Free ram is now: WARNING on ganglia-test2 i-00000250 output: Warning: 19% free memory [20:33:37] RECOVERY Free ram is now: OK on aggregator-test i-0000024d output: OK: 89% free memory [20:34:09] RECOVERY Free ram is now: OK on aggregator-test2 i-0000024e output: OK: 91% free memory [20:34:49] RECOVERY Free ram is now: OK on ganglia-test2 i-00000250 output: OK: 84% free memory [22:25:06] PROBLEM Puppet freshness is now: CRITICAL on mailman-01 i-00000235 output: Puppet has not run in last 20 hours [22:32:33] !log testlabs test [22:32:34] Logged the message, Master [23:42:31] PROBLEM Puppet freshness is now: CRITICAL on localpuppet1 i-0000020b output: Puppet has not run in last 20 hours [23:53:34] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5 output: Warning: 13% free memory [23:59:43] !labs Access [23:59:43] https://labsconsole.wikimedia.org/wiki/Access