[00:00:04] maplebed: still around? [00:00:10] * Reedy notes he's marked /away [00:00:13] yup. [00:00:39] Any suggestions what to do with a mysql instance with corrupt core tables that won't start? Reinstall mysql? [00:00:49] do you care about the data? [00:01:21] In theory, yes, I care about the data. [00:01:39] ah, too bad. it's easy if you don't. more interesting since you do. [00:01:45] lol [00:01:50] point me at the instance; I might be able to poke it back to life. [00:02:07] myisamchk -e /data/project/db/mysql [00:02:15] myisamchk -r /data/project/db/mysql [00:02:24] maplebed: It's deployment-sql [00:02:37] andrewbogott: would you put me in the project? [00:02:46] reedy: I already tried myisamchk *.MYI [00:02:53] maplebed: yes, just a second [00:04:11] I suspect if we have to loose users/grants, it's not a big deal [00:04:16] 03/28/2012 - 00:04:16 - Creating a home directory for ben at /export/home/deployment-prep/ben [00:05:18] 03/28/2012 - 00:05:18 - Updating keys for ben [00:06:45] I presume that the db only has one user, mw [00:07:19] indeed, plus any system accounts, root, and whatever the debian maintainer account is [00:07:40] maplebed: Can you get in now? [00:08:38] worked. [00:09:46] 03:22 mutante: mysqld on deployment-sql is stopped - did not start it though after i heard petan is working on corrupted db's [00:09:47] orly [00:10:12] oh, that's... relevant. [00:10:35] 15:41 labs-logs-bottie: petrb: getting sql server down I found a bunch of corrupted db's, rollback is necessary [00:10:40] 16:16 labs-logs-bottie: petrb: it seems that corruption of db is worse than I expected, need to restore backup old few months [00:11:15] when is that from? yesterday? [00:11:43] 20th/21st [00:11:47] so a week ago [00:11:58] well, damn. [00:12:01] maplebed, hear that? [00:12:07] yup. [00:12:12] makes me want to step away slowly. [00:12:29] either that or nuke the thing and start over. [00:12:32] slowly? [00:12:32] run [00:12:45] dude, I don' want it chasing me into my swift project space... [00:12:58] it must never even know I was there. [00:13:10] delete your bash history on your way out ;) [00:13:24] I wonder if it's just the system database files... or the whole thing [00:14:13] I am still pretty sure that the site is now more broken than when I started, though. [00:14:23] It was only some files that would never load before, now nothing loads. [00:14:40] Heh [00:14:57] Of course, maybe breaking it totally was a necessary step in the healing process. [00:16:01] btw, reedy, stupid question: where is that log? [00:16:09] https://labsconsole.wikimedia.org/wiki/Server_Admin_Log [00:17:53] thx [00:18:18] oh hey, I remember ryan saying something about a db getting borked because two mysqld processes were mounting the same dataset; this isn't that, is it? [00:18:48] no idea. [00:19:19] OK, I think I will log out and devote the rest of my evening to waiting for petan to log in. [00:19:26] fucking lol [00:19:51] maplebed: fuck replication, let's just have 2 servers use the same data files [00:19:52] * Reedy grins [00:19:55] I did get beta to work slightly better, before breaking it entirely. [00:21:40] well goddamn, the site is still a little bit up. What the heck? [00:21:45] maplebed, did you do that? [00:22:02] yeah. [00:22:10] I think I did. [00:22:39] nice work! [00:22:51] I've got to stop mysql again though [00:22:57] uhoh [00:23:06] it didn't create its socket. [00:23:44] What magic did you work to get host.frm repaired? [00:24:01] I didn't. [00:24:13] ? [00:24:29] ops h4x [00:27:24] huh. what's the deal with these: [00:27:25] Mar 28 00:13:56 i-000000d0 kernel: [ 3219.632307] type=1505 audit(1332893636.103:120): operation="profile_replace" pid=12541 name="/usr/sbin/mysqld" [00:27:29] in /var/log/messages [00:27:48] anyway, I gotta bail... I don't think I acutally did anything to fix it. [00:27:53] nor do I think it's acutally fixed. [00:28:30] Personally.. I think I'd be tempted to backup the /mysql folder and reinstall mysql ontop and see how it goes from there... [00:29:32] if you want to do something like that, back up /data/project/db and /var/log/mysql/ and try to create a new DB. [00:30:11] maplebed: It is slightly better than how I left it. thanks. [00:33:01] Yeah, I can't even get into bastion, so no chance of me trying to break stuf [00:33:02] f [09:04:45] !log deployment-prep petrb: fixing sql [09:47:43] !log deployment-prep root: mysql partially fixed [09:56:21] mutante: hi! i can't log into my instance from bastion. agent forwarding is enabled, and ssh-add -l shows the correct identity. [09:56:28] any idea what the problem could be? [09:56:36] the instance is wikidata-contenthandler-demo [09:56:48] i'm getting: Permission denied (publickey) [09:56:50] hi [09:56:56] when did you create it? [09:57:00] yesterday [09:57:05] the key needs to be in labs [09:57:13] @search ssh [09:57:13] Results (found 3): pageant, ssh, socks-proxy, [09:57:15] !ssh [09:57:15] https://labsconsole.wikimedia.org/wiki/Help:SSH [09:57:27] petan|wk: what do xyou mean by "be in labs"? [09:57:37] there is a page [09:57:37] i have uploaded it to the labsconsole [09:57:40] ok [09:57:43] and it works for logging into bastion [09:57:52] and the agent forwarding works too [09:57:54] there is another way to handle it [09:58:05] you can create a private key on bastion [09:58:12] then upload the public to labs [09:58:29] so that you have separate key for your instance then to bastion [09:58:33] that would work without forwardning [09:58:39] at some point it is more secure and simpler [09:58:40] but why would forwardning be the issue? [09:58:46] I have no idea [09:58:50] many people have troubles with that [09:58:59] usually problem of their OS [09:59:15] ubuntu 11.10 [09:59:22] i use agent forwarding a lot with the toolserver [09:59:25] it may be some firewall or that [09:59:28] hm... [09:59:30] and ssh-agent -l on bastion shows the correct identity [09:59:34] so, not a problem on my end [09:59:36] no clue [09:59:50] making the private key on bastion is surely better [09:59:53] err, ssh-add -l shows it. anyway [10:00:01] is it? surely? [10:00:04] why? [10:00:08] it's more secure [10:00:17] also it doesn't need agent to be running [10:00:24] which could be problem on some systems [10:00:26] like windowns [10:01:14] really? having a private key on a shared host, and entering the passphrase on that shared host, seems way less secure to me [10:01:25] how is it shared [10:01:31] bastion? [10:01:33] only wmf ops have access to bastion as root [10:01:37] you have access, i have acces... [10:01:39] no one else does [10:01:44] ok but I don't have root [10:01:47] neither you [10:01:57] so you can't read my private key [10:02:00] and even if you could [10:02:02] it's just for labs [10:02:12] well, if nothing gets corrupted, everything is secure [10:02:39] a shared system with a public ip is more exposed than my personal box. [10:03:02] it basically boils down to the question whether my own laptop is more secure, or a box shared with many users [10:03:14] I am not afraid of getting my key which is useable only for labs project stolen [10:03:15] both have their specific exposure profiles [10:03:41] while the key I use on my pc is definitely problem if stolen [10:04:20] if someone hacked to bastion, they could read the forwarded key as well [10:04:29] because it goes through that [10:04:39] not the private key, no. [10:04:47] they could hack into the session [10:04:51] but not steal the key [10:04:53] it does, bastion forward it to target host [10:04:59] in order to auth [10:05:10] oh right [10:05:12] not it doesn't [10:05:15] right [10:05:21] that would make ssh pretty useless [10:05:31] it forwards the challanvge and the response [10:05:33] not the private key [10:05:38] true [10:05:46] which is why i think it's pretty secure :) [10:06:27] anyway. i think a local private key is not more secure *per se*. [10:06:40] it *may* be more secure, depending on circumstances. or less secure. [10:07:10] basically - is it more likely that i get myself a key logger on my box, or that someone gets root on bastion? [10:07:17] if someone steal your private key from bastion, it's fault of ops, not you [10:07:22] both are hopefully quite unlikely. both are possible [10:07:31] i don't care who's fault it is :) [10:07:44] anyway, i'll try the local private key, just to see if it makes a difference. [10:07:52] I would rather leave the responsibility to someone else [10:08:04] difference is that it works :) [10:08:10] that also leaves the control to someone else, i.e. you have to trust them :) [10:08:27] they already have access to target host, so why not [10:08:37] the private key is useable for labs vm's [10:10:15] I mean, even if I didn't trust them (everybody knows I don't hehe) I can't do anything about it [10:11:02] i mean you have to trust that none of them will screw up, ever. [10:11:27] btw mutante are you here [10:11:33] I really want to fix that puppet thing [10:12:03] if we don't fix it today, I will likely disable it in nagios and try to get my patch merged [10:12:31] 03/28/2012 - 10:12:30 - Updating keys for daniel [10:12:39] andrewbogott: hi [10:12:55] sql is up somewhat, regarding the performance troubles, no idea [10:13:02] petan|wk: nope, no bananas [10:13:05] doesn't work either [10:13:12] 03/28/2012 - 10:13:11 - Updating keys for daniel [10:13:15] Daniel_WMDE: try now [10:13:16] seems like labsconsole doesn't put my public keys on the instance at all [10:13:28] it takes up to 20 minutes for puppet to update key [10:13:33] ah... why does that log entry show twice? [10:13:36] once for every instance? [10:13:42] no [10:13:44] project [10:13:48] ok [10:13:51] still: no bananas [10:13:58] how many projects you in [10:14:01] daniel@bastion1:~$ ssh -i .ssh/wikidata_rsa wikidata-contenthandler-demo [10:14:02] Permission denied (publickey). [10:14:04] d [10:14:05] two [10:14:08] were you ever able to ssh there [10:14:11] bastion and wikidata-dev [10:14:16] only bastion [10:14:19] the instance never worked [10:14:28] aha [10:14:29] the instance of wikidata-dev, i mean [10:14:36] i guess i'll just delete it and try again [10:14:36] in that case delete it and create again [10:14:41] i guess it's corrupt in some way [10:14:43] yea [10:14:44] yes [10:14:47] that happen a lot [10:14:52] ugh [10:14:55] ow nice [10:14:56] puppet server is overloaded [10:15:23] when it get broken during creation of server, it never recover [10:16:11] @search out [10:16:11] Results (found 2): console, git, [10:16:14] !console [10:16:14] in case you want to see what is happening on terminal of your vm, check console output [10:16:26] you will see the error there [10:16:39] it's improtant to check if it's ok before you try to login [10:17:52] ssh: connect to host contenthandler-demo port 22: No route to host [10:18:01] so... how would i do that, then? [10:18:07] check if it's ok, without logging in? [10:18:14] oh, console [10:18:15] sorry [10:18:49] uh... [10:18:51] mountall: Event failed [10:18:53] narf [10:19:36] uh... [10:19:44] i tried rebootin it. now the console output is empty? [10:19:53] i guess it hasn'tr rebooted yet, then... [10:20:51] petan|wk: i'm still getting "permission denied", using the forwarded key as well as using the local key. [10:21:12] hm... [10:21:24] petan|wk: which image type do you recommend? i tried oneiric [10:22:18] sigh [10:22:21] giving up for now [13:20:39] petan|wk: still there? [14:13:28] andrewbogott: yes [14:15:00] petan|wk: Did you do a restore, or just verify that the server was up? [14:15:26] I removed the binary logs [14:15:38] there was data causing recovery fail [14:16:28] so, I didn't do any recovery, the databases are still corrupt, but most of wikis we need are not affected [14:16:35] commons and simple wiki are working [14:16:41] Is there any hope for getting them uncorrupt? [14:16:55] yes, there is a full backup, old few months [14:16:58] The performance problem I'm seeing is mostly related to a single query that hangs [14:17:07] I need to get a list of corrupt db's and recover them all [14:17:18] the data are not corrupt, only some table files [14:17:22] frm [14:17:28] that doesn't contain data [14:17:51] problem could be that the current tables were upgraded by mediawiki itself [14:18:02] and recovering frm files using backup could break stuff [14:18:20] so I want to make full backup of current db, which require around 90gb of space [14:18:36] I would like to talk to Ryan before I do that [14:18:42] because I don't know if we can have it [14:19:21] You can definitely have it, at least temporarily. [14:19:30] ok [14:19:44] I will start right now [14:19:44] Would corrupt .frm files cause the symptom I'm seeing? [14:19:50] I don't know [14:19:58] maybe some dba people would know that [14:20:21] however it causes a lot of troubles some queries are failing [14:20:28] especially alters [14:20:36] ok [14:20:56] Is there plenty of working space on the volume you're using? [14:21:03] I don't even know if performance issues are caused by sql [14:21:18] maybe it's a problem somewhere else, we need to check why it loads so slow [14:21:24] servers aren't loaded even a bit [14:21:26] !nagios [14:21:26] http://nagios.wmflabs.org/nagios3 [14:21:32] that display load of all servers [14:21:36] mostly 0 [14:21:39] Can you tell me where you're going to put the backup? [14:21:52] ./data/project/old_sql [14:22:27] yes there is plenty of space I guess [14:22:43] * andrewbogott nods [14:23:54] I'll be here and there for the next hour or so, but ping me if you need anything. [14:24:38] !log deployment-prep petrb: replicating current db to /data/project/olddb_1 and shutting down sql server [14:33:22] !logs is http://bots.wmflabs.org/~wm-bot/searchlog/index.php?action=search&channel=%23wikimedia-labs&query=$1 [14:33:23] You are not autorized to perform this, sorry [14:33:26] meh [14:33:40] my own bot refuses to listen [14:47:26] !logs is http://bots.wmflabs.org/~wm-bot/searchlog/index.php?action=search&channel=%23wikimedia-labs&query=$1 [14:47:26] Key was added! [14:47:26] Hmm, who can create labs projects again? [14:47:29] :S [14:47:30] :D* [14:47:30] I need two new projects [14:47:46] RoanKattouw: mutante ssmollett or maybe andrewbogott :O [14:48:06] mutante: I can haz labz projekt plox? ;) [14:48:07] but I never saw ssmollett actually talk :)) [14:48:44] RoanKattouw: I can do it... do you have a labs account already? [14:48:51] Oh, cool [14:48:53] Yes, I do [14:48:55] Catrope [14:49:01] what do you want the projects named? [14:49:04] mutante: Never mind, andrewbogott has got it [14:49:17] articlefeedbackv5 and resourceloader2 [14:49:18] mutante: I need you anyway [14:49:23] 'k [14:51:12] did someone quieted nagios [14:51:13] RoanKattouw: OK, you are a tiny god of two new tiny worlds. [14:51:17] I don't see it talk [14:51:17] Yay, thanks [14:51:29] 03/28/2012 - 14:51:29 - Creating a project directory for articlefeedbackv5 [14:51:29] 03/28/2012 - 14:51:29 - Creating a home directory for catrope at /export/home/articlefeedbackv5/catrope [14:51:29] 03/28/2012 - 14:51:29 - Creating a project directory for resourceloader2 [14:51:29] 03/28/2012 - 14:51:29 - Creating a home directory for catrope at /export/home/resourceloader2/catrope [14:51:58] ah nice [14:52:05] RoanKattouw: You're going to migrate them today? [14:52:21] (or re-create rather) [14:52:31] 03/28/2012 - 14:52:31 - Updating keys for catrope [14:52:32] 03/28/2012 - 14:52:31 - Updating keys for catrope [14:53:13] Krinkle: I am going to migrate rl2 now [14:53:16] aftv5, maybe later [14:54:05] yeah, them=rl2 wiki plural [14:55:04] Oh, yeah [14:58:52] this is a blah :o [14:58:59] ok [14:59:10] Damianz: :o [14:59:14] your python suck [14:59:16] :D [14:59:30] you told me to insert time.sleep(1) [14:59:41] but that made the nagios bot stop working [14:59:50] totaly [15:00:33] Damian made me sad [15:04:00] >.> [15:04:21] Damianz: I want it to work [15:04:24] You said "I assume sleep(1)" I said "time.sleep(1)" which is how you sleep in python. never mentioned the implimentation :P [15:04:36] ok, what does it do [15:04:42] sleep for 1 second? 1 hour? [15:04:51] I really want it to sleep, but not for ages [15:05:26] Suspend execution for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time. The actual suspension time may be less than that requested because any caught signal will terminate the sleep() following execution of that signal?s catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the schedu [15:05:54] ling of other activity in the system. [15:06:09] ok, why it doesn't work then [15:06:40] http://pastebin.mozilla.org/1540678 [15:06:43] this is a part of code [15:06:51] when I uncomment it, it doesn't work a bit [15:07:02] o.0 [15:07:13] Why are you sleeping in an inotify callback? [15:07:36] Just push all the shit into a list then loop over it in another thread or modify your irc code to not flood. [15:07:42] Is it possible you just failed to import time? [15:08:08] andrewbogott: talking to? [15:08:10] me? [15:08:16] yeah [15:08:23] regarding deployment? [15:08:24] I'd assume by the huge exception he'd get back he'd of fixed thjat. [15:08:43] regarding time.sleep(1) [15:08:46] Damianz: that's what I try to do [15:08:52] Damianz: true [15:09:01] modifiing the bot to not flood is inserting sleep for 1 second after it send a message [15:09:18] that's what I did [15:09:19] I hope [15:09:39] I never worked in python I just hope it's not so hard as brain buck [15:09:40] fuck [15:10:04] Do you have "import time" somewhere in the file before that line? [15:10:07] Prefrably at the top. [15:10:09] surely not [15:10:29] Freebsd takes sooo long to install :( [15:11:12] heh [15:11:30] I don't like packages of free bsd [15:11:50] debian is looking like it run latest version of sw, when you compare it to bsd [15:11:52] The linux zfs port isn't majorly supported IIRC. [15:12:06] lot of stuff is not [15:13:49] I'd rather not have to compile my filesystem for updates when dealing with tbs of data though ;P [15:14:32] this is a flood test [15:14:32] this is a flood test [15:14:37] this is a flood test [15:14:39] this is a flood test [15:14:39] this is a flood test [15:14:40] this is a flood test [15:14:41] this is a flood test [15:14:48] does it work? [15:14:55] I can't switch my screen fast enough :D [15:15:03] I mean was there a delay between message [15:15:53] this is a flood test [15:15:54] this is a flood test [15:15:58] ok [15:16:00] it doesn't [15:16:14] Damianz: I inserted import line [15:16:24] it sends messages now, but doesn't sleep [15:16:36] er [15:17:23] why is python so complicated, why do people use it :O [15:17:39] it would be so much simpler if that bot was written in sane language... [15:18:03] petan: I think you can ask freenode staff to mark your bot as allowed to flood somehow [15:18:11] I think that is used for CIA bots [15:18:19] Python is an awesome language [15:18:28] hashar: I would rather not flood :P [15:18:36] look [15:18:39] !ping [15:18:39] !ping [15:18:39] pong [15:18:39] !ping [15:18:40] !ping [15:18:40] pong [15:18:42] pong [15:18:42] !ping [15:18:43] pong [15:18:44] pong [15:18:44] !ping [15:18:45] pong [15:18:45] hahaha [15:18:47] this doesn't flood [15:18:48] !ping [15:18:48] pong [15:18:50] !ping [15:18:51] pong [15:18:53] \o/ [15:18:54] it's written in sane language [15:18:58] :D [15:19:02] 1 sec delay [15:19:09] It probably will flood if I send it a shed load of stuff [15:19:16] nah [15:19:29] this bot could have 100000 messages in queue, but it wouldn't get killed [15:19:32] from network [15:19:50] I was told by freenode staff that message per second is ok [15:20:48] there is a chance you would get killed though :O [15:21:49] Damianz: just write a python code which sleep for a second after sending a message [15:22:00] some better than I did [15:22:19] isn't it microsecond [15:22:32] that would explain if it's so fast [15:31:34] this is a flood test [15:31:37] :D [15:31:41] lmao [15:31:47] I think I know what's problem [15:31:53] that thing is threaded! [15:32:03] every single message is a thread [15:32:33] no wonder python eats more ram for simple echo "Hello world" than rest of OS [15:32:44] meh [15:36:17] . [15:53:07] PROBLEM Puppet freshness is now: CRITICAL on shop-analytics-main1 shop-analytics-main1 output: Puppet has not run in the last 10 hours [15:53:07] PROBLEM Puppet freshness is now: CRITICAL on swift-fe1 swift-fe1 output: Puppet has not run in the last 10 hours [15:56:36] mutante: pong me when u back [16:03:06] Well if you're using the threaded callback then yes it's threaded. [16:18:19] petan: he probably won't be online for several (6 to 8) hours. [16:45:32] 03/28/2012 - 16:45:32 - Creating a home directory for krinkle at /export/home/resourceloader2/krinkle [16:46:32] 03/28/2012 - 16:46:32 - Updating keys for krinkle [16:49:04] hi all [16:50:42] hi koolhead17 [16:50:52] IWorld, hello there [16:55:34] petan: How's the db restore going? [17:00:26] andrewbogott: do you know when the read-only access to replicated databases will be available? [17:00:38] No idea, sorry [17:03:53] PROBLEM dpkg-check is now: CRITICAL on resourceloader2-apache resourceloader2-apache output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:04:33] PROBLEM Current Load is now: CRITICAL on resourceloader2-apache resourceloader2-apache output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:05:18] PROBLEM Current Users is now: CRITICAL on resourceloader2-apache resourceloader2-apache output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:06:03] PROBLEM Disk Space is now: CRITICAL on resourceloader2-apache resourceloader2-apache output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:06:43] PROBLEM Free ram is now: CRITICAL on resourceloader2-apache resourceloader2-apache output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:08:13] PROBLEM Total Processes is now: CRITICAL on resourceloader2-apache resourceloader2-apache output: CHECK_NRPE: Error - Could not complete SSL handshake. [17:08:38] RoanKattouw: poke puppet [17:08:43] Krinkle: OK [17:10:13] RECOVERY Current Users is now: OK on resourceloader2-apache resourceloader2-apache output: USERS OK - 1 users currently logged in [17:10:57] RoanKattouw: even though puppet was empty in the labs console (no commands) there are some base things that it otherwise complains about [17:11:03] RECOVERY Disk Space is now: OK on resourceloader2-apache resourceloader2-apache output: DISK OK [17:11:08] SSL doesn't even have to be installed, the error is a bit odd [17:11:14] weird [17:11:17] I ran puppet and it works now [17:11:43] RECOVERY Free ram is now: OK on resourceloader2-apache resourceloader2-apache output: OK: 90% free memory [17:11:51] yep [17:12:02] it doesn't install SSL, but it does "something" that makes it OK [17:12:16] It sets up gmond apparently [17:13:13] RECOVERY Total Processes is now: OK on resourceloader2-apache resourceloader2-apache output: PROCS OK: 91 processes [17:13:53] RECOVERY dpkg-check is now: OK on resourceloader2-apache resourceloader2-apache output: All packages OK [17:14:33] RECOVERY Current Load is now: OK on resourceloader2-apache resourceloader2-apache output: OK - load average: 0.04, 0.37, 0.46 [17:15:49] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Eloquence link https://www.mediawiki.org/w/index.php?diff=516864 edit summary: linkfix [17:27:03] RECOVERY Disk Space is now: OK on wikistream-1 wikistream-1 output: DISK OK [17:35:03] PROBLEM Disk Space is now: WARNING on wikistream-1 wikistream-1 output: DISK WARNING - free space: / 72 MB (5% inode=47%): [21:28:12] PROBLEM Puppet freshness is now: CRITICAL on aggregator1 aggregator1 output: Puppet has not run in the last 10 hours [21:28:12] PROBLEM Puppet freshness is now: CRITICAL on analytics analytics output: Puppet has not run in the last 10 hours [21:28:12] PROBLEM Puppet freshness is now: CRITICAL on asher1 asher1 output: Puppet has not run in the last 10 hours [21:28:12] PROBLEM Puppet freshness is now: CRITICAL on backport backport output: Puppet has not run in the last 10 hours [21:28:12] PROBLEM Puppet freshness is now: CRITICAL on bastion-restricted1 bastion-restricted1 output: Puppet has not run in the last 10 hours [21:28:20] lol [21:28:28] That check is annoying.