[07:03:27] Damianz mhm [07:03:34] let me check it [07:07:07] Damianz how you copy it? [09:40:46] Daminaz: Hello! Comming back to the red-only issue; Your "clearing" did not help - everything still there - including IO errors ... what now? [10:05:53] Hello all! Any labs admin here? [10:08:30] what do you need? [10:12:31] We have a IO error / read-only issue on "/data/project/DrTrigonBot" - Damianz and Ryan_Lane suggested me to delete it and re-checkout but I CANNOT DELETE IT because of the IO error "rm -rf" has errors too... can you help? [10:27:10] paravoid: ^^^ [10:28:15] not really... [10:29:47] ...a pitty... but thanks anyway! ;)) [10:43:32] Damianz I am testing io speed on bsql I have interesting results [10:43:43] write speed to /db is 203mb/s [10:43:52] problem is with flushing of cache [10:45:23] Damianz: Are you here? Ready for a secound round? ;) [10:45:35] DrTrigon of what [10:45:40] he's sleeping [10:45:59] he typically wake up like 4 pm [10:46:59] petan: ...what I thought... still the IO error / read-only issue... but I think I have to shut the bot down and wait for next weekend... [10:48:27] ============================ BOS ============================= [10:48:31] root@bots-bsql01:/db/test# time dd if=/dev/zero of=/db/test/xx count=200000 && time sync [10:48:31] 200000+0 records in [10:48:31] 200000+0 records out [10:48:31] 102400000 bytes (102 MB) copied, 0.455191 s, 225 MB/s [10:48:31] real 0m0.457s [10:48:31] user 0m0.040s [10:48:33] sys 0m0.416s [10:48:35] real 1m53.422s [10:48:37] user 0m0.000s [10:48:40] sys 0m0.008s [10:48:42] root@bots-bsql01:/db/test# rm xx [10:48:44] root@bots-bsql01:/db/test# [10:48:46] root@bots-bsql01:/db/test# time dd if=/dev/zero of=/db/test/xx count=200000 && time sync [10:48:49] 200000+0 records in [10:48:51] 200000+0 records out [10:48:53] 102400000 bytes (102 MB) copied, 0.466653 s, 219 MB/s [10:48:55] real 0m0.468s [10:48:57] user 0m0.056s [10:49:00] sys 0m0.412s [10:49:02] real 0m35.886s [10:49:04] user 0m0.000s [10:49:06] sys 0m0.004s [10:49:08] root@bots-bsql01:/db/test# [10:49:10] root@bots-bsql01:/db/test# time dd if=/dev/zero of=/root/xx count=200000 && time sync [10:49:13] 200000+0 records in [10:49:15] 200000+0 records out [10:49:17] 102400000 bytes (102 MB) copied, 0.430658 s, 238 MB/s [10:49:19] real 0m0.432s [10:49:21] user 0m0.028s [10:49:24] sys 0m0.400s [10:49:26] real 0m46.059s [10:49:28] user 0m0.008s [10:49:30] sys 0m0.000s [10:49:34] ============================ EOS ============================= [10:49:42] Damianz ^^ [10:49:52] this problem is not because of me [10:50:00] it runs same slow on ext4 [10:50:05] it's Ryan's prob [10:50:13] @notify Damianz [10:50:13] This user is now online in #wikimedia-tech so I will let you know when they show some activity (talk etc) [10:51:14] petan: ...or can you help with the [10:51:28] IO error / read-only issue ...?? [10:51:33] maybe [10:52:46] So Ryan told me to just delete "/data/project/DrTrigonBot" and re-checkout beacuse of that issue - but because of that issue I CANNOT delete it by "rm -rf" ... so what to do? can you delete an re-create this dir? [10:53:13] which instance [10:54:24] bots4 [10:55:34] so you want me to delete it? [10:56:29] DrTrigon [10:56:33] I do not know what would be THE PROPER SOLUTION... Appearently Damianz tried something yesterday: [10:56:36] (00:39:01) Damianz: and my cleared I mean I just moved it and made a new once... since that just changes the meta data on the parent dir and doesn't need to go anywhere near the files in it... yay filesystems [10:56:50] ok proper would be to fix the brain split [10:56:58] which needs to be done by Ryan [10:57:41] but that caused no change at all - everything is still the same (files still there, along with IO errors) ... so I do not know what is the right solution, but I can re-create the dir if needed and would like to proceed somehow - you can say how! [10:58:06] ok we can just delete it now [10:58:06] Ryan suggested he would like to avoid doint the brain split fix... [10:58:19] no wonder he's lazy :P [10:58:50] aaa - very well ... ;) [10:59:47] so what do you prefer [11:00:13] ...so if deleting solved the issue... and you can do it - please go ahead! Is there some mechanism to avoid such thing for the future? E.g. backups?? [11:00:21] !log bots root: bots removing DrTrigonBot per request on irc [11:00:23] Logged the message, Master [11:01:34] ...so I can just "mkdir" a new one? [11:01:39] sec [11:02:07] (sorry) [11:03:46] !log bots root: rebooting bots-4 to fix gluster reconnect [11:03:48] Logged the message, Master [11:04:05] DrTrigon there is some process running of you [11:04:19] script_wui [11:04:23] can I reboot? [11:04:28] killed [11:04:56] !log bots root: aborted reboot [11:04:58] Logged the message, Master [11:05:00] !log bots root: folder removed [11:05:02] Logged the message, Master [11:05:10] DrTrigon somehow it mysteriously fixed gluster [11:05:13] so I don't need reboot [11:05:17] your folder is gone [11:05:27] Damianz did a reboot yesterday [11:05:43] ok [11:05:46] ...so I have to "mkdir" a new one? Is that correct? [11:05:55] yes you don't really have to [11:05:59] that's up to you :P [11:06:02] but you can create it now [11:06:56] ;))) I see... [11:06:56] for some reason DrTrigonBot-old exist now [11:07:04] no idea what it is [11:07:12] but it's broken just as the previous one [11:07:42] who knows maybe it's not [11:07:47] but gluster log is full of these [11:08:20] yes is it [11:08:21] it is [11:08:26] maybe it's the previous folder? [11:09:04] might be the one Damianz mentioned ... if you ask me; you can remove it too [11:09:20] I don't really care - if you want me to do that [11:09:28] but it might require gluster restart once more [11:10:09] I do not know either - I will not take care of it, I will work in the new one only... [11:10:19] ok [11:10:31] "mkdir: cannot create directory `DrTrigonBot': Permission denied" [11:10:51] use sudo [11:11:00] ...of course! thanks! ;)) [11:13:13] so I did a "sudo chown -hR drtrigon DrTrigonBot" too... [11:13:46] petan: What about the future? Why did this happen? Will there be a mechanism preventing this from now on? [11:14:20] DrTrigon that's a very good question [11:14:32] it happened because glusterfs suck [11:14:32] we have a bug about that [11:14:52] we can either get some patch to fix this or replace gluster with other solution [11:15:05] that's all I know so far [11:15:15] it's Ryan's problem ATM [11:15:45] I can't really help much here, I think glusters is a great thing but it needs a lot of works for it to work perfectly [11:15:48] petan: ... and he's lazy... [11:15:52] heh [11:15:55] not so lazy as me [11:15:58] ;)))) [11:18:32] seriously; I gives my a bad feeling somehow ... I would be nice if there could at least be some work-a-round (e.g. backup - I don't know...) meanwhile ... [11:24:22] DrTrigon workaround is not use gluster - you don't need to [11:24:37] DrTrigon you are supposed to use stable instance like bots-bnr1 [11:24:43] there is stable storage /mnt/share [11:24:51] which is local only - but works [11:25:02] or you can use /mnt/secure which is global and works [11:25:08] but has worse io [11:25:39] there is always some workaround :P [11:25:52] @labs-resolve bnr [11:25:52] I don't know this instance - aren't you are looking for: I-0000056f (bots-bnr1), I-00000629 (bots-bnr2), [11:26:04] DrTrigon these instances work best [11:26:27] petan: http://dpaste.de/0pZHk/raw/ [11:26:40] legoktm ? [11:26:45] how can i delete those rows? [11:27:02] oh [11:27:10] right, did it delete some? [11:27:20] no [11:27:33] ...so not use "/data/project"? And go for "/mnt/secure"? For me it does not matter - I am just doing what is told to me. I was told to use bots4 and I thought the safe storage there is "/data/projects". So I am fine to change but I have to know this... [11:27:34] if yes, you can try again OR delete from persondata where value="" limit 1000; [11:27:40] that will delete first 1000 rows [11:27:44] right [11:27:45] you can just run this in a loop [11:27:47] :P [11:28:02] or - try to restart the transaction, you can even run it on localhost [11:28:10] legoktm just ssh to bots-bsql01, type mysql [11:28:13] run it there [11:28:17] oh [11:28:18] ok [11:28:19] so that you won't get connection problem [11:29:02] legoktm@bots-bnr1:~$ mysql legoktm [11:29:03] ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) [11:29:07] oh wait [11:29:08] ugh [11:29:10] wrong instance [11:29:18] >.< [11:29:21] lol [11:29:26] :) [11:30:28] petan: ^^^ [11:30:37] DrTrigon yes? [11:30:50] (12:27:33) DrTrigon: ...so not use "/data/project"? And go for "/mnt/secure"? For me it does not matter - I am just doing what is told to me. I was told to use bots4 and I thought the safe storage there is "/data/projects". So I am fine to change but I have to know this... [11:31:00] DrTrigon well, you shouldn't really use bots-4 - that's a testing instance [11:31:08] DrTrigon use bots-bnr2 [11:31:31] but still "/data/project" ? [11:31:40] I recommend you to use local storage /mnt/share given that you WILL not move your bot to another instance, otherwise use global storage [11:31:50] /data/project will suck everywhere [11:31:57] :| [11:32:26] btw, are you using sql? [11:32:57] so I have to migrate everything to bots-nr2? then use /mnt/share there? [11:33:11] no sql for me [11:33:16] bots-bnr2 [11:33:23] how to install needed software? will that be done from your side? (on bots-nr2) [11:33:26] that's my recommendation I know it suck but we are working on improvement [11:33:33] bots-Bnr2 [11:33:35] not bots-nr2 [11:33:45] that b is important :P [11:33:57] I just do not want to move everything in a few months again... [11:34:07] installation of software should be done using puppet, but you can also just ask me [11:34:10] ok bots-bnr2 [11:34:22] DrTrigon on bots-bnr2 you shouldn't that's going to be long term installation [11:34:33] is there the same software as on bots4 currently? web access too? [11:34:56] DrTrigon yes there is web access there is all default software except for what users installed themselve on bots-4 [11:35:01] because on bots-4 everyone has root [11:35:17] yes that is one good but dangerous thing I know... [11:35:33] that's a reason why all these instances are being rebooted all time :P [11:35:51] bots-bnr1 for example has a nice uptime :P [11:36:01] bnr2 was created yesterday [11:36:15] so I will stick with bots4 for now (no time at the moment) but next what I will do is to migrate to bots-bnr2 and /mnt/share - sounds that reasonable? [11:36:33] whats the difference between 1 and 2? [11:37:02] they are same [11:37:10] difference is that bnr1 is currently hosting ton of bots [11:37:16] so it's more loaded while bnr2 is empty [11:38:05] that will change for sure until I have time to migrate... ;)) may be theres even a bot-bnr42 till then... ;))) [11:38:56] lol [11:39:18] how complicated is it to migrate it? [11:39:24] you can just scp everything? [11:39:31] and restart it on new box [11:40:22] I am looking at the moment... [11:40:46] how to scp between bots4 and bnr2? [11:41:26] mhm... you can create temporary ssh key for that [11:41:33] or you can do things through bastion [11:41:49] or you can just copy everything to /data/project and copy it back from there to local storage [11:42:01] ok [11:49:10] petan: so I think I got it up now again - so I'll try migrating to bots-bnr2 ... (stay tuned ;) [11:55:54] petan: Can you check if all lua libraries I istalled to bots4 ar available on bots-bnr2? I need them in order to be able to compile a python module... [11:56:08] k [11:57:18] petan: Does this migration make sence at all since the public_html still is on /data/project ...? [11:57:30] !log account-creation-assistance Opreations log, stardate 45650.54: -application is now properly reporting to Icinga, but -database isn't. Giving -database the (re)boot to see if that helps. [11:57:31] Logged the message, Master [11:59:37] DrTrigon I don't know... but you can always make some kind of autobackup of these html things [11:59:51] how? [12:00:22] cron a script which will cp -r public_html every day to /mnt/share [12:00:30] I do same [12:00:50] !log bots petrb: install liblua5 to bnr2 [12:00:52] Logged the message, Master [12:00:56] yeah... but then I can do this for everthing on /data/project and forget about the migration at all... [12:01:17] bots-4 is quite bad anyway... [12:01:22] it's a testing instance [12:01:57] but this is up to you... I don't really know how to fix this, apache doesn't see to secure - these folders are not readable even by root [12:02:19] so unless we recover old nfs server we had for this purpose we need to use gluster [12:02:31] but what about a backup?? [12:02:39] (on your side) [12:02:52] what kind of backup? [12:03:08] my backup is just backing up stuff in public_html, but my bots run on local storage [12:03:16] so in worst case just public html stuff breaks [12:03:17] backup /data/project in order to be able to restore it if it crashes... [12:03:34] that fs is too huge, I have no resources to make a full backup of that [12:03:35] that would solve all problems here [12:03:50] !log account-creation-assistance Opreations log, supplemental: This was a triumph; I'm making a note here, huge success, and that turning it off and on again still works! [12:03:51] Logged the message, Master [12:04:03] Ryan recommends everyone to use /data/project not me :/ I can't make backups of that, you would need to ask Ryan for that [12:04:06] backup just on request - e.g. my folder? ;)) [12:04:14] mhm [12:04:19] can you create a ticket for that? [12:04:26] we might create some backup system... [12:04:55] I mean in the end we need a full backup anyway... I has to be reliable it is supposed to replce TS, isn't it??? [12:05:04] yes I know [12:05:16] so....? ;) [12:05:21] I will consult that with ryan, who knows what the options are [12:05:33] but you are right, there should be a backup [12:06:13] in this moment it's more like - everyone is responsible for their own files and their backups. that should change [12:06:56] ...as usual.. :) [12:07:25] consulting with ryan sound good! :)) meanwhile I will stick with bots4 untill I have more time (sorry!) and then contact you again in order to get an update on the situation - is that ok with you? [12:08:04] sure [12:08:15] but I can't guarantee you any stability on bots-4 [12:08:42] mmmmm [12:10:45] I meant I am aware that labs and my stuff there is more beta stage - it is not a very important bot and as I have seen there were some issues but not a lot thus I think it should be fine for a while again - I hope... ;)) If there is an issue again I will re-decide... THANKS A LOT FOR YOUR TIME! greetings! [12:14:39] Have a nice day! [15:02:49] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Design was modified, changed by Giftpflanze link https://www.mediawiki.org/w/index.php?diff=657568 edit summary: [+0] /* Implementation */ typo? [16:55:18] @labs-resolve bnr2 [16:55:19] I don't know this instance - aren't you are looking for: I-00000629 (bots-bnr2), [17:04:33] paravoid ping [17:04:45] can't ssh to bots-bnr1 [17:04:56] no route to host [17:06:41] is labs normally so slow for downloading stuff there? [17:06:45] 1,445,968 17.7K/s eta 13m 43s [17:06:49] for 11 mb file [17:10:59] aude labs network seems fucked to me [17:11:06] some instances don't even see to othesr [17:11:21] !ic [17:11:21] http://208.80.153.210/icinga http://icinga.wmflabs.org/ [17:12:33] :( [17:13:05] * aude shall wait for ryan tomorrow or whenever [17:14:09] seriously Ryan should give some permissions to volunteer sysadmins so that there is someone who can fix stuff over weekends... [17:14:14] yeah [17:14:43] labs are for community devs and community devs work mostly over weekends - when labs are broken D: [17:14:44] meh [17:14:51] exactly [17:15:08] i only have time on the weekends to do labs stuff and half the time have problems :/ [17:15:33] * aude waits 5 more min for my file [17:19:54] * Damianz pats aude [17:27:31] Damianz I finished setting up open grid cluster :D :d [17:27:42] Damianz when u ssh to bots-gs you can use qsub to submit jobs [17:27:53] but we have only 1 working node atm because bnr1 doesn't work [17:27:58] some connection issue - see nagios [17:28:27] if u want to add yourself as admin, use sudo su, then sudo -u gseadmin [17:28:41] sudo -u gseadmin doesn't work from normal user [17:29:57] wow qrsh is fun [17:30:04] it bypass bastion somehow [17:30:39] open grid uses host keys iirc [17:33:25] mhm [17:33:32] Damianz you can test it :D [17:33:40] Damianz but idk if you have some bot for that [17:35:02] possibly [17:35:20] still waiting for cbng to import :P [17:35:48] LOL [17:35:53] Damianz I was doing some IO tests [17:35:57] see scrollback [17:36:02] same speed on ext4 [17:36:05] so this isn't btrfs [17:36:07] that is slow [17:36:11] it's labs IO [17:36:17] writing to disk is fucking fast [17:36:27] flushing to physical storage is slow [17:37:59] meh [17:38:17] this was much quicker importing to mysql on ext4 [17:47:54] maybe this problem is happening now - you wouldn't be so fast in this moment? [17:48:04] maybe it's problem related only to that one server? [17:49:45] If it's the specific server then it's more the host it's on... don't really have time to dig into it... trying to get documentation done for tomorrow [18:17:29] addrawr ping [18:17:48] addrawr I am going out now, please note that bnr1 seems to be down or whatever [18:17:57] if you want to restart your bots, use qsub @ bots-gs [18:52:56] Krenair: *ping* [21:03:26] help neededopenid-wiki.instance-proxy.wmflabs.org/wiki 502 Bad Gateway even after rebooting the instance [21:03:30] help needed openid-wiki.instance-proxy.wmflabs.org/wiki 502 Bad Gateway even after rebooting the instance [21:06:34] wfm [21:56:18] petan: I noticed :/ [21:56:22] just got back from my weekend [22:54:28] petan / petan|wk, what instances does qsub @ bots-gs go to? [22:56:21] !log bots addshore: restarted bnr1 [22:57:53] Logged the message, Master [23:27:58] !log bots addshore: rebooted bsql01 to solve lock issue and also to act as the previously needed reboot [23:28:01] Logged the message, Master [23:36:31] addsleep: WHY DID YOU RESTART bots-bnr1???? [23:37:07] argh [23:38:51] is bots-bnr1 operational again? [23:39:02] probably [23:39:13] all my screens got killed though [23:39:17] legoktm: it was inaccessible [23:39:23] how? [23:39:34] bnr1 hasnt be working for a while... [23:39:51] ??? [23:39:55] it was working fine for me [23:40:03] http://ganglia.wmflabs.org/latest/?r=day&cs=&ce=&c=bots&h=bots-bnr1&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [23:40:37] when was the last time you checked...? [23:41:18] 10 hours ago? [23:41:25] o.o [23:41:30] lol [23:41:37] xD [23:41:50] :( [23:41:51] ... [23:42:01] * Damianz kicks the shit out of addsleep [23:42:18] WHY U REBOOT BOTS-BSQL01 [23:44:20] * Damianz grumbles and starts his 2 day import all over again since it had only got to 89% done... yet another week of downtime [23:46:24] ... so is this 2 days or 7 days? [23:46:50] downtime for what? [23:47:16] cb [23:47:38] and 2 days to import means it will finish tuesday, I don't have time to finish off moving it until saturday... so 7 days. [23:49:40] hmm sorry Damianz I didn't realise you had started yet :/ [23:50:27] how big is the db? For some reason any long activities on bsql01 seem to take a zillion times longer than they should.. [23:51:36] 180mb and yes, it imports really fast on sql2... but very, very slowly on bsql01 pv says @ 591kB/s.. eta 2 days, 18hours (which from the past day seems about right) [23:51:53] 180mb...... [23:52:06] iowait is like 25><50% [23:52:15] what, now? [23:52:19] Yeah... 180mb is tiny.... it's only like a million rows. [23:52:28] http://ganglia.wmflabs.org/latest/?r=day&cs=&ce=&c=bots&h=bots-bsql01&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS < generally [23:52:55] i was going to say I imported a 22,000,000 row db the other day and it didnt even take 2 days... [23:53:51] IIRC this is like 970k ish of rows in the main table.... interestingly it's jumped to 8% done eta 2 hours now, so much faster than yesterday. [23:53:54] so its going faster now? :D [23:54:01] yee [23:54:07] something was wrong with bsql01 [23:54:18] no queries I made were going through [23:54:27] not even to flush the query cache >>.< [23:54:38] Still this is slow... compared to a normal mysql server. [23:54:56] I know, the only thing I can see it might be is IO [23:55:14] we tweaked the config the other day but couldnt restart as db was being used lots :/ [23:55:22] as it broke more it seemed more urgent :/ [23:56:15] http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20pmtpa&h=virt9.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 < that's the host it's on [23:56:16] and yee, my DB was many many GB, so 180 should go lots quicker ;p [23:56:21] iowait is low, load is kinda high [23:57:01] hmm [23:57:13] Interesting we do more traffic in than out [23:57:32] hehe :P [23:57:49] Ill try and have a deeper look into things tomorow, could you give me a ping after your db import finishes and if you start any more? :) [23:57:49] More interesting - it's mostly bots, rofl [23:58:29] yeah - looks like it might finish tonight at this rate [23:58:33] :) [23:58:38] glad i could be of service ;p [23:58:45] care to un kick the shit out of me now? ;p [23:58:55] only if you !log ;P [23:59:09] log what? :P [23:59:19] reboots [23:59:25] I did O_o [23:59:52] Oh so you did, I ignore that bot :D [23:59:57] xD