[00:19:47] FastLizard4: yeah, I couldn't figure out the issue [00:20:03] though I didn't spend an amazing amount of time on it :) [00:20:20] Ryan_Lane: Okay, thanks anyway. :) [00:21:13] well, gmond is talking from the system to the server [00:21:18] there's an established connection [00:28:22] I think I may have found the issue [00:29:04] that said, it's still not showing up. heh [00:29:42] oh. right [00:29:53] so. the security group was slightly off [00:30:09] it opened 50000-51000 to 10.4.0.0/24 [00:30:14] our subnet is larger than that now [00:30:27] so, I changed it to 10.4.0.0/21 [00:30:49] it takes a while for the security groups to be applied on the busier compute nodes [00:31:02] once that happens the instance will show up in the list [00:32:12] Hey, Ryan, I gots a favor to ask. [00:32:18] Coren: ? [00:32:39] FastLizard4: ok. it's in the list now [00:33:32] Think you could whip up a few instances so I can start playing with OGS deployment scenarios to help draft my original proposal? Low-resources, low-impact. I just want boxen I can screw up without affecting anyone. [00:34:02] Ryan_Lane: Oh, awesome :D [00:34:03] just make some in webtools [00:34:09] Ryan_Lane: Thanks so much! :D [00:34:22] Ryan_Lane: Do I have the magic for that? [00:34:22] Coren: what's your labsconsole user name? [00:34:29] Ryan_Lane: marc [00:34:45] Wait, I lied. 'Coren' [00:34:56] marc is my username [00:35:34] Oh, and I had an unrelated question: Why is our invisible pink unicorn cyan? :-P [00:38:04] heh [00:48:07] Ryan_Lane: Is it normal for the grid to be reporting a negative value for RAM on Ganglia? :P [00:48:38] Coren: whoops sorry. one sec and I'll add you [00:48:50] FastLizard4: heh [00:49:00] it's probably because I reset gmond processes [00:49:07] and the clients may need to reset themselves [00:49:11] Ahh :P [00:49:41] Coren: ok. you're added [00:50:00] Coren: you can manage instances, security groups, ips, etc in webtools now [00:56:19] rather an epic graph [01:02:18] Ryan_Lane: Thanks (delay: food! dinner rang while I waited) :-) [01:02:54] * Coren reads tah docs. [01:03:52] Damianz: what's an epic graph? [01:04:02] FastLizard4's [01:04:06] heh [01:04:20] too lazy to open tweetdeck [01:04:53] Ryan_Lane: Wait, we are putting webtools and bots on /distinct/ projects? [01:05:02] they are currently distinct [01:05:14] it likely makes sense for them to be combined in the future [01:05:37] Ryan_Lane: It would indeed. I was kinda planning on it. :-) [01:11:15] andrewbogott: there's a problem with manage-volumes [01:17:43] Oh, hah! I expect that by setting one of those nodes with the admin security profile I effectively locked myself out because I'm using the bastion for mortals. :-) [01:19:22] And I'm not cool enough to use bastion-restricted yet. :-) [01:23:57] Coren: we'll need to change that when you're given higher privileges [01:24:12] !log webtools Created instances webhost-grid-{admin,login,exec1,exec2} for trial deployment of OGS [01:24:15] Logged the message, Master [01:28:32] Ryan_Lane: I don't see a method by which I can alter the security group I put -admin in after the fact? [01:30:28] Coren: ah [01:30:35] yeah, this is a limitation of openstack [01:30:52] delete/recreate? [01:30:56] unfortunately [01:31:08] Meh. I've seen worse. :-) [01:31:33] it's a really annoying thing [01:31:41] I'm hoping it's something addressed in grizzly [01:32:29] Ryan_Lane: Is it intended and if so any idea when labsconsole and gerrit share ssh key registries? [01:33:06] Krinkle: http://www.mediawiki.org/wiki/Wikimedia_Labs/Account_creation_improvement_project#SSH_key_management [01:34:10] Ryan_Lane: I just updated it in gerrit, and works instantly. Doesn't seem to work in labs yet though (added it on Special:NovaKey) [01:34:35] it isn't updating for you? [01:34:42] the script likely isn't running [01:34:43] one sec [01:35:24] "the script" [01:35:27] :D [01:35:28] Creating directory '/home/marc'. [01:35:28] Unable to create and initialize directory '/home/marc'. [01:35:44] Yeah I've seen that before [01:35:48] I had to reboot the instance in that case [01:36:01] nope [01:36:02] * Coren tries that. [01:36:06] ... or not [01:36:09] Krinkle: fixed [01:36:15] Coren: which instance? [01:36:22] Ryan_Lane: okay, will try again. [01:36:22] webtools-grid-login [01:36:32] Ryan_Lane: The key update only needs to happen on bastion, right? From there it is keyless to other vms, or not? [01:36:43] Ryan_Lane: great, I'm in [01:36:45] Krinkle: nope. it needs to occur on labstore2 [01:37:08] there's a global nfs share to all instances that's read-only [01:37:13] all keys are in it [01:37:29] the mount point on labstore2 isn't an automount [01:37:38] when the gluster outage occured, it didn't re-mount itselt [01:37:45] I needed to unmount/mount [01:38:04] we need monitoring for that script [01:38:06] Ryan_Lane: Actually, all three of 'em [01:38:12] OHHHH [01:38:14] I know why [01:38:57] Ryan_Lane: Okay, so it does use keys for each ssh to each vm, but also in one place for all, great. [01:38:58] gluster droppings? :-) [01:39:04] I disabled the glustermanager script because I was scared it would cause an outage while I was bringing the volumes back up [01:39:13] I just re-enabled it [01:40:30] it should work in a minute [01:41:10] Ryan_Lane: And so it does. Danke. [01:41:14] yw [01:41:52] I'm still kind of scared it'll cause another outage [01:42:13] that's on my list to fix tomorrow [01:42:59] * Coren knocks on wood. [01:43:40] I changed it to run every 5 minutes, rather than every minute [01:44:02] So it should hammer more... gently? :-) [01:44:07] yes. [01:44:12] it's slightly broken right now [01:44:26] it continuously changes the share permissions of every project [01:44:39] which pisses the glusterd proces off [01:44:41] *process [01:46:30] Don't anthropomorphise programs; they don't like that. :-) [01:46:51] :D [01:51:44] Hm. Do we have guidelines on what should/should not be !logged? [01:52:01] generally any changes to the system [01:52:42] !log webtools Added self to admins sudoers policy. [01:52:43] Logged the message, Master [03:05:53] Aaah, I keep on forgetting to ask this question when there's people around [03:06:12] Do projectadmins have any permissions to run Nagios commands on instances in their project? [08:43:22] @seen Ryan_Lane [08:43:22] petan: Last time I saw Ryan_Lane they were quitting the network with reason: Quit: Leaving. N/A at 2/15/2013 8:29:21 AM (00:14:01.0915570 ago) [08:43:30] meh [08:43:39] Damianz you have an idea how to remount gluster [08:43:43] on instance [08:52:00] @notify andrewbogott_afk [08:52:01] This user is now online in #wikimedia-dev so I will let you know when they show some activity (talk etc) [08:52:09] paravoid ping? [08:52:35] @notify paravoid [08:52:35] This user is now online in #wikimedia-tech so I will let you know when they show some activity (talk etc) [09:51:34] petan: maybe the bot could query the idle time of the user [09:51:51] though that is just to register a notification isn't it ? [09:52:06] hashar huh? [09:52:13] forget me :-] random musing [09:52:30] I don't see what the idle time is good for :P [09:52:53] idle time is based on operation that may be automatically done by your client [09:53:08] so it's idle time of client, not of user :P [09:55:59] petan: a whois on myself gave me: hashar 191 seconds idle, 1360917972 signon time [09:56:13] I guess that idle time is only when talking to people [09:57:02] ok but response to ctcp for example would reset the counter [09:57:21] it should be reset by all actions except for /raw PING [09:57:36] it's just unreliable :P [09:58:33] I want to see the person really talking in a real channel to be sure they are back [10:08:15] test it out :-] [10:08:22] not sure whether a ctcp will reset the idle [10:09:52] I guess my idea is to catch users not being active in channels where wm-bot is [10:09:57] such as speaking in private messages [13:04:34] petan: could you make it so wm-bot send the "I will notify you, when I see mitevam around here" in private notice ? /msg [13:04:51] petan: that will avoid a bit of spam in the channels :] Only the user doing the @notify care about the bot answer [13:04:54] hashar it already does that :D [13:05:05] it was originally created as a PM service [13:05:05] didn't do it in #mediawiki a minute ago :-] [13:05:17] I extended that to channels just to make others aware of that [13:05:34] you have to do @notify blah in private message [13:08:22] if you do that in a channel bot will respond to a channel [13:08:30] if you do that in a PM bot will respond in a pm [13:08:35] hashar: hey, i have a small review question if you could just take a look at one line ... [13:08:43] mutante hi [13:08:49] i wonder why gerrit marks something as changed, but i don't see any diff [13:08:55] do you know how to remount gluster on instance? [13:09:06] I have an instance where /data/project is borked [13:09:11] petan: hi, no, not really, sry [13:09:12] I can't reboot it, I need to fix it [13:09:16] mhm [13:09:17] ok [13:09:56] mutante: maybe a commit message change [13:10:00] borked in wich way, petan? [13:10:04] mutante: or the new patchset has not the same parent (aka its a rebase) [13:10:22] mutante: each patchset list its parent commit. You should get a different parent for each of the patchset. [13:10:28] hashar: line 505 in this one.. it is colored.. but it seems unchanged ? https://gerrit.wikimedia.org/r/#/c/49194/4/redirects.conf [13:10:32] mutante: ideally Gerrit would notify you about the change being a rebase [13:10:47] it is just about one line in that patch set [13:10:59] mutante: the wikimedia.org [13:11:05] mutante: the last dot has been escaped [13:11:06] root@i-000000a9:/mnt/share/wmib# cat /data/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt [13:11:07] cat: /data/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt: Input/output error [13:11:12] mutante ^^ [13:11:16] \.wikimedia.org ==> \.wikimedia\.org [13:11:24] hashar: thanks, exactly why i needed a second pari of eyes.. blind :) [13:11:29] pair [13:11:55] mutante: git diff --word-diff [13:11:58] mutante: that might help :-] [13:12:21] thanks [13:13:33] [Fri Feb 15 13:13:27 2013] [error] SSL Library Error: 185073780 error:0B080074:x509 certificate routines:X509_check_private_key:key values mismatch [13:13:34] boooo [13:14:10] petan: has it happened to you in the past like that and rebooting fixed it , you say? [13:14:19] no [13:14:31] I don't know if it happened in the past but i know I must not reboot the box [13:14:40] because data which needs to be written are now in RAM [13:14:40] mount -o remount ? i don't know about the gluster issues though and it's automounted [13:14:52] I tried remount :/ [13:14:54] no luck [13:14:58] it's still same borked [13:15:09] I think I should restart glusterfs service but don't know how [13:15:11] arg. i don't know then [13:15:31] was there a gluster update again? [13:15:41] see mail - Ryan said this night gluster failed [13:15:48] this is surely related to it [13:15:55] ah, well, then at least it's known already [13:15:58] yep [13:16:27] i just know that Ryan said it was a tedious job to fix after failed update last time [13:16:36] like he had to do many little small things [13:16:39] mhm [13:17:11] petan: there is no gluster service :-D [13:17:20] ther eis [13:17:33] petan: that is a fuse mount invoking a gluster client [13:17:38] root 2148 1 0 Jan07 ? 00:07:29 /usr/sbin/glusterfs --volfile-id=/bots-home --volfile-server=projectstorage.pmtpa [13:17:39] on /home type fuse.glusterfs [13:17:50] this process [13:17:55] I think it needs to be restarted [13:18:04] root 2148 1 0 Jan07 ? 00:07:29 /usr/sbin/glusterfs --volfile-id=/bots-home --volfile-server=projectstorage.pmtpa.wmnet /home [13:18:05] so that is invoked by fuse [13:18:06] root 3857 1 0 08:20 ? 00:00:12 /usr/sbin/glusterfs --volfile-id=/bots-home --volfile-server=projectstorage.pmtpa.wmnet /home [13:18:06] root 10551 1 0 09:53 ? 00:00:09 /usr/sbin/glusterfs --volfile-id=/bots-home --volfile-server=projectstorage.pmtpa.wmnet /home [13:18:07] root 18284 1 0 12:03 ? 00:00:16 /usr/sbin/glusterfs --volfile-id=/bots-project --volfile-server=projectstorage.pmtpa.wmnet /data/project [13:18:23] you will want to remount the partition like mutante said [13:18:58] that might kill the process and start a new one [14:26:53] hashar [14:26:57] root@i-000000a9:/mnt/share/wmib# mount -o remount /data/project/ [14:26:58] root@i-000000a9:/mnt/share/wmib# [14:26:59] root@i-000000a9:/mnt/share/wmib# cat /data/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt [14:27:00] cat: /data/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt: Input/output error [14:28:00] petan: I would umount [14:28:05] make sure it is no more mounted [14:28:15] and make sure Gluster is no more running [14:28:17] then mount -a [14:28:19] when I do that, it proceed but automount mount it back [14:28:25] and it's still same [14:30:10] the same way ? [14:30:15] the input/output ? [14:30:19] maybe the volume has some errors [14:30:23] you can also reboot the instance :-] [14:30:34] the quick / sure / dirty way to remount [14:32:07] I can't [14:32:13] the data I need to write are in RAM [14:32:29] wikimediafoundation.info .. can you confirm it works? [14:32:49] should redirect and not show Unconfigured Domain [14:32:54] doesi't work to me, mutante [14:32:58] oh it does [14:33:05] Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment. [14:33:17] http://wikimediafoundation.org/wiki/Home [14:33:38] thanks [14:33:54] pheew.. that was harder than it sounds [14:34:10] because the apache restart script has issues since the eqiad switch [14:34:22] there are more things that have issues [14:34:25] and sync-apache seemed to skip servers , but they are decomed.. just still running [14:34:27] such as glusterfs D: [14:34:55] sure.. quoting Quim at FOSDEM [14:34:59] "2k open bugs" :p [14:35:00] or something [14:47:41] Change on 12mediawiki a page Wikimedia Labs was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=646935 edit summary: [+77] /* Tool Labs */ +link to TODO [15:20:58] !log deployment-prep rebooting apaches box to find out whether apache2 service is bring up {{gerrit|47398}} [15:21:01] Logged the message, Master [15:37:01] [bz] (8RESOLVED - created by: 2Antoine "hashar" Musso, priority: 4Normal - 6normal) [Bug 38996] [OPS] apache2 need manual start - https://bugzilla.wikimedia.org/show_bug.cgi?id=38996 [15:37:32] !log deployment-prep puppet properly start the apache2 service. Fixed {{bug|38996}} [15:37:34] Logged the message, Master [15:49:51] !log wikidata-dev redirected cron spam to myself on all instances [15:49:54] Logged the message, Master [15:52:12] @seen Ryan_Lane [15:52:12] petan: Last time I saw Ryan_Lane they were quitting the network with reason: Quit: Leaving. N/A at 2/15/2013 8:29:21 AM (07:22:51.5568280 ago) [15:52:49] @notify Ryan_Lane [15:52:49] I will notify you, when I see Ryan_Lane around here [16:14:22] Change on 12mediawiki a page Wikimedia Labs/TODO was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=646989 edit summary: [+2] ce [16:18:45] Change on 12mediawiki a page Wikimedia Labs/TODO was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=646993 edit summary: [+23] Not sure about OGS yet [16:54:20] Ryan_Lane I need to fix gluster on bots-1 without reboot is that possible? [16:54:28] it doesn't work much now [17:02:26] @notify Ryan_Lane [17:02:27] I will notify you, when I see Ryan_Lane around here [17:11:45] * hashar waves [17:12:31] hashar: beta cluster down again? http://en.wikipedia.beta.wmflabs.org/ [17:13:17] pfff [17:13:18] seriously [17:13:22] I am tired [17:13:24] of this [17:13:40] whatever it is glusterd did it [17:14:14] I guess Ryan is not coming until monday which is fun [17:14:31] !log deployment-prep running "mw-update-l10n --verbose" on -bastion as mwdeploy user [17:14:34] Logged the message, Master [17:14:41] chrismcmahon: some perm errors again :/ [17:14:49] PHP Warning: rename(/home/wikipedia/common/php-master/cache/l10n/l10n_cache-an.cdb.tmp.949349430,/home/wikipedia/common/php-master/cache/l10n/l10n_cache-an.cdb): Input/output error in /data/project/apache/common-local/php-master/includes/Cdb.php on line 174 [17:14:52] basically cluster is dead [17:14:55] gluster is dead [17:14:58] nothing I can do apparently [17:15:08] hashar that's what I get [17:15:11] chrismcmahon: sorry I am out for the weekend. Might check later tonight [17:15:12] on bots [17:15:13] no promise though [17:15:17] petan: yeah cluster issue :( [17:15:26] but it suck it happened on friday [17:15:35] shittons of shit [17:15:39] andrewbogott_afk: ping [17:16:08] so bascically [17:16:11] reboot -bastion maybe [17:16:18] MAYBE [17:16:21] hashar: run !:) [17:16:32] and then refresh the l10n cache using : running "mw-update-l10n --verbose" on -bastion as mwdeploy user [17:16:38] that would regenerate the cache [17:16:38] hashar I can reboot it so that people will blame me :D [17:16:42] go ahead [17:16:44] I am off for now [17:16:46] late already [17:16:48] * hashar wave [17:17:31] !log deployment-prep petrb: rebooting bastion to fix some issues with mw [17:17:33] Logged the message, Master [17:23:17] Ryan_Lane I need to fix gluster on bots-1 without reboot is that possible? [17:23:37] what's broken with it? [17:23:47] it doesn't read some files [17:24:01] cat: /data/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt: Input/output error [17:24:10] it must not be rebooted [17:24:21] one sec [17:25:23] <^demon> Ryan_Lane: Jeff merged the gerrit diffs earlier :) [17:25:28] cool [17:27:08] petan: that file is likely split-brained [17:27:18] petan: the filesystem itself is up and running [17:27:18] what does it mean [17:27:27] that file is accessible on other boxes [17:27:30] just not this one [17:27:30] it is? [17:27:33] yes [17:27:35] wow. [17:27:40] oh [17:27:44] that actually makes sense [17:27:45] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-mobile/20130214.txt [17:27:48] this is the file [17:28:09] just the bot lost ability to write into that log [17:28:22] which stopped whole logging system [17:28:31] the split brain causes that [17:28:42] ok, how to fix splitbrain [17:28:55] I have to fix it, unfortunately [17:29:00] aha [17:29:05] how? [17:29:24] figuring out which nodes have the correct and which have the incorrect files [17:29:31] oh [17:30:13] ok it's not really urgent as long as there exist a solution for that... but for now all logs are stored in RAM which is not a most reliable storage [17:30:22] * Ryan_Lane nods [17:32:43] weird [17:32:50] their doesn't seem to be a split brain on this file [17:38:51] but it still doesn't work [17:39:01] don't tell me there is no logfile for gluster that can explain what's up [17:39:30] $gluster-whatsupwithtehfile /data/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt [17:41:44] petan: I'm running a self-heal on the node [17:41:50] there's a log file [17:41:54] it says there's a split brain [17:41:59] ok [17:42:00] I don't see a split-brain condition [17:42:05] which is why I'm running the self-heal [17:42:18] the logs are in /var/log/glusterfs [17:42:22] yay [17:43:21] "Please delete the file from all but the preferred subvolume." [17:43:27] background meta-data data self-heal failed on /public_html/petrb/logs/#wikimedia-mobile/20130214.txt [17:43:34] except that it doesn't exist on the other two nodes [17:43:41] hmh [17:43:43] oh [17:43:43] wait [17:43:45] I see [17:43:56] it *is* in fact split-brained and I see it [17:44:08] how [17:44:30] on the servers [17:44:47] labstore2.pmtpa.wmnet: -rw-r--r-- 2 wmib wikidev 23469 Feb 14 22:28 /a/bots/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt [17:44:47] labstore1.pmtpa.wmnet: -rw-r--r-- 2 wmib wikidev 22642 Feb 14 19:05 /a/bots/project/public_html/petrb/logs/#wikimedia-mobile/20130214.txt [17:45:05] larger and newer :P [17:48:18] which system can access this file? [17:48:27] at least web server can [17:48:31] apache-01? [17:48:34] yes [17:48:37] it can't :( [17:48:39] or it appears to [17:48:47] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-mobile/20130214.txt [17:48:59] that page doesn't load for me [17:49:07] it's cached for you [17:49:07] oh in that case it's browser cache [17:49:19] he connection to the server was reset while the page wa [17:49:21] true [17:49:36] okay so we definitely lost it :( [17:49:40] no [17:49:41] we didn't [17:49:42] poor wikimedia-mobile [17:49:43] :D [17:49:45] ok [17:51:25] * Ryan_Lane sighs [17:53:13] fucking glusterfs makes this harder and harder [17:53:25] it keeps shadow copies of files and shit like that [17:53:57] oh yes, files and shit like that, like... folders :D [17:59:43] ok [17:59:45] fixed [18:00:04] ooh yay :D [18:00:06] want to see how? [18:00:07] GFID=$(getfattr -n trusted.gfid --absolute-names -e hex 20130214.txt | grep 0x | cut -d'x' -f2) [18:00:12] rm -f /a/bots/project/.glusterfs/${GFID:0:2}/${GFID:2:2}/${GFID:0:8}-${GFID:8:4}-${GFID:12:4}-${GFID:16:4}-${GFID:20:12} [18:00:16] rm 20130214.txt [18:00:20] I'm not kidding [18:00:40] it's obvious that the glusterfs folks hate all of their users [18:00:42] bugt I still get errors [18:00:43] :/ [18:00:54] on another file? [18:00:57] or that same one? [18:00:59] now it also keeps putting some shit into logs :D [18:01:02] yes on same [18:01:09] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-mobile/20130214.txt?fsd [18:01:17] see that file [18:01:20] yeah [18:01:23] restart the bot maybe? [18:01:24] bottom of it [18:01:27] the file itself isn't broken [18:01:30] no, it's not being inserted by bot [18:01:41] the file itself is still inaccessible on that box [18:01:47] try to do cat on it [18:02:58] actually it is being inserted by a bot - bot attempt to write - gluster append data to file and return file IO error to bot instead of "it's written dude" [18:06:10] and if I kill the bot now we will loose the logs for 2 - 3 days [18:07:20] :( [18:07:27] can you save the logs elseqhere [18:07:30]