[01:11:25] *poke* andrewbogott or Coren or Ryan_Lane [01:11:33] ? [01:11:56] Could you fix projectstorage.pmtpa.wmnet:/bots-home, everything is read only apparently (and I rebooted the box, like twice) [01:13:25] * Damianz grumbles about having to change his password because really what is someone going to do with my wikipedia login [01:15:32] hm, I just fixed that [01:15:49] it works for me on bots-labs [01:15:55] which instance is having an issue? [01:16:06] Doesn't work on bots-apache01 or bots-cb [01:16:06] Damianz: ? [01:16:31] works for me... [01:16:52] touch: setting times of `web-settings.php': Read-only file system < :( [01:17:00] * Damianz tries bots-labs [01:17:03] which directory [01:17:25] public_html/damian [01:17:51] cd: /data/project/public_html/damian: Too many levels of symbolic links [01:17:53] that's a new one [01:18:29] bots-labs jsut gives me that for everything under /data/project heh [01:19:16] bah [01:19:27] it's not home that's broken [01:19:34] you're linking into data/project [01:20:10] oh yeah - sorry, copied wrong line to start with [01:20:16] * Damianz notes it is like 02:20 [01:20:17] fixed [01:20:44] Yay, thanks [01:23:39] yw [01:23:46] Damianz: move your damn bots :) [01:24:18] If only I had time, motiviation, energy [01:24:44] :D [01:24:54] They'd probably run better on tools, but I need to put in some hacks to make them support getting started anywhere + actually make the code compilable easily [01:25:04] bots was awesome, then people broke it, now I want to stab people [01:25:19] :) [01:27:04] I wouldn't be so annoyed someone deleted the database server if they hadn't also broken the backups for it >.<. Don't even want to know how old this database copy is *sigh* [01:28:52] 1.5million edits... sorta recent, 6months old will do for now [14:52:39] in SGE, what's state=Eqw ? [14:53:01] three of my jobs stuck at that state [14:53:06] for several days [14:53:47] Coren: petan ^ [14:56:01] liangent: It's errored out trying to run. If you do a qstat -j you'll get a detailed error near the bottom. [14:56:06] What's your job number? [14:56:37] Coren: [14:56:38] 1240083 0.25014 php_dispat local-liange Eqw 10/11/2013 06:05:00 1 [14:56:38] 1240107 0.25009 php_transl local-liange Eqw 10/11/2013 06:22:12 1 [14:56:38] 1240309 0.25000 php_cleanu local-liange Eqw 10/11/2013 07:13:01 1 [14:57:09] Coren: [14:57:10] error reason 1: 10/11/2013 06:54:12 [50497:29234]: can't stat() "/data/project/liangent-php/php_dispatchRC_zhwiki.ou [14:57:10] scheduling info: Job is in error state [14:57:44] (that's probably ".out") [14:57:56] Coren: [14:58:00] Coren: ok [14:58:13] It looks as though your jobs tried to start while the filesystem was down. Chances are just clearing their error state will have them proceed normally. [14:58:38] qmod -cj 1240083 1240107 1240309 [14:58:52] Coren: well I just qdel'd all of them [14:58:56] Ah. [14:59:10] so they can start later automatically from cron [14:59:16] That also work. [14:59:20] works* [14:59:32] Coren: better to have them die out completely next time [14:59:48] so the next submission won't be blocked [15:00:02] * Coren ponders. Are you placing them on the continuous queue? [15:00:37] Coren: the first is, and the rest aren't [15:01:22] Hm. It's normal for the first to have wedged then; continuous jobs are never deleted on purpose (so you get a chance to correct an error); but the latter two should have gone away. [15:01:46] * Coren will make sure to clear out errors if there's ever another disruptive filesystem maintenance. [15:03:37] Coren: well but it makes sense to restart continuous jobs [15:04:01] hmm "clearing out errors" restarts it, right? [15:04:08] For a continuous job, yes. [15:04:18] It may, for a task, if it hadn't started at all. [15:04:29] (But partially completed tasks are never restarted) [15:04:37] ok [15:34:02] Ryan_Lane: can you check these 2 bugs I made regarding wm-bot [15:34:23] Coren [15:34:34] eh sry, let me try again :P [15:34:50] Coren: are you in charge of creation of projects? [15:34:51] * Coren waits for the new try. :-) [15:35:10] petan: No more "in charge" than anyone else is. What's up? [15:35:13] that : shouldn't be right next to enter on this keyboard [15:35:44] I created a ticket to create a project called "wm-bot" so that I can create 1 instance for wm-bot there and I can finally delete bots-labs instance on bots project [15:36:14] which is last instance for me which block bots project deletion, there are others of other people who do block it though :P [15:36:27] so it still is going to take some time for bots project to die... [15:37:08] I wonder, why not tool labs for wm-bot? It has public IPs with proper indent, and the task is eminently gridengineable. :-) [15:37:20] let me dig these bugs... [15:37:23] tool-labs has ops support; random projects do not. :-) [15:38:50] here is why: https://bugzilla.wikimedia.org/show_bug.cgi?id=51935 check deps [15:39:32] some of dependency bugs were flagged as wontfix which makes it kind of impossible unless I reduce wm-bot's funtionality which I really don't want to [15:41:57] I'm not seeing any such thing in those bugs; there is exactly one close wontfix and that one you *still* didn't explain why it even made sense. [15:42:59] it's https://bugzilla.wikimedia.org/show_bug.cgi?id=51936 I think [15:43:11] 51936 [15:43:19] wm-bot currently is able to relay some tcp / udp messages to irc [15:43:23] Right. [15:43:54] https://meta.wikimedia.org/wiki/Wm-bot#.40relay-on [15:44:06] that is kind of useful feature I use it in many of my shell scripts [15:44:29] What kind of authentication does that service have? [15:44:33] I can just call a simple command and *poof* message is on irc channel [15:44:38] none or token [15:45:17] (token is password-like string you need to know in order to relay messages) [15:46:57] ... that feature is the only reason why you feel you need a publically accessible port? [15:46:57] also I had some plans for wm-bot to scale it to many more features in future, not sure if tools project wouldn't limit that a bit... but in this moment I think this one feature is probably the only blocking thing [15:47:18] also is it possible to run like 5 bouncers to freenode for a 1 bot on tools project? [15:47:55] and another thing that worries me is that memory limit on virtual memory, currently wm-bot eats few gigs of virtual memory, but just few mb of resident [15:47:55] Why would you even *want* to? [15:47:58] because it does now [15:48:00] check #wm-bot [15:49:03] it is in so many channels that 1 instance isn't enough [15:49:03] there is restriction from freenode side on channels per user [15:49:03] wm-bot is multiinstance botnet [15:49:07] First off, that limit can be lifted at need. Secondly, why are you using bouncers at all? [15:49:59] 1) I asked multiple times "I don't know" or silence was always a reply from freenode staff 2) multiple reasons, primary reason is that I can restart bot core (patches etc) without having to dc bot [15:50:31] so that I can update the bot internals and it still is online, doesn't miss channel logs etc [15:50:33] Hm. (2) seems kinda reasonable; does wb-bot actually understand backlogs? [15:50:38] yes [15:50:44] it has own bouncer I wrote both [15:51:04] Oh heh. Reinvent wheel FTW? :-) [15:51:16] it was simpler than adapting any other bouncer [15:51:34] that bouncer is like 200 lines of code, configuration file for znc would be larger thing to write [15:51:57] anyway, channel limit is not only limit [15:52:07] wm-bot can't send more than 1 message per second otherwise it would get killed [15:52:11] I still see nothing in here that can't be solved from the grid; if only with a scheme like cyberbots. [15:52:16] that is a problem when you are serving in 160 channels [15:53:24] well this can work on grid [15:53:24] I am just wondering if it's allowed (open 5 connections to freenode from 5 simultaneous jobs) [15:53:25] Sure. Although I'd then ask you to connect to dickson only to avoid needless net traffic. [15:53:25] that's no problem... [15:53:50] you should really have ip of dickson as chat.freenode.net in hosts [15:54:07] that is round robin of freenode :P [15:54:52] I know; but that's a bad idea. If you want to connect to a specific server, connect to that server, don't count on someone doing trickery with host resolution. :-) [15:55:02] chat.freenode.net is not a specific server, it's round robin so it wouldn't matter [15:56:40] It does matter. Overriding DNS resolution is a Bad Thing. Just connect to dickson.freenode.net since that is what you need. :-) [15:56:40] it /could/ theoretically give you dickson ip anyway, this way it would enforce doing it everytime [15:59:00] * Coren is asking freenode staff for the right process so that you need just the one bouncer in the first place. [16:00:46] but you understand that I am using 5 bouncers to get over 2 limits [16:00:56] message per second, channels per second [16:01:27] * channels per user [16:01:27] I don't even think they can lift the restriction on first [16:01:53] if there was only 1 bot in 160 channels it would be terribly slow because of message per second limit [16:02:53] Ryan_Lane: I really need to fix gluster there [16:06:52] !ping [16:06:53] that instance is on fire now [16:06:53] !pong [16:23:12] petan: What is your bot's cloak? [16:25:32] Coren: it's in this channel, just type /whois wm-bot [16:25:43] wikimedia/bot/wm-bot [16:33:25] petan: They're giving your bot an I-line as we speak. [16:34:24] :P [16:34:57] aha great, what does it do? [16:35:32] these lines have different meaning on each network... [16:36:01] petan: Gives specific limits to an account; in your case, greatly relaxes the channel and rate limit. :-) [16:36:09] cool [16:36:22] does it mean I can turn off the message per second limit? [16:37:11] howcome they speak to you and ignored me :( [16:39:27] It's mostly a question of knowing who to ask, and who has the authority to okay the change. Also, being able to speak directly to the infrastructure team help. :-) [16:46:34] haha petan I hope wmf will soon start selling "I love gluster" t-shirts :-) I would get one [16:52:27] hashar: around? [16:52:41] yurik_: sorry leaving right now :( [16:52:48] commuting back home [16:52:50] np [16:52:52] another time [16:52:54] thx [17:01:47] Coren: for other people running irc bots from labs, can we use dickson too? idk if it's supposed to be used by other people [17:02:09] legoktm: It's the best possible thing to do, though you are not obligated to. [17:02:21] legoktm: It's been put in the production RR now, so it's a "normal" server. [17:03:08] ok, its a simple one line change for me so i'll do that soon [17:23:43] how can I find out what, if any, environment vars are set on a grid node? [17:24:33] petan: which instance is having issues? [17:25:35] MartijnH: just run a script which prints out all the variables [17:25:37] MartijnH: The easiest way to know for sure is to simply have a shell script that reads 'set' and send that to the grid. [17:26:04] ah ok :D [17:39:11] Ryan_Lane: bots-labs [17:40:35] fixed :o [17:46:59] petan: yeah, the gluster process needed to be kicked [17:47:03] on the client [17:48:39] if I send a file called test to the grid with just set in it, I thought the output should be in test.out. That comes back empty though (so does test.err) [17:51:38] !ping [17:51:38] !pong [18:23:08] Ryan_Lane: have you looked at ansible at all? [18:24:19] !ping [18:24:19] !pong [18:24:49] YuviPanda: he's in meeting [18:25:09] mutante: ah, ok [18:26:11] YuviPanda: yes [18:26:16] YuviPanda: it's SSH based [18:26:25] it's basically a nicer dsh [18:26:27] pass [18:26:34] oh, so slow [18:26:43] Ryan_Lane: mostly because of https://github.com/Orain/ansible-playbook [18:26:55] Ryan_Lane: which is what orain use to do their mediawiki farm deploys [18:26:56] haven't looked at it yet [18:27:00] yeah [18:27:11] they use it. I think salt is superior in most ways [18:28:06] Ryan_Lane: hmm, right [18:28:15] Ryan_Lane: i've not really looked into either yet, but will do in the near future. [18:28:15] hopefully [18:29:11] ansible is similar in some ways [18:34:07] Ryan_Lane: So, the issue with the new NFS server is that the olde 3.2 kernel doesn't speak the force-only-UID-use bit; making root not superuser over the net. [18:34:15] yep [18:34:19] so, we need to upgrade [18:34:23] it's good to do so anyway [18:34:31] Ryan_Lane: This can be solved by (a) upgrading the kernel or (b) switching to the new service group username thing. [18:34:32] so that we know the problem was with the controller [18:34:45] and upgrading the kernel just requires a reboot, right? [18:34:46] Ryan_Lane: We don't. It could be a driver regression too. :-( [18:35:01] they aren't using the same drivers? [18:35:14] Well, the newer kernel has a more recent driver. [18:35:22] right ;) [18:35:33] So if we upgrade the kernel, we necessarily use the more recent driver which may or may not have a regression. [18:35:44] so, if we upgrade the kernel and the problem comes back then it'll be more likely it's not the controller [18:36:05] Indeed, although it would bode badly for every other bit of hardware we have with that controller. :-( [18:36:22] yeah, but at least then we know there's a problem and it can be reported with canonical [18:36:49] The alternative is our change of group names thing. If we had the group and users visible to the NFS server, we could do the "proper" thing and enable auth-by-principal with NFS [18:37:30] hm. we might be able to do that [18:37:51] by setting the user/group ou to the projects base + the global groups [18:37:56] and having them be unique [18:38:05] There is a third alternative that has some benefits: [18:38:12] and I wanted to run that one by you. [18:38:24] ? [18:38:53] We could fork and adapt idmapd so that every project has its domain, and map through LDAP. That would have the side effect of giving us true multi-tenancy. [18:39:01] i.e. root on foo != root on bar [18:39:04] not a chance in hell [18:39:23] Yeah, I didn't much like it but I wanted to put the option on the table. [18:39:26] :) [18:39:33] I mean, it's a good idea [18:39:46] but that's going to be really painful going forward [18:39:55] it's pretty specific domain knowledge [18:40:11] Yeah, the feature would be nice but the maintenance cost is nontrivial. [18:40:14] yep [18:40:21] our current plan works [18:41:51] Allright; I'll bump the kernel up to 3.8 now and keep an eye on things to see if the driver gives us any issues. If it doesn't, then we're golden to move the projects over and make NFS the default. [18:42:12] our current approach has tenancy, just enforced through LDAP [18:42:34] and assuming we make the service groups globally unique it can be enforced on the server [18:42:38] You mean enforced through the selection of exports. [18:42:43] yes [18:43:17] BTW, seems there's a new OpenStack service to manage exports [18:43:54] so, we can eventually use that rather than our custom code [18:45:08] That'd be nice, because right now it's a fairly nasty pile of hacks. [18:45:27] It *works*, but it's about as elegant as Frankenstein's monster. [18:45:40] indeed [18:45:55] I think andrewbogott made a nova plugin for this a while ago [18:46:02] we needed to wait till folsom to use it [18:46:07] I'm not sure why we didn't switch :D [18:46:12] oh. it was gluster [18:46:19] and we were migrating to nfs [18:48:57] At least that is well tested; except for the fact that those servers are SOO long to reboot, NFS barely notices. [18:50:07] And if we get controller stalls then we know for sure what the issue is and rollbacking is no harder. [18:52:42] Coren: indeed [18:52:45] sounds like a good plan [18:53:59] NFS coming back online with 3.8 [18:54:36] mmmm [18:55:37] Coren: is it back up? [18:55:46] YuviPanda: Yep. [18:55:58] okay, if greg-g quits again i'll investigate [18:56:07] grrr, grrrit-wm [18:56:11] greg-g: you're quitting!? :D [18:56:22] Ryan_Lane: shit! my cover! is blown! [18:56:25] heh [18:56:40] okay, seems to be up [18:56:45] Though sometimes NFS exponential backoff means it may take a minute or two before you get access back fully. [18:57:22] Wait, why is grrrit-wm quitting just because the filesystem stalls for a couple of minutes? [18:57:48] Coren: logging [18:58:01] Coren: i checked logged size, is about 7M now after ~2months of operating. [18:58:08] ~3 months actually [18:58:08] but still [18:58:09] Coren: logging! [18:58:15] grr [18:58:19] something bigger is up [18:58:23] That's not me! [18:58:30] yeah [18:58:30] looking [18:59:09] Coren: I can't seem to ssh into -login? [18:59:19] * YuviPanda tries ssh than mosh [18:59:34] YuviPanda: WMF [18:59:38] WFM, even [18:59:42] Coren: nvm [18:59:42] hehe [18:59:53] intense lag but [19:00:33] goddamn lag! [19:00:43] i'm on a shitty connection [19:02:36] Coren: ugh, werid [19:02:42] Coren: i manually restarted it, worked [19:02:49] Coren: it was just trying to, uh, i dunno what [19:03:05] !ping [19:03:05] !pong [19:03:13] wm-bot, at least, didn't get confused. [19:03:42] Coren: sure, but it is a large scale abomination running in a different project with a different fs, no? :D [19:03:55] No, I'm pretty sure it's on NFS too. [19:04:05] ... or maybe not. [19:05:11] Ah, how silly. Changing an NFS runtime parameter had NFS go into a 90s recovery grace period. [19:05:28] oh? [19:06:11] Oh, no significant issue; it's already over. File access blocked, I presume, while it flushed every outstanding request before changing the setting. [19:06:15] It'd make sense. [19:06:20] embaressing admittance of failure: I can't get my JVM up. Has gotten lucky with that yet? [19:06:48] MartijnH: What's your issue? [19:07:03] Coren: wm-bot uses gluster [19:07:05] on bots [19:07:35] JVM fails to start with "could not reserve enough space for object heap". I tried starting with 500m with -Xms200, but no cigar [19:08:19] sounds like it needs more heap! [19:08:25] jsub -mem 4G? [19:09:08] her [19:09:13] it works with 2G [19:09:23] but that's a little rediculous, you know, for hello world ;) [19:09:46] MartijnH: hey, java! [19:09:52] MartijnH: we could setup nailgun at some point... :D [19:11:39] heh [19:12:41] well, let's see how far hello world works reliably with 2G :/ [19:12:53] (not with 1G) [19:13:56] let's see if I can get some magic on the road with rediculous settings :) [19:14:16] heh [19:14:28] MartijnH: let me know if you need new packages or somesuch. [19:14:28] MartijnH: i'll be one for the next hour or so [19:14:36] * YuviPanda is imposing a strict 2AM sleep thingy [19:15:07] I shouldn't, but thanks, will do [19:15:40] oh, and if you don't mind me sticking my nose where it doesn't belong, Melatonin works wonders for me, and is rather harmless ;) [19:16:00] MartijnH: heh, yeah. [19:16:07] MartijnH: i'll give it a shot if this doens't work [19:20:44] Oh, poopness. Turning on the feature server-side doesn't let clients notice it got turned on until they remount the filesystem -- which can't be done. :-( [19:20:47] Guuuuuuh, I broke it. Can I have a file manually taken please? I don't know my tool's password... [19:21:23] Matthew_: Sure. [19:21:38] Coren: OK, tool is articlerequest-dev and the file is index.php [19:21:52] And I'm going to have to stop development until take is fixed, sadly :/ [19:22:18] Matthew_: It's working now, but -login will need a reboot to notice. [19:22:27] Ryan_Lane: is there an API request to create new service groups? [19:22:39] Coren: OK. Thanks :) Is a reboot scheduled? [19:22:47] (Or shall I ssh in somewhere else?) [19:23:03] Coren: did you just kill nfs?\ [19:23:24] Betacommand: Nope; it's working fine. [19:23:31] Well, it was 20 seconds ago. [19:23:45] Betacommand: WFM [19:23:53] all my bots died [19:24:03] o_O [19:24:15] *all* of them? [19:24:28] well, all of mine died too :P [19:24:30] but that was like, one [19:25:08] Coren: all but 1 [19:25:40] 1 BOT TO RULE THEM ALL! [19:25:43] Betacommand: I just fixed an issue with file permissions on the new NFS server, but I'm surprised it'd affect anything running that wasn't actively trying to change ownership of files -- and even if it did it would have made them /work/, not fail. :-) [19:26:08] and thats the one that is never affected by nfs issues [19:26:51] Coren: Thanks again, I hope things get fixed soon :) [19:27:00] Hm. So definitely NFS related then. How strange; most bots went on without noticing as far as I can tell. [19:28:15] * Coren wonders what fate has against you [19:28:22] * Coren wonders what fate has against you; you always seem to be the hardest hit at the slightest glitch. :-) [19:28:44] Oooooh! I forgot to stop the bug-betacommand service on reboot. :-P [19:29:04] (Seriously, though, can you tell why they died from their logs? I'd like to make sure that doesn't happen again) [19:30:18] Coren: I dont have logs [19:30:43] Betacommand: You even disabled the grid output files? [19:31:13] Coren: I deleted them shortly before they crashed [19:32:31] Coren: One of my bots died too, job was 985288. Also two tasks on a different job mysteriously went into the "error" state without logging what the error was. Then when I signaled that job (job 1240824) to re-exec itself it went missing without logging anything either. [19:32:50] * Coren is annoyed. [19:32:58] * Coren grumbles, grumbles. [19:32:59] Coren: i bet the same thing happened to grrrit-wm too [19:33:06] The documentation lies! [19:33:08] nothing in logs [19:33:27] Sorry about the burp for your tools people, according to the NFS documentation that should not have happened. [19:33:44] Coren: docs always lie [19:33:57] At least the issue is fixed now. [19:34:05] bugs are normally hidden :) [19:34:08] well, until next sunday [19:34:09] :P [19:34:11] (or not, hopefully!) [19:34:21] Betacommand: Well sure, if your bug is documented then it's a feature. :-) [19:34:28] Coren: thats what you said last time [19:34:57] Betacommand: Nonono. Last time I said it *shoudl* be fixed. Now I can observe that it is. :-) [19:35:54] Coren: So if they crash again do I win the lottery or something? [19:36:47] Betacommand: No, but I sure as hell would like to know why; having a (seemingly) random selection of bots fail like this is a serious annoyance. [19:43:07] Coren: anytime servers are messed with my stuff crashes [19:47:16] this is just insanity. the script containing "Java -version" fails on not being able to reserve enough heap space with 3G (but succeeds (yay?) with 4G). [19:47:36] MartijnH: ugh, that... sounds wrong [19:47:41] are there any known issues with JVM memory allocation? [19:47:49] or maybe not :D [19:47:49] java [19:48:21] I've run actor systems on a raspberry pi ffs [19:49:30] Coren: ^ [19:50:35] MartijnH: There are a number of issues mostly in the /way/ JRE allocate memory that make them... painful on a shared environment. [19:50:50] Coren: .... The system is going down for maintenance in 89 minutes! [19:51:16] Betacommand: Just tools-login; feel free to use tools-dev in the meantime; I'll not touch that one today. [19:52:32] Coren, do you know of any tweaking I could do? I could imagine that growing the heap space can be painful, but in this situation the initial space allocation seems to be failing even [19:52:50] MartijnH: Mostly, it preallocates a huge heap, and preemptively mmaps the entire JRE in the process space. It's a pain. I'll probably have to set up a special environment for java apps. [19:53:40] MartijnH: There's nothing wrong that I can tell, Java on unices *does* have a nearly 4G map on startup, and there's little to do about it. [19:53:56] MartijnH: One of the many things to love about Java. [19:54:46] MartijnH: In the meantime, requesting 5G is okay if you've got just the one app to run. If you need many, I'll move up the java environment setup so that it becomes a bit saner. [19:55:14] (Notes that vmem != physical memory in use) [19:55:15] Coren: i had to allocate 2 gigs for each of my java IRC bots…. [19:55:32] * Coren groans. [19:55:39] Hate java. Hate hate hate. [19:55:55] * Coren adds a java grid queue to the list. [19:56:08] * MartijnH sighs [19:56:19] oh yeah, and i got a random IRC bot crash too [19:56:42] rschen7754|away: Yeah, we already established that the NFS change which shouldn't have affected anything did. [19:56:46] ok [19:57:49] I was so hoping to do fun stuff with dynamically growing and shrinking an application spinning up small grid jobs here and there [19:57:59] So I'm going to create a queue with dedicated nodes for java that calculate usage a bit more in line with the funky way Java does things; once every java app has been moved there I'll remove the JRE from the "normal" nodes. Thankfully, most of what Java maps is, in fact, shareable. [19:58:01] dancing in glorious unison [19:58:02] Speaking of Java's extremely large vmem usage reminded me of [[wikitech:Nova Resource Talk:Tools/Help#300-350M? Really?]] [19:59:42] anomie: Indeed. [19:59:57] Nodejs is bad enough already, Java just takes it out of the ballpark entirely. :-) [20:00:36] And, just to make things fun, its default preallocated heam is, like, 512M or 1G. [20:00:51] * anomie wonders why some people are so in love with nodejs [20:02:16] are there other jvms that are a little more well behaved maybe? [20:03:29] MartijnH: I think that pretty much all of today's JVMs are based on the same fork, so it's not likely. [20:03:48] we're running openjdk [20:03:48] iirc [20:03:56] yes, openjdk 7 [20:04:04] YuviPanda: You are, but it's about 95% the same as Sun's. [20:04:10] maybe jamvm is less of a pain, but I'm just guessing there [20:04:11] You are correct*, [20:04:20] Coren: MartijnH nailgun might help [20:04:41] Coren: http://www.martiansoftware.com/nailgun/ [20:04:57] although [20:04:57] > Before you download it, be aware that *it's not secure*. Not even close. [20:04:57] heh [20:04:57] e [20:05:20] YuviPanda: I'm thinking not; it runs everything in the same VM so no tenancy. [20:05:21] yeah [20:05:31] everybody shares the same vm with the same rights with nailgun [20:06:01] Coren: yeah [20:06:09] MartijnH: although you can run one nailgun per tool [20:06:36] jamvm seems more promising, but I don't know how compatible it may be. [20:06:57] Coren, I've tried it before, and it worked like a charm for me [20:08:08] it is fully compliant to the spec afaik, and I think that there is very unspecified behaviour if any in the bytecode specification [20:08:16] very little that is [20:08:27] Well, it's actually installed by default on our Ubuntu system. [20:10:46] So 'java -jamvm' should work out of the box. [20:11:33] (To really tweak the footprint you'll also need to use -Xms and -Xmx ) [20:13:09] I'm going to toy with that a little tomorrow, thanks for helping me try and get this show on the road [20:47:42] jamvm seems much much better. Still horrible, but much much better [20:52:00] how do i find out the fingerprint for bastion.wmflabs.org in a more secure way than trust on first use? [20:57:09] jzerebecki: I don't know that it is published anywhere; I put tool-login's on wikitech but I agree it would be better if the general bastion's were there also. [20:57:23] Coren: /me grabs some popcorn and waits for his bots to die [20:57:53] ... why would they die? If they have a dependency on tools-login, they're really broken. :-) [20:58:16] Coren: you are messing with the servers [20:58:18] Coren: hey :-) The NFS migration mostly worked for beta cluster. Was just missing some mediawiki extensions git repositories, I simply recloned them [20:58:42] hashar: Yeay git! [20:58:55] Coren: the only dependencies are in their home dir [21:00:00] Betacommand: That's not going to be affected; only the actual bastion instance proper is being rebooted. [21:00:31] Coren: you hope [21:00:38] Coren: and a tiny issue is that I can change ownership of files on labnfs.pmtpa.wmnet:/deployment-prep/project though I can apparently change group [21:00:57] last time you said it would affect the bots either [21:01:10] # touch /data/project/foo && chown hashar /data/project/foo [21:01:11] chown: changing ownership of `foo': Invalid argument [21:01:13] hashar: That's been fixed; this is why I'm rebooting tools-login now; a simple reboot whenever you are ready will do the same for you. [21:01:21] ahhh perfect [21:01:25] Betacommand: I'm not touching the infrastructure. :-) [21:01:27] that is a NFS configuration ? [21:01:47] hashar: Yeah, but the client is stupid and only checks at first mount. :-( [21:01:57] that works for me :-] [21:02:05] ... and, tools-login restarted without issue. [21:02:10] Coren: Ive seen my bots go down too often to put faith in that comment [21:02:11] I think we had a similar issue when migrated beta from Gluster to nfS [21:02:48] hashar: Remote filesystems are annoying this way; getting ownership is always a finnicky ordeal when you have multi-tenancy. [21:02:51] !log deployment-prep rebooting deployment-bastion for NFS config fix. [21:03:10] Logged the message, Master [21:03:24] !log tools labs-login rebooted to fix the ownership/take issue with success. [21:03:27] Logged the message, Master [21:04:21] !log deployment-prep -bastion rebooted, restarted udp2log : /etc/init.d/udp2log stop; /etc/init.d/udp2log-mw start [21:04:26] Logged the message, Master [21:04:42] Coren: fixed. Merci beaucoup! [21:04:50] can someone verify that https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bastion.wmflabs.org is correct? [21:05:46] jzerebecki: It's the same fingerprint for me, so if it's not correct we've been in trouble for a long time. :-) [21:05:53] thx [21:06:46] Coren, I got jamvm working with "just" 750M [21:07:04] MartijnH: That's *MUCH* more sane. [21:07:21] well, somewhere between more sane and less insane [21:07:25] but still [21:07:48] that buys me a 300M heap size :/ [21:08:21] MartijnH: That's a fair deal better than 1.9G to buy 256M with the default jvm. :-) [21:08:22] though I haven't actually tried allocating to the max heap size yet [21:08:43] MartijnH: Heap size should be nearly 1:1 with VM usage. [21:09:19] it happily starts with max heapsize 2300M when I start the job with 2750M [21:09:32] but then again, that's *max* heapsize, and I hello worlded [21:10:48] maybe it's a good idea to put -jamvm in the JAVA_OPTS, and make a note on wikitech for when things break that it shouldn't be, but might be related to the different JVM [21:15:47] oh, tangentially related, where should I build my stuff? should I just use login-dev for that kind of stuff? [21:29:28] MartijnH: tools-dev is better for things like build so that interactive performance isn't impacted for most users. They are otherwise identical. [21:29:48] Betacommand: I note the lack of bots crashing. Oh ye of little faith. :-) [21:36:34] hashar: ping [21:37:17] https://deployment.wikimedia.beta.wmflabs.org/ \O/ No wiki found [21:38:51] Coren^^ [21:39:36] bah [21:39:39] I know hashar just did some things; perhaps he's still in the middle of it? [21:40:05] na that url tends to disappear from time to time [21:40:06] http://en.wikipedia.beta.wmflabs.org/ works [21:40:32] hashar: would we have a css-lint ? [21:41:12] hashar: yes. beta *g* [21:41:16] Steinsplitter: can you bug fill it please ? Will look at it tomorrow [21:41:34] Steinsplitter: or ping Reedy :] [21:41:34] yes. it is __not__ urgent. only fyi. [21:41:44] that is most probably related to a change in operations/mediawiki-config [21:41:52] mutante: no clue [21:42:01] mutante: I think we have a test to compile .less files [21:42:25] mutante: and php Code Sniffer supports some kind of css linting [21:42:48] alright, no worries, if i want it i'll try myself some day : [21:43:01] hashar: have a good night, it's late:) [21:43:34] mutante: still in audio for the next 40 minutes [21:44:59] ah, hangout? gotcha [21:45:48] hi there [21:47:03] Coren: tool creation seems to be broken atm. ‘auth’ and ‘account’ are listed in the service group list, but cannot be accesed via ssh or web [21:48:46] Hm, when were they created, do you know? [21:49:20] the first one about 15 min ago [21:50:23] Ah, the home creation process had wedged while the permission bug was active, I had to remove a lockfile. It's back on and both should be properly created now. [21:50:40] thanks :) [21:51:50] Coren: is it possible to delete service groups? I assumed the creation of ’auth’ failed as it is reserverd, so I created ‘account’ → account could be removed [22:02:50] Coren: are replica dbs still offline after the reboot? can't get in to enwiki_p or metawiki_p. it was working fine a couple hours ago: "ERROR 2003 (HY000): Can't connect to MySQL server on 'metawiki.labsdb' (110)" [22:06:00] J-Mo: They were never down, actually, and it works for me. [22:06:10] (They don't have anything to do with NFS) [22:06:28] ireas: It needs an op to do it; please open a bugzilla so I don't forget? [22:06:47] Coren: okay, thanks [22:09:22] Coren: well, happy news is that it's working again for me too. [22:10:04] ... not happy news if we don't know why it didn't for a while. [22:10:25] I see nothing on the ganglia to show they might have been unavailable. Oddness. [23:31:13] Hi. [23:31:21] What's wrong with S3? [23:31:31] http://tools.wmflabs.org/sulinfo/sulinfo.php?username=MZMcBride [23:31:33] I get an error there. [23:31:35] It reads: [23:31:48] > [23:31:57] SQL Errors [23:31:57] warning Warning : The SQL server s3 is down or having issues. Consequently, the following wikis won't be displayed : sdwiki, scwiki, scnwiki, scowiki, scnwiktionary, sdwiktionary, sewiki, sewikimedia, sgwiki, sgwiktionary, rmywiki, shwiki, shwiktionary, sawiktionary, sawiki, sawikibooks, rowikisource, rowikiquote, rowikinews, rowikibooks, roa_tarawiki, roa_rupwiktionary, roa_rupwiki, rnwiki, rmwiki, rowiktionary, rswikimedia, sahwiki, rwwiktionar [23:32:04] > [23:36:25] Elsie: at least "labs-only" it appears [23:36:42] dbtree in prod looks normal [23:37:12] curiously, on top it says S3, but further down S7 [23:37:15] Due to an issue on Wikimedia Labs s7 databases server, these informations cannot been displayed. [23:43:22] mutante: I figured prod was fine. ;-) [23:58:51] Hello. [23:58:57] Can I get added to bots-labs? [23:59:00] And/or wm-bot? [23:59:02] petan: ^^^