[00:04:59] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Needed Toolserver features was modified, changed by Fastily link https://www.mediawiki.org/w/index.php?diff=660792 edit summary: [+317] /* Bots project */ + [06:47:44] !log account-creation-assistance Altered security group 'default' config, changed rule 'from 80 to 80 proto tcp ip 10.4.0.54/32' to 'from 80 to 80 proto tcp ip *10.4.0.0/21*' to fix issues with accessing web services on accounts-application [06:47:48] Logged the message, Master [09:02:13] Coren ping [09:02:48] is there a way to manually drop a queue for certain host for specific time? [09:06:52] addwork ping [09:57:38] !log wikidata-dev wikidata-testclient: Pulled merged changesets in puppet from gerrit [09:57:42] Logged the message, Master [10:14:49] !ganglia [10:14:58] @search ganglia [10:14:58] Results (Found 2): load, load-all, [10:15:02] !load-all [10:15:02] http://ganglia.wikimedia.org/2.2.0/?c=Virtualization%20cluster%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [10:15:05] eh [10:15:10] where is labs ganglia [10:18:48] lol addwork [10:18:57] 750 jobs within 10 minutes? [10:19:02] come on :D [10:19:33] fickin queues [10:19:43] I need to find out how to disable them [10:21:19] and more are coming... [10:21:34] great [10:24:09] petan: i think its his cron? [10:27:20] yes [10:27:25] Beetstra hi [10:27:42] Beetstra your bot is eating lot of ram on bnr1 :o I don't really care but right now there is about 12gb of memory used [10:27:53] as it require swap to be used, it slow down the box [10:28:06] so, just FYI your bot is probably very slow now [10:30:41] hmm .. [10:30:46] let me look [10:31:24] (I actually thought it was going pretty fast) [10:33:10] hmm ... [10:33:30] coibot is eating memory .. ? [10:35:30] I don't know... [10:35:42] there is one process eating about 3g and dozen of small processes [10:36:00] I saw that coibot had about 7.7G ... [10:36:09] I mean, the main bot, 'coibot.pl' [10:36:29] hey, -bnr3 ..! [10:36:36] I have restarted the bot [10:37:12] aha [10:38:04] no wonder it put it to bnr3 given the load on bnr1 [10:38:10] see qtop [10:39:15] I'll slow down the linkwatcher a bit [10:39:28] @notify Merlissimo [10:39:28] This user is now online in #wikimedia-tech so I will let you know when they show some activity (talk etc) [10:39:47] Beetstra, you shouldn't need to - there is no problem with load, rather with memory [10:40:57] well .. that memory-load is due me running a large number of LinkParsers on the linkwatcher, which each eat a piece of memory [10:41:11] And the more linkparsers, the faster the bot [10:41:11] aha [10:41:11] but maybe this is overdone :-) [10:52:17] linkwatcher is a bit slower now (munching the backlog at 1 file every 5 minutes in stead of munching 1 file every 3 minutes) [10:53:20] ok as long as that is ok to you :P [10:54:12] !log bots petrb: removed exec from gs [10:54:18] Logged the message, Master [10:54:25] It will just take longer [10:54:40] !log bots petrb: found log files for master server in /var/spool/gridengine/qmaster yaylog removed exec from gs [10:54:41] Due to my good friends Legobot and Addbot I have a backlog of several thousands of files :-p [10:54:42] Logged the message, Master [10:54:58] actually its just addbot now [10:55:10] legobot stopped editing ~10 hours ago due to a bug i havent fixed yet [10:55:14] well [10:55:16] heh [10:55:23] interwiki removal edits [10:55:30] it still does a bunch more stuff :P [10:56:13] But as long as it is munching backlog files .. [10:56:26] ah [10:56:27] If it starts writing backlog files again, then it means that it can't keep up with real time [10:56:48] Don't worry, legoktm .. I'll still blame you ;-) [10:57:30] heh [10:57:44] ill start have to categorizing my scripts as "beetstra-compatible" and not :P [10:58:39] Well, legoktm .. in a way you should blame the spammers. If they would not be there, we would not need to monitor who is adding links [10:59:12] SGE is really dumb thing [10:59:34] for some reason it put your bot to worst possible candidate instance :P [10:59:55] your bot was eating shittons of ram so that new instance was put to instance with lowest free memory [11:00:01] heh [11:00:16] I expect you will get to same no memory troubles soon [11:00:36] I don't know, maybe it was due to the load on the box [11:00:44] yes but bnr2 would be best [11:00:45] I have never seen memory-hole-problems with coibot [11:01:23] I'll try to keep an eye on it [11:03:34] Beetstra ok right now usage is 7.3g and ram is 7.8g [11:03:35] :P [11:03:44] that box will run out of memory within few minutes [11:03:50] see qtop [11:04:02] not blaming ur bot, this is rather mistake of SGE [11:04:13] that box had low free ram before you started your bot [11:04:21] it should have never been submitted there [11:04:43] !log wikidata-dev wikidata-dev-9 Set all logrotate config (except the puppet-managed ones) to daily and rotate 2. Submitted bug 46259 to limit size of dispatchChanges.log (on repo). [11:04:47] Logged the message, Master [11:05:51] meh .. what does coibot have today [11:06:24] Nooooh .. that is not coibot fault .. coibot is only using 174 (plus all the modules) [11:06:40] addbot is using 210m per module, and glusterfs is using 721 [11:06:51] meh [11:07:06] Beetstra I told you it's not a problem of COIBOT [11:07:23] that box was already overloaded with other stuff [11:07:34] 7.4g :o [11:07:38] WOW .. addbot has 991 processes running [11:07:42] indeed [11:07:58] and I was only running 105 linkparsers before .. [11:07:59] wow [11:08:59] it insists to put coibot on bnr3 [11:10:16] petan .. to run coibot -> '/home/beetstra/coibot.sh' .. in case you manage to change tell it to run somewhere else and I am not around to restart - feel free [11:10:32] petan: Did you set up the virtual_free config and it is still failing? [11:11:00] scfc_de where? I was googling a lot and couldn't find how to set up max memory usage [11:11:18] I added h_vmem to load_thresholds [11:11:25] petan: Didn't I post a link yesterday/this night? [11:11:29] !logs [11:11:29] logs http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs [11:11:30] and it just was producing some errors [11:11:41] uh yesterday night I wasn't here [11:12:26] I need a config that will do: if memory is over 6gb - don't submit jobs to this box [11:13:05] Then you have an impostor :-). http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20130317.txt: "[17:54:21] petan|wk: http://jeetworks.org/node/93" [11:14:51] scfc_de: error - unknown attribute virtual_free [11:14:55] doesn't work :/ [11:15:20] oh nvm [11:20:13] ok scfc_de now the configuration worked, but nothing really happened, I don't know what is it supposed to change though [11:20:17] there is no dropped queue [11:20:25] but maybe it changes behaviour of submitting? [11:20:40] I changed mem_free instead of virtual_free [11:20:48] because I don't want swap to be used [11:22:46] yay I found it [11:22:54] what we need is mem_used in thresholds [11:23:34] Beetstra I fixed teh memory problem :o [11:23:46] if you restarted your bot now, it would for sure wouldn' [11:23:50] t go to bnr3 [11:23:57] but it 's fine now [11:24:00] keep it there [11:24:11] running addwork's task's will end and memory will free [12:40:04] petan: Pong. [12:40:11] Cheers, petan [13:00:56] Coren I figured out how to do that, but I will need to write it somewhere because it was really hard to find it :D [13:01:04] I needed to set up maximal memory per node [13:01:24] it's load_thresholds mem_used=SIZE [13:01:31] No, that' [13:01:53] that will stop submitting jobs to node when certain memory is reached [13:01:57] that'll just set /preference/ for what node to use, it won't prevent a process from eating it all. [13:02:03] of course [13:02:07] that's what I needed [13:02:13] I wanted to change the preference [13:02:15] You need to set h_vmem as consumable. [13:02:24] yes I know but that's not what I want to change [13:02:30] that's already set to 4gb [13:02:36] o_O [13:02:44] Beetstra's bot is eating more than 4gb [13:02:53] if you have that less than 4gb his bot won't work on tool labs [13:02:56] which one? [13:03:02] that one you restarted today [13:03:06] he was about 4gb [13:03:08] coibot? [13:03:09] * it [13:03:11] yes [13:03:12] probably [13:03:14] one of them [13:03:23] that one eating 46% of ram [13:03:30] You can't do that; you'll never have a stable grid. You simply need to set h_vmem on the nodes, and make it a consumable resource. [13:03:35] (I didn't restart anything) [13:03:41] Coren that's what I did in past [13:03:56] Let coibot request the memory it actually needs. [13:03:56] Coren but what if you have multiple small jobs which together eat a lot of ram huh? [13:04:08] petan: Then they need to set their h_vmem accordingly. [13:04:22] Coren imagine you have 1000 jobs on one exec node, each eating 100mb of ram [13:04:24] what would you do [13:04:34] from 1000 users [13:04:44] 1000 different, separate tasks with separate limits [13:05:02] you need to set up some global limit / per instance [13:05:03] If they have their h_vmem set properly, then nothing breaks. If you don't have enough resources, they'll remain queued. [13:05:13] Coren what is "proper" [13:05:18] what is your value for h_vmem? [13:05:34] how much [13:05:40] ... what? I don't have a value for h_vmem; the *jobs* do. [13:05:47] of course I mean in config [13:05:52] If they need 1G, they set it to 1G [13:06:05] Coren but these jobs don't know that? [13:06:15] howcome job can know how much memory it will need [13:06:20] that's something what changes on demand [13:06:22] petan: ... what? [13:06:22] coibot is only using 300 mb over 9 processes [13:06:48] petan: You have to set a limit to your job. If you really don't know how much, then you need to set the limit high and check how much you really used. [13:07:02] petan: and reduce accordingly. [13:07:09] that may not be possible for all bots [13:07:39] look at Beetstra's bot, today one of his processes was eating about 4gb of ram and it never did before [13:07:47] linkwatcher something like 2G over ~100 processes [13:07:49] it's just totally random [13:08:06] based on size of task the bot is processing [13:08:15] which is different every time it run.. [13:08:41] What I think happened with COIBot today, is that because the load was high, it could not cope with something and started storing things in memory (filling an array or something). [13:08:50] Coren of course, if you had jobs which do know how much ram will they need then it would be pretty easy to manage them [13:08:54] even without swap [13:09:09] petan: If the bot has been written in a way the ram it uses is unbounded, then it has a bug. You have two choices: limit the jobs and keep the grid stable, or don't limit the job and allow any one of them to bring a whole node down. [13:09:24] petan: That's why you put _hard_ limits [13:09:26] Coren but that's like 90% of all bots which we are hosting now [13:09:40] and I believe that it's same with other bots we do not host yet [13:10:22] Coren there is third option too - don't limit the job and configure the node so that it can't be easily brought down :P [13:10:41] petan: That option does not exist. [13:10:47] if you set a per node limits as well, the SGE will just stop submitting jobs to that "ill" node [13:10:54] of course it exist that option we are using now... [13:11:08] petan: No, that's the option that you are attempting to use, and failing. [13:11:19] well I have not failed yet in that [13:11:22] it works [13:11:49] if I configured SGE as you suggest result would be SGE killing all bots we have in loop [13:11:50] petan: If *any* job is brought down because of the behaviour of another, you have failed. [13:12:03] Coren but that's what never happened to me yet [13:12:13] no job yet crashed [13:12:15] petan: No, it would result in bots going over their limit being unable to allocate more memory. [13:12:24] petan: On Toolserver, Bots *must* specify how much memory they need, and they do, and it works. [13:12:30] being unable to allocate more memory = crash [13:12:31] scfc_de: That. [13:12:53] petan: If a bot crashes because it is unable to allocate more memory, then that bot is buggy and needs to be fixed. [13:13:01] scfc_de when I was running bot on toolserver I didn't, but it wasn't using SGE anyway [13:13:27] petan: Sorry, you just didn't *notice* the hard memory limit. [13:13:29] Coren you should then announce that somehow so that all people have time to fix their bots then.... [13:13:34] Coren: if I configure a job with a limit, and my bot reaches that limit, will the job be killed? [13:13:42] Darkdadaah yes [13:13:53] petan: Now you have to because rogue bots were bringing the house down :-). [13:14:02] Darkdadaah: Further mallocs will fail. If you have error checking in your code, then you'll be able to handle it the way you want it. [13:14:04] while Beetstra would limit the amount of ram used by the bot, you can do little if you allow an user to run hundred of job in parallel [13:14:06] scfc_de well, I moved the bot to labs ages ago [13:14:10] !log wikidata-dev wikidata-dev-9: All Wikidata log files are now in /var/log/wikidata-*.log files. I'm trying to have them managed by logrotated. [13:14:13] Logged the message, Master [13:14:28] Ah ok, thanks. [13:14:35] phe: Why? Every job has its own memory limit; SGE knows how to add. :- [13:14:38] phe: :-) [13:14:55] Coren anyway I don't see what is wrong on setting per node limits as well, it will hurt nothing and prevent unexpected troubles [13:14:59] petan: https://wiki.toolserver.org/view/Rules 9.5 [13:15:31] Coren your current explanation of how stuff should be configured seems to me like it would work in "ideal world" only, but not everything is ideal in fact [13:15:54] like there is no reason why SGE should be allowed to submit jobs to node that is OOM [13:16:00] phe, only if SGE settings is sane :) [13:16:09] -phe +Coren [13:16:22] petan: You are lacking a critical requirement: "a bot that is running must not be brought down by another bot's behaviour". It's simple. It's clear. [13:16:23] I know it should never happen for that box to be OOM if the world would be idal [13:16:25] * ideal [13:16:37] Coren of course - that's what I follow [13:16:45] it also never happened to me that it would happen [13:16:45] petan: There is also no reason why a job should be allowed to bring a node down. Ever. [13:16:59] petan: Really? You never had to kill a job? [13:17:06] Coren not on this grid [13:17:40] * Beetstra points at legoktm ;-) [13:17:42] that box was already overloaded with other stuff [13:18:15] ill start have to categorizing my scripts as "beetstra-compatible" and not :P [13:18:23] Those are things that should never happen. [13:19:17] * legoktm was joking :P [13:20:13] phe: Indeed. [13:20:32] phe: And I've been trying to explain to petan what needs to be done so that assertion is true. :-) [13:20:42] !log bots petrb: everywhere iptables -A OUTPUT -p tcp --dport 25 -j REJECT [13:20:44] Logged the message, Master [13:21:26] what about dedicating an instance for addshore (and legotkm ? I've not followed the story) so they can only kill themselves :) [13:21:33] * Beetstra starts studying BSD::resource [13:22:02] phe: i think this is only a temporary problem, our jobs should be done by the end of the month [13:22:29] Right now, on tools-, if your process runs and is bounded in its memory use it will keep running until the VM is brought down. If it grows unbounded, it will not affect other jobs. If other jobs go berkerk, it will not affect yours. [13:23:08] In an environment where any community dev can run things they write, nothing else will do. [13:24:07] but you'll have trouble with sane SGE settings, bot will need to overestimate the amount of resource they need and instance will tend to be underloaded [13:25:54] phe: Possibly. They are VMs, though, so "underloaded" is not a concern. Better underloaded than unstable. [13:25:54] phe: And bots who requests too many resources get penalized when they try to run; so the maintainers are encouraged to reduce it as much as they safely can. [13:26:14] phe: Something like an IRC bot with modest memory requirements will run easily, at high priority, and be rock stable. [13:26:23] !log bots root: created /home/addshore/.forward [13:26:25] Logged the message, Master [13:26:34] right, and you can add more instance to fix that, the way sge works, you'll have probably less potential trouble with a greater number of small instance than a small number of big instance [13:26:46] phe: The memory hogs have a choice of request more and be penalized accordingly or split the task up in smaller, predictable chunks. [13:26:59] phe: Exactly. [13:27:25] Coren but legoktm was joking about his bot producing job for Beetstra's bot on wiki [13:27:38] (beetstra's bot need to check every edit of legoktm's bot) [13:27:45] he wasn't talking about system load [13:27:59] petan: I know, by point was about Beestra's growing to accomodate the extra load. [13:28:02] my* [13:28:10] petan: It's an externality. [13:28:21] petan: You cannot let externalities bring unrelated jobs down. [13:28:25] but if beetstra accomodate extra load - that node will just stop being used for any other jobs than beetstra's [13:28:34] so everything will be ok [13:28:46] that is actually what happens now [13:28:54] petan: Really? You have a mechanism in place where other jobs that were *already* running on that node will be migrated as resources diminish? [13:29:37] well, that's what I deemed impossible in past when I was talking with you regarding OOM, but I didn't know you want to know how much memory jobs will need before they are actually started [13:29:55] that of course will work (your proposed solution) but it will waste lot of system resources too [13:30:13] because some bots will ask for more than they will actually use [13:30:24] which will result in lot of unused memory [13:30:26] petan: Yes. That's exactly what you want. [13:31:17] * Beetstra jokes: "I'll make the bot ask for 4G per process .. even if I use only 25M ;-) [13:31:58] Beetstra: And that'll work. You jobs are going to be rock solid, but you're not going to like the result since your jobs will run at a big penalty, and won't have many slots to run in. [13:32:04] Coren that's a question... in case there will be HUGE number of small bots, there will be huge amount of memory that will not be used [13:32:16] I know, Coren [13:32:29] which isn't ideal either [13:33:00] petan: I think if memory is that important, maybe we should instead fix Beetstra's bot not use 4 GB :-). [13:33:07] petan: It's a minor problem at best. The bots keep running, which is what counts. If users go overboard, you teach them the joys of pacct. :-) [13:34:00] petan, it's a choice between underloaded instance, and "OOM can occur at any time" [13:34:06] petan: Requirement 1: jobs stay up. If a job is hindered by the infrastructure, then the infrastructure is wrong. [13:34:38] Coren well, but in this very case - beetstra's bot would crash on your configuration, while on mine it survived :P [13:34:40] petan: Corollary: if a job is hindered by an other job, the infrastructure is wrong. [13:34:51] Coren because it would eat more memory it asked for [13:35:04] which of course you can consider his problem.... [13:35:14] * than it asked for [13:35:30] petan, SGE in the toolserver already ask to provide an upper bound for memory usage [13:35:36] grmpf, BSD::resource is not installed [13:35:39] it doesn't seem like a problem [13:35:40] Platonides yes I was told [13:35:46] Platonides but I wasn't using SGE at all [13:35:48] on toolserver [13:35:59] petan: If his bot is written in such a way that it is unable to limit its memory use, it will crash *anyways* even if it were alone on a node as it consumes the resources. The difference is only "what else will it bring down" [13:36:07] well, I admit it's a bit annoying [13:36:17] Platonides, sge virtual_free is ignored on the TS for all linux box :) [13:36:24] petan: Which is why there was a rule put in place that requires SGE use. [13:36:53] phe, is it? [13:36:57] Beetstra: What's your bot written in? [13:37:02] Coren fair, but I am still afraid lot of resources will be wasted [13:37:08] Platonides, it's ignored on linux box 'cause java process require to much memory and never start if virtual_free is used [13:37:09] we'll see [13:37:10] coren: perl [13:37:20] petan: Then you are just worrying about the wrong things. [13:37:33] btw Coren I didn't need to provide information about memory when I was starting job on your cluster [13:37:51] Beetstra: How are you growing unbounded, titanic hashes? [13:37:51] phe, don't use java :P [13:37:52] Coren: things that are paid from donations :P [13:38:07] petan: That's because I have a sane default at 256M that works for most things. [13:38:20] petan: A TB of RAM costs, what? :-) [13:38:24] Coren: ? [13:38:29] Beetstra: Is the source available somewhere? [13:38:29] petan: Bttz. Virtual things that run on things that are paid from donations. :-) [13:38:38] in my directories .. [13:38:55] petan: Could you add me (scfc) to the Bots project? [13:39:17] Beetstra: No, I mean, if your perl program can grow to very many GB of size, either it has big-ass hashes, or you never undef stuff when you're done with it. :-) [13:39:44] or I am push-ing stuff in an array that gets never emptied [13:40:05] Beetstra: That too, but then that's very predictable as a behaviour. [13:40:10] Like 1400 edits per minute through legoktm-bots ... [13:40:14] Beetstra: You can plan accordingly. :-) [13:40:29] well, it was .. it was running at 600 edits per minute without problem .. and then there were lego and addshore ;-) [13:40:35] Beetstra: What's that array? A queue of "stuff to do"? [13:40:41] yep [13:40:57] and .. the bot is smart enough that when the queue gets too big, it saves the queue as a backlog and empties the queue [13:40:59] !log bots inserted scfc as member [13:41:03] Logged the message, Master [13:41:03] well, linkwatcher, that is [13:41:03] Redis? [13:41:07] petan: Thanks. [13:41:11] yw [13:41:14] COIBot never had that problem to worry about [13:41:36] Beetstra: That seems like a good way of doing things that should keep memory usage to fairly stable levels. [13:41:44] Beetstra, then I don't know why it is having problems [13:41:45] It does [13:41:54] Beetstra: as a tangent, if you do have some time and the data is still available, i would be interested to know how fast my bot was editing globally :P [13:42:13] Moreover, the sub-processes in the bot respawn themselves after so much work [13:42:32] Platonides because only one of his bots were doing that :P [13:42:39] Platonides: COIBot had problems with the massively increased editing speed - from 600 to >2000 edits per minute [13:42:42] Platonides the second one wasn't saving to disk [13:42:43] Beetstra: That's probably even overly paranoid, but ironclad. [13:43:26] Coren: the main bot even kills sub-processes when it detects one is hanging .. [13:43:48] Beetstra: Yeay! Robust programming. :-) [13:43:55] Beetstra, you may want to have an ignore list, populated with other bots [13:44:13] Platonides .. that is what I now have [13:44:37] Coren there may be some cases when this behavior of SGE would be counter productive, such as in case of wm-bot but that is not going to use SGE soon anyway so I don't care [13:44:41] Coren: but no memory limits .. so not completely robust [13:44:53] when hard drive breaks, wm-bot start using operating memory as alternative storage [13:45:01] and empties it once hard drive is remounted and working [13:45:11] in that case if SGE killed it, many people would become evil [13:45:17] especially sumanah [13:45:18] :P [13:46:37] and virtual hard drives breaking is happening quite often on gluster - and TBH, can happen to any kind of storage [13:46:47] petan: You're making a number of unwarranted presumptions, the first of which is that a box where the hard disk failed would suddently grow infinite amounts of ram. [13:47:09] petan: It would run out of ram eventually *anyways* [13:47:10] Coren of course not - but when usage slowly grow up - I can smell troubles [13:47:21] check what is going on - remount disk and {{fixed}} [13:47:40] in case of SGE-terminator my task would probably die before I noticed [13:47:52] petan: ... why? [13:47:54] Coren I am talking about growing like 5MB per day [13:48:04] that wouldn't kill the box OOM withing few minutes [13:48:05] petan: 5M/day == minuscule. [13:48:21] of course if I didn't fix it, the box would die OOM in few years [13:48:29] petan: Just start the job with a healthy padding. Splurge! Give it an extra 100M! [13:48:39] that's possible? [13:48:49] petan: ... of course! [13:49:05] is that even possible for SGE to "grant extra memory in case process is getting out of it + send email to operator to fix it" [13:49:13] it'll better to fix glusterfs than to work-around at application level :) [13:49:29] phe: yes but it can happen with any storage in future [13:49:48] petan: Well, not that way. What you want to do is give it an extra 100M to begin with, and set a soft cap 100M below. Then you can handle the soft cap any way you want. [13:49:49] bots which are supposed to literally never crash - needs to count with breakages of any kind [13:50:20] petan: But wouldn't it be better for the *bot* to give a warning the minute it looses its disk? [13:50:30] petan: After all, /it/ knows. [13:50:47] Coren yes but what I mean is if this process could become automatic somehow, so that in case some process would be getting OOM, it would get extra "grace memory" and email would be dispatched to operator [13:50:56] so it can be fixed without need to kill anything [13:50:59] petan, as a fallback if it loses the disk where it is saving its data, make it email that 2GB file to the admin with instruction on where should it be placed ;) [13:51:19] Platonides yes that's 3rd alternative actually :P [13:51:28] petan: Like I said, it works the other way around. You can never go above your hard limit, but you're perfectly allowed to set a soft limit some margin below that and act accordingly. [13:51:52] Platonides 4th is connect to irc and post the RAW data to irc channel so that someone can grab them and fix it [13:52:09] I.e.; give 1G as your hard limit, but set a soft limit at 750M and handle it whichever way you choose. [13:52:11] 5th is sending e-mail to barrack obama [13:52:19] you would have problems with irc flood limits [13:52:26] Platonides 1 message per second of course [13:53:06] Coren is it possible to hook some script to soft limit [13:53:21] like execute: warning.sh when soft limit [13:53:46] petan: Not globally; what happens when you hit your soft limit is that the job gets sent a SIGUSR1 [13:53:54] ah [13:53:57] petan: The job can then handle it the way it wants. [13:54:01] well but wrapper could handle that [13:54:14] some jobs can't even read signals [13:54:21] (these written in cross-platform languages) [13:54:26] not all platforms have signals [13:54:32] petan: Sure, you can handle the signal and for the "actual" job. [13:54:35] but wrapper script can do that [13:54:35] python and signal are a pain for example [13:54:56] s/for/fork/ [13:55:08] petan: The point is, there is a mechanism to warn. [13:55:11] phe, but you can do it [13:56:06] Platonides, if you want reliabilty you need to handle EINTR in many point of the application, most people will not get the point [13:56:34] * Beetstra needs to figure out how to use setrlimit .. tomorrow [13:57:44] oh .. maybe I did :-) [13:58:51] Beetstra: I can't login to bots-liwa, and on -gs, /home/beetstra yields "Permission denied". Is the source available somewhere /else/? :-) [13:59:19] then .. no, don't think so [14:00:02] scfc_de .. you want to dig through >10000 lines of codes for linkwatcher ;-) [14:01:13] Beetstra: Well, no :-). But it should be possible to see if for example instead of the back log you could just submit an SGE job and let the grid handle it. [14:01:46] The bot handles its own backlog when it has nothing better to do .. [14:02:00] like now: Loaded a backlog edit file from backlog/10539.edits. [14:02:36] Beetstra: Yes, you made that clear. [14:02:46] Beetstra, if you use subprocess, here a way to setrlimit : https://svn.toolserver.org/svnroot/phe/trunk/ocr/ocr.py [14:03:01] ohoh .. python [14:03:25] Beetstra, man ulimit for doing it from bash [14:03:27] your bot is not python ? [14:03:37] nope, perl [14:03:52] oops ;( [14:03:54] but in perl it is: [14:04:01] use BSD::Resource; [14:04:02] setrlimit('RLIMIT_RSS', 30000000,50000000) or die $!; [14:04:04] I think [14:04:34] * petan would put all people who don't use c++ in jail [14:04:52] starting with myself :> [14:04:59] petan, a BSD one? :) [14:05:16] But the point of using a grid is that you don't have to handle backlogs, rlimits & Co. yourself, but just specify what you need and let "the cloud" work it out. [14:05:18] Platonides lol [14:05:33] .. my wife does not use c++, my parents do not use c++, my sister does not use c++ .. I do not use c++ .. we will all be nicely together in jail [14:05:36] Platonides guantanamo one [14:05:41] you were asking for that pun line [14:05:47] .. is going to be fun .. putting my wife and my mother together in jail .. [14:07:03] mdale doesn't use c++! [14:08:10] * PissedPanda doesn't use C++ [14:08:28] friggin' KSA firewall .. [14:08:28] * petan puts Panda to Pissed Prison [14:08:42] OK, will work further on this tomorrow [14:08:53] * Beetstra sees you all tomorrow [14:09:07] I am still wondering how addshore is scheduling his bots [14:09:30] btw Coren looking on current usage, I think you will need to make like 1000 instances to run just addshore's bot :P [14:10:36] petan: Another bot to fix :-). [14:10:54] scfc_de I think it's by design [14:11:01] his bot is just resource expensive [14:11:07] does creepy amount of edits [14:16:26] petan: What does the bot do? [14:16:35] I have no idea [14:16:56] I just know he was telling us it's going to eat tons of resources once he start it :> [14:16:59] it remove interwiki links on many many wiki at a very high rate [14:17:09] aha [14:17:12] now we know :) [14:17:39] I was suspecting it being just running lot of while(true){} statements in parallel [14:18:17] phe: As part of moving the interwiki links to Wikidata? Then that should be finished in ... some time? [14:18:28] petan: That's actually a job that's massively parallelizable. I'll try to help him to make it so. [14:18:34] scfc_de yes like in 20 years :D [14:18:47] Coren but he already did it [14:18:56] Coren he's running thousand of small processes [14:19:15] petan: Oh, cool. That's /trival/ to run on a grid then. [14:19:25] scfc_de, yeps, but I dunno if it was a good idea to populate wikidata and to remove iw links at the same time [14:19:26] yes it is, but you will need lot of instances anyway [14:19:51] petan: That's the whole /point/ of having a grid, petan. Adding exec nodes is easy. :-) [14:19:53] Coren 1 process reserve approx 300MB of ram * 1000 = 300 000 MB ram needs to be reserved [14:20:08] yes but it's wasting of resources [14:20:22] it already works just fine on this small grid we have but it's /insecure/ ofc [14:20:22] petan: Well, if they run all at the same time. [14:20:28] they do [14:20:53] it could have use python thread, no fork at all nor separate process [14:21:07] phe it's not written in python :D [14:21:15] pff ;) [14:21:17] Wait, you're making no sense. If they use 300M of ram, then you'd have to have 300G of ram to accomodate them. You don't have 300G of ram. [14:21:18] not everything actually is written in that :P [14:21:47] Coren no I didn't say he use 300MB but he should probably reserve 300 MB [14:21:51] phe: Threads are rarely the good solution. For something that's actually loosely parallelizable, you want jobs [14:21:53] because some of his processes are hitting that [14:21:55] some are not [14:22:22] his processes use from 10mb - 300mb of ram randomly, mostly less [14:22:51] Why do they need that much amount of memory anyway? Shouldn't it be nearly static? [14:23:24] petan: Then he's splitting his jobs on odd criteria and it should be easy to spread the load more evenly. [14:24:07] petan: (If we want to be smart about it an maximise usage, surely he can predict from the input data whether it's a "small" or "large" job and allocate accordingly) [14:24:38] @notify Addshore [14:24:38] I will notify you, when I see Addshore around here [14:24:41] petan: If he does so, then his work will finish all the faster for it. [14:28:56] Ah, okay, he reads all page titles from the database instead of iterating over it. That would explain the differences. [14:29:01] petan: IMO, the primary problem is that you are thinking of the labs as "free VPS hosting". It's not. It's a resource for running community-developed tools the projects rely upon in a stable environment. This /does/ mean that those tools need to be written with stability in mind, and within some strictures. [14:30:37] petan: The end result, however, is a Good Thing. My job is (inter alia) to be there to give the maintainers the help they need for things like keeping their tools' resource usage stable, and helping them code for things like soft limits, checkpointing, etc. [14:36:29] "scfc@bots-gs:/data/project/beetstra/linkwatcher$": Yeah, found it! "less linkwatcher.pl": "linkwatcher.pl: Permission denied". *Argl* :-). [14:40:35] Ryan_Lane: master, what happened to my wiki s? Not Found [14:40:37] The requested URL /wiki/Main_Page was not found on this server. [14:40:51] http://openid-wiki.instance-proxy.wmflabs.org/wiki/Main_Page [14:41:44] Nikerabbit: ? [14:41:55] Krenair: ? [14:43:49] hashar: ? [14:44:42] Hmmm. Addbot seems to be started every two minutes for enwp. If I understand the workflow correctly, it shouldn't harm to reduce that to once per hour or something like that. [14:47:44] scfc_de probably not, but one of goals as I understood them was: not change the tasks, but the cluster so it fits to need of bot [14:48:00] so addshore shouldn't need to change anything [14:48:33] petan: That's a worthwhile objective, but only so far as it doesn't compromise the other requirements. [14:48:43] which? [14:48:55] please don't ping random people [14:49:04] number of instances? :o [14:49:12] petan: The goal is not "no change" but "as little change as possible". If it turns out that as little as possible /is/ no change, then yeay! :-) [14:49:12] is there a requirement for that? [14:50:57] petan: Look at http://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Design [14:51:03] ok [14:51:22] Wikinaut: Is your instance running? [14:51:33] yes. i think php i sbroken [14:51:35] apache2: Syntax error on line 210 of /etc/apache2/apache2.conf: Syntax error on line 1 of /etc/apache2/mods-enabled/php5.load: Cannot load /usr/lib/apache2/modules/libphp5.so into server: /usr/lib/apache2/modules/libphp5.so: cannot open shared object file: No such file or directory [14:51:43] php5 [14:52:05] Wikinaut: Is modapache-php5 installed? [14:52:19] moment [14:52:39] Wikinaut: Sorry, I mean libapache2-mod-php5 [14:52:46] (I know) [14:52:51] one moment [14:53:01] Coren yes I agree with most of that. But I am curious how many changes will need to be done once people start using it [14:53:26] was missing ! [14:53:26] + I am curious about utilization of these exec nodes [14:53:28] wh y? [14:53:29] petan: That's a one-off problem. New tools are clearly not going to be a concern, since they'll know offhand. [14:53:32] puppet ?? [14:53:40] Wikinaut: Do you use pupper? [14:53:47] *puppet [14:53:54] * Reloading web server config apache2 N: Ignoring file '20auto-upgrades.ucf-dist' in directory '/etc/apt/apt.conf.d/' as it has an invalid filename extension [14:54:02] petan: As to existing tools, I expect it will vary a lot from tool to tools. [14:54:05] ^see message aft apt-get [14:54:12] petan: UTRS, for instance, I expect can run without change. [14:54:14] <^demon> Wikinaut: That wouldn't break apache. [14:54:26] Coren well, it will be quite harder for developers to make their bots in a fashion "I need to know how many resources I will eat as much precisely as possible" it will basically kill all dynamic memory allocation / queues etc [14:54:39] ok. works again http://openid-wiki.instance-proxy.wmflabs.org/wiki/Main_Page [14:54:53] ty (thank you) [14:54:57] Wikinaut: No problem [14:55:10] petan: I think it has been mentioned today at least three times that on Toolserver this is a requirement, it is done, and it works. [14:55:17] Coren: because you will either need to use a large scale (ask for more than you need) and then you will waste resources, or you will need to calculate precisely [14:55:20] petan: No it doesn't. It means you need to /bound/ your memory; that's something you need to do anyways -- relying on the OS being able to give you infinite amounts of ram on deman is horribly bad for stability anyways. [14:55:24] Is the problem to be signalled to someone ? Ryan perhaps ? [14:55:39] Nikerabbit: I did NOT ping randomly! [14:55:53] I pinged only persons I know that they are of help [14:55:56] usually ;-) [14:56:13] Wikinaut: This is no general labs problem so Ryan is not the right person [14:56:30] ok [14:56:34] There seems to be a instance problem [14:57:05] not sure about that. the instance is working since January or so [14:57:09] but let's see [14:57:19] Wikinaut: maybe a update? [14:57:42] not really. [14:58:06] there was no pending message since about one week [14:58:11] Coren, yes I agree, but it seems to me like creation of virtual environment in another virtual environment :P if you need to host reasonable amount of bots per node, then you need to be strict it allocation of memory, so these bots will need to count with significantly lower amount of memory than developers are usually expecting [14:58:23] There is a auto update in labs [14:58:37] Coren which IMHO is good [14:58:44] Coren I like well optimized jobs [14:58:51] petan: Yes. This is the price to pay for reliability. :-) [14:58:54] Coren but question is if other devs will share my happiness [14:59:27] petan: They'll be happy when their tool is rock-solid. :-) [15:00:44] Coren the default amount of memory is which property in queue config? [15:00:55] Coren, petan: Why not use some instances for tools that should be stable and some for playing? [15:00:56] also how do you set the maximum value for that which tool operator can use [15:01:12] Jan_Luca that's what we want to do [15:01:33] Coren suggested we could convert current bots project to staging area which I am fine with [15:01:34] petan: Sorry, I did not read the whole dialog [15:01:50] Jan_Luca that's dialog from another day, no need to be sorry :P [15:02:32] Coren did you make some sysadmin documentation yet, or it's still just in your head :P\ [15:03:47] petan: I'm writing it now. [15:03:51] ok [15:04:16] Wikinaut: I'm glad you think that way but I'm just a normal (l)user of labs ;) [15:04:31] petan: (Not even figuratively, actually, I have the edit window open at this very moment) :-) [15:04:38] Nikerabbit: "try harder" ;-) [15:04:54] Coren: I hope you system is stable :-P [15:05:08] *your [15:07:17] Jan_Luca: If it isn't, you get to yell at me. They actually pay me to make it stable. :-) [15:07:59] Coren: I mean the system with the edit window :-) [15:08:13] heh [15:08:17] auto-safe FTW [15:08:22] * save [15:08:40] petan: That's boring [15:08:55] Jan_Luca: Oh! [15:09:11] 11:09:00 up 75 days, 14:26, 3 users, load average: 0.15, 0.13, 0.14 [15:09:29] Yeah, I think it'll live to day 76. :-) [15:09:52] Coren: Y2K13 problem? [15:09:59] :-) [15:10:30] Coren: You are a energy waster :-P [15:16:38] Coren did you see qtop? :P [15:16:42] on bots [15:16:44] <3 it [15:16:53] I have. It's cute. -) [15:17:01] but it suck on small terminal [15:17:13] I just implemented N of tasks per server [15:17:15] * node [15:17:59] Coren you know Damianz, he reboots his pc just when he move house [15:18:17] petan: That seems reasonable. [15:18:21] :-) [15:18:29] not really, I would use some UPS [15:18:54] actually since I switched desktop to laptop it's easier [15:19:02] petan: That depends how far you have to move, really. ;-) [15:19:08] heh :P [15:20:46] @notify addwork [15:20:46] This user is now online in #huggle so I will let you know when they show some activity (talk etc) [15:21:42] petan: One question: What is your job? [15:21:51] Jan_Luca sysadmin :P [15:21:56] but not for wmf... [15:22:03] I work for german telefonica [15:22:16] actually I am DBA / sysadmin [15:22:28] petan: Are you german? [15:22:30] nope [15:23:38] petan: Oh, I asked because I'm germn [15:23:42] *german [15:23:43] I know :) [15:25:26] btw Coren that linux uptime is not telling truth, you can hibernate and it won't reset :P [15:25:50] so maybe you aren't waster :P [15:26:03] petan: With hibernate I would have a uptime like Coren, too [15:26:10] you see [15:26:26] I just figured my laptop has huge uptime as well [15:26:28] Without around 10 hours... [15:26:51] petan: Nope. But I'm not wasting anyways; all that dissipated head reduces the energy I need to heat my home. :-) [15:27:10] yes that's my thinking too [15:27:29] problem is during summer when you need to use extra cooling [15:27:52] Coren: do you have a heater? I only use my PC :-P [15:28:02] andrewbogott: You're awake already? [15:28:39] Silke_WMDE: I'm in CDT now [15:28:51] Coren Jan_Luca In Finland I met someone who heated his house with bitcoin mining. o_O [15:28:52] Um… unsure whether the 'daylight' part of that brings be closer to your TZ or not. Maybe. [15:29:22] Silke_WMDE: Heh. Sounds reasonable. :-) [15:29:33] Silke_WMDE: How big was the house? [15:29:41] andrewbogott: I think we have a short period before Germany will switch to summer time, then we'll be apart one hour more again. [15:29:52] Yeah, that sounds right [15:30:10] Jan_Luca: The upper floor of an old wooden school building on the countryside [15:30:45] andrewbogott: I'll invite you for a review. It's really tiny! [15:30:57] :) [15:30:58] Silke_WMDE: With one PC only? [15:31:11] Jan_Luca: No, some of them [15:31:46] * andrewbogott hopes that none of his "Utilities included" tenants are still bitcoin mining [15:46:05] Coren: Is you documentation on Wikitech or on MW.org? [15:46:18] mw.org [15:51:58] Coren: Is there some documentation we / I could help? [15:53:03] Well, I'm in the middle of writing an outline now; I expect that the best possible thing at this time might be to collect a list of questions that are likely to become FAQs so that I can make sure I cover them. [15:53:26] The problem with the lead designer writing docs is that much of what I did is "obvious to me" so I won't think of it. :-) [15:53:27] ok [16:03:36] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661008 edit summary: [-1290] Start doc [16:05:25] Coren: Here some questions that come into my mind for your FAQ. [16:05:38] [[User:Jan/Tools]] [16:06:26] Oh, no linky: https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Help [16:06:47] sorry wrong copy-paste: https://wikitech.wikimedia.org/wiki/User:Jan/Tools [16:08:17] Krenair, have you used nova-precise2 recently? [16:08:33] Coren: When there more questions I add them to my list [16:09:24] andrewbogott, not really. why? [16:09:48] Krenair: Can you log in via ssh? I can't :( [16:10:52] I can ping it... No ssh though. Permission denied (publickey) [16:11:06] andrewbogott: Don't worry, be happy :-) [16:11:35] Krenair: Same for me. I'm going to reboot it... [16:12:42] Jan_Luca: That's very helpful, thanks! [16:13:02] Coren: no problem [16:15:36] Krenair: Well, now I can ssh but the web interface is broken. I guess that's… better [16:15:55] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661009 edit summary: [+350] /* Tool account */ more [16:17:06] !log openstack rebooted nova-precise2 because it was refusing ssh connections [16:17:10] Logged the message, dummy [16:17:49] andrewbogott, yeah I can't connect to MySQL [16:18:49] krenair@nova-precise2:~$ sudo service mysql status [16:18:49] mysql stop/waiting [16:18:49] krenair@nova-precise2:~$ sudo service mysql start [16:18:49] start: Job failed to start [16:19:58] hm, and puppet is disabled... [16:19:59] andrewbogott: How urgent is the changeset you sent me for review? The thing is: It's working on a WIkidata client but not on a review - but I'm about to leave for the rest of week. Can this wait? I'm afraid to leave my colleagues with a broken test system on Wednesday. [16:20:12] s/review/repo/ [16:21:16] Silke_WMDE: The first couple of patches are trivial and unlikely to break anything. The last couple are big and I don't really expect them to merge this week. [16:21:48] I'm looking at https://gerrit.wikimedia.org/r/#/c/51797/ [16:28:29] Silke_WMDE: Those variables that I cut out… did they do something that I missed? I figured they were just there because of copy/paste [16:28:46] andrewbogott: Silke is offline [16:28:53] oops [16:42:01] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661013 edit summary: [+525] /* Grid engine */ outline [16:43:13] Referred here from #mediawiki. Etherpad Lite seems to be down. Is this known? [16:47:34] andrewbogott: weren't you working on passwordless sudo? [16:48:03] paravoid: A while ago, yes. What's up? [16:48:45] what happened with that? :) [16:49:33] I only did it for labs… it was working (but configurable per project) last I checked. [16:49:35] What are you seeing? [16:50:25] ah, configurable per project [16:50:27] missed that bit [16:51:10] should be on by default for new projects [16:57:19] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Guillom link https://www.mediawiki.org/w/index.php?diff=661022 edit summary: [+74] switch to standard (for now) open tasks format [16:57:43] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661023 edit summary: [+2169] /* Web services */ logging [16:59:48] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661027 edit summary: [-25] /* Logs */ cr [17:07:05] Coren: A question: How does tools-apache use PHP? mod_php5, php-fpm, mod_suphp, fcgi, suexec? [17:07:30] Jan_Luca: mod_suphp, including for general CGIs [17:07:58] Jan_Luca: (using its shbang self-invocation) [17:08:46] Coren: I thought mod_suphp is not the solution with the best security... [17:10:30] Jan_Luca: From my code review, it has excellent insulation; and it's used in production environment with good success. It's a bit more flexible than the others, at the cost of allowing you to shoot your /own/ foot if you are careless. [17:11:45] Coren: But the last version of mod_suphp is from 2009-03-14 [17:13:13] Jan_Luca: That was one day *after* a Friday, 13th :-). [17:13:46] Jan_Luca: True, but there has been no known security vulnerabilities found in it since. :-) [17:14:11] Jan_Luca: It's still in the LTS supported arsenal as of Raring, so it's a known good quantity. [17:14:34] Coren: OK, I only wanted to point to this before we have a security problem [17:14:48] Jan_Luca: I do keep an eye on advisories in case it pops up. [17:15:04] I like PHP with suexec more [17:16:36] Jan_Luca: Both cases are defensible; but since they are functionally pretty much the same (especially from the user's pov) it's not a major issue. [17:17:30] Coren: I only like not tools that are too old, I love fresh and young programs ;-) [17:18:12] Jan_Luca: I like "old and proven". My own philosophy is "if it's been used in production for X years and it hasn't been exploited, we're good" :-) [17:18:44] Coren: Another problem: Could you enable chown for non-root? [17:19:37] Because every time to use "cp" to change the owner of my scripts from my user to the tool user is boring :-( [17:20:14] Jan_Luca: There are serious security concerns with giving chown to non-root, especially if we deploy quotas. You do have a good argument for a limited "give-to-tool" chown tool, though. [17:20:23] * Coren ponders. [17:20:41] maybe some wrapper with setuid? [17:20:55] I can probably do that. If you (a) own the file, (b) are part of a tool's group and (c) are giving the file to the tool. [17:21:16] Coren: That would be nice [17:21:22] * Coren considers. [17:21:29] Couldn't the setuid not be worked around by "sudo cp && rm"? Feels better than setuid root. [17:21:31] Yeah, that's probably the best. [17:21:46] But I agree that it's inconvenient. [17:22:07] Hm. Lemme try something. [17:22:37] scfc_de: I can use "become " and then cp /my/file file and then rm /my/file [17:23:02] but I ask for a faster solution [17:23:20] valeriej: helps if you ping a specific person... e.g. marktraceur! [17:23:33] (and if you repeat yourself) [17:23:34] 18 16:43:13 < valeriej> Referred here from #mediawiki. Etherpad Lite seems to be down. Is this known? [17:24:09] Jan_Luca: I would not make a "give file", but a "take file" [17:24:45] Coren: no problem [17:24:51] Jan_Luca: That's what I was proposing: "giveaway " as an alias for that. [17:25:04] I didn't know who to ping, but now I do! Thanks, jeremyb_. [17:25:13] Coren: But how do you determine whether the "receiving" user is allowed to take the file? [17:25:46] Jan_Luca: (Could be even named "chown".) [17:25:59] scfc_de: By having write permission to the directory containing it. [17:26:07] scfc_de: I.e.: the same as a copy-rm would [17:26:30] valeriej: well normally you'd look at https://wikitech.wikimedia.org/wiki/Nova_Resource:Etherpad , https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Etherpad&action=history , etc. [17:26:39] valeriej: those don't see so useful atm [17:26:56] Coren: Hmmm. Yeah, it should make no difference. [17:26:59] in particular the docs are non-existant. and they should say who to ping [17:27:13] Coren: when using copy-rm you have to have write permissions to the file, too [17:27:46] (I tried to use rm as tool-user but I get no permissions) [17:27:56] jeremyb_: Ah, I see. I checked here: https://www.mediawiki.org/wiki/Etherpad_Lite [17:28:06] Thanks, again. [17:28:17] valeriej: which was created by marktraceur :) [17:28:26] Coren: But that would need setuid, wouldn't it? [17:28:47] scfc_de: Nope. [17:29:05] jeremyb_: Yeah, I just thought to check that. :) [17:30:32] Coren: Hmmm. Okay, you're right again :-). "sudo cp" would also require read permission on the "given" file. I'll better shut up :-). [17:41:59] Coren: Your help (https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Help) is very nice until now [17:42:42] Jan_Luca: Well, it's very skeletal still. :-) [18:11:53] is http://etherpad.wmflabs.org/ down? [18:18:19] zeljkof: I asked that earlier "18 16:43:13 < valeriej> Referred here from #mediawiki. Etherpad Lite seems to be down. Is this known?" I was pointed to marktraceur. [18:21:00] valeriej: thanks, did you get a reply? [18:21:07] zeljkof: yes. tring to debug it [18:21:10] *trying [18:21:16] Ryan_Lane: great :) [18:21:20] it seemed to have dropped off the network [18:21:25] Just now, heh. Thanks Ryan_Lane. [18:21:41] it's one of the few down [18:22:09] so, I'd imagine something happened on it [18:33:09] I'm just going to reboot etherpad-lite [18:33:50] it hasn't tried to get an ip address for 13 days [18:36:12] it stopped kernel logging on the 5th [18:57:39] Ryan_Lane is etherpad project restricted? [18:57:43] or accessible [18:57:49] restricted in which way? [18:57:59] like people can't be members of it etc [18:58:24] only tools has this restriction currently [18:58:36] from projectadmin pov [18:58:44] ah ok can I be member then :D [18:58:53] whoever manages etherpad should add you [18:58:58] I wanted to see how it's configured cuz I can't install ether on my box [18:59:07] * Ryan_Lane nods [18:59:12] @labs-project-users etherpad [18:59:12] Following users are in this project (showing all 10 members): Ryan Lane, Wikinaut, Johnduhart, Abartov, Dzahn, MarkTraceur, Novaadmin, SuchetaG, Yuvipanda, Demon, [18:59:27] YuviPanda can u add me [18:59:34] marktraceur: ^^ [18:59:45] I think he's the usual maintainer [18:59:50] aha [18:59:53] +1 to marktraceur [19:00:09] and he's onlien, I think. He was waving and particling in another channel a little bit ago [19:00:30] I need to add a request queue for projects [19:49:15] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661074 edit summary: [-1] /* Logs */ ce [20:04:26] andrewbogott, hi, how'd you fix nova-precise2 in the end? [20:05:01] Krenair: I asked Ryan and he immediately noticed that / was full, just like the last two or three times that nova-precise2 stopped working :) [20:05:09] And I think he fixed the cron that had been filling up the disk [20:08:11] Sorry, was at lunch [20:17:34] * Damianz pats andrewbogott on the nose [20:19:16] bleh [20:20:07] -.- [20:20:18] why is bots.wmflabs.org sshable externally [20:20:20] no wonder it gets owned a lot [20:22:58] !log bots bots-bsql01 / full, mysql dead with too many connections. Clearing root, restarting mysql [20:23:23] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661087 edit summary: [+1920] /* Grid engine */ jsub [20:23:24] Logged the message, Master [20:28:02] seriously, who in their right mind puts mysql binlogs in /var/log [20:28:06] I mean fuck me [20:28:18] * Damianz slaps debian [20:29:18] /puppet/files/logrotate ? [20:29:39] /sbin/nukefromorbit [20:31:10] Coren, regarding Ryan_Lane's email about service users and groups: Will there always be an exact 1:1 between users and groups, or is it useful to add additional users to a given local group? [20:31:26] [bz] (ASSIGNED - created by: Antoine "hashar" Musso, priority: Immediate - normal) [Bug 45084] autoupdate the databases! - https://bugzilla.wikimedia.org/show_bug.cgi?id=45084 [20:33:15] andrewbogott: Hm. I would expect that service user will be 1:1 with service groups in my use case, though /global/ users will be added to the group/ [20:33:41] andrewbogott: But I wouldn't bar it unless it's really needed; other use cases might find use for, say, role groups. [20:33:50] Ok, wo we'll need a gui to add real users to the service group [20:33:54] *so [20:34:14] andrewbogott: Yes -- and the intent is that member of a service groups can do so, not just project admins. [20:34:23] ok [20:34:48] So y'think the GUI for this should look roughly the same as the existing gui for roles? [20:35:10] andrewbogott: I'd expect it'd be the most consistent option. [20:35:19] it should look pretty with unicorns and pink flowers [20:35:22] it looks like shit currently [20:35:36] andrewbogott: pink unicorns and pretty flowers, you mean. :-) [20:35:55] It already has pink unicorns, as best I can tell [20:36:08] someone stole the pink unicorn :( [20:36:09] (given that pink unicorns are invisible, as I understand it) [20:36:10] legal made it blue [20:37:40] 9.9G root partition, 7.6G is mariadb binlogs.... fml [20:37:50] * Damianz wishes a slow and slightly painful death on petan [20:38:28] Legal made it blue? [20:39:19] I don't think legal was involved… just, someone pointed out that the particular graphic we were using was the same from (slightly) religiously-charged wikipedia page... [20:39:19] So, changed. [20:39:41] And anyway that particular banner is long gone, as far as I can tell. [20:42:26] andrewbogott: I *liked* the reference to the IPU. [20:44:58] Coren: Re https://www.mediawiki.org/w/index.php?diff=661023, do newer versions of Apache support this error logging scheme you're looking for, or is this "hope"? :-) [20:47:02] scfc_de: 2.4 does [20:48:18] what's needed in 2.4 that doesn't exist in earlier versions? [20:48:29] Ryan_Lane: ErrorLogFormat [20:48:35] ah [20:48:49] can't you just create a custom log format for this? [20:48:57] then using ErrorLog ? [20:49:01] Ryan_Lane: Not for error logs, that's the point. [20:49:05] lame [20:49:06] :) [20:49:17] Ryan_Lane: Yeah, but they fixed it in 2.4. :-) [20:49:26] we'd actually like 2.4 for other reasons [20:49:45] so, if you backport it, you'll be helping in a number of ways :) [20:50:07] Should be fairly easy to backport; apache has a sane and well known build process. [20:50:24] I was just a little leery of having the /only/ 2.4 install WMF-wide. [20:50:24] :-) [20:50:39] well, we're considering using it for https [20:50:42] *snort* [20:51:00] we're using nginx now [20:51:10] there's a lot of strong motivation to use apache 2.4 [20:51:33] for one, it can share ssl cache between all nodes [20:51:45] so, we woudln't need to use lvs sh load balancing mode [20:51:58] also, it would be a more efficient cache [20:52:45] mod_spdy is written by google, right? [20:53:01] I guess that's for apache 2.2 [20:53:57] Hm. allright then, since it's clearly worthwhile I'll build a 2.4 deb. [20:53:58] bleh. it isn't supported for 2.4. :D [20:54:10] Ah, no mod_spdy then [20:54:28] the biggest reason is a shared cache [20:54:29] But that means making debs of all the modules we use too. [20:55:12] I doubt that's many [20:55:53] modules? compile it staticly :D [20:57:58] Damianz: troll ;) [20:58:17] you know it [20:59:39] Huh. Someone already has a 2.4 in a ppa. [21:00:28] the apache maintainers built 2.4 [21:00:31] it's in debian experimental [21:01:22] Debian has always been a bit... overly conservative. 2.2 is walking with a cane by now. [21:02:03] We all know if you want the latest to use fedora ;) [21:02:04] apache 2.4 broke the ABI and API [21:02:38] Damianz: I don't like the hatoids. They tend to change things for the sake of changing things. [21:02:47] we have decided to postpone the transition to apache2 2.4. The main blocker is that mod_perl needs a major new upstream release which very likely won't be ready in time for Wheezy and we don't want to release Wheezy without mod_perl. [21:02:52] The transition will probably happen shortly after the release of Wheezy. We are sorry for any inconvenience this may have caused. [21:02:56] https://lists.debian.org/debian-apache/2012/05/msg00036.html [21:03:24] * Damianz tips his fedora red-hat hat at Coren [21:03:40] That's actually a fairly good reason. Also one that doesn't apply to us (I think) [21:04:19] Oh, it looks like more fun with different mpm models in apache 2.4 and getting modules to be correctly thread safe... [21:05:49] I know, I'm just saying, don't assume people just like being slow :) [21:06:12] debian does like being slow at heart though [21:10:55] It looks like they subtilly broke backwards compatibility in the module API, so some custom modules won't build on 2.4 without adaptation [21:28:20] hi Damianz [21:28:47] * Damianz pokes petan [21:28:47] what's up [21:28:55] oh that binlog needs own partition of course [21:28:59] I have that in my todo [21:29:00] your server is full [21:29:03] really? [21:29:04] it broke [21:29:08] mine? [21:29:12] bork? [21:29:14] wut [21:29:20] bsql01 is totally yours atm :P [21:29:24] lol [21:29:33] I made it 246M by killing the logs, binlogs need shifting soon [21:29:43] ok so it's down or not [21:30:04] it's up right now [21:30:12] ur telling me it broke [21:30:30] it did earlier [21:30:40] the root parition filled up and mysql got grumpy [21:30:57] ok so did you login to mysql and flushed it? [21:31:02] did you log it [21:31:03] !sal [21:31:03] https://labsconsole.wikimedia.org/wiki/Server_Admin_Log see it and you will know all you need [21:31:05] labstore1.pmtpa.wmnet:/keys 71T 9.5T 62T 14% /public/keys < I'll eat my hat if keys really use 9.5T of disk space [21:31:14] hehe [21:31:19] we have a lot of them [21:31:32] Uptime: 1 hour 6 min 55 sec [21:31:33] wtf [21:31:35] it was really down [21:31:40] No I killed it, cleaned up the logs, restarted mysql and it was happy again. [21:31:41] * petan blame whoever did it [21:31:44] omg [21:31:45] why [21:31:49] you could just mysql to it [21:31:53] and execute flush [21:31:57] nope [21:32:02] mhm ok [21:32:08] too many connections which wouldn't clear because it couldn't write to it's log. [21:32:27] bah [21:32:32] I only noticed because my bots bitched a lot [21:33:17] !log bots purged binary logs [21:33:22] Logged the message, Master [21:33:22] fixed [21:33:33] anyway MariaDB [(none)]> purge binary logs before '2038-01-19 03:14:07'; [21:33:34] Query OK, 0 rows affected (0.73 sec) [21:33:43] well... by varying levels of fixed [21:33:51] * Damianz waits another week for it to fill up again [21:33:51] I will make a partition for that [21:33:52] in future [21:33:53] also 2038? [21:34:02] that is MAX date for mysql [21:34:06] so it mean [21:34:17] all logs [21:34:17] * Damianz thinks like today would have done :P [21:34:26] probably but I have stored this date [21:34:36] so it work in future :) [21:34:42] I believe our server will work for a long time [21:35:21] did anyone else bitched it's full? [21:35:23] Until someone does a db import. [21:35:29] btw this problem can be fixed [21:35:40] either by disabling logs [21:35:46] or making a large partition for them [21:35:54] as a DBA I would recommend keeping them :> [21:35:55] mhm [21:36:05] I more want to know wtf debian puts them in /var/log [21:36:09] hehe [21:36:20] because debian suppose syadmins are idiots [21:36:31] As an engineer I think you should use a real database lik postgre or oracle :P [21:36:36] LOL [21:36:42] oracle would use archlogs [21:36:44] that is same [21:36:50] + redologs [21:37:04] :P [21:37:08] well archlogs aren't quite the same... but yes we'd need a ~day of them. [21:37:20] archlogs are same as binary logs in mysql [21:37:25] Or w/e our time to flush to the store is for replay... not that we'd probably ever recovar it in labs. [21:37:52] well, I believe having slightly bigger storage would solve the problem [21:37:57] or nagios some alert [21:37:59] :P [21:38:17] There is a nagios alert for that - someone moved it to a channel no one is in [21:38:24] or we could just disable them and hope that no one will ever need to do recover [21:38:32] hehe [21:38:37] who knows who was it [21:38:51] We have daily backups - if they want more than that good luck [21:38:58] ok [21:39:01] Or did have... before stuff moved... anyway [21:39:05] BTW [21:39:10] daily backups had a security holes [21:39:14] I fixed them [21:39:25] you were backing up databases and gave a+r to others [21:39:35] so everyone could read whole dump no matter of db [21:39:52] That's not really a security hole, labs is pretty open [21:40:04] yes I would like to have them open :) [21:40:12] some people in here have different opinions [21:40:29] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661195 edit summary: [+540] /* Simple utilities */ more [21:40:43] If you use bots you're agreeing to others seeing your stuff so meh [21:41:31] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661197 edit summary: [-1] /* Tool account */ ce [21:41:49] addwork pioung [21:44:21] How are you supposed to stop others from seeing your login details, Damianz? [21:44:47] You don't? [21:45:04] All passwords should be salted per the labs docs and users are allowing other labs users to view that data [21:45:15] So it's not really a problem [21:45:56] In the same sense everyone can see your bots passwords, or at the least their cookie jar in bots. [21:46:37] But if you want to run a bot that connects to a bot account on-wiki, you need to give the MediaWiki API a password. [21:47:11] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661198 edit summary: [+0] /* Grid engine */ ce [21:47:20] Yeah? [21:47:29] Krenair: oAuth support will eventually be in mediawiki [21:47:31] and live on the sites [21:47:37] it's slated for this quarter [21:47:39] And I can go to any bot, su to your users, look at your config file for the db password and view it? [21:47:53] You're putting security in the wrong place here [21:48:05] Ryan_Lane, can you see why opendj won't run on nova-precise2? [21:48:11] sure [21:49:05] it looks like its running to me [21:49:11] oAuth for mediawiki will be highly awesome though - for many things [21:49:22] hm [21:49:31] it also looks like it's full of errors in its access log [21:49:48] I wonder if its database is corrupted [21:50:35] Ryan_Lane: When I restart it it throws an error that I don't understand... [21:50:45] yeah [21:50:48] also 'service --status-all' shows it as not running but maybe that's meaningless [21:51:05] that's specific to upstart [21:59:07] [bz] (NEW - created by: Antoine "hashar" Musso, priority: Normal - enhancement) [Bug 36648] replicate HTTPS architecture - https://bugzilla.wikimedia.org/show_bug.cgi?id=36648 [22:02:19] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661201 edit summary: [+1892] /* Grid engine */ mo' details [22:29:00] Are my docs understandable at all, btw? [22:30:33] Coren will check them tommorow :P [22:30:38] night in EU [22:30:58] Coren btw !tooldocs is [22:31:03] or !toolsdocs [22:31:24] me lazy [22:31:34] !tooldocs is http://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Help [22:31:35] Key was added [22:31:47] !tooldocs [22:31:47] http://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Help [22:32:41] ty [22:32:45] night [22:32:52] Ryan_Lane: So, thoughts? Works for you? [22:39:20] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661218 edit summary: [+6] /* Job names */ ce [22:43:09] andrewbogott: oh. sorry [22:43:12] was distracted [22:44:36] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by MPelletier (WMF) link https://www.mediawiki.org/w/index.php?diff=661221 edit summary: [+23] /* Simple utilities */ ce [22:49:56] what means the -mem option to jsub virtual_free or h_vmem ? and can't you use the same name as SGE, so TS people will understand it ? [23:01:57] andrewbogott: fixed it [23:02:08] andrewbogott: it was already running it, and the init script wasn't killing it [23:02:17] I killed it manually and did a restart [23:02:50] andrewbogott: did you want to work with coren on the project user stuff? [23:03:05] or was there something else you were trying to finish up? [23:12:07] Ryan_Lane: ' it was already running it' <- can you de-pronoun that for me? [23:12:25] And, yeah, I was looking at making a gui for service groups. [23:12:28] the process wasn't dead :) [23:12:36] it was running [23:12:42] * andrewbogott tries to log in [23:13:43] Woo I can log in! [23:13:52] I'm about to step away but will work on this later and/or in the morning [23:19:16] heh [23:19:17] cool [23:19:37] maybe I'll get started on the filesystem stuff, then [23:20:13] of course I'll work on it as well, for things you guys need