[04:02:32] How can I allocate ipv6 address in noda instance? [04:03:28] can't yet [04:03:53] opps. ok. [04:04:17] it's in our sights, though [04:47:54] !log deployment-prep rebooting deployment-lucene [04:47:57] Logged the message, Master [04:49:14] !log deployment-prep rebooting deployment-integration [04:49:16] Logged the message, Master [05:34:41] * Beetstra looks at the responsiveness of linkwatcher .. and grumbles [05:35:09] :/ [05:35:47] lego, what username does your bot use, and what is the username of addshore's bot? [05:35:59] Legobot and Addbot? [05:36:33] thanks, see http://meta.wikimedia.org/w/index.php?title=User:LiWa3/Settings&diff=5317714&oldid=5316074 [05:37:00] ahhh [05:37:45] * Beetstra is going to utterly ignore Addbot and Legobot :-D [05:38:05] (they were whitelisted anyway .. but their edits do get parsed ..) [05:40:06] oops, typo in bot [05:40:08] :-p [05:40:19] now it is not doing anything [05:40:19] :-( [06:31:17] legoktm, are Legobot and Addbot the two biggest editors? [06:31:23] not even close [06:31:32] highest speed at the moment .. I mean [06:31:35] EmausBot had 21million global edits [06:31:36] oh [06:31:38] probably [06:32:15] I see Sk!dbot active on wikidata .. you familiar with that one? [06:32:26] yeah [06:32:30] its another import bot [06:32:46] all of the bots on wikidata edit very fast [06:32:46] high speed? [06:32:52] since they dont have to fetch page content [06:33:24] umm [06:33:25] probably [06:34:08] I am going to ignore that one for now as well .. when things get slower again, I will unignore them [06:34:38] I count 69 epm at 6:32 [06:35:30] I had a wikiwide increase of edits from ~600 per minute to >2000 a minute .. but that is for the 95 wikis I parse .. [06:35:43] heh [06:35:43] Since, say, 5 days [06:35:53] the push to 7 million :P [06:35:56] linkwatcher could not handle that .. [06:36:23] you don't watch all wikis? [06:37:17] all mediawiki wikis, within reason [06:37:32] wikipedia, wikiquote, wikiversity, wikitravel, wiki .. [06:37:54] arent there 700+ wmf wikis? [06:38:15] heh .. oh, I missed a digit .. 795 [06:38:49] :P [06:40:26] oh, it is 794 [06:40:26] -> LW: 13 minutes 55 seconds active; RC: last 1 sec. ago; Reading ~ 794 wikis; Queues: P1=0; P2=2855; P3=1905 (7827 / -1); A1=0; A2=0 (2377 / -1); M=0; Total: 25043 edits (1799 PM); 669 IP edits (2.8%; 48 PM); Watched: 23523 (93.9%; 1690 PM); Links: 234 edits (0.9%; 16 PM); 547 total (39 PM; 0.02 per edit; 2.33 per EL add edit); 0 WL (0%; 0 PM); 0 BL (0%; 0 PM); 0 RL (0%; 0 PM); 0 AL (0%; 0 PM) [06:41:23] how does it detect new links being added? [06:41:29] does it fetch page text? [06:41:52] No .. worse [06:42:22] it pulls the previous and current revid through the parser in the API, ripps out all external links, compares the two lists [06:42:53] page text would hide people adding YouTube-templates ... [06:42:59] oh hmmmm [06:43:08] I know .. painful .. [06:43:39] But it is running in full realtime ... unless you have some morons with high-speed bots editing cross wiki on interwikis ... [06:44:13] now it has a backlog of ~1.2 million edits to parse ... [06:44:57] can't you just drop all of our edits from the queue? [06:45:05] And XLinkBot is depending on the real-time-ness of the linkwatcher (albeit only for en.wikipedia, that is why linkwatcher has 3 queues) [06:45:40] That is the hack I now applied - the DiffReader-module (that reads IRC) now bails out when the bots in the settings are editing [06:45:55] So they don't reach the core, and hence not the parser [06:46:02] OK, have to go, see you later! [06:46:20] bye! [08:08:32] hi [08:10:24] addshore there is some problem with your script [08:17:34] I fixed it [08:23:43] Coren|Sleep check out /bin/usr/qstatus on bots, maybe you could use it as well [08:23:56] just type qstatus to see [08:43:29] * legoktm looks [08:44:23] legoktm works? [08:44:49] isnt it just like a combination of qhost + # of total jobs? [08:44:55] yup [08:45:03] + qstat -j [08:45:07] every second [08:45:08] refresh [08:45:15] ok [08:45:33] I will rename it to qtop [08:45:34] :P [09:52:56] legoktm, addshore - after 3.5 hours linkwatcher reports ~500 edits per minute, which is more like what I had before [09:53:02] :) [09:53:20] And the bot is happily munching its MASSIVE backlog at the moment [09:53:35] (which means that it has time to spare now) [10:21:46] addshore ping [10:22:15] !ping [10:22:15] pong [10:56:22] [bz] (UNCONFIRMED - created by: Damian Z, priority: Unprioritized - minor) [Bug 38792] Thumbnails are broken - https://bugzilla.wikimedia.org/show_bug.cgi?id=38792 [12:57:40] Coren: I'm trying to import my php pages for my tool but there is a problem. No error log is produced, how can I debug this? [12:58:02] Hm. What are you using to import? [12:58:26] I just copied my php files from public_html on toolserver. [12:58:58] Oh, you mean there is a problem running them, not copying them. :-) [12:59:20] Yes :) And I already imported my database, by the way. [13:00:10] * Coren looks. [13:00:56] Hm. There is no php_error.log in your tool home which means it doesn't even get that far. [13:02:33] Aha. Possible problem no 1: your index.php is not owned by the tool. :-) [13:02:56] So it wouldn't know where to send the php_error.log [13:03:02] Er which one? [13:03:12] Ah yes true. [13:03:36] But that's not the page with issues :P [13:04:20] I'm not sure why you're not getting the messages, but I see two of them: [13:04:31] [Wed Mar 13 12:54:25 2013] [error] [client 10.4.1.89] PHP Notice: Undefined index: liste in /data/project/anagrimes/public_html/wiktio/anagrimes_web/lib_formulaire.php on line 77, referer: http://tools.wmflabs.org/anagrimes/wiktio/anagrimes_web/ [13:04:39] [Wed Mar 13 12:54:25 2013] [error] [client 10.4.1.89] PHP Fatal error: Cannot redeclare dummy() (previously declared in /data/project/anagrimes/public_html/wiktio/anagrimes_web/lib_chaines.php:10) in /data/project/anagrimes/public_html/wiktio/anagrimes_web/lib_chaines.php on line 11, referer: http://tools.wmflabs.org/anagrimes/wiktio/anagrimes_web/ [13:04:52] Where is this from? [13:04:55] * Coren looks into why you're not getting the errors. [13:05:22] That's the page with problems. [13:05:59] Yeah, I'm not sure why you're not getting the errors in your tool home, but that's what I'm seeing in the global log. [13:06:27] Wait, I think I may have redefined where the error log goes... [13:06:33] * Darkdadaah checks [13:07:32] Darkdadaah: Ah, yes, that'd explain it then. :-) [13:08:03] Hm no, I commented the line out. [13:08:11] Grrr [13:09:27] Oy, there's lots of ini_set for error_log in there. [13:10:38] Oh, found it then. [13:11:07] Or not. [13:11:52] Or yes. [13:12:22] Why did I put so many ini_set for errors :( ? [13:12:33] Coren: It works now. [13:12:46] Or at least, I have my error logs. [13:12:48] Because the ts doesn't have a default value that you can reach. [13:13:56] I had a hidden ini_set('error_log','old/path/on/toolserver') which sent the error log into limbo. [13:15:04] Thanks for your help, I can now continue debugging :P [13:15:50] addshore ping [13:22:03] Coren: ok it's alive now: http://tools.wmflabs.org/anagrimes/wiktio/anagrimes_web/chercher_anagrammes.php?mot=mariage&langue=fr&type=&flex=oui&gent=oui&nom_propre=oui&liste=table#liste [13:28:41] Darkdadaah: Yeay! U can haz suksess! [13:31:00] Darkdadaah: Mais pourquoi sous wiktio/anagrimes_web ? [13:31:38] Coren: parce que j'ai simplement copié-collé la structure du répertoire que j'avais sous toolserver. [13:31:58] Je mettrais ça au propre. [13:32:13] mettrai* [13:32:28] Darkdadaah: Aha! Fait gaffe si les autres répertoires ont des trucs qui faut pas mettre sur le web alors. [13:32:44] Tout était dans public_html, donc c'est ok. [13:33:10] Heh. Ou au moins, on le présume. :-P [13:33:19] * Darkdadaah checks [13:37:05] Nan c'est bon. [13:37:24] ...ah tiens on parle en français maitenant. [13:38:06] maintenant* [13:51:13] Darkdadaah: No sé lo que estás hablando, hablando español todo el tiempo. [13:52:15] * Coren tries to confuse _everyone_! [13:54:48] Coren: Ich verstehe dich nicht. [13:57:01] Darkdadaah: Mein Deutschsprache ist schlecht. [14:00:06] 私も。 [14:01:17] <^demon> I speak English. Et très mauvais français. [14:02:36] As long as we understand each other :) [14:03:07] <^demon> Aking sa batas magsalita tagalog, hindi ko. (Google Translate) [14:04:18] <^demon> Aking balae? [14:04:21] <^demon> Maybe. [14:04:25] !log bots restarted webserver: relax AllowOverride options [14:04:27] <^demon> sa batas or balae. [14:04:28] Logged the message, Master [14:04:43] !log tools restarted webserver: relax AllowOverride options [14:04:44] Logged the message, Master [14:04:54] !log bots (Last log entry is a lie: wrong project) [14:04:56] Logged the message, Master [14:06:21] [bz] (RESOLVED - created by: Tim Landscheidt, priority: Unprioritized - enhancement) [Bug 46003] Relax restrictions on .htaccess - https://bugzilla.wikimedia.org/show_bug.cgi?id=46003 [14:14:47] Coren: You could allways come over to the dark side ;) [14:21:37] Damianz: /which/ dark side? :-) [14:21:44] bots! [14:22:07] Damianz: Heh. One set of problems at a time, preferably. :-P [14:32:44] !log wikidata-dev wikidata-dev-9 Disabled WikibaseSolr so that test properties can be imported again [14:32:47] Logged the message, Master [16:26:24] @notify wm-bot [16:26:25] This user is now online in #huggle so I will let you know when they show some activity (talk etc) [16:26:27] :o [16:26:28] hehe [16:26:31] @notify addshore [16:26:31] This user is now online in #huggle so I will let you know when they show some activity (talk etc) [16:27:02] wm-bot: you didn't [16:27:02] Hi petan, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-) [16:46:04] * Silke_WMDE hates git when it's telling me about shallow stuff [16:52:41] andrewbogott I'm getting a brand new error message when puppet tries to get me a mediawiki. "shallow file was changed during fetch" Have you seen that beore? [16:52:51] before [16:53:17] I haven't… is it happening on a fresh clone or an existing one? [16:53:29] a fresh one [16:53:45] a older instance, but I threw the wiki and the db away [16:54:06] then it takes ages [16:54:47] oh wait... maybe one of the two instances is succeeding now... [16:55:07] Possible that it's a real error, if someone pushed right when you cloned [16:55:25] ah, I see [16:55:55] it happened twice or three times in a row [16:56:56] but probably you're right, the second one seems ok now, too. [16:57:12] I'm sorry for crying without a reason [16:59:28] Silke_WMDE: No problem; there was some ML discussion about that shallow clone behavior this week so I might tinkering with it later on. [16:59:40] ah cool [17:12:03] quick ? - just want to make sure there are no outstanding public ip requests [17:53:45] addshore ping [19:22:33] is addbot run in labs? [19:23:17] petan, Damianz: ? [19:59:03] Ryan_Lane: Not sure? [19:59:29] I think it is [20:00:53] Damianz: nevermind [20:00:58] some folks wanted it disabled [20:01:02] I don't see a reason for it [20:01:40] If they want it disabled the standard process is to get its accounted blocked isn't it? Unless there's very good logical reason I hold up the 'sorry, don't participate in censorship board' [20:03:42] it was related to the site outage [20:03:44] well [20:03:51] they wanted it disabled so that it would stop editing [20:03:58] I told them it doesn't make sense [20:04:30] * Damianz nods [20:46:05] Ryan_Lane well it eats hell a lot of resources [20:46:10] no wonder if it brings the sites down [20:46:12] :D [20:46:24] he is spawning it in like 200 processes at once [20:46:33] addshore ping [20:46:49] Ryan_Lane was related to site outage? [20:46:56] 200 is nothing [20:47:01] you want to tell me that addbot brough the production down? [20:47:42] Coren did you see qtop? [20:47:55] cool if you could insert it to your cluster as well [20:49:07] petan: I saw. Sounds like a neat thing, though I tend to use qmon myself. :-) Did you make a deb for it or is it just a self contained /usr/local/bin script? [20:58:31] Coren will put it into deb and puppet [20:58:49] I don't want to use qmon [20:58:52] <3 shell [20:59:44] I've been using graphical interfaces for long enough, when I discovered power of shell I can't image how I could live without that [20:59:46] I like qmon because I can just push it off another monitor and keep an eye on it. [21:00:10] right, but you can do the same with qtop if we make it more detailed [21:00:27] everything what qmon shows can be displayed in terminal except for graphs [21:00:52] petan: Sure, but qmon also allows queue control. Different niches. :-) [21:01:16] qconf and qdel and these tools too :OP [21:01:19] (It also allows configuration, but why anyone would use that instead of qconf is beyond my comprehension) [21:01:53] I just found out that doing stuff in shell is usually faster than nice gui's but maybe in some cases not... who knows [21:02:06] but having both won't hurt [21:02:15] unless qmon would eat tons of resources :P [21:02:30] I am still thinking of cluster configuration in regards of various types of bots [21:02:39] It's actually pretty lightweight. [21:03:04] I am thinking that there could be eventually multiple queues for even separate clusters... rather than mixing all kinds of jobs to all boxes [21:03:22] It's be downright breezy if it didn't have that stupid splash screen on startup that takes forever to transmit for no good reason. [21:03:23] because there are different kinds of bots eating different resources [21:04:00] for example thanks to addbot we discovered how easy it is to overload the cluster or make the queue full and unusable [21:04:10] petan: That's the reason why you really want to have as few queues as possible, actually. It allows for more efficient resource usage (i.e.; a memory heavy bot can share nicely with a cpu intensive but light footprint bot) [21:04:43] so, fyi, these old foo.labs.wm URLs are all redirected to just wikitech now: https://gerrit.wikimedia.org/r/#/c/53478/3/redirects.conf [21:04:55] I think if there were different queues for heavy cronned jobs and lightweight bots which runs all time it would work better [21:05:24] because some bots (irc bots for example) don't take it well when system performance changes rapidly [21:05:32] while these heavy wiki bots don't care [21:05:54] some irc bots may even disconnect from irc network because of temporary lag caused by huge number of heavy tasks [21:06:42] if we had like small cluster (1 - 2 very small boxes) for irc bots and separate queue for heavy jobs, I believe it would be far more stable [21:06:44] petan: That probably means you overcommit a bit, or that you should consider priority allocation instead. Hell, I'd consider tickets for bots that need interactive performance. [21:07:46] sure you can mix them and they will work, but problem is HOW they will work [21:08:01] as you say, they are interactive - you don't care if wiki bot respond to you in 1 second or 4 [21:08:04] petan: Honestly, I've never seen a grid setup where multiple queues didn't end up being a problem except in the rare case of hardware segregation. [21:08:16] but you will care if irc bot respond to you quickly or not [21:08:37] petan: Right, so give interactive tasks higher priority, and a couple extra tickets. [21:08:46] tickets? [21:08:53] well, we can try [21:09:05] like to mix some non important irc bot with heavy jobs [21:09:08] see how it will work [21:09:30] if it will work badly (I suppose it will) we can consider a different solution and if it worked fine, I don't care [21:09:38] petan: Right. [21:10:04] we could probably move morebots which is debianized already [21:10:13] petan: But it's probably a good idea to give more priority to interactive tasks, and to reduce priority of heavy duty tasks regardless of queue setup. [21:10:29] can this be done on level of queue [21:10:31] or task? [21:11:22] also, don't overestimate linux kernel... however cool grid is, it will still depend on the kernel... for example I saw that you have no swap on your boxes [21:11:26] how you handle OOM? [21:11:45] I have 20gb of swap becaues anything is better than kernel randomly killing tasks [21:12:26] of course - goal is to never need to use that swap [21:12:34] but if something terribly went wrong it's good [21:13:01] from experience - even if you are watching ram - you can always get out of it - for example when gluster daemon fucked and ate 60% of ram [21:13:23] even if you restricted users from being able to use all ram - some system daemon can break [21:13:26] petan: Gluster is teh evil. :-) [21:13:32] so having swap as backup is useful [21:13:51] Coren i had a box where gluster ate over 5gb of ram [21:14:00] petan: In my experience, once a system start trashing it might as well be dead for jobs; better to kill the jobs and restart them automatically on another node. [21:14:26] petan: But, ideally, you want qqotas as well. [21:14:33] Coren that is very bad for irc bots - for example wm-bot is like... never supposed to be restarted, ever :P [21:15:14] it's being used in about 60 wm channels and it's logging some - when it's down people are really pissed because of hole's in logs [21:15:21] especially when there is some conference [21:15:51] which was my point of having 2 clusters - heavy wiki bots don't care if you kill them and move them to another node, but irc bots do care [21:15:59] petan: That's overly optimistic even in the best scenarios; you can reduce not eliminate downtime. What we /can/ do is manage it, do proper checkpointing, and make sure it goes back up as fast as possible. [21:16:11] and it's very much more likely to have box where wiki bots are running killed on OOM rather than box where irc bots are running [21:16:55] petan: I don't think it is. With resource quotas and ulimits, the only way the OOM will wake up is if the system itself goes bonkers, in which case it makes no difference. [21:17:12] Coren of course, but wm-bot running on separate box was able to have uptime over 120 days - I am really wondering if that will be possible on your cluster with no swap and similar backup solution if it ran together with addshore's ultimately heavy tasks [21:17:46] I am pretty septic about it :P [21:18:03] but we could move some morebots there to see [21:18:21] if I saw it's really that stable I have no problems with that - BUT resource limits are evil [21:18:30] i hope they are going to be per process and not per user [21:18:34] petan: Well, I've had considerably more complicated usage patterns dealt with at Andritz with jobs that "you kill this job we lose $1M" variety. :-) [21:18:54] Coren: Only $1M? pussy [21:18:55] ;P [21:18:58] :) [21:19:13] Coren would you run these jobs on this your cluster? :P [21:19:24] petan: They are going to be per job. It's resource allocation not user limits; if your job needs more let it ask for more... and possibly have to wait until its available. :-) [21:19:42] btw I disagree with you that system which has borked daemons is supposed to die [21:19:55] you can always fix the system on the fly without having to reboot it [21:20:03] petan: I'm not saying "supposed to die" just "not able to prevent" [21:20:24] mhm... you can definitely improve the chances of being able to save it [21:20:38] petan: No, I wouldn't run those CFD jobs on tools- yet. (a) Not done turing and setting resources, and (b) doesn't run on physical hardware. :-) [21:20:56] tuning* [21:21:05] for example before I enabled swap on boxes people were bitching like every second day about boxes dying on OOM, and now - since then none of them ever crashed on OOM [21:21:31] petan: Thing is, you didn't fix the problem of overallocation, you just hid it and moved the barrier further. [21:21:56] Some bots just eat ram when wikipedia is busy though :( [21:22:18] not really, these boxes don't even use that swap for most of time - just when some borked bot starts to eat tons of memory - the system doesn't break, and I can gracefully kill it and resume the operation [21:22:21] petan: Now the OOM wouldn't trigger until the box has been completely down due to trashing its poor little heart out. :-) [21:22:53] petan: But that's my point. If /you/ had to kill the job, then it means that the grid didn't do its job. :-) [21:23:08] Coren but your grid would do the same [21:23:12] how you prevent OOM [21:23:23] you said: let the system kill it / or die and move them to another node [21:23:25] that's the same [21:23:51] difference is instead of killing 1 process you would kill hundreds [21:23:51] petan: Oooh. No. I didn't make myself clear. When I said "the system" I meant "gridengine" not "the OS" [21:24:13] ok you can't move the process without having to terminate it [21:24:24] well you "can" [21:24:30] but... not simple [21:24:43] petan: No, but if it was nice enough to support checkpointing (easy for most bots) then it will be nice and clean. [21:25:04] ok don't forget to document it for bot devs [21:25:09] so they know how to write them [21:25:16] petan: I certainly plan to encourage maintainers to implement minimal checkpointing for long running tasks. [21:25:38] petan: Yeah, it's on my "documentation todo" for the next couple of weeks. [21:26:26] petan: It /is/ simple, in essense. "Catch SIGUSR1. If you get it, save enough state to be able to restart cleanly and exit without error; your job will then be restarted." [21:26:50] yes, but that is possible for wiki bots only [21:26:59] irc bots need to keep connection open [21:27:28] btw why RESTART option doesn't work [21:27:34] petan: Well yes, those are more complicated -- but it's unlikely that /they/ will be the one running out of resources. [21:27:36] we have it in long queue and when my job exit with 1 [21:27:40] it doesn't get restarted [21:28:08] petan: Odd. It works for me. [21:28:23] Coren no but when some process eat ALL memory on one of your nodes which have no swap - it's possible that system randomly kill that irc bot [21:28:33] it may be silly, but it happens [21:29:48] petan: You're missing the point; having swap doesn't help if you misconfigured something -- all it means is that instead of running out of core, you need to run out of core+swap during which time the system becomes nearly unusable anyways. :-) [21:30:31] Coren but that core+swap is usualy very temporary - it just prevent you from loosing whole OS [21:30:45] which is IMHO very bad [21:32:05] + if you knew how system is using swap, you would know that it first swap out "idle" memory - which is memory of processes that isn't often accessed, so first megabytes of swapped memory isn't really performance killer [21:32:33] windows for example are using swap even when they have lot of free memory [21:32:52] because unused allocated memory can't be used by any process - but it can be flagged as swapped [21:33:09] (basically swapping tons of 0) [21:33:53] petan: Actually, no, an allocated page that hasn't been dirtied will not be paged out, it'll just be deallocated unless it was mlocked [21:34:16] petan: Same with a COW page (think: text) [21:34:54] you can't just deallocate memory that was already allocated to some process without having to kill it, or bringing to unstable state [21:34:54] petan: Well, Linux. Dunno about Windoze memory management. [21:35:15] well not really expert on this [21:35:27] who knows, but from what i was reading - it shouldn't ever happen [21:35:34] petan: That's the whole /principle/ of demand paging. :-) You keep a record, but don't actually bring in the page from store / physically allocate until it is actually used. :-) [21:35:46] when OS tell the process that memory was allocated - it must be able to provide this memory to it [21:36:22] petan: No, actually, it doesn't have to. By /default/ the kernel doesn't allow you to overallocate, but that can be turned on. [21:36:32] petan: but "allocated" doesn't mean "currently in ram" [21:36:33] so from what I have read, this actually happens only when you have swap enabled, so that system is always sure there is some kind of reserve to use in case it ran out of ram [21:36:46] Coren of course [21:37:10] petan: Same goes with unused allocated memory. It doesn't actually need to be anywhere until there is an actual use of it. [21:37:17] Coren but when system say "memory was allocated" to process it MUST ensure that the amount of memory allocated is available for that process anytime is would like to use it [21:37:43] so it doesn't need to use it physically, but it must be sure that this memory is available somewhere [21:37:47] for example in swap [21:38:04] petan: No, actually, there is no need for it to be available anywhere; this is why you can actually overcommit if you want to. [21:38:29] petan: Of course, it's a bad idea in most cases to allow overcommit. :-) [21:38:51] ok, so when you have no swap, process ask for 10 mb of ram and you only have 2mb of ram free - you tell it"OK you have the memory" later on process will want to store 3 mb of data - what will happen? [21:39:17] (this is impossible in windows) [21:39:18] petan: If you have no swap, and the process asks for more than is available, then the sbreak() will just fail. [21:39:29] well, but process WON'T ask [21:39:31] it already asked [21:39:39] and system told it that it got the memory [21:39:43] it just wasn't used yetr [21:40:05] that's the point with swap you can swap out even UNUSED allocated memory [21:40:11] which is nothing [21:40:11] petan: ... then there was no overcommitment. The memory /is/ available and the OS was just using it for other things in the meantime which can be discarded (buffers, etc) [21:40:18] (no performance affected) [21:40:48] oh right, but when you are running out of memory - these buffers and so are discarded anyway [21:40:51] petan: No, you're completely missing the point. If you have 8G ram, then the OS will never accept to allocate more than that. [21:41:16] so you will get to the point when you will run out of memory but there will be still for example 100mb of unused but allocated memory [21:41:16] petan: (Unless you turn on overcommit) [21:41:34] and you won't be able to get rid of it, because that memory even if empty is already reserved by other processes [21:41:53] petan: I think that your definition of "unused" is something I don't get. [21:42:12] mhm I will find a link i've been discussing this on stack overflow [21:42:59] petan: Probably just trying to tell me what you mean by "unused" should suffice, because I don't think I get it. :-) [21:44:38] unused as a memory that was requested by some process, then allocated but process never used it [21:44:49] so it's a free memory which can't be used [21:45:23] Aah. Okay, well, it /is/ in use then. That can never lead to OOM. [21:45:50] If it's allocated, by definition it's used [21:48:19] * addshore waves [21:49:04] hehe, you tweaksed qstatus ? :D [21:49:08] petan: it wasn't related to the outage [21:49:22] the outage was due to multiple factors, mostly related to the pope [21:50:24] petan: FYI, the OOM killer can /never/ be invoked if you turn overcommit off entirely. [21:51:00] hm I can't find it but I found this :D [21:51:02] http://unix.stackexchange.com/questions/2658/why-use-swap-when-there-is-more-than-enough-ram [21:51:09] swap is actually useful [21:51:13] yes. [21:51:33] turning swap off is generally a bad idea [21:51:43] petan: That is true /only/ on physical machines. [21:51:49] Coren if I had to host my bot on a box with no swap I couldn't sleep [21:51:52] :P [21:52:05] petan: On a VM, you are better off allocating more "ram" and let the host handle it. [21:52:06] lot's of swap on vms causes nasty issues, though [21:52:12] yep [21:52:34] Coren that machine has no idea if it's virtual or physical and moving unneeded allocated RAM which is not being accessed at all to swap can free some ram [21:53:03] think about my sleeping too [21:53:09] I will have nightmares [21:53:40] petan: The fact that the machine doesn't know whether it's virtual or physical is the /reason/ why it's better to allocate more ram and not swap on a VM. The kernel couldn't make a reasonable decision. [21:54:00] petan: Wheras the vm /host/ does. [21:54:27] Coren but you will have less physical RAM without swap [21:54:32] petan: I am willing to bet you 10:1 that my setup with no overcommit allowed and no swap is more stable. :-) [21:54:37] because all these inactive processes will be in real ram [21:54:50] petan: No, they will be in "ram" [21:54:57] Coren we will see when addbot will start there [21:55:01] petan: Which the vm /host/ will page out. [21:55:11] either your grid will have to kill it [21:55:12] or it die [21:55:12] both will suck [21:55:31] Ryan_Lane wait a moment LOL [21:55:34] Ryan_Lane did you say that [21:55:41] Ryan_Lane outage was because of a pope? [21:55:42] :D [21:55:44] lol [21:55:45] yes [21:55:48] petan: Well, if it uses more ram than is available, sure it will. That would have happened anyways, if not at the hand of the gridengine, it would have been killed by the OOM killer or a sysadmin. :-) [21:55:51] well, related to it [21:55:57] haha [21:55:58] :D [21:56:06] so it was a pope [21:56:09] damn [21:56:21] petan: It's /always/ the fault of the pope. :-) [21:56:25] :P [21:56:37] icinga looks nice :O [21:56:37] that's a new answer for why it broke i will use [21:56:46] addshore thanks to the pope [21:57:00] addshore I wanted to talk with you [21:57:05] addshore but I forgot why [21:57:14] :D sorry, have been at work all day :P [21:57:19] same [21:57:23] but I have internet in work :D [21:57:32]