[04:02:32] How can I allocate ipv6 address in noda instance? [04:03:28] can't yet [04:03:53] opps. ok. [04:04:17] it's in our sights, though [04:47:54] !log deployment-prep rebooting deployment-lucene [04:47:57] Logged the message, Master [04:49:14] !log deployment-prep rebooting deployment-integration [04:49:16] Logged the message, Master [05:34:41] * Beetstra looks at the responsiveness of linkwatcher .. and grumbles [05:35:09] :/ [05:35:47] lego, what username does your bot use, and what is the username of addshore's bot? [05:35:59] Legobot and Addbot? [05:36:33] thanks, see http://meta.wikimedia.org/w/index.php?title=User:LiWa3/Settings&diff=5317714&oldid=5316074 [05:37:00] ahhh [05:37:45] * Beetstra is going to utterly ignore Addbot and Legobot :-D [05:38:05] (they were whitelisted anyway .. but their edits do get parsed ..) [05:40:06] oops, typo in bot [05:40:08] :-p [05:40:19] now it is not doing anything [05:40:19] :-( [06:31:17] legoktm, are Legobot and Addbot the two biggest editors? [06:31:23] not even close [06:31:32] highest speed at the moment .. I mean [06:31:35] EmausBot had 21million global edits [06:31:36] oh [06:31:38] probably [06:32:15] I see Sk!dbot active on wikidata .. you familiar with that one? [06:32:26] yeah [06:32:30] its another import bot [06:32:46] all of the bots on wikidata edit very fast [06:32:46] high speed? [06:32:52] since they dont have to fetch page content [06:33:24] umm [06:33:25] probably [06:34:08] I am going to ignore that one for now as well .. when things get slower again, I will unignore them [06:34:38] I count 69 epm at 6:32 [06:35:30] I had a wikiwide increase of edits from ~600 per minute to >2000 a minute .. but that is for the 95 wikis I parse .. [06:35:43] heh [06:35:43] Since, say, 5 days [06:35:53] the push to 7 million :P [06:35:56] linkwatcher could not handle that .. [06:36:23] you don't watch all wikis? [06:37:17] all mediawiki wikis, within reason [06:37:32] wikipedia, wikiquote, wikiversity, wikitravel, wiki .. [06:37:54] arent there 700+ wmf wikis? [06:38:15] heh .. oh, I missed a digit .. 795 [06:38:49] :P [06:40:26] oh, it is 794 [06:40:26] -> LW: 13 minutes 55 seconds active; RC: last 1 sec. ago; Reading ~ 794 wikis; Queues: P1=0; P2=2855; P3=1905 (7827 / -1); A1=0; A2=0 (2377 / -1); M=0; Total: 25043 edits (1799 PM); 669 IP edits (2.8%; 48 PM); Watched: 23523 (93.9%; 1690 PM); Links: 234 edits (0.9%; 16 PM); 547 total (39 PM; 0.02 per edit; 2.33 per EL add edit); 0 WL (0%; 0 PM); 0 BL (0%; 0 PM); 0 RL (0%; 0 PM); 0 AL (0%; 0 PM) [06:41:23] how does it detect new links being added? [06:41:29] does it fetch page text? [06:41:52] No .. worse [06:42:22] it pulls the previous and current revid through the parser in the API, ripps out all external links, compares the two lists [06:42:53] page text would hide people adding YouTube-templates ... [06:42:59] oh hmmmm [06:43:08] I know .. painful .. [06:43:39] But it is running in full realtime ... unless you have some morons with high-speed bots editing cross wiki on interwikis ... [06:44:13] now it has a backlog of ~1.2 million edits to parse ... [06:44:57] can't you just drop all of our edits from the queue? [06:45:05] And XLinkBot is depending on the real-time-ness of the linkwatcher (albeit only for en.wikipedia, that is why linkwatcher has 3 queues) [06:45:40] That is the hack I now applied - the DiffReader-module (that reads IRC) now bails out when the bots in the settings are editing [06:45:55] So they don't reach the core, and hence not the parser [06:46:02] OK, have to go, see you later! [06:46:20] bye! [08:08:32] hi [08:10:24] addshore there is some problem with your script [08:17:34] I fixed it [08:23:43] Coren|Sleep check out /bin/usr/qstatus on bots, maybe you could use it as well [08:23:56] just type qstatus to see [08:43:29] * legoktm looks [08:44:23] legoktm works? [08:44:49] isnt it just like a combination of qhost + # of total jobs? [08:44:55] yup [08:45:03] + qstat -j [08:45:07] every second [08:45:08] refresh [08:45:15] ok [08:45:33] I will rename it to qtop [08:45:34] :P [09:52:56] legoktm, addshore - after 3.5 hours linkwatcher reports ~500 edits per minute, which is more like what I had before [09:53:02] :) [09:53:20] And the bot is happily munching its MASSIVE backlog at the moment [09:53:35] (which means that it has time to spare now) [10:21:46] addshore ping [10:22:15] !ping [10:22:15] pong [10:56:22] [bz] (UNCONFIRMED - created by: Damian Z, priority: Unprioritized - minor) [Bug 38792] Thumbnails are broken - https://bugzilla.wikimedia.org/show_bug.cgi?id=38792 [12:57:40] Coren: I'm trying to import my php pages for my tool but there is a problem. No error log is produced, how can I debug this? [12:58:02] Hm. What are you using to import? [12:58:26] I just copied my php files from public_html on toolserver. [12:58:58] Oh, you mean there is a problem running them, not copying them. :-) [12:59:20] Yes :) And I already imported my database, by the way. [13:00:10] * Coren looks. [13:00:56] Hm. There is no php_error.log in your tool home which means it doesn't even get that far. [13:02:33] Aha. Possible problem no 1: your index.php is not owned by the tool. :-) [13:02:56] So it wouldn't know where to send the php_error.log [13:03:02] Er which one? [13:03:12] Ah yes true. [13:03:36] But that's not the page with issues :P [13:04:20] I'm not sure why you're not getting the messages, but I see two of them: [13:04:31] [Wed Mar 13 12:54:25 2013] [error] [client 10.4.1.89] PHP Notice: Undefined index: liste in /data/project/anagrimes/public_html/wiktio/anagrimes_web/lib_formulaire.php on line 77, referer: http://tools.wmflabs.org/anagrimes/wiktio/anagrimes_web/ [13:04:39] [Wed Mar 13 12:54:25 2013] [error] [client 10.4.1.89] PHP Fatal error: Cannot redeclare dummy() (previously declared in /data/project/anagrimes/public_html/wiktio/anagrimes_web/lib_chaines.php:10) in /data/project/anagrimes/public_html/wiktio/anagrimes_web/lib_chaines.php on line 11, referer: http://tools.wmflabs.org/anagrimes/wiktio/anagrimes_web/ [13:04:52] Where is this from? [13:04:55] * Coren looks into why you're not getting the errors. [13:05:22] That's the page with problems. [13:05:59] Yeah, I'm not sure why you're not getting the errors in your tool home, but that's what I'm seeing in the global log. [13:06:27] Wait, I think I may have redefined where the error log goes... [13:06:33] * Darkdadaah checks [13:07:32] Darkdadaah: Ah, yes, that'd explain it then. :-) [13:08:03] Hm no, I commented the line out. [13:08:11] Grrr [13:09:27] Oy, there's lots of ini_set for error_log in there. [13:10:38] Oh, found it then. [13:11:07] Or not. [13:11:52] Or yes. [13:12:22] Why did I put so many ini_set for errors :( ? [13:12:33] Coren: It works now. [13:12:46] Or at least, I have my error logs. [13:12:48] Because the ts doesn't have a default value that you can reach. [13:13:56] I had a hidden ini_set('error_log','old/path/on/toolserver') which sent the error log into limbo. [13:15:04] Thanks for your help, I can now continue debugging :P [13:15:50] addshore ping [13:22:03] Coren: ok it's alive now: http://tools.wmflabs.org/anagrimes/wiktio/anagrimes_web/chercher_anagrammes.php?mot=mariage&langue=fr&type=&flex=oui&gent=oui&nom_propre=oui&liste=table#liste [13:28:41] Darkdadaah: Yeay! U can haz suksess! [13:31:00] Darkdadaah: Mais pourquoi sous wiktio/anagrimes_web ? [13:31:38] Coren: parce que j'ai simplement copié-collé la structure du répertoire que j'avais sous toolserver. [13:31:58] Je mettrais ça au propre. [13:32:13] mettrai* [13:32:28] Darkdadaah: Aha! Fait gaffe si les autres répertoires ont des trucs qui faut pas mettre sur le web alors. [13:32:44] Tout était dans public_html, donc c'est ok. [13:33:10] Heh. Ou au moins, on le présume. :-P [13:33:19] * Darkdadaah checks [13:37:05] Nan c'est bon. [13:37:24] ...ah tiens on parle en français maitenant. [13:38:06] maintenant* [13:51:13] Darkdadaah: No sé lo que estás hablando, hablando español todo el tiempo. [13:52:15] * Coren tries to confuse _everyone_! [13:54:48] Coren: Ich verstehe dich nicht. [13:57:01] Darkdadaah: Mein Deutschsprache ist schlecht. [14:00:06] 私も。 [14:01:17] <^demon> I speak English. Et très mauvais français. [14:02:36] As long as we understand each other :) [14:03:07] <^demon> Aking sa batas magsalita tagalog, hindi ko. (Google Translate) [14:04:18] <^demon> Aking balae? [14:04:21] <^demon> Maybe. [14:04:25] !log bots restarted webserver: relax AllowOverride options [14:04:27] <^demon> sa batas or balae. [14:04:28] Logged the message, Master [14:04:43] !log tools restarted webserver: relax AllowOverride options [14:04:44] Logged the message, Master [14:04:54] !log bots (Last log entry is a lie: wrong project) [14:04:56] Logged the message, Master [14:06:21] [bz] (RESOLVED - created by: Tim Landscheidt, priority: Unprioritized - enhancement) [Bug 46003] Relax restrictions on .htaccess - https://bugzilla.wikimedia.org/show_bug.cgi?id=46003 [14:14:47] Coren: You could allways come over to the dark side ;) [14:21:37] Damianz: /which/ dark side? :-) [14:21:44] bots! [14:22:07] Damianz: Heh. One set of problems at a time, preferably. :-P [14:32:44] !log wikidata-dev wikidata-dev-9 Disabled WikibaseSolr so that test properties can be imported again [14:32:47] Logged the message, Master [16:26:24] @notify wm-bot [16:26:25] This user is now online in #huggle so I will let you know when they show some activity (talk etc) [16:26:27] :o [16:26:28] hehe [16:26:31] @notify addshore [16:26:31] This user is now online in #huggle so I will let you know when they show some activity (talk etc) [16:27:02] wm-bot: you didn't [16:27:02] Hi petan, there is some error, I am a stupid bot and I am not intelligent enough to hold a conversation with you :-) [16:46:04] * Silke_WMDE hates git when it's telling me about shallow stuff [16:52:41] andrewbogott I'm getting a brand new error message when puppet tries to get me a mediawiki. "shallow file was changed during fetch" Have you seen that beore? [16:52:51] before [16:53:17] I haven't… is it happening on a fresh clone or an existing one? [16:53:29] a fresh one [16:53:45] a older instance, but I threw the wiki and the db away [16:54:06] then it takes ages [16:54:47] oh wait... maybe one of the two instances is succeeding now... [16:55:07] Possible that it's a real error, if someone pushed right when you cloned [16:55:25] ah, I see [16:55:55] it happened twice or three times in a row [16:56:56] but probably you're right, the second one seems ok now, too. [16:57:12] I'm sorry for crying without a reason [16:59:28] Silke_WMDE: No problem; there was some ML discussion about that shallow clone behavior this week so I might tinkering with it later on. [16:59:40] ah cool [17:12:03] quick ? - just want to make sure there are no outstanding public ip requests [17:53:45] addshore ping [19:22:33] is addbot run in labs? [19:23:17] petan, Damianz: ? [19:59:03] Ryan_Lane: Not sure? [19:59:29] I think it is [20:00:53] Damianz: nevermind [20:00:58] some folks wanted it disabled [20:01:02] I don't see a reason for it [20:01:40] If they want it disabled the standard process is to get its accounted blocked isn't it? Unless there's very good logical reason I hold up the 'sorry, don't participate in censorship board' [20:03:42] it was related to the site outage [20:03:44] well [20:03:51] they wanted it disabled so that it would stop editing [20:03:58] I told them it doesn't make sense [20:04:30] * Damianz nods [20:46:05] Ryan_Lane well it eats hell a lot of resources [20:46:10] no wonder if it brings the sites down [20:46:12] :D [20:46:24] he is spawning it in like 200 processes at once [20:46:33] addshore ping [20:46:49] Ryan_Lane was related to site outage? [20:46:56] 200 is nothing [20:47:01] you want to tell me that addbot brough the production down? [20:47:42] Coren did you see qtop? [20:47:55] cool if you could insert it to your cluster as well [20:49:07] petan: I saw. Sounds like a neat thing, though I tend to use qmon myself. :-) Did you make a deb for it or is it just a self contained /usr/local/bin script? [20:58:31] Coren will put it into deb and puppet [20:58:49] I don't want to use qmon [20:58:52] <3 shell [20:59:44] I've been using graphical interfaces for long enough, when I discovered power of shell I can't image how I could live without that [20:59:46] I like qmon because I can just push it off another monitor and keep an eye on it. [21:00:10] right, but you can do the same with qtop if we make it more detailed [21:00:27] everything what qmon shows can be displayed in terminal except for graphs [21:00:52] petan: Sure, but qmon also allows queue control. Different niches. :-) [21:01:16] qconf and qdel and these tools too :OP [21:01:19] (It also allows configuration, but why anyone would use that instead of qconf is beyond my comprehension) [21:01:53] I just found out that doing stuff in shell is usually faster than nice gui's but maybe in some cases not... who knows [21:02:06] but having both won't hurt [21:02:15] unless qmon would eat tons of resources :P [21:02:30] I am still thinking of cluster configuration in regards of various types of bots [21:02:39] It's actually pretty lightweight. [21:03:04] I am thinking that there could be eventually multiple queues for even separate clusters... rather than mixing all kinds of jobs to all boxes [21:03:22] It's be downright breezy if it didn't have that stupid splash screen on startup that takes forever to transmit for no good reason. [21:03:23] because there are different kinds of bots eating different resources [21:04:00] for example thanks to addbot we discovered how easy it is to overload the cluster or make the queue full and unusable [21:04:10] petan: That's the reason why you really want to have as few queues as possible, actually. It allows for more efficient resource usage (i.e.; a memory heavy bot can share nicely with a cpu intensive but light footprint bot) [21:04:43] so, fyi, these old foo.labs.wm URLs are all redirected to just wikitech now: https://gerrit.wikimedia.org/r/#/c/53478/3/redirects.conf [21:04:55] I think if there were different queues for heavy cronned jobs and lightweight bots which runs all time it would work better [21:05:24] because some bots (irc bots for example) don't take it well when system performance changes rapidly [21:05:32] while these heavy wiki bots don't care [21:05:54] some irc bots may even disconnect from irc network because of temporary lag caused by huge number of heavy tasks [21:06:42] if we had like small cluster (1 - 2 very small boxes) for irc bots and separate queue for heavy jobs, I believe it would be far more stable [21:06:44] petan: That probably means you overcommit a bit, or that you should consider priority allocation instead. Hell, I'd consider tickets for bots that need interactive performance. [21:07:46] sure you can mix them and they will work, but problem is HOW they will work [21:08:01] as you say, they are interactive - you don't care if wiki bot respond to you in 1 second or 4 [21:08:04] petan: Honestly, I've never seen a grid setup where multiple queues didn't end up being a problem except in the rare case of hardware segregation. [21:08:16] but you will care if irc bot respond to you quickly or not [21:08:37] petan: Right, so give interactive tasks higher priority, and a couple extra tickets. [21:08:46] tickets? [21:08:53] well, we can try [21:09:05] like to mix some non important irc bot with heavy jobs [21:09:08] see how it will work [21:09:30] if it will work badly (I suppose it will) we can consider a different solution and if it worked fine, I don't care [21:09:38] petan: Right. [21:10:04] we could probably move morebots which is debianized already [21:10:13] petan: But it's probably a good idea to give more priority to interactive tasks, and to reduce priority of heavy duty tasks regardless of queue setup. [21:10:29] can this be done on level of queue [21:10:31] or task? [21:11:22] also, don't overestimate linux kernel... however cool grid is, it will still depend on the kernel... for example I saw that you have no swap on your boxes [21:11:26] how you handle OOM? [21:11:45] I have 20gb of swap becaues anything is better than kernel randomly killing tasks [21:12:26] of course - goal is to never need to use that swap [21:12:34] but if something terribly went wrong it's good [21:13:01] from experience - even if you are watching ram - you can always get out of it - for example when gluster daemon fucked and ate 60% of ram [21:13:23] even if you restricted users from being able to use all ram - some system daemon can break [21:13:26] petan: Gluster is teh evil. :-) [21:13:32] so having swap as backup is useful [21:13:51] Coren i had a box where gluster ate over 5gb of ram [21:14:00] petan: In my experience, once a system start trashing it might as well be dead for jobs; better to kill the jobs and restart them automatically on another node. [21:14:26] petan: But, ideally, you want qqotas as well. [21:14:33] Coren that is very bad for irc bots - for example wm-bot is like... never supposed to be restarted, ever :P [21:15:14] it's being used in about 60 wm channels and it's logging some - when it's down people are really pissed because of hole's in logs [21:15:21] especially when there is some conference [21:15:51] which was my point of having 2 clusters - heavy wiki bots don't care if you kill them and move them to another node, but irc bots do care [21:15:59] petan: That's overly optimistic even in the best scenarios; you can reduce not eliminate downtime. What we /can/ do is manage it, do proper checkpointing, and make sure it goes back up as fast as possible. [21:16:11] and it's very much more likely to have box where wiki bots are running killed on OOM rather than box where irc bots are running [21:16:55] petan: I don't think it is. With resource quotas and ulimits, the only way the OOM will wake up is if the system itself goes bonkers, in which case it makes no difference. [21:17:12] Coren of course, but wm-bot running on separate box was able to have uptime over 120 days - I am really wondering if that will be possible on your cluster with no swap and similar backup solution if it ran together with addshore's ultimately heavy tasks [21:17:46] I am pretty septic about it :P [21:18:03] but we could move some morebots there to see [21:18:21] if I saw it's really that stable I have no problems with that - BUT resource limits are evil [21:18:30] i hope they are going to be per process and not per user [21:18:34] petan: Well, I've had considerably more complicated usage patterns dealt with at Andritz with jobs that "you kill this job we lose $1M" variety. :-) [21:18:54] Coren: Only $1M? pussy [21:18:55] ;P [21:18:58] :) [21:19:13] Coren would you run these jobs on this your cluster? :P [21:19:24] petan: They are going to be per job. It's resource allocation not user limits; if your job needs more let it ask for more... and possibly have to wait until its available. :-) [21:19:42] btw I disagree with you that system which has borked daemons is supposed to die [21:19:55] you can always fix the system on the fly without having to reboot it [21:20:03] petan: I'm not saying "supposed to die" just "not able to prevent" [21:20:24] mhm... you can definitely improve the chances of being able to save it [21:20:38] petan: No, I wouldn't run those CFD jobs on tools- yet. (a) Not done turing and setting resources, and (b) doesn't run on physical hardware. :-) [21:20:56] tuning* [21:21:05] for example before I enabled swap on boxes people were bitching like every second day about boxes dying on OOM, and now - since then none of them ever crashed on OOM [21:21:31] petan: Thing is, you didn't fix the problem of overallocation, you just hid it and moved the barrier further. [21:21:56] Some bots just eat ram when wikipedia is busy though :( [21:22:18] not really, these boxes don't even use that swap for most of time - just when some borked bot starts to eat tons of memory - the system doesn't break, and I can gracefully kill it and resume the operation [21:22:21] petan: Now the OOM wouldn't trigger until the box has been completely down due to trashing its poor little heart out. :-) [21:22:53] petan: But that's my point. If /you/ had to kill the job, then it means that the grid didn't do its job. :-) [21:23:08] Coren but your grid would do the same [21:23:12] how you prevent OOM [21:23:23] you said: let the system kill it / or die and move them to another node [21:23:25] that's the same [21:23:51] difference is instead of killing 1 process you would kill hundreds [21:23:51] petan: Oooh. No. I didn't make myself clear. When I said "the system" I meant "gridengine" not "the OS" [21:24:13] ok you can't move the process without having to terminate it [21:24:24] well you "can" [21:24:30] but... not simple [21:24:43] petan: No, but if it was nice enough to support checkpointing (easy for most bots) then it will be nice and clean. [21:25:04] ok don't forget to document it for bot devs [21:25:09] so they know how to write them [21:25:16] petan: I certainly plan to encourage maintainers to implement minimal checkpointing for long running tasks. [21:25:38] petan: Yeah, it's on my "documentation todo" for the next couple of weeks. [21:26:26] petan: It /is/ simple, in essense. "Catch SIGUSR1. If you get it, save enough state to be able to restart cleanly and exit without error; your job will then be restarted." [21:26:50] yes, but that is possible for wiki bots only [21:26:59] irc bots need to keep connection open [21:27:28] btw why RESTART option doesn't work [21:27:34] petan: Well yes, those are more complicated -- but it's unlikely that /they/ will be the one running out of resources. [21:27:36] we have it in long queue and when my job exit with 1 [21:27:40] it doesn't get restarted [21:28:08] petan: Odd. It works for me. [21:28:23] Coren no but when some process eat ALL memory on one of your nodes which have no swap - it's possible that system randomly kill that irc bot [21:28:33] it may be silly, but it happens [21:29:48] petan: You're missing the point; having swap doesn't help if you misconfigured something -- all it means is that instead of running out of core, you need to run out of core+swap during which time the system becomes nearly unusable anyways. :-) [21:30:31] Coren but that core+swap is usualy very temporary - it just prevent you from loosing whole OS [21:30:45] which is IMHO very bad [21:32:05] + if you knew how system is using swap, you would know that it first swap out "idle" memory - which is memory of processes that isn't often accessed, so first megabytes of swapped memory isn't really performance killer [21:32:33] windows for example are using swap even when they have lot of free memory [21:32:52] because unused allocated memory can't be used by any process - but it can be flagged as swapped [21:33:09] (basically swapping tons of 0) [21:33:53] petan: Actually, no, an allocated page that hasn't been dirtied will not be paged out, it'll just be deallocated unless it was mlocked [21:34:16] petan: Same with a COW page (think: text) [21:34:54] you can't just deallocate memory that was already allocated to some process without having to kill it, or bringing to unstable state [21:34:54] petan: Well, Linux. Dunno about Windoze memory management. [21:35:15] well not really expert on this [21:35:27] who knows, but from what i was reading - it shouldn't ever happen [21:35:34] petan: That's the whole /principle/ of demand paging. :-) You keep a record, but don't actually bring in the page from store / physically allocate until it is actually used. :-) [21:35:46] when OS tell the process that memory was allocated - it must be able to provide this memory to it [21:36:22] petan: No, actually, it doesn't have to. By /default/ the kernel doesn't allow you to overallocate, but that can be turned on. [21:36:32] petan: but "allocated" doesn't mean "currently in ram" [21:36:33] so from what I have read, this actually happens only when you have swap enabled, so that system is always sure there is some kind of reserve to use in case it ran out of ram [21:36:46] Coren of course [21:37:10] petan: Same goes with unused allocated memory. It doesn't actually need to be anywhere until there is an actual use of it. [21:37:17] Coren but when system say "memory was allocated" to process it MUST ensure that the amount of memory allocated is available for that process anytime is would like to use it [21:37:43] so it doesn't need to use it physically, but it must be sure that this memory is available somewhere [21:37:47] for example in swap [21:38:04] petan: No, actually, there is no need for it to be available anywhere; this is why you can actually overcommit if you want to. [21:38:29] petan: Of course, it's a bad idea in most cases to allow overcommit. :-) [21:38:51] ok, so when you have no swap, process ask for 10 mb of ram and you only have 2mb of ram free - you tell it"OK you have the memory" later on process will want to store 3 mb of data - what will happen? [21:39:17] (this is impossible in windows) [21:39:18] petan: If you have no swap, and the process asks for more than is available, then the sbreak() will just fail. [21:39:29] well, but process WON'T ask [21:39:31] it already asked [21:39:39] and system told it that it got the memory [21:39:43] it just wasn't used yetr [21:40:05] that's the point with swap you can swap out even UNUSED allocated memory [21:40:11] which is nothing [21:40:11] petan: ... then there was no overcommitment. The memory /is/ available and the OS was just using it for other things in the meantime which can be discarded (buffers, etc) [21:40:18] (no performance affected) [21:40:48] oh right, but when you are running out of memory - these buffers and so are discarded anyway [21:40:51] petan: No, you're completely missing the point. If you have 8G ram, then the OS will never accept to allocate more than that. [21:41:16] so you will get to the point when you will run out of memory but there will be still for example 100mb of unused but allocated memory [21:41:16] petan: (Unless you turn on overcommit) [21:41:34] and you won't be able to get rid of it, because that memory even if empty is already reserved by other processes [21:41:53] petan: I think that your definition of "unused" is something I don't get. [21:42:12] mhm I will find a link i've been discussing this on stack overflow [21:42:59] petan: Probably just trying to tell me what you mean by "unused" should suffice, because I don't think I get it. :-) [21:44:38] unused as a memory that was requested by some process, then allocated but process never used it [21:44:49] so it's a free memory which can't be used [21:45:23] Aah. Okay, well, it /is/ in use then. That can never lead to OOM. [21:45:50] If it's allocated, by definition it's used [21:48:19] * addshore waves [21:49:04] hehe, you tweaksed qstatus ? :D [21:49:08] petan: it wasn't related to the outage [21:49:22] the outage was due to multiple factors, mostly related to the pope [21:50:24] petan: FYI, the OOM killer can /never/ be invoked if you turn overcommit off entirely. [21:51:00] hm I can't find it but I found this :D [21:51:02] http://unix.stackexchange.com/questions/2658/why-use-swap-when-there-is-more-than-enough-ram [21:51:09] swap is actually useful [21:51:13] yes. [21:51:33] turning swap off is generally a bad idea [21:51:43] petan: That is true /only/ on physical machines. [21:51:49] Coren if I had to host my bot on a box with no swap I couldn't sleep [21:51:52] :P [21:52:05] petan: On a VM, you are better off allocating more "ram" and let the host handle it. [21:52:06] lot's of swap on vms causes nasty issues, though [21:52:12] yep [21:52:34] Coren that machine has no idea if it's virtual or physical and moving unneeded allocated RAM which is not being accessed at all to swap can free some ram [21:53:03] think about my sleeping too [21:53:09] I will have nightmares [21:53:40] petan: The fact that the machine doesn't know whether it's virtual or physical is the /reason/ why it's better to allocate more ram and not swap on a VM. The kernel couldn't make a reasonable decision. [21:54:00] petan: Wheras the vm /host/ does. [21:54:27] Coren but you will have less physical RAM without swap [21:54:32] petan: I am willing to bet you 10:1 that my setup with no overcommit allowed and no swap is more stable. :-) [21:54:37] because all these inactive processes will be in real ram [21:54:50] petan: No, they will be in "ram" [21:54:57] Coren we will see when addbot will start there [21:55:01] petan: Which the vm /host/ will page out. [21:55:11] either your grid will have to kill it [21:55:12] or it die [21:55:12] both will suck [21:55:31] Ryan_Lane wait a moment LOL [21:55:34] Ryan_Lane did you say that [21:55:41] Ryan_Lane outage was because of a pope? [21:55:42] :D [21:55:44] lol [21:55:45] yes [21:55:48] petan: Well, if it uses more ram than is available, sure it will. That would have happened anyways, if not at the hand of the gridengine, it would have been killed by the OOM killer or a sysadmin. :-) [21:55:51] well, related to it [21:55:57] haha [21:55:58] :D [21:56:06] so it was a pope [21:56:09] damn [21:56:21] petan: It's /always/ the fault of the pope. :-) [21:56:25] :P [21:56:37] icinga looks nice :O [21:56:37] that's a new answer for why it broke i will use [21:56:46] addshore thanks to the pope [21:57:00] addshore I wanted to talk with you [21:57:05] addshore but I forgot why [21:57:14] :D sorry, have been at work all day :P [21:57:19] same [21:57:23] but I have internet in work :D [21:57:32] I didnt today :/ [21:57:42] actually I applied for a job in WMDE haha :D [21:58:11] when silke announced that TS admin I couldn't resist :> [21:58:16] you applied for the TS admin? [21:58:17] hahaha [21:58:28] but as part time only [21:58:48] well I know it will be temporary only given that TS is going to be shut down [21:59:32] addshore you had no internetz in work? [21:59:46] you are living in UK? what kind of country doesn't have internet in office :D [22:00:19] we are actually going to have internet even in tube [22:00:21] soon [22:00:54] btw Ryan_Lane what kind of laugh was that :P [22:01:19] well, the slated lifetime of toolserver is about as long as the contract [22:01:29] mhm [22:06:31] petan: well i did have internet but i was busy and away from a computer :P [22:06:37] btw Coren you can always disable swappiness and keep some little swap just for the worst [22:06:47] that will solve your performance reasons for not using it [22:07:33] addshore what is wikidata-admin? wm-bot is there lol? :D [22:07:37] that bot is everywhere ahah [22:08:22] its for wikidata admins ;p [22:08:30] btw Coren you still didn't answer my question [22:08:40] how are you handling OOM now? how you prevent it from happening [22:11:39] addshore did you see qtop [22:11:44] I enhanced ur script [22:11:48] :o [22:11:59] *checks qtop* [22:12:09] hehe nice :P [22:12:55] addshore I wanted to put jobs which eats most to top [22:12:58] like top does [22:13:05] but qstat doesn't provide that info [22:16:35] Coren http://serverfault.com/questions/218750/why-dont-ec2-ubuntu-images-have-swap [22:16:42] this is also interesting it's about ec-2 [22:16:54] "If some process blows up and you don't have swap space, your server could come to a crawling halt for a good while before the OOM killer kicks in, whereas with swap, it merely gets slow. " [22:17:34] I really experienced incredible stability improvement since I enabled it and absolutely no performance drop [22:18:08] petan: Sorry, I was out eating. Which question? [22:18:36] petan: That person had overcommit turned on. [22:19:11] petan: No OOM because no overcommit. If a process tries to sbrk() more than is available, they get ENOMEM [22:21:10] Ryan_Lane: Hi [22:21:25] Ryan_Lane: you comment on UnwashedMeme's commit https://gerrit.wikimedia.org/r/#/c/49239/ [22:21:45] Perhaps, can you elaborate, what you mean with "dictionary" ??? [22:22:37] marc@tools-login:~$ sysctl vm.overcommit_memory vm.overcommit_ratio [22:22:37] vm.overcommit_memory = 2 [22:22:37] vm.overcommit_ratio = 90 [22:23:06] Coren ok: imagine you have 200kb of free memory [22:23:19] system daemon will ask for 500kb, it will get ENOMEM? [22:23:29] that sounds very very dangerous [22:23:45] Wikinaut: Instead of $wgOpenIDConsumerForce = new OpenIDProvider('wpsite', 'My WP Site', 'My WP Site Username: ', 'http://example-wp-site-url.com/author/{username}/' ); use a associative array of key/value pairs. [22:23:54] I think, or that's how I read it [22:23:57] these bots that are eating lot of memory are not broken - they are eating the memory because they need it [22:24:01] they eat it very slowly [22:24:17] so it's possible that some other process, maybe system, will by faster when asking for last piece of ram [22:24:19] petan: Then they need to have that memory /actually/ available [22:24:36] Coren how you make it so? [22:24:53] petan: job asks for 6G, is allowed no more than 6G [22:24:59] nooo [22:25:05] petan: and is put on a box with at least 6G available [22:25:07] job will keep asking for 200kb [22:25:11] multiple times [22:25:22] so many times that after 1 day of running it will eat few gb's [22:25:24] petan: You're missing the point. *Job* asks for 6G [22:25:30] petan: Not process. [22:25:43] ok, but... so what? [22:25:46] how does it matter [22:26:03] ok, job is allowed for 6gb only [22:26:13] you have 2 jobs, each of them eat 4gb [22:26:20] you are back in same problem [22:26:26] ... how? [22:26:36] because they together eat 8 gb [22:26:38] They'd be scheduled where there actually was 4G available. [22:26:48] i.e.: in this case, on different boxen. [22:27:12] eh [22:27:20] but you don't know how much of ram they will eat [22:27:27] before you launch them on physical nodes [22:27:31] The main issue I have for using grids is most my stuff needs to just run constantly... not be scheduled :( [22:27:41] Damianz +1 [22:27:45] *I* don't, that's why they will be launched with the amount they need. [22:27:46] same for me [22:27:52] Damianz: Hence you'd use the 'continuous' queue. [22:28:06] brb [22:28:40] Coren I can't explain that situation - but I believe it will happen on your grid, and then something will fail, I just hope it won't be whole OS [22:28:42] yeah but that doesn't really give me anything better than using a process supervisor... I can have god restart stuff if memory/cpy bloats. The possible benifit is if a box dies it will restart someone else vaugly soon..ish [22:28:43] but it's very possible [22:40:01] back [22:40:27] Damianz: Initial tests say "vaguely soonish == less than 120s" [22:40:56] Damianz: Less than that if it was brought down on purpose as opposed to by box crashing. [22:41:33] Damianz: And, if you want to be extra nice, you can catch SIGUSR1 to have near-instant restart. [22:42:07] Well.. when you're using multiprocessing near-instant restart just isn't going to happen as you can't control your children [22:42:28] [bz] (NEW - created by: Arthur Richards, priority: Normal - normal) [Bug 40605] Supporting MobileFrontend on beta labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=40605 [22:42:55] Damianz: Actually, if you have groups of related jobs, you can tell the queue manager so it knows how to move them around as a unit properly. I'd be glad to show you how. [22:43:37] Not really seperate jobs - the main process takes the wikipedia feed and spawns a subprocess to handle processing it because that sorta-works and sorta sometimes crashes the box [22:43:48] * Damianz notes to re-write it to a thread pool with a queue at some point [22:44:13] I could technically submit a new job for every subprocess instead but you'd get like thousnads of jobs an hour [22:44:33] Damianz: That should actually be fairly easy to implement checkpoint; you have a known workflow with obvious halting point. [22:44:58] Damianz: Remember, I'm actually supposed to help you make the most of the infrastructure too. I'll be happy to help you with it. [22:45:20] Incidentally, how long-lasting and heavy are the subprocesses? [22:45:33] not very long [22:45:58] Then it's probably not worth making jobs for each, even though that gives load balancing for free. [22:46:02] basically the time to pull the data from en wiki api, toolserver api, core api, send the output to relay server and update mysql then die [22:46:33] So, atm, you fork for each of them and they exit, right? [22:46:47] double fork, yes [22:46:52] Why double fork? [22:47:05] I'd love to re-write it to use a queue and re-try failed edits... then I could just spin up workers to handle the queue as needed... because that's sexy but meh, no time [22:48:41] Because, if you don't double fork, all you need to do for clean checkpointing is, once you get a SIGUSR1, you save state, wait for all your children to be reaped, and exit. [22:49:23] SGE then just restarts you (possibly elsewhere) with you knowing where you were in the feed. [22:50:16] * Damianz thinks he might just re-write the whole thing to be worker orientated with a real time interface and then put it in the corner to never be used, because everything just works right now [22:50:32] :-) [22:50:34] SGE would be more useful for my review stuff if I ever can migrate that off gae [22:51:06] Since I could easily submit like a 500k jobs and have them run eventually [22:51:14] Well, don't hesitate to call on me if you need help. They actually pay me for this. :-) [22:51:36] * Damianz notes to donate so he can own Coren's ass [22:51:37] :P [22:56:08] legoktm: Ping [22:59:15] pong [22:59:33] pang? [22:59:38] peng [23:06:31] Krenair: I'm pretty sure this won't work: https://gerrit.wikimedia.org/r/#/c/53464/1/special/SpecialNovaInstance.php,unified [23:06:54] hm [23:07:08] it got rid of the exception for me [23:07:22] oooohhhhh [23:07:29] nevermind [23:07:42] I made it get the project from the instance instead of trusting the user's input :) [23:07:44] the execute function is what sets the project and region [23:08:39] I was looking at it thinking "you have to provide region and project to get instance info back" :) [23:09:13] yeah, this change looks good [23:09:24] Sorry to interrupt, but is tools-login supposed to drop my connection without an error when I SSH to it? [23:10:31] Coren: Hi [23:10:43] fwilson: I'd imagine not :) [23:10:59] it lets me in [23:11:01] legoktm: I see your job is done. did it complete normally? [23:11:17] I just get "Connection closed by 10.4.0.220", no "authentication failed: publickey" or similar [23:11:19] No my script crashed [23:11:24] But it was my code's fault [23:11:30] I just haven't gotten around to fixing it yet [23:11:37] fatal: Access denied for user fwilson by PAM account configuration [preauth] [23:11:42] I'd imagine you aren't in the project [23:11:49] Oh lol, i should have thought of that [23:12:01] Do webtools-ish things work there? [23:13:20] legoktm: Just checking that it wasn't my fault. :-) [23:13:23] Ah, there's a webserver so I'd assume so, would someone mind adding me to the tools project? [23:13:37] fwilson: Sure thing. [23:13:41] Coren: looks like you have a willing guinea pig ;) [23:13:45] Thanks! [23:14:02] fwilson: What's your username on wikitech? [23:14:08] Fox Wilson [23:14:33] fwilson: You now exist. :-) [23:14:52] fwilson: I'll also need a name for your tool; that's going to be the tool's username and part of the url. [23:14:53] it would be nice for that interface to do ajax user lookup while typing [23:15:01] well, any of the interfaces that need it [23:15:01] Ryan_Lane: It would indeed. :-) [23:15:15] I wonder how hard that is to add [23:15:26] because that annoys the shit out of me [23:15:37] Coren: voxelbot [23:15:56] btw, instance pages, better? https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000003a [23:16:27] Ryan_Lane: Is the pretties. [23:17:14] I really need to move project pages into the main namespace [23:17:51] fwilson: You'll need to log off then back in to get the new groups. [23:17:53] * Ryan_Lane adds a bug [23:18:52] Coren: Excellent, thank you. [23:19:16] I don't have an info dump ready, but the short of it is: you want to sudo -iu local-voxelbot [23:19:39] from that user, you have a mysql db available, and the public_html and cgi-bin do what you'd expect [23:21:04] Alright, thanks. [23:23:14] Change on 12mediawiki a page Wikimedia Labs/Interface usability improvement project was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=659522 edit summary: [+157] [23:24:49] http://www.stumbleupon.com/su/3gqjf4/:T.NALV30:XeZd$w2d/fullsmile.info/the-benefit-of-being-an-programmer/ [23:24:52] I guess python is through mod_wsgi? [23:27:28] Change on 12mediawiki a page Wikimedia Labs/Interface usability improvement project was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=659523 edit summary: [+163] /* Content organization and improvement */ [23:28:08] fwilson: mon_suphp, actually, but same result. [23:28:27] mod* [23:30:49] fwilson: You want to script owned by the bot account, not you. :-) [23:30:59] Coren: oh, that might help :) [23:32:15] s/bot account/tool account/ [23:32:19] Gotta get used to that. :-) [23:32:40] I can't change the ownership of the script... [23:32:55] And i'm not in sudoers [23:32:56] fwilson: Did you sudo to the bot account? [23:33:01] I can't [23:33:08] sudo -iu local-voxelbot [23:33:11] That fails? [23:33:28] I'm too used to bots project :) [23:33:55] Heh. The advantage of a user-per-tool is that more than one maintainer can then sudo to the tool [23:34:16] You won't be able to take ownership, but you can make a copy the tool will own and rm the other one [23:34:38] Change on 12mediawiki a page Wikimedia Labs/Interface usability improvement project was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=659525 edit summary: [+58] /* Content organization and improvement */ [23:35:00] fwilson: Almost there. /usr/bin/python rather than /usr/local/bin/python [23:35:01] Change on 12mediawiki a page Wikimedia Labs/Interface usability improvement project was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=659526 edit summary: [-16] /* Content organization and improvement */ [23:35:23] Nope, still 500ing [23:35:33] [Wed Mar 13 23:35:17 2013] [error] [client 10.4.1.89] malformed header from script. Bad header=Hello, world: test2.py [23:35:44] But... but... oh ok [23:35:58] YAY! [23:36:00] You need a blank line after the Content-Type. :-) [23:36:04] :) [23:36:19] You haz a suksess? [23:36:21] yes [23:36:30] Change on 12mediawiki a page Wikimedia Labs/Account creation improvement project was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=659527 edit summary: [-109] /* Current account creation process */ [23:36:39] yes i did, so now everything should work [23:36:45] If you need the mysql db, the credential are in the bot's ~/.my.cnf [23:36:53] (i.e.: just 'mysql' works) [23:37:16] Alright, good, I tend to use sqlite [23:37:23] That works also. [23:37:25] Oh good it's installed :) [23:37:37] You're not the only one using it. :-) [23:38:42] Good, everything that I'm planning/have done should work then [23:39:25] If you hit a dependency that's not there, just poke me on IRC or open a bugzilla if I don't seem around. I have a quick turnaround. [23:39:26] Alright, I will do that. [23:40:18] Time to migrate from mod_python :) [23:42:01] fwilson: Last hint, scripts need to be +x to be, well, executable. :-) [23:42:28] I'm much too used to mod_python (it's evil) :) [23:43:08] Typo: #! at the start, not just # [23:43:29] * Coren is watching you. Muahaha. [23:43:43] Sadly, the error_log can't be split the way the access log can. :-( [23:44:00] How did I miss that @_@ [23:44:54] Hm. I really need to find a way to hack around the error_log problem for non-php stuff [23:45:02] (php has its own error_log that is configurable) [23:45:19] * Coren ponders. [23:45:33] fwilson: Atm: 'ascii' codec can't encode character u'\\xe9' in position 5910: ordinal not in range(128) [23:45:38] is your current problem. [23:45:43] but it worked on my server... [23:45:51] *sigh* [23:45:58] What line is it? [23:46:11] File "/data/project/voxelbot/cgi-bin/recentchanges.py", line 52 [23:47:03] Can you give me the line again, I just removed some code [23:47:14] File "/data/project/voxelbot/cgi-bin/recentchanges.py", line 39 [23:47:29] 'ascii' codec can't encode character u'\\u2013' in position 4975: ordinal not in range(128) [23:47:48] Looks like it thinks that your output character encoding is ascii [23:48:00] Yay, it works now. Silly unicode [23:48:22] just kidding... it's erroring again [23:48:41] 'ascii' codec can't encode character u'\\xe1' in position 2314: ordinal not in range(128) [23:48:50] same thing then [23:48:52] Yeah, you definitely have a locale problem. [23:49:10] Does your code set the locate or does it just presumes one from the environment? [23:49:13] locale* [23:49:28] Just presumes one atm [23:49:36] I'm thinking you want the python-analogue of setlocale() [23:49:56] IIRC, scripts are executed in the "C" locale. [23:50:11] believe it or not, the python equivalent is setlocale() [23:50:12] :) [23:50:28] You probably want "en_US.UTF-8" [23:50:49] Or you know, [23:50:53] Just use Python3. [23:51:12] Is python 3 on tools? [23:51:28] Not... yet. :-) [23:52:00] * fwilson watch --interval 5 which python3's [23:52:22] Coren: Mind installing it? :P [23:52:40] Checking now for conflicts. [23:53:03] No breakage detected. Installing. :-) [23:53:24] Yay! [23:54:21] Okay, it's showing up [23:54:34] marc@tools-login:~$ which python3 [23:54:34] /usr/bin/python3 [23:55:11] Are dumps in the same place as bots? [23:55:22] legoktm: Yep [23:55:30] Awesome. [23:55:53] Oh and where does public_html end up at? [23:55:53] In /public/datasets/public/ [23:55:57] legoktm: tools.wmflabs.org/toolname [23:56:06] legoktm: tools.wmflabs.org/toolname/ [23:56:07] Oh [23:56:12] Not ~toolname ? [23:56:22] Nope, no squiggly [23:56:25] Tildes are uglies. [23:57:35] legoktm: Directory index forbidden by Options directive [23:58:09] Oh, a 500 again. [23:58:10] Err...? [23:58:22] fwilson: [23:58:26] [Wed Mar 13 23:57:50 2013] [error] [client 10.4.1.89] File "/data/project/voxelbot/cgi-bin/recentchanges.py", line 20 [23:58:26] [Wed Mar 13 23:57:50 2013] [error] [client 10.4.1.89] if i[u'type'] == "edit": [23:58:26] [Wed Mar 13 23:57:50 2013] [error] [client 10.4.1.89] ^ [23:58:26] [Wed Mar 13 23:57:50 2013] [error] [client 10.4.1.89] SyntaxError: invalid syntax [23:58:43] * Coren *really* needs to find a way to also split the error log. [23:58:46] Oh, python3 handles everything as unicode [23:58:52] Heh, sorry :) [23:59:21] No need to be sorry. It's an Apache 2.2 limitation.