[00:29:57] 3Wikimedia Labs / 3tools: jsub not installed on the queue machines - 10https://bugzilla.wikimedia.org/64988#c6 (10bgwhite) (In reply to Marc A. Pelletier from comment #5) Any status update on documentation? I can't use 'jlocal' as I keep getting: "/usr/bin/jlocal": No such file or directory [00:33:12] 3Wikimedia Labs / 3tools: jsub not installed on the queue machines - 10https://bugzilla.wikimedia.org/64988#c7 (10Betacommand) Where are you using it? on your crontab or your submitted jobs? [00:59:12] 3Wikimedia Labs / 3tools: jsub not installed on the queue machines - 10https://bugzilla.wikimedia.org/64988#c8 (10bgwhite) On queue machines and tools-login. On crontab and submitted jobs. [01:01:57] 3Wikimedia Labs / 3tools: jsub not installed on the queue machines - 10https://bugzilla.wikimedia.org/64988#c9 (10Betacommand) thats your problem, jlocal should only exist on the -submit host. you basically have a wrapper script you invoke that submits other jobs. if you use jlocal in your crontab to start t... [01:06:57] 3Wikimedia Labs / 3tools: jsub not installed on the queue machines - 10https://bugzilla.wikimedia.org/64988#c10 (10bgwhite) That still doesn't solve my original question. jlocal is not found anywhere. I can't use it on any host as it is not found. I can't use it on the submit host as it is not found. Wh... [01:08:41] 3Wikimedia Labs / 3tools: jsub not installed on the queue machines - 10https://bugzilla.wikimedia.org/64988#c11 (10Betacommand) jlocal is on the submit host, I used it daily and just got a email from a cron using it ~2 minutes ago [01:09:06] is bgwhite on IRC? [01:22:20] Betacommand: Seldomly. [01:22:48] All web tools are down? [01:22:56] YuviPanda: Awake? [01:23:02] scfc_de: yup [01:23:22] sshing [01:24:17] scfc_de: seems up? [01:24:25] scfc_de: although I was seeing '2014/05/20 01:23:52 [alert] 18840#0: 768 worker_connections are not enough' [01:24:28] a lot on it [01:25:29] I could have sworn ... ?! Now works for me as well. [01:25:41] scfc_de: I am sure there were difficulties. [01:25:44] scfc_de: I did restart nginx [01:31:56] scfc_de: looks like our nginx needs tuning [01:33:26] scfc_de: http://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads/ is a nice read, I should spend some time on it tomorrow [01:36:40] YuviPanda: That sounds like fun, so I'll leave it for you :-). [01:36:46] scfc_de: :D [03:44:51] Hi there. Is there an admin around who could approve this? https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Nischayn22 I'm working with Nischay on a reference tool which will query OCLC's WorldCat API as part of The Wikipedia Library. [08:17:58] (03Abandoned) 10Hashar: Add usage() to take(1) [labs/toollabs] - 10https://gerrit.wikimedia.org/r/70058 (owner: 10Platonides) [08:18:01] (03Abandoned) 10Hashar: Make take non-recursive by default. [labs/toollabs] - 10https://gerrit.wikimedia.org/r/70059 (owner: 10Platonides) [08:18:15] (03Abandoned) 10Hashar: Implementing verbose messages. [labs/toollabs] - 10https://gerrit.wikimedia.org/r/70107 (owner: 10Platonides) [08:18:55] (03Abandoned) 10Hashar: WORK IN PROGRESS: Add robots.txt [labs/toollabs] - 10https://gerrit.wikimedia.org/r/77916 (owner: 10Tim Landscheidt) [09:07:55] 3Wikimedia Labs / 3tools: Where to put OSM hillshading tiles - 10https://bugzilla.wikimedia.org/65519 (10nosy) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier I synced the OSM hillshading tiles off Toolserver and its sitting currently in my home. I'd love to copy it into a production place but I have n... [09:10:09] 3Wikimedia Labs / 3tools: Where to put OSM hillshading tiles - 10https://bugzilla.wikimedia.org/65519#c1 (10nosy) Forgot to say its about 270GB of space needed. [09:23:23] 3Wikimedia Labs / 3tools: Copy contents of https://svn.toolserver.org/ to Wikimedia git - 10https://bugzilla.wikimedia.org/58801#c14 (10nosy) I guess I will put all the stuff I backup to Labs first in my home. If this gets unhandy just let me know. [09:24:35] Is there an issue with the web services? When reloading a page three times, I get 504 on http and a "Fehler: Datenübertragung unterbrochen" on HTTPS [09:24:58] https://tools.wmflabs.org/magnustools/ for example [09:25:48] through a proxy, I get 502 Bad Gateway nginx/1.0.15 [09:30:12] yes there is ... failures have been intermittent and are getting worse [09:31:29] I am unable to work with my wiki. [09:33:38] 3Wikimedia Labs / 3tools: Where to put OSM hillshading tiles - 10https://bugzilla.wikimedia.org/65519#c2 (10Alexandros Kosiaris) Already talked with nosy on IRC. The best place would be /data/project/tiles on the maps project VMs [09:46:23] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap en was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1010132 edit summary: /* Schedule */ some updates on what is done [09:49:36] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Roadmap de was modified, changed by Silke WMDE link https://www.mediawiki.org/w/index.php?diff=1010135 edit summary: /* Zeitplan */ Aktualisierungen:was ist erledigt? [10:26:14] hello [10:26:18] checking out the proxy now [10:26:47] proxy should be slightly better now. [10:26:51] it's just buckling under load [10:26:57] rillke: GerardM- ^ [10:27:36] thanks, yuvi [10:27:40] can you verify? [10:27:54] it needs some tuning, looks like. let me tweak some settings [10:28:30] yes, wiki loads now [10:28:38] great! [10:29:22] ok [10:36:37] YuviPanda: hey my proxy issue? [10:37:04] liangent: I couldn't figure out a way to check if upstream hasn't sent any data :( [10:37:16] and I also haven't had much time to look, between firefighting tools proxy and also android app :( [10:37:17] sorry! [10:37:57] YuviPanda: by checking content-length? [10:54:56] liangent: hmm, that doesn't sound bad. let me see if I can put up a custom error handler [10:55:00] but first fixing the proxy itself [10:55:58] thanks YuviPanda [12:37:50] so I have installed goaccess on the proxy [12:37:52] interesting logs [12:38:23] imagemapedit seems to be among the most requested things [12:38:39] YuviPanda: did traffic just coincidentally surge? Or does the new nginx version have worse performance somehow? [12:38:58] andrewbogott: I am unsure. Perhaps worse default config? [12:39:14] andrewbogott: I've a patch with some tuned config params [12:39:27] Are you still hopeful that you'll be able to speed things up? [12:39:49] ok [12:40:14] andrewbogott: https://gerrit.wikimedia.org/r/134328 [12:40:57] andrewbogott: if this doesn't work let's revert to the older version and let it be. I'll also work with Coren on setting up a test tools proxy that's an exact copy of the live one [12:43:38] 3Wikimedia Labs / 3tools: Move wiki.toolserver.org to WMF - 10https://bugzilla.wikimedia.org/60220#c26 (10Silke Meyer (WMDE)) Reedy, I would like to communicate a "freeze" date for the wiki before it can't be edited any more. When will you be bale to migrate it and so when could that deadline be? End of May?... [12:45:23] 3Wikimedia Labs / 3tools: Where to put OSM hillshading tiles - 10https://bugzilla.wikimedia.org/65519 (10nosy) 5NEW>3RES/FIX [12:47:04] YuviPanda: probably have to force an nginx restart after that change applies [12:47:14] andrewbogott: shouldn't the Notify take care of that? [12:47:27] I'm running puppet [12:47:30] Oh, you're right. [12:47:31] Yes it should [12:47:33] on the box [12:48:05] I updated the labs proxy already, seems fine. [12:48:15] Although there weren't perf issues there anyway [12:48:30] yeah [12:56:29] andrewbogott: seems stable now. [12:56:37] andrewbogott: thoughts on adding another proxy box and DNS load balancing between them? [12:56:43] should be trivial to do from the proxy side [12:56:51] YuviPanda: will you respond to the email thread with your thoughts? [12:57:05] andrewbogott: I will in about 5 mins. [12:57:06] YuviPanda: I certainly don't object to setting up another VM if you think it will help. [12:57:42] andrewbogott: I think the current patch itself will help, since there was no load spike in the VM, and also the error logs explicitly pointed out that nginx was running out of open FDs [12:58:28] That seems pretty definite! Seems weird though, it's not like our current use case is that extreme. [12:58:59] andrewbogott: true, but we do open extra FDs in redis connections, and proxying implies at least 2 fds per connection [12:59:11] so that's 3 FDs per connection. [12:59:36] hm, ok [12:59:47] andrewbogott: magnus reports it is down again, I can't repro it [13:00:10] I am tailing logs as well and got nothing [13:01:00] YuviPanda: ok, I emailed him asking him to join us on IRC [13:01:06] andrewbogott: ok! [13:01:44] andrewbogott: is there any way I can get notified when tools webproxy is down? [13:02:25] hey hedonil. is tools still down for you? is up for me and error logs are clean [13:02:44] YuviPanda: Doesn't icinga already check it? If not, it should. [13:02:49] YuviPanda: yep, it's down for newly restarted webservices [13:02:58] hedonil: ow. example? [13:03:05] http://tools.wmflabs.org/newwebtest/blame/ [13:03:17] hedonil: it shows me x's tools?! [13:03:45] YuviPanda: :-) right at moment... [13:03:56] hey Coren. I don't have a way for icinga to notify me in particular, and I don't know of labs' icinga status [13:03:59] hedonil: so it is up? :) [13:04:01] YuviPanda: there are some intermittend "hangs" [13:04:12] hedonil: yeah, that was probably the nginx restart to apply the patch [13:04:28] hedonil: That may be the webservice /itself/ running out of resources. The defaults are rather conservative. [13:05:04] Coren: hmm, rather no [13:05:15] hedonil: Checking the tool's error log should reveal this; lighttpd complains loudly when that happens. [13:05:40] That URI works for me. [13:05:54] * hedonil checks this nearly every second, as I'm testing things [13:06:27] YuviPanda: http://tools.wmflabs.org/catscan2 [13:06:30] YuviPanda: hangs [13:06:53] ... ah, I see issues as well. The connections regularily just 'sit there', and that's with a long-running webservice. [13:07:35] and this happed occasionaly with other tools, too [13:08:31] hmm, error logs still show nothing [13:08:40] I don't know if this is just the backend server time outing [13:09:39] it now redirects to http://tools.wmflabs.org/magnustools/ [13:09:42] is that accurate? [13:10:04] and http://tools.wmflabs.org/catscan2/catscan2.php works alright [13:10:37] access logs say things are fine, error logs don't indicate anything wrong [13:10:59] * YuviPanda considers downgrading to nginx 1.5 and just leaving it there. [13:11:49] YuviPanda: even if logs say no, users say yes :P [13:12:02] I'm not denying problems, hedonil. Just frustrated I can't diagnose them [13:12:19] YuviPanda: just kidding a bit [13:12:24] :) [13:12:29] hedonil: they seem to be up for me now. [13:12:41] * andrewbogott is reluctant to move back from an official release back to a home-made one :( [13:12:53] yeah, me too andrewbogott [13:13:42] YuviPanda: if you're in the middle of testing an app, and somthing hangs, you dig deep into shit (thinking it's your crapy code...) [13:14:09] hedonil: yeah, finding out that it is infrastructure and not your code can be frustrating [13:14:38] I'm tailing catscan's logs too, no errors there [13:15:15] YuviPanda: /right now/ it seems to be ok. [13:15:44] YuviPanda: but I'll cry loud if it happens again :-) [13:17:43] YuviPanda: o/: hangs. http://tools.wmflabs.org/newwebtest/ec/ [13:17:49] at the very moment [13:18:48] hedonil: what browser are you using, btw? hangs for me in FF but not in chrome [13:18:59] Coren: wasnt the new web supposed to be better? All Ive seen are headaches [13:19:03] loads just fine for me in ff :( [13:19:06] YuviPanda: chrome & firefox [13:19:46] Betacommand: It is, an it was, but it looks like the upgrade to the latest version has issues. [13:19:54] YuviPanda: but you're right, mostof the issues are in chrome [13:20:25] andrewbogott: interesting. I am getting a bunch of 499 response codes in the access logs [13:20:34] 499? [13:20:57] client closed connection [13:20:59] 'Used in Nginx logs to indicate when the connection has been closed by client while the server is still processing its request, making server unable to send a status code back.' [13:21:07] In this case is 'client' the browser or the backend service? [13:21:07] and there's a bunch of them [13:21:26] browser, IIRC [13:22:43] apparently caused if the server's timeout is longer than the browser's, and this would potentially mean that our backend connections are being saturated? [13:22:44] YuviPanda: that would happen if the user sees a 'hang' and gives up [13:22:57] So it could be a secondary symptom [13:23:03] true, but there are a *lot* of these [13:23:14] * Coren ponders. [13:23:16] tail -f /var/log/nginx/access.log | grep 499 gives me a LOT of them [13:23:25] ok, so probably not user behavior [13:23:31] yeah [13:23:39] YuviPanda: Have you looked in the redis logs to see if /it/ fails to return the right value in time? [13:23:41] Your patch just shortened the server timeout though, didn't it? [13:23:59] andrewbogott: that's the keepalive timeout [13:24:03] not the server timeout [13:24:07] oh, ok [13:24:23] Coren: redis logs are empty, and am doing a monitor and it seems fine [13:25:23] andrewbogott: I could downgrade and see if the 499s persist? [13:25:32] if they do the problem is elsewhere, if not then nginx bug. [13:25:40] sure. [13:25:45] Is this still 1.7? [13:25:50] andrewbogott: yup [13:25:57] I think you should try 1.6 first [13:26:00] since it's 'stable' [13:26:26] 1.7 is just 1.6 with one series of patches, but let me see if I can find a build [13:26:30] Unless, oh, was 1.6 unavailable? [13:26:39] andrewbogott: can't find debs inside http://nginx.org/packages/ubuntu/ [13:27:18] andrewbogott: nevermind, found 'em [13:27:29] andrewbogott: or not. no -extras 1.6 [13:27:37] so not useful [13:28:20] andrewbogott: found [13:28:21] https://launchpad.net/~nginx/+archive/stable/+packages [13:30:16] andrewbogott: moved to 1.6 [13:33:19] no drop in 499s. downgrading to 1.5 to check. [13:33:27] we also have no baselien for 499s from before, so this is frustrating. [13:34:43] aha! [13:34:58] I hadn't restarted nginx after the downgrade to 1.6, just did that. [13:35:23] 499s seem to have slowed down, but still around [13:35:43] hedonil: are things any better now? [13:36:48] * hedonil checks [13:37:36] 499s not tool specific either, see some for / [13:39:13] andrewbogott: YuviPanda: "hang" rate dropped to ~ 1:50. A good ratio for today ,-) [13:39:28] awww / lol / :'( [13:39:28] :) [13:39:52] let me test something [13:41:35] just disabled spdy to test [13:41:36] Hi folks, I'm trying to add Nischayn22 to the local-wikipedia-library-reference tool group but it's 'failing to add'. not sure what's going on, could someone take a look? https://wikitech.wikimedia.org/w/index.php?title=Special:NovaServiceGroup&action=managemembers&projectname=tools&servicegroupname=local-wikipedia-library-reference&returnto=Special%3ANovaServiceGroup [13:42:49] Ocaasi: I'm looking. [13:42:55] andrewbogott: thanks! [13:43:05] YuviPanda: Is the automatic compression related to 1.5/1.6/1.7? [13:43:16] gzip? don't think so [13:43:23] are you facing issues with that? [13:44:12] No, I'm just wondering what may be different in the different versions, and if nginx would have to compress all streams in one version, but in another not, that would cause a lot of load I assume. [13:44:12] andrewbogott: hmm, 499s aren't a new phenomenon. they exist in the archived logs as well [13:45:02] So, red herring :( [13:46:11] possibly. [13:46:43] hedonil: how's it holding up on your end? any more hung connections? [13:47:09] !ping [13:47:09] !pong [13:47:09] YuviPanda: looks good, atm. knock on wood [13:48:49] Ocaasi: I see the problem too, there's clearly a bug. I'm about to get called away, though, do you mind creating a bugzilla bug for this and assigning it to me? [13:49:06] will do, thanks andrew [13:49:47] could it have to do with this tool being named wikipedia-library-reference and there already being a wikipedia-library named tool? [13:50:14] Ocaasi: I believe that... [13:50:31] yes, that's what I was going to say :) I just set up a test with two groups where one's name was a subset of the other... [13:50:33] same behavior. [13:50:49] ok, will go on that hunch and create a new tool with unique name [13:51:12] andrewbogott: thanks again! [13:51:51] thanks, sorry for the inconvenience [13:52:00] andrewbogott: Coren the 499 might be a red herring, and things *might* be alright. I've no way to tell. I might just wait for a bit to see if hedonil complains again [13:52:07] I am also going to setup curl in a loop [13:52:24] no problem. could an admin please remove local-wikipedia-library-reference from the tools service group list? the name is causing bugs. [13:52:54] YuviPanda: I haven't seen the issue again yet, fwiw [13:53:09] intermittent issues are the worst. [13:53:20] Ocaasi: done [13:53:29] andrewbogott: gracias, sir [13:55:06] andrewbogott: feel free to leave this for someone else, but it's also failing for local-oclc-reference [13:56:45] well… that one I cannot explain :( [13:57:53] :( [14:03:55] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534 (10Ocaasi) 3UNC p:3Unprio s:3normal a:3Marc A. Pelletier From here: https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup Try to add a member to a group. Even though there... [14:05:31] Coren: andrewbogott hedonil things seem stable now [14:05:39] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534 (10Ocaasi) [14:05:39] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534 (10Ocaasi) [14:05:42] YuviPanda: is that with 1.6? [14:05:55] YuviPanda: ack [14:06:29] andrewbogott: https://bugzilla.wikimedia.org/show_bug.cgi?id=65534 (i'm not sure how to change the assignment to you) [14:06:39] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534 (10Ocaasi) [14:06:44] but i added you to the cc list [14:06:54] 3Wikimedia Labs / 3tools: Failed to set group members for local-oclc-reference - 10https://bugzilla.wikimedia.org/65534 (10Andrew Bogott) a:5Marc A. Pelletier>3Andrew Bogott [14:07:41] hedonil: andrewbogott Coren https://dpaste.de/3Wx4 is a script I'm running that consistently hits tools and checks for non 200s [14:08:14] That seems heavy-handed. [14:27:48] hello [14:28:08] is it possible to change shell account username? [14:30:48] Coren: hehe :) [14:31:03] Coren: been trying to make it a habit to use python instead of bash wherever [14:32:44] mgrabovsky: It is not especially possible. If you're truly desperate you can create a new account with the name you want. [14:33:23] andrewbogott: there's a twist, though, I think it may have to do with SVN [14:33:56] andrewbogott: I used to have commit access with my nick, and when I tried to use that when registering on Wikitech, I wasn't able to use it [14:34:13] so I used a different one that worked [14:34:38] andrewbogott: I got the 'There was either an authentication database error or you are not allowed to update your external account.' error [14:35:24] it might be your old svn account was already migrated to labs. What was your svn name? [14:35:41] andrewbogott: same as my nick here, mgrabovsky [14:36:22] there is a labs account with that name; it already has a key registered as well. [14:37:10] oh, is it possible to link my account on Wikitech to that one? [14:37:49] I didn't know a kind of migration had taken place, I was gone for too long it seems [14:38:39] hm, I'm not sure. I need to go in a minute but I will think about it -- can you send me an email with all the various vitals? your wikitech name, your old shell name, your new shell name, etc? [14:38:59] andrewbogott: what's your email or a way to contact you? [14:39:07] abogott@wikimedia.org [14:39:26] thanks, I'll get in touch [15:10:07] andrewbogott: Coren hedonil seems stable now? 499s still happening, but my script that hits tools seems happy so far. I'm inclined to just leave it as it is [15:10:31] It's not breaking as badly, at the very list. [15:10:40] oh? [15:10:59] Coren: where? [15:12:29] YuviPanda: works like clockwork \o [15:12:45] Coren: oh, you meant it's *not* breaking as badly, I missed the not [15:13:00] Yeah, that's what I meant. :-P [15:16:04] Coren: :) [15:16:11] Coren: I'll write up a report to labs-l shortly [15:33:34] the homepage of my webpage is index.html. can i somehow change it to index.py. ie. anyone who visits http://tools.wmflabs.org/mytool gets to see index.py instead of index.html. [15:34:20] rohit-dua: which webserver? apache or lighttpd , behind proxy or not [15:34:34] oh, you answered that already kind of [15:34:38] saying it's tools [15:34:45] yes no proxy [15:35:05] index-file.names = ( "index.html" ) [15:35:08] something like that [15:35:11] and change the index.html [15:35:43] http://redmine.lighttpd.net/projects/1/wiki/Index-file-names_Details [15:36:13] mutante: but where do i change index-file.names [15:36:21] inside tool labs [15:38:30] rohit-dua: i _think_ on tools-webgrid-01, but only from looking at where scfc_de changed related things [15:40:50] andrewbogott: Coren I've never really written 'outage reports' of any sort, so the email I just sent might be a bit weird. do let me know if there's any more / less info I could add [15:42:08] rohit-dua: Web server settings can be changed in a file named .lighttpd.conf in your tool's home. [15:42:38] YuviPanda: looks good to me, thank you. [15:42:42] Coren: i do not get any .lighttpd when i ls -l in my home directory.. [15:42:53] ls-a [15:43:14] andrewbogott: :) I still need to setup monitoring, though [15:43:22] rohit-dua: There is none by default, [15:43:41] so i create one and add the default settings? [15:43:41] andrewbogott: is there a way I can get something to email or some other way of notifying me when this goes down? [15:44:15] rohit-dua: You can simply add it with any directives you want. https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Configuring_the_web_server [15:44:31] YuviPanda: Probably, but I don't immediately know what the best approach is. [15:44:44] !log deployment-prep Deployed scap 7b6fc47 via trebuchet [15:44:44] Coren: thank you :-) [15:44:54] rohit-dua: In particular, you probably only want one line in it: [15:44:57] index-file.names += ("index.py") [15:45:06] !ping [15:45:06] !pong [15:45:42] Coren: so what would happen if both index.html and .py are present? [15:46:24] rohit-dua: ... I'm not sure. It certainly will pick one but I wouldn't rely on which. [15:46:30] @seen morebots [15:46:30] bd808: Last time I saw morebots they were talking in the channel, they are still in the channel #wikimedia-operations at 5/20/2014 3:15:40 PM (30m50s ago) [15:46:50] (I.e.: I'd make sure that you have only one or the other) [15:47:11] morebots, everything ok? [15:47:19] bd808: I'll restart it [15:47:36] andrewbogott: ty [15:49:02] labs-morebots, feeling better [15:49:03] I am a logbot running on tools-exec-08. [15:49:03] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [15:49:03] To log a message, type !log . [15:49:19] !log testlabs this test message is testing the message test [15:49:21] Logged the message, dummy [15:53:55] !log deployment-prep Deployed scap 7b6fc47 via trebuchet [15:53:57] Logged the message, Master [16:08:39] YuviPanda: got some new hangs (few ~ 1/hour), made a pic of status [16:08:43] YuviPanda: http://tools.wmflabs.org/tools-info/misc/proxy-hang.png [16:09:05] Any idea why my webservice could be ceasing to work every once in a while? [16:09:08] An atomatic restart would be nice [16:09:13] auto [16:09:36] otherwise it is borderline useless [16:10:01] I cannot even set up a heartbeat script, because all cron jobs are forced to be run on the worker nodes [16:10:15] (from where I apparently cannot restart the webservice) [16:10:25] Am I missing something obvious here? [16:11:09] YuviPanda: interesting thing is, that on reload /one/ resource load randomly fails [16:12:06] YuviPanda: something with http keep alive request settings here ? [16:14:52] dschwen: you need MORE magic with your script :P [16:15:26] dschwen: I provide you a webwatcher script (a bestseller atm ) [16:15:35] dschwen: https://tools.wmflabs.org/paste/view/fde97e1a [16:17:06] Coren: I'd suggest to create a continuos queue for webservice, would be the easiest solution, I think [16:29:38] dschwen: here is version for vanilla (non-tweaked) webservices ;) https://tools.wmflabs.org/paste/view/5d976bb0 [16:33:27] dschwen: What's the tool's name? [16:46:02] dschwen: Thing is, it's really abnormal for lighttpd to die and it doesn't seem wise to me to restart it unconditionally. How do your lighttpds die? (There should be at least some hint in the error.log) [16:48:17] Coren: it doesn't seem to be abnormal, though, as there are often enough people complaining about tools that are down. [16:48:37] Coren: one of the reasons seems to be out of memory-issues [16:50:00] valhallasw: The hard limit on lighttpds is at 4G(!); if one of them manages to hit that limit I really *want* it stopped. :-) But my point is, it's really not supposed to be possible for lighttpd to exit without an explicit kill or something really bad happening (like lighttpd *crashing*) so any instance of a webservice being down really needs to be looked into. [16:51:04] Coren: betacommand has reported OOM issues with cgi-bin python scripts [16:51:34] Coren: well, there are some reasons. One is std config. I'll provide a analysis on lighty in a few days. [16:51:53] and sure, stopping it when it hits 4G is reasonable enough, but restarting the damn server makes much more sense than leaving it for dead [16:51:59] * hedonil became a lighty blackbelt :P [16:52:18] hedonil: But then, if lighty ends because of a broken config you don't want to restart it with that same config. [16:53:01] valhallasw: The grid will never restart a process *it* killed, only one that dies on its own. [16:53:15] Coren: I don't care what 'the grid' does, I care what 'tool labs' does [16:53:24] and I care about a web service that does not need me to keep it online [16:53:54] ergo, if the grid kills the web service for some reason, there should be something in-place to restart it for me [16:53:56] because, really, that's all I'm going to do when it's down anyway [16:54:04] Coren: one thing is OOM, it's ok for most tools to die when the limit is reached. but restarted would be fine then. [16:54:28] As suggested, a contiuous queue would be fine for tools-webgrid [16:54:46] wouldn't help for OOM, though. [16:55:02] valhallasw: Then you should care about its memory usage running out of bounds and why it gets to that point it the first place. Seriously. 4G is way, *way* past any sort of reasonableness and any web server that hits that barrier often enough that restarting it regularily is an issue is problematic to begin with. [16:55:31] Coren: Yes, but not an issue that *I'm* paid to resolve. [16:55:52] valhallasw: Wait what? [16:56:01] if lighttpd is crappy enough to require 4G for simple cgi-bin scripts, that's *your* problem, not mine. [16:56:29] valhallasw: It isn't. If your "simple" cgi-bin script eats up 4G of ram, then it's a problem with your script, not lighttpd. [16:57:11] Talk to betacommand. I glanced at his scripts. They look reasonable enough. [16:57:24] For some reason, Apache never had issues. [16:57:49] valhallasw: Wrong; it regularily had issue in that the entire *server* ran OOM and had to be restarted. [16:58:44] valhallasw: In fact, the fact that some tools became destructive because they ate all resources is one of the main reasons why I switch to per-tool servers -- this way misbehaving scripts kill themselves and don't bring everything else down with them. [16:59:17] thanks, hedonil [16:59:30] Coren, I find the following at the end of my error.log [16:59:41] 2014-05-20 14:35:44: (server.c.1512) server stopped by UID = 0 PID = 30656 [16:59:41] 2014-05-20 14:35:44: (server.c.1502) unlink failed for: /var/run/lighttpd/zoomviewer.pid 2 No such file or directory [16:59:59] I don't know what to make of it [17:00:12] dschwen: Hm. That's just /after/ it wanted to end; what's a couple lines up from that? [17:00:29] oh wait [17:00:32] is this fatal?! [17:00:33] 2014-05-20 14:35:44: (server.c.1512) server stopped by UID = 0 PID = 30656 [17:00:33] 2014-05-20 14:35:44: (server.c.1502) unlink failed for: /var/run/lighttpd/zoomviewer.pid 2 No such file or directory [17:00:36] sorry [17:00:37] one sec [17:00:45] 2014-05-20 14:35:44: (mod_fastcgi.c.2701) FastCGI-stderr: PHP Notice: Undefined index: stage in /data/project/zoomviewer/public_html/index.php on line 6 [17:00:54] Coren: and now they silently die instead of getting restarted. That's still not good from a tool user perspective. The tool user *also* still doesn't know anything, except lighttpd randomly dies without any information. [17:01:02] hm, where I'm from this is a warning [17:01:12] Coren: and the tool user can only get that information with magic incantations of jstat and jacct [17:01:27] dschwen: Hm, that shouldn't be fatal indeed. Interesting. [17:01:30] at least, if you consider an exit code of 137 (or was it something else?) informative. [17:01:55] sorry, that second 'tool user' should have been 'tool owner' [17:02:02] dschwen: We can rule out OOM at least -- that gets a sigkill and wouldn't have had the server politely wrap up as shown by the last two lines. [17:03:08] dschwen: yw. (modify starting script line 22 for your restart) [17:03:40] dschwen: your webservice's problem was: maxvmem 3.925G [17:04:03] Aha. So it hit the soft limit first. [17:04:12] dschwen: it's not in error.log, but jobinfo: $ qacct -j lighttpd-zoomviewer [17:04:13] I was just looking at the qacct [17:05:37] Corem: hi again. is it possible i link my index homepage file from a path like index-file.names += ("/BUB/app/index.py") [17:05:42] hedonil, what is ${HOME}/webstart.sh [17:05:55] just "webservice $1" ? [17:06:08] Coren: hi again. is it possible i link my index homepage file from a path like index-file.names += ("/BUB/app/index.py") [17:06:09] thanks [17:06:19] hm, why is this eating so much mem [17:06:23] dschwen: yep. something like that [17:06:31] rohit-dua: No; indices are "file in the directory which will be shown by default". You probably want a rewrite if you want your /toolname/ to show something elsewhere by default. [17:07:11] Coren: is it possible to rewrite that? [17:08:14] rohit-dua: Look at the "Url rewrite" section of the help page I pointed you at earlier. This is probably what you want. [17:08:46] dschwen: Is it possible for user-provided data to your web interface to cause it to work on huge datasets? [17:09:42] hm [17:09:44] Coren: thank you. i got it :) [17:09:45] oh... [17:09:58] I'm launching vips image processing tasks [17:10:11] as child processes I guess [17:10:24] 3Wikimedia Labs / 3Infrastructure: filearchive table not available on labs - 10https://bugzilla.wikimedia.org/61813#c10 (10Luis Villa (WMF Legal)) Can't they already do that by simply uploading the file instead of the SHA? [17:10:25] I should probably submit those through grid enginr [17:10:29] If we start webservices with "-m a", that should inform users about any jobs killed by the grid, but I don't think that would catch any "soft" kills. [17:10:30] buuuuut [17:10:32] Hm. Those would be added to the script's own usage. [17:10:51] that memory eating bug was fixed in vips months ago [17:10:58] let's see... [17:11:06] which version of vips do we have... [17:11:19] dschwen: Add a "ulimit" to be sure that it doesn't run away? [17:11:22] scfc_de: It wouldn't indeed. Hm. Nontrivial to catch. [17:11:50] 7.32.3 [17:11:55] scfc_de: Ah, indeed, if you ulimit your child process then the sum can't break you (but then that can cause the child to be killed so you have to cope with that) [17:12:20] scfc_de: That's unarguably better behaviour though. [17:13:01]