[00:01:16] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556#c2 (10merl) 5NEW>3ASSI It is a tracking bug for features missing at labs, mediawiki, wikidata or any other component covered by this bugzilla that any of my tools relies on. For my tools itself i am tracking bugs... [00:04:54] Coren: you around? [00:22:37] Betacommand: Wazza? [00:23:16] Coren: See PM for BEANs [01:00:27] Dispenser: there isnt a query killer [01:00:50] Then I'll write another one [01:01:10] Dispenser: such a tool isnt allowed [01:01:37] Dispenser: some data is exposed on labs that general users shouldnt have access to [01:02:17] Dispenser: it also opens up a whole host of problems, and security issues [01:03:36] So I chmod 400 * -R [01:04:28] an unprivileged user that can't even write to their home directory [07:02:31] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~dispenser/* to Tool Labs - 10https://bugzilla.wikimedia.org/66868#c1 (10Thorsten) All tools are redirected to http://dispenser.homenet.org now. Why that? [07:02:46] :D [09:35:31] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~dispenser/* to Tool Labs - 10https://bugzilla.wikimedia.org/66868#c2 (10Andre Klapper) I *guess* it's related to https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=616201284#Can_we_raise_the_loss_of_these_tools_w... [10:48:33] Coren: http://tools.wmflabs.org/dispenser/cgi-bin/viewer.py/Reflinks redirects to http://dispenser.homenet.org/~dispenser/cgi-bin/viewer.py/Reflinks. My gut feeling says there needs to be an interstitial because the user probably won't check the URL after clicking and expects to be in WMF realm. Thoughts? [11:08:48] Mailed legal@wikimedia.org about that. [11:14:22] even though its 'external', it doesn't save from that server, dispenser has disabled 'auto-save' so now the user is redirected to their wiki instead of homenet so i doubt any data gets saved.. [11:20:20] warpath: But it (very probably) will log which IP has requested which article. Users of WMF web services have the expectancy that this information is limited to ops who have signed an NDA. [11:21:25] indeed the only backdraw....but i trust dispenser so am not afraid to use the tools.. [11:22:55] warpath: Yes, and that is *your* choice. But a URL http://tools.wmflabs.org/... doesn't raise any expectation that you have to think about anyone's trustworthiness besides WMF. [11:23:49] dispenser needs to add a note on top of his tools I presume..not all tools are hosted on wmflabs.. [11:24:49] Whose disturbing my slumber? [11:25:09] warpath: Then the user has already submitted his data to a third party. [11:25:27] that guy ^ [11:27:20] I've been telling everyone to keep pointing to Toolserver. If its an issue we can have nosy/silke change that link [11:38:59] And its not proxying as before to avoid tainting Labs reputation [12:11:09] Coren: around? [12:12:32] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497#c6 (10Tim Landscheidt) I chatted yesterday with Yuvi a bit about monitoring and its challenges, and he reminded me that the main problem with applying the prod setup to Labs is that roots can... [12:13:28] scfc_de: I'd say worst thing is a remote code execution exploit [12:13:32] this whole dispenser thingy is getting frightfully annoying [12:13:54] scfc_de: since checks are done by code that could potentially be running on the icinga host [12:13:59] It's an OPEN SOURCE environment with OPEN SOURCE tools [12:14:21] scfc_de: also, have you looked at the icinga code in puppet? very, very prod specific. [12:14:34] hedonil: yeah, true. I think Coren was working on making that easier to determine on labs. [12:14:44] hedonil: also, unrelated, any progress on the nginx log stats stuff? [12:14:47] so: fork these tools as *real* Open Source and put an end to this story [12:15:05] hedonil: legoktm tried that, kinda, and Dispenser asked him to take them offline. the tools need to be rewritten [12:15:10] Or create new ones without any possibility of legal action. [12:15:10] YuviPanda: Yeah, you did a *lot* of stuff ;) [12:15:32] hedonil: not sure if sarcastic :D I just gave you a sample of the logs :) [12:15:34] YuviPanda: Yeah, but is root on Icinga so exciting? [12:15:51] scfc_de: potentially, since it will be running on real hardware. [12:16:34] YuviPanda: In your proposal. My comment on the bug is for the "current" setup with Icinga and Ganglia as Labs instances. [12:16:47] scfc_de: ah, right. [12:16:54] scfc_de: but, have you looked at prod's icinga code? :) [12:17:05] scfc_de: one of the few times I can sympathize with petan for not having to use that. [12:17:29] hedonil: btw, once the graphite situation stabilizes I'll be collecting metrics on 5xx return codes for the tools as well [12:17:54] YuviPanda: No, not sarcastic at all .. really great job (maybe some improvements in change management :P process though..) [12:17:59] I peeked in the past and it didn't look /that/ bad? What do you have in mind specifically? [12:18:11] scfc_de: icinga.pp has a lot of prod specific stuff [12:18:13] YuviPanda: I'm going to write son bash lines during this day [12:18:21] hedonil: can you, uh, make them python? :D [12:18:40] YuviPanda: of course, will do so [12:18:44] hedonil: ty! :) [12:19:15] YuviPanda: your wish, my command, master ;) [12:19:17] hedonil: feel free to use whatever output format you deem fit, keeping in mind space constraints [12:19:31] hedonil: what do you mean by 'improvements in change management process'? [12:19:51] hedonil: also, PM? [12:20:10] YuviPanda: Sure, it's not written with other use cases in mind. Still: Let me cling to my dreams :-). [12:20:16] YuviPanda: the 101 of change mangement [12:20:37] scfc_de: :D no reason to not use prod's icinga, but step 1 of that is modulizing it :) people were looking at shinken too, to replace icinga in prod [12:20:45] YuviPanda: first test on one node, then roll out to all nodes :P [12:21:08] hedonil: hehe :D I do do that most of the time, except the proxy business was partly caused by traffic and that's a bit hard to test on spare nodes :) [12:21:10] YuviPanda: Yep, betting on the wrong horse would be ... bad. [12:21:46] scfc_de: yeah, agreed. I'd still like to make at least some of the metrics be checked with check_graphite, but that's for later :) [12:21:50] * hedonil thinks about thre massive diamond-log things on several nodes [12:22:26] hedonil: the problem there was /var/ being only 2G, which I think is wrong, but yeah, that was fixed a bit later than I should've. [12:23:21] YuviPanda: as it turned out later, hehe - but I think you get the point [12:23:27] hedonil: i do, I do :) [12:23:32] YuviPanda: hehe [12:23:42] hedonil: have you seen http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h [12:25:09] !log tools cleaned out old diamond archive logs from tools-exec-08 [12:25:12] Logged the message, Master [12:25:31] hedonil: at least with ^ we can be a little more proactive about cleaning up disk [12:25:40] YuviPanda: Very well done [12:26:09] hedonil: scfc_de eventual plan is to kill ganglia, both in labs and prod [12:26:39] only after we've decent dashboards, ofc [12:26:50] YuviPanda: Did you apply biglogs by some particular scheme? I see volumes on exec nodes 1, 2, 3, 4, 5 and 10? [12:27:01] scfc_de: no, I was applying as I cleaned them up [12:27:09] so it's a bit, uh, erratic [12:27:16] k [12:27:23] scfc_de: feel free to apply them to the rest :) [12:27:35] scfc_de: I'll do it as I hit 'em [12:28:04] Yep. Should probably be included from toollabs::*exec* & Co. [12:28:14] !log tools cleaned out old diamond archive logs on tools-webgrid-04 [12:28:16] Logged the message, Master [12:28:33] !log tools cleaned out old diamond archive logs on tools-master [12:28:35] Logged the message, Master [12:32:53] gifti: Re tools-exec-gift, I assume your requirements for packages & Co. are identical to a "regular" exec node so the only difference is in how the SGE scheduler treats that host? [12:34:43] !log tools tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that [12:34:46] Logged the message, Master [12:40:13] !log tools tools-exec-cyberbot: Root partition has run out of inodes [12:40:16] Logged the message, Master [12:40:30] YuviPanda: See! That's why they should be monitored as well :-). [12:40:37] scfc_de: heh, was just going to say that :) [12:41:45] @seen Cyberbot378 [12:41:45] scfc_de: I have never seen Cyberbot378 [12:41:51] Ah, numbers ... [12:41:55] @seen Cyberbot376 [12:41:55] scfc_de: I have never seen Cyberbot376 [12:42:01] @seen Cyberbot678 [12:42:01] scfc_de: I have never seen Cyberbot678 [12:42:08] @seen Cyberpower378 [12:42:08] scfc_de: I have never seen Cyberpower378 [12:42:15] @seen Cyberpower678 [12:42:15] scfc_de: Last time I saw Cyberpower678 they were quitting the network with reason: Ping timeout: 255 seconds N/A at 7/5/2014 4:13:21 PM (3d20h28m53s ago) [12:42:19] Or words. [12:42:57] hehe [12:42:57] hedonil, YuviPanda: The community would appreciated more new tool rather than rewriting existing ones. If you need ideas just ask me. [12:45:38] Dispenser: I think the tools being forkable / open source is important as well. [12:50:53] !log tools tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step [12:50:56] Logged the message, Master [12:51:18] scfc_de: I didn't know of -delete :) [12:52:32] scfc_de: -gift has no slot or h_vmem limits; it's set aside specifically so that resource management is left to the application. [12:52:46] Coren: But packages & Co. are the same? [12:53:27] (Two-part question: a) Whether anything needs to puppetized differently. b) Why SGE fucked up.) [12:53:35] So for b) the answer would be user error? [12:53:50] scfc_de: You're correct, this requires an intersitial. [12:54:11] Coren: do we have a process for dealing with such issues? [12:54:17] also things like using GA, etc. [12:54:21] PP violating tools, I mean [12:54:23] scfc_de: It's an exec node, so yeah. [12:55:27] YuviPanda: There is no process other than "the sysadmins will trout you" at this time; repeat offenders get blocked but that only happened twice to date and was mostly due to communication problems caused by language issues. [12:55:52] Coren: hmm, right. having something well defined (even if it is 'trout for X , then do Y') would be nice [12:55:53] YuviPanda: Dispenser: If I look at one of the published sources: https://web.archive.org/web/20130310035422/http://toolserver.org/~dispenser/sources/webreflinks.py [12:56:05] it says "Distributed under the terms of the GPL" [12:56:22] so there was no reason for legoktm to take down his fork [12:56:31] YuviPanda: Meh. Unneeded bureaucracy. [12:56:40] Coren: heh, alright. [12:56:58] Coren: btw, I was doing some back of the envelope calculations for graphite metrics, wanted to run 'em by you before responding to godog's email [12:56:59] It takes more than just that one file to make it work [12:57:04] hedonil: There's also the issue that much of that tool is derived from the GPL so its copyright status (and the truth value of some of the licence text) is dubious. [12:57:36] Coren: all of labs was 70k metrics on average, while it is 25k right now (just toollabs/betalabs/graphite). I was thinking of 150k metrics as a safe upper limit, but will be happy with 100k too. [12:57:53] Coren: this is for capacity planning of the disks. Your thoughts? I've never done capacity planning estimation stuff before [12:59:42] Dispenser: really, best thing would be to end this imho defiant behavior and port it to labs in total [12:59:45] YuviPanda: It's a fair estimation; but you also have to take into account RAID overhead (or the lack thereof) which will factor in a lot. [13:00:18] Dispenser: or just leave labs and make your own thing, you're free to do this [13:00:31] Haven't I already? [13:00:32] YuviPanda: No raid = less resources needed and generally faster I/O for small writes at the cost of the data being vulnerable; there's an important question to ask there. [13:01:44] Coren: right. looking at the machines in the server spares page, I'm personally ok with no RAID (we can always do backups, and should do backups anyway) + SSD. 600GB of non-RAID SSD + 500G of RAID'd NLSAS sounds good enough for a few years. [13:02:18] Coren: my back of the envelope calculations were ~400GB of storage for 100k metrics, and if we limit metrics to an opt-in per-project basis even that'll take a long time coming [13:03:04] YuviPanda: I don't even think it's worthwhile to go SSD if we have a number of stripped drives; IMO it's important to have more smaller drives that a couple bigger ones. [13:04:23] Coren: wouldn't the SSD perf still be greater? also we don't have anything in https://wikitech.wikimedia.org/wiki/Server_Spares with a number of small drives. Only thing with >2drives is 4x3TB [13:05:54] YuviPanda: That depends on a number of things; SSD are blindingly fast on reads but there are caveats about write patterns that may make our use pattern (bazillions of small consecutive writes) very suboptimal. [13:06:19] YuviPanda: But yeah "what is available" also has to factor into it. [13:06:46] Coren: yeah, if we had 4x500GB or 4x1TB that would've been a nobrainer [13:07:13] 4x3TB just seems... overkill. [13:07:53] hashar: there is /data/project/logs/captcha.log [13:07:56] Coren: and yes, ssd write patterns to factor in, but there's plenty of people on the internet (and bd808|BUFFER as well) talking about how a lot of graphite i/o problems went away with the switch to SSDs [13:08:11] hashar: so, https://gerrit.wikimedia.org/r/#/c/144933/ [13:08:35] hashar: going out for run. feel free to comment there :) [13:10:59] YuviPanda: I'm sure it has, but I fear the SSD will not last long under that use. *shrug* Honestly, I'm not worried overmuch about it either way; if I had my way I'd want a box with lots of small disks as ideal, but we'll probably fare well with pretty much anything available for the forseeable future. [13:12:16] Coren: yeah, I tend to agree as well. small points :) [13:19:57] coren: so we do have an issue with exec-gift? [13:22:47] gifti: There is no /technical/ issue with it that I know of, though you may wish to change a bit how you schedule your jobs a bit so that your jobs which are more expensive take more slots. You have 1000 to allocate, which will certainly overload your node if you only use one each. [13:23:09] !log graphite removed whitelist.conf to see if the puppet change preventing diamond on nonselected projects from sending metrics has been done fully. Nothing on tcpdump [13:23:10] Logged the message, Master [13:24:41] ... wait, that statement came out incomprehensible. [13:24:43] :-) [13:25:06] Coren: are there no other means to automatically prevent overloading? [13:26:24] gifti: Well, the scheduler will not *start* jobs if there is already too much load, but that doesn't prevent previously-running jobs from growing. The only real way around it would be to turn on a consumable resource (like memory) that jobs can "spend" like on the general grid. [13:28:52] gifti: You may want something like https://bugzilla.wikimedia.org/show_bug.cgi?id=52976 as well; I'm about to turn that on in general, that might suffice to help you manage load if you can estimate how much resouces the jobs will take when started. [13:32:51] gifti: you can nimit the numer of paralell executes task by adding -tc . Coren could also change the load the sge schedular expects for the job by changing job_load_adjustments and load_adjustment_decay_time [13:33:26] !log tools tools-exec-cyberbot: Freed 402398 inodes ... [13:33:28] Logged the message, Master [13:33:29] Merlissimo: -rc only works for true parallel jobs, not for independent ones. [13:33:37] -tc * [13:34:59] oh no array jobs? then only the bugzilla feature request would help, yes [13:37:02] Coren: on toolserver i let sge schedular mirror np_load_short (1 min) and not np_load_avg (5min) so that a server is blocked faster when load increases [13:37:40] np_load_short ist also part of the alarm state on ts [13:40:08] And now /var is full on -cyberbot. *argl* [13:51:33] 3Wikimedia Labs / 3tools: Provide user_slot resource in grid - 10https://bugzilla.wikimedia.org/52976#c5 (10Marc A. Pelletier) 5ASSI>3RESO/FIX I went with 60 because that's a very divisible integer. The complex is named 'user_slot' with shortcut 'u'. [13:56:37] Coren: giftbot queue config: load_thresholds np_load_avg=3.75 you should really change this. normally 1.75 is a good value, and to solve this problem you should change giftbot queue config imo to "load_thresholds np_load_avg=2.00,np_load_short=1.50" . that should solve the problem [13:57:41] Merlissimo: Yeah, those sound like reasonable values. {{done}} [13:57:59] Merlissimo, gifti: I've added the user_slot consumable by the way, so you can also use that. [13:59:37] thx Coren [14:12:30] !log tools tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later [14:12:33] Logged the message, Master [14:18:12] ping: Petan wmbot broken [14:52:52] Is Beta Labs user is autoconfirmed by default? [16:07:04] We get JVM crashes with "There is insufficient memory for the Java Runtime Environment to continue." in our web app. Any idea what to do? [16:09:08] dnaber: Tools? Labs project? [16:10:03] It's running at http://tools.wmflabs.org/languagetool/ (if it's not just crashed) [16:14:03] dnaber: AFAICS, the amount of memory available for Tomcat is fixed to 4 GByte. Can be worked around manually, though, if the crashes continue. [16:15:31] scfc_de: can I copy the webservice script and just add a larger number there? but it doesn't make sense anyway, Tomcat and our app work nicely on a different (non cloud) server with less than 1GB [16:16:21] dnaber: In essence, for Tomcat "webservice start" calls "qsub -e $home/error.log -o $home/error.log -i /dev/null -q "webgrid-$server" -l h_vmem=4g -b y -N "$server-$tool" /usr/local/bin/tool-$server" (cf. /usr/bin/webservice:56). So you could just replace 4g with 6g or 8g (and replace $home and $server accordingly). [16:16:45] dnaber: Note that memory for grid purposes is virtual memory, that is not RSS. [16:21:17] 3Wikimedia Labs / 3tools: Setup an icinga instance to monitor tools on tool-labs - 10https://bugzilla.wikimedia.org/51434#c4 (10scott.leea) Is this something I can work on? [16:23:59] scfc_de: thanks, will try with 6gb [16:36:40] DispenserAFK: Is dab solver redirecting without PP warning? [16:40:46] bd808: have you tested https://gerrit.wikimedia.org/r/#/c/144469/2 on vagrant or someplace to verify that it compiles and such? [16:41:58] andrewbogott: Unfortuantely no [16:42:20] bd808: ok, lemme see... [16:43:58] bd808: ok, that change is now deployed here: http://wikitech-test.wmflabs.org/wiki/Main_Page [16:44:22] the page still loads… I'm not sure what else to test, though [16:44:41] andrewbogott: can you trigger a notification there? [16:45:12] bd808: I don't know enough about echo to know how to do that. Instance creation is broken atm; that would be the obvious way [16:45:48] create, reboot, delete, add user to a project are the notifications [16:46:42] ah, so if I add myself to a project I'll get one? [16:47:58] hm, seems not [16:57:49] bd808: do you mind creating an account on that test wiki so that you can look? I'm not sure what I should be seeing. [16:57:57] Let me know what your account name is and I'll grant you some rights. [16:58:09] Sure. Hang on ... [17:00:06] andrewbogott: Boom... undefined variable `base` in OpenStackNovaUser.php line 587 [17:00:38] Hm, doing what? [17:00:48] Creating a new user [17:01:16] ah, sure. i bet it worked anyway? I think I have a patch in progress about that [17:01:39] andrewbogott: http://paste.debian.net/108938/ [17:01:48] heh, yeah, in prod there's a patch titled "Hotfix, investigate why this ever worked before…" that fixes that [17:02:34] anyway, I think you can move on in spite of it? I just created an account and it worked despite the errors [17:03:24] andrewbogott: that is my favorite hotfix of the day now :) [17:03:32] 3Wikimedia Labs / 3tools: Setup an icinga instance to monitor tools on tool-labs - 10https://bugzilla.wikimedia.org/51434#c5 (10Marc A. Pelletier) Not just yet; we're currently at the stage where we are setting equipment aside for the task and doing our first round of specifications. I expect we'll spend so... [17:04:57] bd808: should be fixed now [17:05:10] andrewbogott: Cool. In an meeting now [17:11:34] 3Wikimedia Labs / 3tools: Sorting by CPU/VMEM columns doesn't sort by their value on http://tools.wmflabs.org/?status - 10https://bugzilla.wikimedia.org/67737 (10Liangent) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier They're currently sorted as text. [17:31:45] https to beta labs remains broken [17:33:28] spagewmf: It's been broken since we moved to eqiad hasn't it? [17:33:52] Not meaning that it's not something that I'd like to see fixed [17:35:00] bd808: I think it worked, I get redirected to https if logged in and lots of https beta labs URLs are in my rowser history [17:36:17] hmmm.... maybe I'm misremembering then. I do know it was broken when we first moved but maybe I never tried again after. [17:41:12] andrewbogott: Sorry for the delay. I created account BryanDavisTest on http://wikitech-test.wmflabs.org/ [17:42:29] I did get an error in the success page "devwiki could not send your confirmation mail. Please check your email address for invalid characters. Mailer returned: Unknown error in PHP's mail() function." Hopefully unrelated to the Echo changes since that is a different code path. [17:46:38] bd808: ok, I gave you shell rights and added you to two projects. [17:46:42] Do you see the notifications you'd expect for that? [17:48:05] andrewbogott: I got a notification for "Your user rights were changed by Testandrew." but only that one. [17:48:20] I logged out and back in and don't see projects either [17:48:41] wait... now they are showing up [17:49:32] So I see projects that I am a member of, but only see the rights change notification at this point. [17:49:43] OK. I guess we don't know if that worked before. [17:50:20] So, we've verified that your patch doesn't totally break echo :) Is that all we need? [17:51:24] The notification I got was actually not from OSM. It's from ListGroupRights [17:51:45] So I may still have totally broken notifications from OSM in that patch :( [17:52:03] alas [17:52:11] want me to give you a login on that host so you can see what's happening? [17:52:39] andrewbogott: Sure. I can stick some logging in or something to test it [17:53:12] wikitech-test-frontend [17:53:26] and it's using labs-vagrant, so the MW code is in /vagrant/mediawiki/ [17:53:47] bd808: I'll add the pinning code later today. thanks for the tag [17:54:57] andrewbogott: Cool. I'll poke at it in a bit (knee deep in an email) [17:55:04] 'k [18:42:08] bd808: are you working on wikitech-test now? Asking because it just broke and I don't know if it was spontaneous or not [18:42:38] I am. What broke? [18:42:45] everything :) [18:44:57] everything, but it's ok, I'll just work on something else for now. [18:45:55] andrewbogott: It's related to my patch. There are obviously several things wrong with it. [18:46:14] I fixed the hook registration and that causes the other breakage [18:46:47] bd808: ok. I merged one of your two patches (the email one) already -- I won't deploy on production until you're confident though. [18:47:21] Cool. I'll test though it and figure out what else I did wrong. [18:47:40] Can I run this locally in Vagrant so I'm not blocking you? [18:48:11] Maybe -- I haven't tried :) It's checked in, at least. [18:48:58] 'wikitech' [18:51:37] Hopefully it won't take long. I found two dumb problems. Trying to test again now to see if that's all that I did wrong. [18:54:27] * YuviPanda pats scfc_de [18:55:11] andrewbogott: Somehow I've made the wiki totally unresponsive :( Or something else broke. [18:55:53] bd808: I restarted apache, seems ok now [18:56:07] cool beans [18:58:46] andrewbogott: Can you add the user BryanDavisTest2 to the shell group so I can test? [18:58:48] YuviPanda: Que? [18:59:16] bd808: done [18:59:55] andrewbogott: w00t! "BryanDavisTest added you to project Nova Resource:Testproject" [19:00:10] I'll amend the patch to fix my dumb mistakes [19:00:14] cool [19:00:16] yay testing :) [19:01:41] Totally! I'm glad you had a place setup to do it. [19:03:08] Oh, labs-l. IRC and news are asynchronous for me. [19:03:21] scfc_de: :) [19:05:17] Well, petan has a point to some extent. But just complaining doesn't get the work done. [19:05:36] scfc_de: true, I responded again with something more substantial [19:05:48] it sure was funny :) [19:10:10] gifti: heh [19:10:17] scfc_de: no btrfs yet, though, I guess :) [19:13:48] YuviPanda: I use btrfs on my personal box and it works quite nice there. I also heard some webhosters using it big scale. Primary motivation for me were snapshots: On each boot I take one so that I have a consistent image to back up. [19:14:11] scfc_de: right. is it on 14.04 LTS? [19:14:18] in a stable form? [19:14:31] I'm just wary of having something as fundamental as the FS be slightly experimental [19:16:43] Oh, dunno. I use Fedora. I wouldn't recommend it on Labs for instance volumes at the moment because I don't see that many benefits (none?) compared to the risk. It might be useful for the time-travel backups on project volumes, though. [19:17:13] right [19:18:04] bd808|LUNCH: the labs-vagrant stuff will probably take some time, working on some diamond stuff now. sorry [19:23:25] !log tools tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again [19:23:27] Logged the message, Master [19:24:21] scfc_de: can you add that as a ferm rule in puppet? [19:25:07] YuviPanda: I think Redis provides a way to allow only local connections, and that's the way I wanted to go. (ferm = trouble.) [19:26:11] andrewbogott: also, any reason to not get scfc_de root on the proxy instance? he has an NDA signed, and *.wmflabs.org is on tools-webproxy, which he has access to anyway [19:26:51] YuviPanda: That's fine, if it's useful. [19:27:00] alrighty then [19:31:14] !log project-proxy added scfc_de as projectadmin [19:31:16] Logged the message, Master [19:52:08] YuviPanda: No worries. We'll get it fixed at some point :) [20:13:19] Sorry everyone, I was stuck without a computer for a couple hours while my update decided to mess with me. Did I miss anything important? [20:18:34] Coren: nope [20:19:44] scfc_de: have you tested the redis patch on something? [20:21:18] YuviPanda: tools-proxy-test [20:21:24] scfc_de: cool [20:21:50] It is applied there at the moment, iptables are empty, so it shouldn't allow external access (better check again :-)). [20:22:08] scfc_de: :D [20:22:25] And it doesn't in comparison to tools-redis. *pooh* [20:30:34] !log mediahandler-tests deleting mediahandler-tests-trusty in 48 hours [20:30:36] Logged the message, Master [20:32:06] !log mediahandler-tests mediahandler-tests-mol, mediahandler-tests-static used for community evaluation, don't touch [20:32:08] Logged the message, Master [20:36:31] !log created experimental instances integration-zuul-merger and integration-zuul-server . Moved them to use local puppetmaster [20:36:32] created is not a valid project. [20:40:39] !log integration created experimental instances integration-zuul-merger and integration-zuul-server . Moved them to use local puppetmaster [20:40:41] Logged the message, Master [21:13:20] question about mediawiki-vagrant, previously i could `sudo -u www-data php foo/bar/baz.php` to run a maintenance script as the same user as the web application server, but in a newly created instance /etc/passwd has www-data as /usr/sbin/nologin [21:13:36] i have manually changed that to /sbin/bash for now so my scripts still work, but is there a preffered way? [21:23:27] !ping [21:23:27] !pong [21:25:48] ebernhardson: I *think* `sudo -u` should still work with a bogus shell in /etc/password, but I'll double check as soon as my new vm finished building. We definitely want to be able to run maint and jobs as www-data. [21:27:47] hi! the tool im currently working on disappeared somehow (404 - Not found). any clues? [21:29:15] a930913: People should be linking to toolserver.org, not wmflabs.org [21:29:36] my web app at tools.wmflabs.org/languagetool/ has an out of memory and doesn't react but I cannot restart of stop it, neither with "./webservice -tomcat restart" nor with jdel or qstop. how can I kill it? [21:30:00] bd808: it turns out sudo -u does work, for some reason the script had `sudo su www-data -c "php ....."` [21:30:15] damn you typos! [21:30:27] sanyi4: What's the URL? [21:30:55] scfc_de: http://tools.wmflabs.org/lonelylinks/ [21:32:23] dnaber: Odd. The job is in status "dr", that is running + deleting. [21:33:03] scfc_de: well, i have just called "./webservice -tomcat restart" and it's trying to restart since 5 minutes or so [21:33:36] scfc_de: actually trying to shut down first, i guess [21:33:52] sanyi4: You have no webservice running. You need to execute "webservice start" to do that. [21:34:27] sanyi4: is it a Java web app maybe? [21:35:34] dnaber: If you're using the standard webservice script, restart = stop + start (and your local script looks unchanged in that regard). [21:35:42] no, plain php. ill try webservice start. [21:37:51] scfc_de: tomcat can be strange, I sometimes need to kill (kill -9) it on my other server, too. can that be done in the cluster, too? [21:38:22] scfc_de: now "Internal error" [21:43:12] dnaber: You should be able to ssh from tools-login to tools-webgrid-tomcat and there you can directly interact with the processes. [21:44:55] sanyi4: The webserver errors are written to error.log in the tool's home directory. Yours has warnings and one fatal error: "PHP Fatal error: Call to a member function set_charset() on a non-object in /data/project/lonelylinks/public_html/lonelylinks.php on line 35". So you need to fix this to have it work. [21:48:40] scfc_de: thanks, I killed the process and i have restarted it again now. [21:49:57] scfc_de: ok, thanks. i apologize for my unaquaintance, i may have some more silly questions in the future :) [21:50:18] sanyi4: :) the only silly questions are the ones unasked :) [21:52:14] bd808: hmm, so our git::clone module doesn't support tags, but supports branches. Can I just make a branch instead? [21:52:20] no patches on top of that, I promise :) [21:54:23] YuviPanda: Sure, or fix git::clone :) [21:54:42] bd808: right, but then I'll have to deal with 'no branch and tag at the same time please' thing :) [21:55:37] Is there a difference between checking out a branch and a tag? [21:57:58] bd808: I think -b in clone is for branch? [21:58:49] YuviPanda: Ah yes. I was thinking of checkout not clone [21:58:56] bd808: right, no git checkout [21:58:59] You can't clone a tag [21:59:06] bd808: I could do an exec, but I think _joe_ will beat me up :) [21:59:21] I could wrap that around with a git::checkout, though [21:59:22] hmm [21:59:28] would be useful for pinning, no doubt [21:59:35] let me do that [22:10:20] I have to run a task which needs huge amount of resource specially cpu, can someone help me. the grid isn't enough for this task [22:11:37] bd808: hmm, I wonder what checkout should do if the working tree is dirty? [22:11:47] bd808: just not checkout? error? [22:11:53] everyone is watching soccer, the Netherlands wins, why wasting time :D [22:11:59] Amir1: hey! can you tell me why the grid isn't enough? [22:12:00] YuviPanda: cry? [22:12:10] bd808: I thought that was a given with puppet :) [22:12:39] YuviPanda: hi, because It needs cpu and when I send it to grid it doesn't work at all, stops [22:12:52] It doesn't return memory error by the way [22:13:08] Amir1: are you sure that's because the grid isn't giving it enough CPU? You can log into the grid nodes (ssh tools-exec-) to see how much CPU it is using [22:13:17] (I usually determine a rather high amount of memory for them) [22:13:33] let me check and tell [22:13:37] Amir1: ok! [22:15:09] YuviPanda: I think checkout should probably die if the tree is dirty. You could stash and pop but that's scarry. [22:15:34] bd808: yeah, merge conflicts and all. question is if puppet run itself should fail, or if I should just do a notice [22:15:46] I'm thinking of making the puppet run fail [22:15:50] YuviPanda: this is the all I can get http://paste.ubuntu.com/7772357/ [22:15:53] but that'll mess with people who are developing on the labsvagrant [22:16:17] YuviPanda: If you don't yell no one will notice. Even then they may not :( [22:16:35] Amir1: right, so from tools-login, if you do 'ssh tools-exec-06', and then run 'top' while your tool is running. that'll help you determine if CPU is actually the problem [22:16:46] bd808: true, true. [22:18:46] YuviPanda: oh okay [22:18:50] tools.dexbot@tools-exec-06:~$ top | grep 24548 [22:18:53] 24548 tools.de 20 0 501m 433m 3628 R 72 5.4 14:34.84 python2.7 [22:19:06] 72 is the CPU percentage [22:19:11] ah, hmm, right. [22:19:21] Amir1: so it *is* getting the CPU, no? [22:19:34] it's taking up 72% of the CPU, so that'll be the same wether it runs in the grid or elsewhere [22:21:29] bd808: actually, I'm going to bail on it now. Just realized I'd have to do things like maintain the user ownership of files, etc. [22:21:34] needs more thought, not a 4AM one [22:22:01] 28661 tools.ph 20 0 98912 66m 4160 R 88 0.8 0:14.65 tesseract [22:22:01] the last line is not what I am running [22:22:19] YuviPanda: No worries. I see you gave up on local time sleep hours again. [22:22:21] Amir1: there are 4 cores :) [22:22:33] bd808: yeah, but I do go to sleep before it is light outside. [22:23:21] YuviPanda: so what can I do [22:23:33] Amir1: I don't think CPU is the problem [22:23:43] if CPU is the problem, it'll be at 100% constantly [22:23:51] so the problem probably lies elsewhere [22:24:18] Amir1: so maybe whatever you are running *is* taking a looong time? [22:24:28] when I run it in command it acts okay (but taking very long time) [22:25:00] YuviPanda: the task reports the progress, it's taking a long time (~a day) [22:25:04] Amir1: right, have you given it the equal amount of time (+ a bit more, perhaps?) on the grid? [22:25:17] Amir1: ah, how long does it take when you run it from the commandline? and where did you run it? [22:25:18] but it returns no progress at all [22:25:40] in that case, I suspect some strange logic error in the program itself? [22:25:46] if it is not returning any status at all... [22:25:57] the code works in four phases: the first phase will be done in two or three hours) [22:26:31] but when I run it in grid, after 14 hours it doesn't get to start even first phase [22:26:52] note: It didn't start it [22:27:04] Amir1: that definitely sounds like a bug in your code [22:27:31] Amir1: wait, what do you mean by 'It did not start it'? what did not start what? [22:27:47] I think I didn't explain it correctly [22:28:20] the bot needs to work in 4 phase, let put other phases aside and say it works just in first phase [22:28:32] the first phase takes two or three hours [22:28:45]