[00:01:16] 3Tool Labs tools / 3[other]: merl tools (tracking) - 10https://bugzilla.wikimedia.org/67556#c2 (10merl) 5NEW>3ASSI It is a tracking bug for features missing at labs, mediawiki, wikidata or any other component covered by this bugzilla that any of my tools relies on. For my tools itself i am tracking bugs... [00:04:54] Coren: you around? [00:22:37] Betacommand: Wazza? [00:23:16] Coren: See PM for BEANs [01:00:27] Dispenser: there isnt a query killer [01:00:50] Then I'll write another one [01:01:10] Dispenser: such a tool isnt allowed [01:01:37] Dispenser: some data is exposed on labs that general users shouldnt have access to [01:02:17] Dispenser: it also opens up a whole host of problems, and security issues [01:03:36] So I chmod 400 * -R [01:04:28] an unprivileged user that can't even write to their home directory [07:02:31] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~dispenser/* to Tool Labs - 10https://bugzilla.wikimedia.org/66868#c1 (10Thorsten) All tools are redirected to http://dispenser.homenet.org now. Why that? [07:02:46] :D [09:35:31] 3Tool Labs tools / 3[other]: Migrate http://toolserver.org/~dispenser/* to Tool Labs - 10https://bugzilla.wikimedia.org/66868#c2 (10Andre Klapper) I *guess* it's related to https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28technical%29&oldid=616201284#Can_we_raise_the_loss_of_these_tools_w... [10:48:33] Coren: http://tools.wmflabs.org/dispenser/cgi-bin/viewer.py/Reflinks redirects to http://dispenser.homenet.org/~dispenser/cgi-bin/viewer.py/Reflinks. My gut feeling says there needs to be an interstitial because the user probably won't check the URL after clicking and expects to be in WMF realm. Thoughts? [11:08:48] Mailed legal@wikimedia.org about that. [11:14:22] even though its 'external', it doesn't save from that server, dispenser has disabled 'auto-save' so now the user is redirected to their wiki instead of homenet so i doubt any data gets saved.. [11:20:20] warpath: But it (very probably) will log which IP has requested which article. Users of WMF web services have the expectancy that this information is limited to ops who have signed an NDA. [11:21:25] indeed the only backdraw....but i trust dispenser so am not afraid to use the tools.. [11:22:55] warpath: Yes, and that is *your* choice. But a URL http://tools.wmflabs.org/... doesn't raise any expectation that you have to think about anyone's trustworthiness besides WMF. [11:23:49] dispenser needs to add a note on top of his tools I presume..not all tools are hosted on wmflabs.. [11:24:49] Whose disturbing my slumber? [11:25:09] warpath: Then the user has already submitted his data to a third party. [11:25:27] that guy ^ [11:27:20] I've been telling everyone to keep pointing to Toolserver. If its an issue we can have nosy/silke change that link [11:38:59] And its not proxying as before to avoid tainting Labs reputation [12:11:09] Coren: around? [12:12:32] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497#c6 (10Tim Landscheidt) I chatted yesterday with Yuvi a bit about monitoring and its challenges, and he reminded me that the main problem with applying the prod setup to Labs is that roots can... [12:13:28] scfc_de: I'd say worst thing is a remote code execution exploit [12:13:32] this whole dispenser thingy is getting frightfully annoying [12:13:54] scfc_de: since checks are done by code that could potentially be running on the icinga host [12:13:59] It's an OPEN SOURCE environment with OPEN SOURCE tools [12:14:21] scfc_de: also, have you looked at the icinga code in puppet? very, very prod specific. [12:14:34] hedonil: yeah, true. I think Coren was working on making that easier to determine on labs. [12:14:44] hedonil: also, unrelated, any progress on the nginx log stats stuff? [12:14:47] so: fork these tools as *real* Open Source and put an end to this story [12:15:05] hedonil: legoktm tried that, kinda, and Dispenser asked him to take them offline. the tools need to be rewritten [12:15:10] Or create new ones without any possibility of legal action. [12:15:10] YuviPanda: Yeah, you did a *lot* of stuff ;) [12:15:32] hedonil: not sure if sarcastic :D I just gave you a sample of the logs :) [12:15:34] YuviPanda: Yeah, but is root on Icinga so exciting? [12:15:51] scfc_de: potentially, since it will be running on real hardware. [12:16:34] YuviPanda: In your proposal. My comment on the bug is for the "current" setup with Icinga and Ganglia as Labs instances. [12:16:47] scfc_de: ah, right. [12:16:54] scfc_de: but, have you looked at prod's icinga code? :) [12:17:05] scfc_de: one of the few times I can sympathize with petan for not having to use that. [12:17:29] hedonil: btw, once the graphite situation stabilizes I'll be collecting metrics on 5xx return codes for the tools as well [12:17:54] YuviPanda: No, not sarcastic at all .. really great job (maybe some improvements in change management :P process though..) [12:17:59] I peeked in the past and it didn't look /that/ bad? What do you have in mind specifically? [12:18:11] scfc_de: icinga.pp has a lot of prod specific stuff [12:18:13] YuviPanda: I'm going to write son bash lines during this day [12:18:21] hedonil: can you, uh, make them python? :D [12:18:40] YuviPanda: of course, will do so [12:18:44] hedonil: ty! :) [12:19:15] YuviPanda: your wish, my command, master ;) [12:19:17] hedonil: feel free to use whatever output format you deem fit, keeping in mind space constraints [12:19:31] hedonil: what do you mean by 'improvements in change management process'? [12:19:51] hedonil: also, PM? [12:20:10] YuviPanda: Sure, it's not written with other use cases in mind. Still: Let me cling to my dreams :-). [12:20:16] YuviPanda: the 101 of change mangement [12:20:37] scfc_de: :D no reason to not use prod's icinga, but step 1 of that is modulizing it :) people were looking at shinken too, to replace icinga in prod [12:20:45] YuviPanda: first test on one node, then roll out to all nodes :P [12:21:08] hedonil: hehe :D I do do that most of the time, except the proxy business was partly caused by traffic and that's a bit hard to test on spare nodes :) [12:21:10] YuviPanda: Yep, betting on the wrong horse would be ... bad. [12:21:46] scfc_de: yeah, agreed. I'd still like to make at least some of the metrics be checked with check_graphite, but that's for later :) [12:21:50] * hedonil thinks about thre massive diamond-log things on several nodes [12:22:26] hedonil: the problem there was /var/ being only 2G, which I think is wrong, but yeah, that was fixed a bit later than I should've. [12:23:21] YuviPanda: as it turned out later, hehe - but I think you get the point [12:23:27] hedonil: i do, I do :) [12:23:32] YuviPanda: hehe [12:23:42] hedonil: have you seen http://tools.wmflabs.org/giraffe/index.html#dashboard=ToolLabs+Basics&timeFrame=1h [12:25:09] !log tools cleaned out old diamond archive logs from tools-exec-08 [12:25:12] Logged the message, Master [12:25:31] hedonil: at least with ^ we can be a little more proactive about cleaning up disk [12:25:40] YuviPanda: Very well done [12:26:09] hedonil: scfc_de eventual plan is to kill ganglia, both in labs and prod [12:26:39] only after we've decent dashboards, ofc [12:26:50] YuviPanda: Did you apply biglogs by some particular scheme? I see volumes on exec nodes 1, 2, 3, 4, 5 and 10? [12:27:01] scfc_de: no, I was applying as I cleaned them up [12:27:09] so it's a bit, uh, erratic [12:27:16] k [12:27:23] scfc_de: feel free to apply them to the rest :) [12:27:35] scfc_de: I'll do it as I hit 'em [12:28:04] Yep. Should probably be included from toollabs::*exec* & Co. [12:28:14] !log tools cleaned out old diamond archive logs on tools-webgrid-04 [12:28:16] Logged the message, Master [12:28:33] !log tools cleaned out old diamond archive logs on tools-master [12:28:35] Logged the message, Master [12:32:53] gifti: Re tools-exec-gift, I assume your requirements for packages & Co. are identical to a "regular" exec node so the only difference is in how the SGE scheduler treats that host? [12:34:43] !log tools tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that [12:34:46] Logged the message, Master [12:40:13] !log tools tools-exec-cyberbot: Root partition has run out of inodes [12:40:16] Logged the message, Master [12:40:30] YuviPanda: See! That's why they should be monitored as well :-). [12:40:37] scfc_de: heh, was just going to say that :) [12:41:45] @seen Cyberbot378 [12:41:45] scfc_de: I have never seen Cyberbot378 [12:41:51] Ah, numbers ... [12:41:55] @seen Cyberbot376 [12:41:55] scfc_de: I have never seen Cyberbot376 [12:42:01] @seen Cyberbot678 [12:42:01] scfc_de: I have never seen Cyberbot678 [12:42:08] @seen Cyberpower378 [12:42:08] scfc_de: I have never seen Cyberpower378 [12:42:15] @seen Cyberpower678 [12:42:15] scfc_de: Last time I saw Cyberpower678 they were quitting the network with reason: Ping timeout: 255 seconds N/A at 7/5/2014 4:13:21 PM (3d20h28m53s ago) [12:42:19] Or words. [12:42:57] hehe [12:42:57] hedonil, YuviPanda: The community would appreciated more new tool rather than rewriting existing ones. If you need ideas just ask me. [12:45:38] Dispenser: I think the tools being forkable / open source is important as well. [12:50:53] !log tools tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step [12:50:56] Logged the message, Master [12:51:18] scfc_de: I didn't know of -delete :) [12:52:32] scfc_de: -gift has no slot or h_vmem limits; it's set aside specifically so that resource management is left to the application. [12:52:46] Coren: But packages & Co. are the same? [12:53:27] (Two-part question: a) Whether anything needs to puppetized differently. b) Why SGE fucked up.) [12:53:35] So for b) the answer would be user error? [12:53:50] scfc_de: You're correct, this requires an intersitial. [12:54:11] Coren: do we have a process for dealing with such issues? [12:54:17] also things like using GA, etc. [12:54:21] PP violating tools, I mean [12:54:23] scfc_de: It's an exec node, so yeah. [12:55:27] YuviPanda: There is no process other than "the sysadmins will trout you" at this time; repeat offenders get blocked but that only happened twice to date and was mostly due to communication problems caused by language issues. [12:55:52] Coren: hmm, right. having something well defined (even if it is 'trout for X , then do Y') would be nice [12:55:53] YuviPanda: Dispenser: If I look at one of the published sources: https://web.archive.org/web/20130310035422/http://toolserver.org/~dispenser/sources/webreflinks.py [12:56:05] it says "Distributed under the terms of the GPL" [12:56:22] so there was no reason for legoktm to take down his fork [12:56:31] YuviPanda: Meh. Unneeded bureaucracy. [12:56:40] Coren: heh, alright. [12:56:58] Coren: btw, I was doing some back of the envelope calculations for graphite metrics, wanted to run 'em by you before responding to godog's email [12:56:59] It takes more than just that one file to make it work [12:57:04] hedonil: There's also the issue that much of that tool is derived from the GPL so its copyright status (and the truth value of some of the licence text) is dubious. [12:57:36] Coren: all of labs was 70k metrics on average, while it is 25k right now (just toollabs/betalabs/graphite). I was thinking of 150k metrics as a safe upper limit, but will be happy with 100k too. [12:57:53] Coren: this is for capacity planning of the disks. Your thoughts? I've never done capacity planning estimation stuff before [12:59:42] Dispenser: really, best thing would be to end this imho defiant behavior and port it to labs in total [12:59:45] YuviPanda: It's a fair estimation; but you also have to take into account RAID overhead (or the lack thereof) which will factor in a lot. [13:00:18] Dispenser: or just leave labs and make your own thing, you're free to do this [13:00:31] Haven't I already? [13:00:32] YuviPanda: No raid = less resources needed and generally faster I/O for small writes at the cost of the data being vulnerable; there's an important question to ask there. [13:01:44] Coren: right. looking at the machines in the server spares page, I'm personally ok with no RAID (we can always do backups, and should do backups anyway) + SSD. 600GB of non-RAID SSD + 500G of RAID'd NLSAS sounds good enough for a few years. [13:02:18] Coren: my back of the envelope calculations were ~400GB of storage for 100k metrics, and if we limit metrics to an opt-in per-project basis even that'll take a long time coming [13:03:04] YuviPanda: I don't even think it's worthwhile to go SSD if we have a number of stripped drives; IMO it's important to have more smaller drives that a couple bigger ones. [13:04:23] Coren: wouldn't the SSD perf still be greater? also we don't have anything in https://wikitech.wikimedia.org/wiki/Server_Spares with a number of small drives. Only thing with >2drives is 4x3TB [13:05:54] YuviPanda: That depends on a number of things; SSD are blindingly fast on reads but there are caveats about write patterns that may make our use pattern (bazillions of small consecutive writes) very suboptimal. [13:06:19] YuviPanda: But yeah "what is available" also has to factor into it. [13:06:46] Coren: yeah, if we had 4x500GB or 4x1TB that would've been a nobrainer [13:07:13] 4x3TB just seems... overkill. [13:07:53] hashar: there is /data/project/logs/captcha.log [13:07:56] Coren: and yes, ssd write patterns to factor in, but there's plenty of people on the internet (and bd808|BUFFER as well) talking about how a lot of graphite i/o problems went away with the switch to SSDs [13:08:11] hashar: so, https://gerrit.wikimedia.org/r/#/c/144933/ [13:08:35] hashar: going out for run. feel free to comment there :) [13:10:59] YuviPanda: I'm sure it has, but I fear the SSD will not last long under that use. *shrug* Honestly, I'm not worried overmuch about it either way; if I had my way I'd want a box with lots of small disks as ideal, but we'll probably fare well with pretty much anything available for the forseeable future. [13:12:16] Coren: yeah, I tend to agree as well. small points :) [13:19:57] coren: so we do have an issue with exec-gift? [13:22:47] gifti: There is no /technical/ issue with it that I know of, though you may wish to change a bit how you schedule your jobs a bit so that your jobs which are more expensive take more slots. You have 1000 to allocate, which will certainly overload your node if you only use one each. [13:23:09] !log graphite removed whitelist.conf to see if the puppet change preventing diamond on nonselected projects from sending metrics has been done fully. Nothing on tcpdump [13:23:10] Logged the message, Master [13:24:41] ... wait, that statement came out incomprehensible. [13:24:43] :-) [13:25:06] Coren: are there no other means to automatically prevent overloading? [13:26:24] gifti: Well, the scheduler will not *start* jobs if there is already too much load, but that doesn't prevent previously-running jobs from growing. The only real way around it would be to turn on a consumable resource (like memory) that jobs can "spend" like on the general grid. [13:28:52] gifti: You may want something like https://bugzilla.wikimedia.org/show_bug.cgi?id=52976 as well; I'm about to turn that on in general, that might suffice to help you manage load if you can estimate how much resouces the jobs will take when started. [13:32:51] gifti: you can nimit the numer of paralell executes task by adding -tc . Coren could also change the load the sge schedular expects for the job by changing job_load_adjustments and load_adjustment_decay_time [13:33:26] !log tools tools-exec-cyberbot: Freed 402398 inodes ... [13:33:28] Logged the message, Master [13:33:29] Merlissimo: -rc only works for true parallel jobs, not for independent ones. [13:33:37] -tc * [13:34:59] oh no array jobs? then only the bugzilla feature request would help, yes [13:37:02] Coren: on toolserver i let sge schedular mirror np_load_short (1 min) and not np_load_avg (5min) so that a server is blocked faster when load increases [13:37:40] np_load_short ist also part of the alarm state on ts [13:40:08] And now /var is full on -cyberbot. *argl* [13:51:33] 3Wikimedia Labs / 3tools: Provide user_slot resource in grid - 10https://bugzilla.wikimedia.org/52976#c5 (10Marc A. Pelletier) 5ASSI>3RESO/FIX I went with 60 because that's a very divisible integer. The complex is named 'user_slot' with shortcut 'u'. [13:56:37] Coren: giftbot queue config: load_thresholds np_load_avg=3.75 you should really change this. normally 1.75 is a good value, and to solve this problem you should change giftbot queue config imo to "load_thresholds np_load_avg=2.00,np_load_short=1.50" . that should solve the problem [13:57:41] Merlissimo: Yeah, those sound like reasonable values. {{done}} [13:57:59] Merlissimo, gifti: I've added the user_slot consumable by the way, so you can also use that. [13:59:37] thx Coren [14:12:30] !log tools tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later [14:12:33] Logged the message, Master [14:18:12] ping: Petan wmbot broken [14:52:52] Is Beta Labs user is autoconfirmed by default? [16:07:04] We get JVM crashes with "There is insufficient memory for the Java Runtime Environment to continue." in our web app. Any idea what to do? [16:09:08] dnaber: Tools? Labs project? [16:10:03] It's running at http://tools.wmflabs.org/languagetool/ (if it's not just crashed) [16:14:03] dnaber: AFAICS, the amount of memory available for Tomcat is fixed to 4 GByte. Can be worked around manually, though, if the crashes continue. [16:15:31] scfc_de: can I copy the webservice script and just add a larger number there? but it doesn't make sense anyway, Tomcat and our app work nicely on a different (non cloud) server with less than 1GB [16:16:21] dnaber: In essence, for Tomcat "webservice start" calls "qsub -e $home/error.log -o $home/error.log -i /dev/null -q "webgrid-$server" -l h_vmem=4g -b y -N "$server-$tool" /usr/local/bin/tool-$server" (cf. /usr/bin/webservice:56). So you could just replace 4g with 6g or 8g (and replace $home and $server accordingly). [16:16:45] dnaber: Note that memory for grid purposes is virtual memory, that is not RSS. [16:21:17] 3Wikimedia Labs / 3tools: Setup an icinga instance to monitor tools on tool-labs - 10https://bugzilla.wikimedia.org/51434#c4 (10scott.leea) Is this something I can work on? [16:23:59] scfc_de: thanks, will try with 6gb [16:36:40] DispenserAFK: Is dab solver redirecting without PP warning? [16:40:46] bd808: have you tested https://gerrit.wikimedia.org/r/#/c/144469/2 on vagrant or someplace to verify that it compiles and such? [16:41:58] andrewbogott: Unfortuantely no [16:42:20] bd808: ok, lemme see... [16:43:58] bd808: ok, that change is now deployed here: http://wikitech-test.wmflabs.org/wiki/Main_Page [16:44:22] the page still loads… I'm not sure what else to test, though [16:44:41] andrewbogott: can you trigger a notification there? [16:45:12] bd808: I don't know enough about echo to know how to do that. Instance creation is broken atm; that would be the obvious way [16:45:48] create, reboot, delete, add user to a project are the notifications [16:46:42] ah, so if I add myself to a project I'll get one? [16:47:58] hm, seems not [16:57:49] bd808: do you mind creating an account on that test wiki so that you can look? I'm not sure what I should be seeing. [16:57:57] Let me know what your account name is and I'll grant you some rights. [16:58:09] Sure. Hang on ... [17:00:06] andrewbogott: Boom... undefined variable `base` in OpenStackNovaUser.php line 587 [17:00:38] Hm, doing what? [17:00:48] Creating a new user [17:01:16] ah, sure. i bet it worked anyway? I think I have a patch in progress about that [17:01:39] andrewbogott: http://paste.debian.net/108938/ [17:01:48] heh, yeah, in prod there's a patch titled "Hotfix, investigate why this ever worked before…" that fixes that [17:02:34] anyway, I think you can move on in spite of it? I just created an account and it worked despite the errors [17:03:24] andrewbogott: that is my favorite hotfix of the day now :) [17:03:32] 3Wikimedia Labs / 3tools: Setup an icinga instance to monitor tools on tool-labs - 10https://bugzilla.wikimedia.org/51434#c5 (10Marc A. Pelletier) Not just yet; we're currently at the stage where we are setting equipment aside for the task and doing our first round of specifications. I expect we'll spend so... [17:04:57] bd808: should be fixed now [17:05:10] andrewbogott: Cool. In an meeting now [17:11:34] 3Wikimedia Labs / 3tools: Sorting by CPU/VMEM columns doesn't sort by their value on http://tools.wmflabs.org/?status - 10https://bugzilla.wikimedia.org/67737 (10Liangent) 3NEW p:3Unprio s:3normal a:3Marc A. Pelletier They're currently sorted as text. [17:31:45] https to beta labs remains broken [17:33:28] spagewmf: It's been broken since we moved to eqiad hasn't it? [17:33:52] Not meaning that it's not something that I'd like to see fixed [17:35:00] bd808: I think it worked, I get redirected to https if logged in and lots of https beta labs URLs are in my rowser history [17:36:17] hmmm.... maybe I'm misremembering then. I do know it was broken when we first moved but maybe I never tried again after. [17:41:12] andrewbogott: Sorry for the delay. I created account BryanDavisTest on http://wikitech-test.wmflabs.org/ [17:42:29] I did get an error in the success page "devwiki could not send your confirmation mail. Please check your email address for invalid characters. Mailer returned: Unknown error in PHP's mail() function." Hopefully unrelated to the Echo changes since that is a different code path. [17:46:38] bd808: ok, I gave you shell rights and added you to two projects. [17:46:42] Do you see the notifications you'd expect for that? [17:48:05] andrewbogott: I got a notification for "Your user rights were changed by Testandrew." but only that one. [17:48:20] I logged out and back in and don't see projects either [17:48:41] wait... now they are showing up [17:49:32] So I see projects that I am a member of, but only see the rights change notification at this point. [17:49:43] OK. I guess we don't know if that worked before. [17:50:20] So, we've verified that your patch doesn't totally break echo :) Is that all we need? [17:51:24] The notification I got was actually not from OSM. It's from ListGroupRights [17:51:45] So I may still have totally broken notifications from OSM in that patch :( [17:52:03] alas [17:52:11] want me to give you a login on that host so you can see what's happening? [17:52:39] andrewbogott: Sure. I can stick some logging in or something to test it [17:53:12] wikitech-test-frontend [17:53:26] and it's using labs-vagrant, so the MW code is in /vagrant/mediawiki/ [17:53:47] bd808: I'll add the pinning code later today. thanks for the tag [17:54:57] andrewbogott: Cool. I'll poke at it in a bit (knee deep in an email) [17:55:04] 'k [18:42:08] bd808: are you working on wikitech-test now? Asking because it just broke and I don't know if it was spontaneous or not [18:42:38] I am. What broke? [18:42:45] everything :) [18:44:57] everything, but it's ok, I'll just work on something else for now. [18:45:55] andrewbogott: It's related to my patch. There are obviously several things wrong with it. [18:46:14] I fixed the hook registration and that causes the other breakage [18:46:47] bd808: ok. I merged one of your two patches (the email one) already -- I won't deploy on production until you're confident though. [18:47:21] Cool. I'll test though it and figure out what else I did wrong. [18:47:40] Can I run this locally in Vagrant so I'm not blocking you? [18:48:11] Maybe -- I haven't tried :) It's checked in, at least. [18:48:58] 'wikitech' [18:51:37] Hopefully it won't take long. I found two dumb problems. Trying to test again now to see if that's all that I did wrong. [18:54:27] * YuviPanda pats scfc_de [18:55:11] andrewbogott: Somehow I've made the wiki totally unresponsive :( Or something else broke. [18:55:53] bd808: I restarted apache, seems ok now [18:56:07] cool beans [18:58:46] andrewbogott: Can you add the user BryanDavisTest2 to the shell group so I can test? [18:58:48] YuviPanda: Que? [18:59:16] bd808: done [18:59:55] andrewbogott: w00t! "BryanDavisTest added you to project Nova Resource:Testproject" [19:00:10] I'll amend the patch to fix my dumb mistakes [19:00:14] cool [19:00:16] yay testing :) [19:01:41] Totally! I'm glad you had a place setup to do it. [19:03:08] Oh, labs-l. IRC and news are asynchronous for me. [19:03:21] scfc_de: :) [19:05:17] Well, petan has a point to some extent. But just complaining doesn't get the work done. [19:05:36] scfc_de: true, I responded again with something more substantial [19:05:48] it sure was funny :) [19:10:10] gifti: heh [19:10:17] scfc_de: no btrfs yet, though, I guess :) [19:13:48] YuviPanda: I use btrfs on my personal box and it works quite nice there. I also heard some webhosters using it big scale. Primary motivation for me were snapshots: On each boot I take one so that I have a consistent image to back up. [19:14:11] scfc_de: right. is it on 14.04 LTS? [19:14:18] in a stable form? [19:14:31] I'm just wary of having something as fundamental as the FS be slightly experimental [19:16:43] Oh, dunno. I use Fedora. I wouldn't recommend it on Labs for instance volumes at the moment because I don't see that many benefits (none?) compared to the risk. It might be useful for the time-travel backups on project volumes, though. [19:17:13] right [19:18:04] bd808|LUNCH: the labs-vagrant stuff will probably take some time, working on some diamond stuff now. sorry [19:23:25] !log tools tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again [19:23:27] Logged the message, Master [19:24:21] scfc_de: can you add that as a ferm rule in puppet? [19:25:07] YuviPanda: I think Redis provides a way to allow only local connections, and that's the way I wanted to go. (ferm = trouble.) [19:26:11] andrewbogott: also, any reason to not get scfc_de root on the proxy instance? he has an NDA signed, and *.wmflabs.org is on tools-webproxy, which he has access to anyway [19:26:51] YuviPanda: That's fine, if it's useful. [19:27:00] alrighty then [19:31:14] !log project-proxy added scfc_de as projectadmin [19:31:16] Logged the message, Master [19:52:08] YuviPanda: No worries. We'll get it fixed at some point :) [20:13:19] Sorry everyone, I was stuck without a computer for a couple hours while my update decided to mess with me. Did I miss anything important? [20:18:34] Coren: nope [20:19:44] scfc_de: have you tested the redis patch on something? [20:21:18] YuviPanda: tools-proxy-test [20:21:24] scfc_de: cool [20:21:50] It is applied there at the moment, iptables are empty, so it shouldn't allow external access (better check again :-)). [20:22:08] scfc_de: :D [20:22:25] And it doesn't in comparison to tools-redis. *pooh* [20:30:34] !log mediahandler-tests deleting mediahandler-tests-trusty in 48 hours [20:30:36] Logged the message, Master [20:32:06] !log mediahandler-tests mediahandler-tests-mol, mediahandler-tests-static used for community evaluation, don't touch [20:32:08] Logged the message, Master [20:36:31] !log created experimental instances integration-zuul-merger and integration-zuul-server . Moved them to use local puppetmaster [20:36:32] created is not a valid project. [20:40:39] !log integration created experimental instances integration-zuul-merger and integration-zuul-server . Moved them to use local puppetmaster [20:40:41] Logged the message, Master [21:13:20] question about mediawiki-vagrant, previously i could `sudo -u www-data php foo/bar/baz.php` to run a maintenance script as the same user as the web application server, but in a newly created instance /etc/passwd has www-data as /usr/sbin/nologin [21:13:36] i have manually changed that to /sbin/bash for now so my scripts still work, but is there a preffered way? [21:23:27] !ping [21:23:27] !pong [21:25:48] ebernhardson: I *think* `sudo -u` should still work with a bogus shell in /etc/password, but I'll double check as soon as my new vm finished building. We definitely want to be able to run maint and jobs as www-data. [21:27:47] hi! the tool im currently working on disappeared somehow (404 - Not found). any clues? [21:29:15] a930913: People should be linking to toolserver.org, not wmflabs.org [21:29:36] my web app at tools.wmflabs.org/languagetool/ has an out of memory and doesn't react but I cannot restart of stop it, neither with "./webservice -tomcat restart" nor with jdel or qstop. how can I kill it? [21:30:00] bd808: it turns out sudo -u does work, for some reason the script had `sudo su www-data -c "php ....."` [21:30:15] damn you typos! [21:30:27] sanyi4: What's the URL? [21:30:55] scfc_de: http://tools.wmflabs.org/lonelylinks/ [21:32:23] dnaber: Odd. The job is in status "dr", that is running + deleting. [21:33:03] scfc_de: well, i have just called "./webservice -tomcat restart" and it's trying to restart since 5 minutes or so [21:33:36] scfc_de: actually trying to shut down first, i guess [21:33:52] sanyi4: You have no webservice running. You need to execute "webservice start" to do that. [21:34:27] sanyi4: is it a Java web app maybe? [21:35:34] dnaber: If you're using the standard webservice script, restart = stop + start (and your local script looks unchanged in that regard). [21:35:42] no, plain php. ill try webservice start. [21:37:51] scfc_de: tomcat can be strange, I sometimes need to kill (kill -9) it on my other server, too. can that be done in the cluster, too? [21:38:22] scfc_de: now "Internal error" [21:43:12] dnaber: You should be able to ssh from tools-login to tools-webgrid-tomcat and there you can directly interact with the processes. [21:44:55] sanyi4: The webserver errors are written to error.log in the tool's home directory. Yours has warnings and one fatal error: "PHP Fatal error: Call to a member function set_charset() on a non-object in /data/project/lonelylinks/public_html/lonelylinks.php on line 35". So you need to fix this to have it work. [21:48:40] scfc_de: thanks, I killed the process and i have restarted it again now. [21:49:57] scfc_de: ok, thanks. i apologize for my unaquaintance, i may have some more silly questions in the future :) [21:50:18] sanyi4: :) the only silly questions are the ones unasked :) [21:52:14] bd808: hmm, so our git::clone module doesn't support tags, but supports branches. Can I just make a branch instead? [21:52:20] no patches on top of that, I promise :) [21:54:23] YuviPanda: Sure, or fix git::clone :) [21:54:42] bd808: right, but then I'll have to deal with 'no branch and tag at the same time please' thing :) [21:55:37] Is there a difference between checking out a branch and a tag? [21:57:58] bd808: I think -b in clone is for branch? [21:58:49] YuviPanda: Ah yes. I was thinking of checkout not clone [21:58:56] bd808: right, no git checkout [21:58:59] You can't clone a tag [21:59:06] bd808: I could do an exec, but I think _joe_ will beat me up :) [21:59:21] I could wrap that around with a git::checkout, though [21:59:22] hmm [21:59:28] would be useful for pinning, no doubt [21:59:35] let me do that [22:10:20] I have to run a task which needs huge amount of resource specially cpu, can someone help me. the grid isn't enough for this task [22:11:37] bd808: hmm, I wonder what checkout should do if the working tree is dirty? [22:11:47] bd808: just not checkout? error? [22:11:53] everyone is watching soccer, the Netherlands wins, why wasting time :D [22:11:59] Amir1: hey! can you tell me why the grid isn't enough? [22:12:00] YuviPanda: cry? [22:12:10] bd808: I thought that was a given with puppet :) [22:12:39] YuviPanda: hi, because It needs cpu and when I send it to grid it doesn't work at all, stops [22:12:52] It doesn't return memory error by the way [22:13:08] Amir1: are you sure that's because the grid isn't giving it enough CPU? You can log into the grid nodes (ssh tools-exec-) to see how much CPU it is using [22:13:17] (I usually determine a rather high amount of memory for them) [22:13:33] let me check and tell [22:13:37] Amir1: ok! [22:15:09] YuviPanda: I think checkout should probably die if the tree is dirty. You could stash and pop but that's scarry. [22:15:34] bd808: yeah, merge conflicts and all. question is if puppet run itself should fail, or if I should just do a notice [22:15:46] I'm thinking of making the puppet run fail [22:15:50] YuviPanda: this is the all I can get http://paste.ubuntu.com/7772357/ [22:15:53] but that'll mess with people who are developing on the labsvagrant [22:16:17] YuviPanda: If you don't yell no one will notice. Even then they may not :( [22:16:35] Amir1: right, so from tools-login, if you do 'ssh tools-exec-06', and then run 'top' while your tool is running. that'll help you determine if CPU is actually the problem [22:16:46] bd808: true, true. [22:18:46] YuviPanda: oh okay [22:18:50] tools.dexbot@tools-exec-06:~$ top | grep 24548 [22:18:53] 24548 tools.de 20 0 501m 433m 3628 R 72 5.4 14:34.84 python2.7 [22:19:06] 72 is the CPU percentage [22:19:11] ah, hmm, right. [22:19:21] Amir1: so it *is* getting the CPU, no? [22:19:34] it's taking up 72% of the CPU, so that'll be the same wether it runs in the grid or elsewhere [22:21:29] bd808: actually, I'm going to bail on it now. Just realized I'd have to do things like maintain the user ownership of files, etc. [22:21:34] needs more thought, not a 4AM one [22:22:01] 28661 tools.ph 20 0 98912 66m 4160 R 88 0.8 0:14.65 tesseract [22:22:01] the last line is not what I am running [22:22:19] YuviPanda: No worries. I see you gave up on local time sleep hours again. [22:22:21] Amir1: there are 4 cores :) [22:22:33] bd808: yeah, but I do go to sleep before it is light outside. [22:23:21] YuviPanda: so what can I do [22:23:33] Amir1: I don't think CPU is the problem [22:23:43] if CPU is the problem, it'll be at 100% constantly [22:23:51] so the problem probably lies elsewhere [22:24:18] Amir1: so maybe whatever you are running *is* taking a looong time? [22:24:28] when I run it in command it acts okay (but taking very long time) [22:25:00] YuviPanda: the task reports the progress, it's taking a long time (~a day) [22:25:04] Amir1: right, have you given it the equal amount of time (+ a bit more, perhaps?) on the grid? [22:25:17] Amir1: ah, how long does it take when you run it from the commandline? and where did you run it? [22:25:18] but it returns no progress at all [22:25:40] in that case, I suspect some strange logic error in the program itself? [22:25:46] if it is not returning any status at all... [22:25:57] the code works in four phases: the first phase will be done in two or three hours) [22:26:31] but when I run it in grid, after 14 hours it doesn't get to start even first phase [22:26:52] note: It didn't start it [22:27:04] Amir1: that definitely sounds like a bug in your code [22:27:31] Amir1: wait, what do you mean by 'It did not start it'? what did not start what? [22:27:47] I think I didn't explain it correctly [22:28:20] the bot needs to work in 4 phase, let put other phases aside and say it works just in first phase [22:28:32] the first phase takes two or three hours [22:28:45] and during the action it constantly print a number [22:28:54] like 324500 (every minute) [22:29:04] I do it to be sure it's working and making progess [22:29:24] but when I put this in grid (run it via jsub) [22:29:47] the wdumps.err or wdumps.out shows nothing at all [22:29:55] even after 14 hours [22:30:13] right. so your tool is 1. taking up 70% CPU, and 2. not printing output o stdout that it should be printing every minute [22:30:24] yes [22:30:25] as I said, that *definitely*, to me, sounds like a bug in your code. [22:31:15] YuviPanda: My issue is I run the exact same code (that code) directly in my tool it works [22:31:30] Amir1: by 'directly in my tool' you mean from tools-login [22:31:43] tools-dev [22:32:27] Instead of command jusb -once -N something -mem 1g python X [22:32:32] I command python X [22:32:35] right. [22:32:46] the exact same code [22:32:56] Amir1: If you run a Python script, why are there Perl errors in wdumps10.*? [22:33:29] scfc_de: this error is returned when I command it in Linux I don't why I just ignore it [22:33:57] when I use putty this error doesn't being returned [22:34:27] can you tell me the name of the tool? [22:35:07] "wdumps10" [22:35:12] my tool is "dexbot" [22:35:20] the first one is the task [22:37:31] I have to say it's reading a huge dump [22:37:38] about 10 GB [22:38:00] Amir1: are you reading it all into memory at once? :) [22:38:33] Amir1: is your tool still running? [22:38:38] Amir1: if not, can you start it for me? [22:38:39] YuviPanda: no. pywikibot has a script to generate entries of dumps [22:38:50] ah, right. I think that uses a generator [22:39:10] YuviPanda: in grid yes [22:39:38] the number is 2212711 [22:41:49] YuviPanda: I can this script in dump of every project, It works even on Italian Wikipedia but not English Wikipedia [22:42:07] *I can run [22:42:22] Everything is okay when the dump is not for English WP [22:42:24] Coren, andrewbogott, request blessing to merge changes on openstack system users [22:43:01] mutante: link? [22:43:10] https://gerrit.wikimedia.org/r/#/c/138002/6/manifests/openstack.pp [22:43:30] (i'll make another one to unify the quoting :) [22:44:02] t [22:44:34] it's the same "replace generic::systemuser with normal 'user'/'group' puppet types that we did on other services [22:45:01] Amir1: so, I just ran strace on it. It is running fine, but is doing a lot of file system calls to NFS, which return slowly [22:45:35] Amir1: when you ran the script from -dev, was it on enwiki? [22:45:49] YuviPanda: yes [22:46:57] scfc_de: is it possible that NFS access on exec nodes can be slower than from -dev? [22:47:20] mutante: I'm pretty sure that whole openstack::project-storage section can be pruned, but I'll make a note to do that later... [22:48:41] andrewbogott: ok, thanks, we just want to get rid of like the last 3 occurences of this [22:48:56] then we can make the change that gets rid of the actual generic::systemuser class [22:49:02] all others were noops [22:49:20] thanks :) [22:50:07] Amir1: did you kill your job? [22:50:15] err, I mean, process? [22:51:34] So the number of HDD calls is the problem, the HDD is not fast. Is there a way to fasten it up? e.g. moving the files to the sever or someplace else [22:51:34] I think during the task this code makes about 50M calls [22:51:51] Amir1: right, so if you are reading the dumps, they aren't HDD calls [22:51:54] Amir1: they're actually NFS calls [22:52:09] Amir1: which are over the network, to another machine, which actually has the massive storage space [22:52:51] Amir1: so unfortunately there isn't really a way of making that faster from a hardware perspective. [22:54:23] Amir1: only solution is to reduce the number of calls, mostly by reading larger amounts of data. I'm unsure how pywikipediabot does it [22:55:59] YuviPanda: thank you. [22:56:10] Amir1: yw! glad to be able to help :) [22:56:32] Amir1: processing enwiki dumps, especially over NFS, is always going to be a bit painful, requiring a good amount of care to reduce reads. [22:56:49] let me see what I can do. I'm pretty much sure that PWB doesn't support big calls, but I'll implement it [22:57:09] :D [22:57:21] Amir1: if it is reading one page at a time, making it be able to read X pages at a time might help [22:57:31] Amir1: or incresing the buffer size it requests to read [22:58:25] yes [22:58:49] Amir1: also, if you're storing intermediary files in /data/project (which is all on NFS), you will again take a performance hit, since that is on NFS again. [22:58:55] so that'll also need to be considered [22:58:59] YuviPanda: I don't think there is a significant difference in NFS performance between instances. But I don't think either that the number of read calls are the limiting factor as the data needs to be uncompressed as well. [22:59:52] But working on the dumps isn't a unique challenge -- there are others who have done so and are doing it at the moment, so I would just pick their brains for advice :-). [23:00:09] right [23:00:13] do you know what this does in labs? [23:00:15] class nfs::home::wikipedia { [23:00:20] it says "historical".. yea [23:00:25] because that was like on fenari [23:00:33] but it also has a labs section [23:00:44] scfc_de: but in this case, either way, the process seems to be doing a *lot* of IO. strace was nothing but IO [23:02:08] scfc_de: btw, am wondering if we should add more exec nodes :) [23:02:14] Well, you probably don't see the uncompression in strace :-). I have no idea what run time one would expect to parse the complete enwiki dump or how it could be optimized. [23:02:51] right [23:02:55] scfc_de: I can do it in about 4-6 hours depending on what Im looking for [23:03:29] scfc_de: most nodes are at about or over 100% load [23:03:36] YuviPanda: Looking at the Tools status page (http://tools.wmflabs.org/?status), we seem to be indeed CPU-bound now with much free memory. [23:03:53] scfc_de: right. shall I add one or do you want to do it? [23:04:19] scfc_de: actually, I could perhaps start the process, go to sleep, wake up to find that puppet run has finished, and then do the additional things :) [23:04:49] YuviPanda: Make it so, because I'm too tired to do something complicated as well :-). [23:04:53] scfc_de: :D [23:04:54] scfc_de: ok [23:05:10] scfc_de: I'm going to add two nodes. We don't actually have puppet code for an lvm /tmp though, so can't do that :( [23:06:39] That would be ... (who's the Greek with the stone again?) Unfortunately, there are lots of directories processes can clog by accident. /tmp is just the "easiest" target. [23:06:50] !log tools created tools-exec-11 [23:06:52] Logged the message, Master [23:07:13] scfc_de: Sisyphus? [23:07:16] !log tools created tools-exec-12 [23:07:18] Logged the message, Master [23:07:23] That's the guy! [23:07:36] scfc_de: true, but outside of /tmp and /var/log... [23:07:49] (Though I wouldn't guarantee for the order of i and y.) [23:07:55] scfc_de: bah, created -12 with trusty by accident [23:08:00] http://en.wiktionary.org/wiki/Sisyphusarbeit [23:08:40] Indeed like the alphabet: i < y. Reminds me of the patron of my school: Sibylla. [23:08:41] !log tools created tools-exec-12 as trusty by accident, will keep on standby for testing [23:08:43] Logged the message, Master [23:09:08] !log tools created tools-exec-13 with precise [23:09:10] Logged the message, Master [23:10:21] For example Redis & Co. write to /var/lib/something. On Fedora, /tmp is a ramdisk, so no problems here :-). [23:12:01] I'm customizing the upstream library to reduce the work [23:12:07] (and calls) [23:12:45] Amir1: we're also adding more nodes to ease CPU load everywhere as well [23:13:04] YuviPanda: Thank you :) [23:13:21] Amir1: yw. Thanks for making us look there :) [23:13:37] :) [23:14:22] !log tools applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13 [23:14:24] Logged the message, Master [23:16:04] scfc_de: I think the eventual trusty migration will be a good time to fix the /tmp issue [23:16:07] (and maybe other ones too) [23:16:52] YuviPanda: Sounds like a plan. [23:18:24] scfc_de: cool. before that we should run a trusty queue for people to test their tools as well, I suppose [23:19:15] scfc_de: runnign puppet on the trusty node, let's see if things succeed :) [23:19:34] 3Wikimedia Labs / 3Infrastructure: source group field is confusing - 10https://bugzilla.wikimedia.org/67759 (10Chase) 3NEW p:3Unprio s:3normal a:3None This is a bug abogott asked me to log. He was helping me make some security group rules and it was bouncing me for specifying the source group (though... [23:20:34] maybe someone knows the answer: i use a result of an sql query in an other query, and if there is an apostrophe in the result value, its messing up the query (i hope i was understandable) -- what to do? [23:21:55] sanyi4: heya! you shouldn't be using string concatenation (the . operator) for building SQL queries [23:22:18] sanyi4: if you're using PHP, please use http://us2.php.net/manual/en/book.pdo.php or something similar, with its 'prepared statements' [23:22:42] sanyi4: string concating sql queries also has big security issues (https://en.wikipedia.org/wiki/SQL_injection) [23:24:49] gifti: are you using your node *at all*? I haven't seen any jobs running on it... [23:25:08] i do [23:25:16] gifti: hmm, nothing on http://tools.wmflabs.org/?status [23:29:50] hm, something changed with qacct? [23:32:33] gifti: no, there isn't actually anything running there. [23:32:36] top shows nothing [23:33:22] not atm, that's right [23:33:30] I see. [23:33:37] do you still need a node all to yourself? [23:34:03] the TCL issues were successfully resolved, IIRC [23:34:33] that has nothing to do with tcl [23:34:50] right, but do you still need your node? [23:35:36] i have a biweekly array job that is/was not possible with the usual grid config [23:35:51] hmm, I see [23:35:57] I'll let Coren handle it then. [23:36:01] thanks for clarifying! [23:36:06] np [23:43:59] alright, I'm off now [23:44:00] * YuviPanda waves [23:46:01] Good night! [23:56:20] anyone have an idea what's wrong with https://tools.wmflabs.org/commonshelper/ ? (page doesn't load) [23:56:54] good night for today! many thanks, guys! [23:57:20] ....ah, i see someone already notified magnus at https://en.wikipedia.org/wiki/User_talk:Magnus_Manske#CommonsHelper