[00:12:56] (03PS11) 10BryanDavis: Add initial database schema [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) [00:18:52] (03CR) 10BryanDavis: [V: 031] "Applies cleanly on mediawiki-vagrant managed database on striker-deploy03.striker.eqiad.wmflabs." [labs/striker/deploy] - 10https://gerrit.wikimedia.org/r/305046 (https://phabricator.wikimedia.org/T142545) (owner: 10BryanDavis) [00:39:57] !log deployment-prep deployment-fluorine is now deployment-fluorine02 running jessie with the old precise packages shoehorned in [00:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [00:44:13] hello labs folks, over the past week or two I've had some jobs start that mysteriously don't finish... and when the cron tries to run the next one I get "there is a job named 'my job' already active' [00:44:31] I have to manually qdel for them to get back in order and run regularly [00:45:37] anyone familiar with this issue? yuvipanda? [00:46:54] I am wondering maybe if job A doesn't complete by the time job A is submitted again via cron, it becomes locked [00:49:18] PROBLEM - SSH on tools-mail is CRITICAL: Server answer [00:49:25] I think that is a reasonable suspicion because I have 3 jobs in particular that keep locking up, and two run every 10 minutes, the other every 5, and all my other bot tasks run over an hour apart [00:50:01] creating phab task! [00:51:55] musikanimal: cron doesn't resubmit jobs I think [00:52:17] Depends what the jsub options are [00:52:46] yeah and actually I recall my weekly job locked up too [00:53:11] `jsub -l release=trusty -mem 350m -once ~/perm_clerk.sh` [00:53:27] all of mine look similar to that, only a few request extra memory [00:55:00] I'm not sure if I even need to have it run on trusty anymore [00:55:36] They haven't switched over the default yet. [00:55:53] If you remove the option, it'll run on Precise. [00:55:58] Until soon. [00:56:02] ok [00:56:22] well, the issue of the jobs locking up is pretty consistent, but not predictable [00:56:42] my 5-minute job just locked up again, right after I qdel'd it from the last time it locked up [01:00:58] 10PAWS: Creating directories starting with a '.' makes the Jupyter Web Interface very confused - https://phabricator.wikimedia.org/T143374#2566445 (10yuvipanda) [01:09:53] 06Labs, 10Tool-Labs: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2566461 (10MusikAnimal) [01:10:42] ^ there it is [01:11:44] I suspect this is happening to others [03:12:38] musikanimal: do you have a job that is currently stuck? [03:13:47] not currently [03:30:04] musikanimal: can you submit one, and I can see what happens? [05:12:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [05:44:25] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:50:37] madhuvishy: there's a cron that submits various jobs and various intervals [05:50:53] or, a cron job for each job, that is [05:52:50] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:53:05] they are not getting stuck right now [05:53:29] 06Labs, 10Tool-Labs: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2566727 (10MusikAnimal) [05:54:56] PROBLEM - Puppet staleness on tools-exec-1211 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:58:59] PROBLEM - Puppet staleness on tools-exec-1213 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:59:11] PROBLEM - Puppet staleness on tools-exec-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [06:09:03] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 355 bytes in 0.004 second response time [06:14:04] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.038 second response time [06:50:59] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [07:30:57] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [11:24:53] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2566461 (10valhallasw) Does this consistently occur on the same hosts? A list of jobs/hosts/times would help a lot. I'll try to work from 9897506 to see if there's anything obviously wrong on tools-... [11:39:30] Question: Is there a possibility to execute jsub from within the grid? [11:41:59] Jogo-obb: unfortunately not [11:42:12] Ok [11:47:14] 06Labs, 10Tool-Labs: webservice generic: unrecognized arguments: --extra-args - https://phabricator.wikimedia.org/T143403#2567172 (10valhallasw) [12:16:11] 06Labs, 10Graphite, 06Operations: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567227 (10fgiunchedi) [13:27:42] 06Labs, 10Graphite, 06Operations: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567365 (10chasemp) Ah, yeah sorry about this. @yuvipanda enabled this a few days ago as we have tracked down the primarily symptom of {T141673} to io going stale (freezing) which... [13:35:27] 06Labs, 10Graphite, 06Operations: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567399 (10fgiunchedi) indeed it might be hard to track via the uuids, alternatively we could purge instance directories not updated for some period of time, e.g. 4/5 weeks [14:06:20] 06Labs, 10Graphite, 06Operations: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567498 (10chasemp) I would like to do better than having to look up UUID's every time honestly but it looks like KVM does not support the domhostname argument for virsh > error:... [14:08:56] 06Labs, 10Tool-Labs, 13Patch-For-Review: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2567507 (10chasemp) Consider adding: * /data/project/.system/gridengine/spool/qmaster/jobseqnum * job count by host [14:26:03] 06Labs, 10Graphite, 06Operations: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567532 (10fgiunchedi) yeah I agree looking up by uuid isn't great, I'm fine with 4w staleness. It looks like about ~6GB per day on average so 30d is 200G which is fine. I won't be... [14:52:01] !log tools reboot 82323ee4-762e-4b1f-87a7-d7aa7afa22f6 [14:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [15:19:48] andrebogott: Is it possible for us to create a new extension with large memory. We would like to use it to test our DRMF search engine. Currently it seems as though the medium memory instances are not large enough. We are currently using the drmf2016 instance for search, but it is only medium memory, we would like to upgrade it to large memory, but @pysikerwelt would prefer not to delete drmf2016 instance becuase he is using it [15:20:05] (not new extension) I meant new instance. [15:20:09] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2567714 (10MusikAnimal) This one is currently stuck: `9948315 0.30056 copypatrol tools.musikb r 08/19/2016 06:45:10 task@tools-exec-1401.eqiad.wmf 1` It locked up around 6:50, August 19 UTC [15:20:39] Howie, andrew is away at the moment [15:20:47] ah [15:21:06] Krenair: thanks [15:21:37] you misspelt his name anyway [15:21:43] if you're a project admin it should be easy for you to just make a new instance [15:21:58] valhallasw`cloud tom29739: job 9948315 is currently locked up, should this help with investigating [15:22:03] I am a project admin, but I think we ran out of resources. [15:22:24] @Krenair: oh yes, I did misspell his name. :? [15:22:43] ah, you want a quota raised? [15:23:06] Krenair: yes we want to create 1 large memory instance [15:23:10] https://phabricator.wikimedia.org/T140904 [15:23:38] musikanimal, how did the job get triggered? [15:23:48] cron job [15:24:02] */5 * * * * jsub -l release=trusty -once ~/copypatrol_wikiprojects.sh >/dev/null 2>&1 [15:24:16] all of my jobs get ran this way [15:24:34] @Krenair: thanks! [15:25:03] musikanimal: what do you mean by locked up [15:25:07] can you be more specific on symptoms [15:25:11] https://phabricator.wikimedia.org/T143375 [15:25:37] the job never actually started, and just hangs [15:26:01] 9948315 0.30057 copypatrol tools.musikb r 08/19/2016 06:45:10 task@tools-exec-1401.eqiad.wmf 1 [15:26:16] musikanimal: I see it running but no io from copypatrol_wikiprojects.rb [15:26:24] right [15:26:49] musikanimal, you could ssh to that exec host and try debugging it [15:27:51] sure, I can try! [15:28:21] [pid 15733] restart_syscall(<... resuming interrupted call ... [15:28:30] musikanimal: from waht I can tell this is expected idle output from a ruby proc [15:28:40] so it's being run but it ouputs nothing and the VM it's on seems ok [15:28:54] I don't think it's locked up in any holistic sense on the grid, but maybe just isn't working right [15:29:35] musikanimal, have you tried jsubing it manually? [15:29:43] Or running it on the bastion [15:30:02] this is a very new thing, since beginning of August, so I'm pretty sure it's not code-related [15:30:09] the bot has ran smoothly for over a year [15:31:07] Weird. [15:31:27] I have not tried jsubing manually, but most of the time it works fine [15:31:45] musikanimal: thanks, let me take a look [15:31:54] whether or not it will "lock up" I'm not able to predict [15:31:57] thanks! [15:31:57] oh, chasemp is already checking out what's happening [15:32:07] * tom29739 goes and searches phab for stuff that happened on tools around the beginning of August [15:32:10] valhallasw`cloud: pleae have a look if you would I'm not sure what's wrong [15:32:17] ok! [15:33:19] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2566461 (10chasemp) strace -f -p shows `[pid 15733] restart_syscall(<... resuming interrupted call ...` and the job seems to have been run correctly and the VM it is on is running other adhoc jobs a... [15:34:29] musikanimal: does this use the system default ruby? [15:34:52] I am guessing not [15:34:55] no [15:35:03] hrm [15:35:04] it uses Ruby 2.2.1 [15:35:09] set by rbenv [15:35:23] hrm [15:36:05] ok, symbols seem to be there [15:36:55] chasemp: I just checked the bot's activity and I think was August 8 or 9 when this first happened [15:37:12] err tom29739 , since you were searching for phabs [15:37:26] OK. [15:37:37] ! [15:37:37] hello :) [15:37:49] l-wx------ 1 tools.musikbot tools.musikbot 64 Aug 19 15:37 1 -> /data/project/musikbot/copypatrol_wikiprojects.out (deleted) [15:37:50] l-wx------ 1 tools.musikbot tools.musikbot 64 Aug 19 15:37 2 -> /data/project/musikbot/copypatrol_wikiprojects.err (deleted) [15:38:10] the file descriptors are pointing to files that don't exist --> that suggests an nfs issue [15:38:17] ahh [15:38:21] chasemp: did we change any nfs settings? [15:38:53] There were some NFS setting changes some time ago [15:39:15] not anything around that time that I can think of / find [15:39:17] Don't think it was as recent as the 8/9th of August though [15:39:34] musikanimal: do you have a rought estimate of when the job got stuck? [15:39:40] valhallasw`cloud: but also if resubmitted nightly via cron I woud think it woudn't persist? [15:39:49] this time it happened around 6:50, August 19 UTC [15:40:24] chasemp: if via qmod -rj, probably not, but I think musikanimal just jsubs it via cron [15:40:25] musikanimal: does it run successfully on the bastion? [15:40:43] it could be an issue w/ that host if it's long lived there yeah [15:40:52] You could try touch-ing those files [15:41:31] valhallasw`cloud: how did you find that out of curiosity? [15:41:50] chasemp: sudo ls -ld /proc//fd [15:41:54] eh, just ls -l I think [15:41:57] right gotcha [15:42:06] I'm really stupid when it comes to this stuff... so bastion you mean just try jsub directly on tools-bastion? as opposed to via cron on the trusty release? [15:42:29] musikanimal, no, directly on the bastion [15:42:32] so there's nothing obvious in syslog/kern.log/etc [15:42:44] nothing that points to nfs breaking. Bah. [15:42:49] nothing happened around tha time nfs cleanup wise either [15:43:11] musikanimal, so the command that would be run with jsub you run directly on the bastion [15:43:22] about an hour earlier nslcd did have issues [15:43:48] and there's a pretty happy puppet run around 07:00 UTC [15:44:01] so whatever it was, it can't have been a long issue [15:44:33] tom29739: right, so not the same thing as tools-bastion? [15:44:58] musikanimal, I meant directly on the tools bastion [15:45:12] ok [15:45:46] musikanimal: I think it's your log rotation [15:46:21] tail -c 100000 $logfile > temp.$$; mv temp.$$ $logfile [15:46:22] it's rotating out from underneath sge? [15:46:46] Right, and sge doesn't close/reopen the output file [15:48:08] I think they lock at different times than when the rotation happens [15:48:09] musikanimal: you can either use truncate (which will empty the file without SGE getting confused), or logrotate after the job finished [15:48:12] *lock up [15:48:23] yes, they lock up at rotated + some time until the buffers fill up [15:48:55] truncate gets rid of the end of the output, I want the beginning of it to be chopped off [15:49:10] anyway the rotation has been there forever, nothing new [15:50:22] musikanimal: disable it for a while and see if the issue persists? [15:50:47] it could very well be that it happens now because tasks take longer, or sge schedling takes a bit longer, etc [15:50:54] the current example, copypatrol, locked up at 6:50, the log rotation happened at 6:00 and there were numerous successful runs after that [15:51:03] and the next rotation didn't happen until 8:00 [15:51:39] valhallasw`cloud: yeah that was my theory.... I've noticed Tool Labs jobs take longer to start [15:51:44] musikanimal: in any case, please turn it off, and see if that changes anything. If it doesn't, at least we will have logging working once we can look at it [15:51:51] sure [15:52:09] musikanimal: because now stderr and stdout are locked up, so there's no way to do anything with gdb [15:52:35] note also copypatrol is ran every 5 minutes, so I did have a theory that what causes this is when a new job is submitted when the old one hasn't finished [15:52:48] you're not using -once/ [15:53:13] jsub -l release=trusty -once ~/copypatrol_wikiprojects.sh [15:53:18] ^ is what gets ran [15:53:35] yeah, so that shouldn't cause any issues, because there will always only be a single job [15:53:54] yeah, and if I remember correctly my weekly job also locked up [15:54:41] what is that patch that yuvi uploaded? [15:54:52] https://gerrit.wikimedia.org/r/#/c/305616/ [15:55:03] it references the bug [15:55:04] wrong task # I think [15:55:21] ah, I was afraid of that [15:55:35] (03PS2) 10EdouardHue: Importing code [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303933 [15:55:59] (03PS3) 10EdouardHue: Importing code [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303933 (https://phabricator.wikimedia.org/T142570) [15:56:01] valhallasw`cloud: so should I qdel the job that's currently locked? or were you still diagnosing stuff [15:56:10] still trying some stuff [15:56:22] ok thanks [16:00:48] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2567885 (10valhallasw) Using some horrible gdb magic: First, fix up stdout and stderr to a different file (via https://gist.github.com/zaius/782263 ) ``` valhallasw@tools-exec-1401:~$ sudo su tools... [16:01:00] musikanimal: I finally got a ruby backtrace! :-) [16:01:06] oh nice! [16:01:13] https://phabricator.wikimedia.org/T143375#2567885 [16:01:14] * valhallasw`cloud prods wikibugs [16:02:09] musikanimal: I can also try to see if there's any network traffic... [16:02:11] so it broke when trying to log in I guess [16:03:15] even if it did break, the script should just die and the job would be finished, I think [16:03:58] valhallasw`cloud: nice [16:04:24] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tool Labs jobs locking up - https://phabricator.wikimedia.org/T143375#2567908 (10valhallasw) There's a single connection open: ``` valhallasw@tools-exec-1401:~$ sudo netstat -np | grep 15733 tcp 32 0 10.68.17.202:39575 208.80.154.224:443 CLOSE_WAI... [16:05:25] musikanimal: at the very least mediawiki-gateway-1.0.7 doesn't seem to give up after a given number of attempts [16:05:44] hmno, it should give up, if retry_count > @options[:retry_count] [16:05:52] yeah it should [16:07:30] one thing that I did change code-wise was make warnings log to STDOUT https://github.com/MusikAnimal/MusikBot/blob/master/musikbot.rb#L86 [16:07:40] that happened on August 4 [16:07:48] I don't think it's related, but thought I'd mention it [16:08:27] the thing is -- I'm not sure why it would be stuck in a sleep if the stdout/stderr is locked up [16:12:42] is it possible that call to the api generated stderr output and so stuck [16:14:57] I'd expect it to be stuck in some write(1, blah), not in a sleep() [16:15:26] I'm puzzling on that too, I'm floating dumb ideas at this point [16:20:33] maybe the maxlag response format changed? the retry_delay gets set based on that https://github.com/MusikAnimal/mediawiki-gateway/blob/master/lib/media_wiki/gateway.rb#L147 [16:22:59] do we have documentation for installing pypi packages in a virtual env ? [16:25:08] Betacommand: usually it is just pip install package once inside the venv [16:25:55] madhuvishy: Ive not worked with venv before [16:26:35] Betacommand: ah, have you created one? [16:26:43] madhuvishy: No [16:27:13] IE I was looking for the docs on the best way to do it on labs [16:28:05] because I want it to work on the webservice side too, not just local logins [16:28:23] Betacommand: okay, is this for a tool? [16:28:31] madhuvishy: yes [16:29:45] Betacommand: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_2_.28uwsgi.29 [16:29:54] have some info if it's on Grid Engine [16:30:54] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web/Kubernetes#python_.28uwsgi_.2B_python3.4.29 for k8s [16:31:49] musikanimal: the only thing I can find with some more gdb prodding is that the sleep() seems to get a crazy long time, but I may just be misinterpreting the data type... [16:32:19] in either case you set up your directory structure like the docs say, and then create a virtualenv by doing, virtualenv , in the right path [16:32:23] I think it logs the time period it's going to sleep, but with stderr/stdout out of service.... [16:33:23] madhuvishy: I dont want to spool up a different web service though [16:33:44] Betacommand: do you have one currently? [16:33:59] madhuvishy: yes, I have the default [16:35:07] Betacommand: does it have a venv folder in ~/www/python? [16:35:51] madhuvishy: like I said, Ive done nothing requiring venv yet [16:36:23] * Betacommand is starting to think that just requesting a pip install would be easiest [16:36:37] Betacommand: what's the name of your tool? [16:36:48] madhuvishy: betacommand.dev [16:36:52] looking [16:37:17] sorry its tools.betacommand-dev [16:40:48] Betacommand: and right now you are just serving stuff in public_html? [16:41:12] where is the default webservice? [16:42:31] madhuvishy: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web [16:42:58] It depends what type of webservice. [16:45:13] Betacommand: yeah, do you have the webservice written in python somewhere? [16:45:23] once you do, can help you put it on a tool [16:45:35] madhuvishy: No, like I said I am using the default service [16:45:39] valhallasw`cloud: interesting... well thank you very much for looking into this! I'm going to start with going back to STDERR, as that change happened not long before the issue surfaced [16:45:48] I don't know it'd be related, but it's worth a try [16:46:17] madhuvishy: its been working fine since labs started. [16:46:25] maxlag errors would explain the irregularity, so maybe it's getting hung there, as you say [16:46:35] madhuvishy: http://tools.wmflabs.org/betacommand-dev/ [16:47:48] madhuvishy: here is an example of a working script: https://tools.wmflabs.org/betacommand-dev/cgi-bin/rationale_check.py?title=File%3AVH_Luke.png [16:51:24] 06Labs, 10Tool-Labs: pip package install request (tld) - https://phabricator.wikimedia.org/T143432#2568003 (10Betacommand) [16:52:08] Betacommand: The supported way to run python webservices is https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Python_2_.28uwsgi.29 [16:52:31] madhuvishy: I DONT WANT TO RUN A DIFFERENT WEB SERVICE [16:53:04] madhuvishy: using the default service works for me [16:53:37] spooling up a uwsgi for what Im doing violates KISS [16:56:34] Betacommand: be nice. [16:56:53] so you want to run a cgi-bin and use a virtualenv for it? [16:57:16] betacommand hi. running all your tools under one tool name is an antipattern that causes a lot of problems including a Single Point of Failure (you!) and not being able to run multiple types of webservices easily. so yes, the supported way to run python webserves is to spin up a uwsgi instance. [16:57:24] bd808: I was thinking that might have been easier, but I just requested a pip install instead [16:57:33] feel free to continue using CGIs, but then you are on your own. [16:57:34] I think that may be possible via a wrapper script that loads the venv [16:58:06] Betacommand also consider this your first official warning for screaming at people. Please don't do that again. [16:58:18] yuvipanda: I understand that. however my code is for cgi-bin and I dont have the extra time to convert [16:58:49] unfortunately we can't support cgi-bin based virtualenvs officially. if you ask nicely maybe someone might be help, but there's no official support for it. [16:58:53] good luck, and be nice. [16:59:16] yuvipanda: thats all that needed to be said. [17:01:05] yuvipanda: running everything under one account may not be labs "pattern" but for my sanities sake its how I started on the TS and have continued. [17:01:46] Betacommand, uwsgi *is* the default python service [17:02:18] betacommand indeed, and hence it doesn't have great support in labs. TS also instituted multi maintainer accounts due to problems with running everything under one account [17:03:05] tom29739: im using the lighttpd service [17:03:40] That's normally used with PHP, but I think it supports python. [17:04:11] tom29739: correct, which is why I have been using it since the start of tool-labs [17:04:20] I don't see why you want to go to all that time and trouble just to not use uswgi. [17:04:57] tom29739: because I would bet that it breaks my code [17:06:09] Betacommand, this is the guidance for a cgi-bin folder on lighttpd: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Apache-like_cgi-bin_directory [17:06:18] tom29739: keep in mind when I started coding that wasn't an option, so I used thee default and haven't changed, because it works for me. [17:07:12] tom29739: I know that. Ive been running my tools that way for years [17:07:34] Oh. [17:07:46] There's an easy way to use that with a virtualenv. [17:08:29] Instead of referencing the system python in the cgi.assign in lighttpd, use the virtualenv python instead. [17:14:37] Betacommand: believe me, uwsgi is much much faster since you only have you liad the venv once [17:14:59] venv is sloooooowwww on NFS trusty servers [17:15:43] 06Labs, 10Tool-Labs: pip package install request (tld) - https://phabricator.wikimedia.org/T143432#2568120 (10valhallasw) 05Open>03declined This package is not packaged by debian, and we do not install python packages globally due to security concerns. Please use a virtualenv or `pip install --user` instead. [17:15:53] Betacommand: re: cgi, you can set #!/path/to/venv/bin/python as your interpreter [17:16:10] and that should just work, except for the slowness of nfs [17:16:19] https://phabricator.wikimedia.org/T136712 [17:16:24] Betacommand: ^ [17:16:26] if you use virtualenv --system-site-packages, most of that should not be an issue [17:17:01] (but in general python is going to be slow via cgi) [17:21:12] valhallasw`cloud: I just tried the --user flag and got a message about pip not being found [17:24:11] Betacommand: hrm. In that case, a venv is the way to go [17:24:39] virtualenv venv --system-site-packages; venv/bin/pip install tld; venv/bin/python [17:25:13] or python setup.py --user, but you might encounter a lot of dependency issues [17:31:11] Lighttpd is slow anyway because it relies on NFS and reloads the files each page load [18:32:56] 06Labs, 10Tool-Labs: DNS resolution sometimes fails on tools-bastion-03 - https://phabricator.wikimedia.org/T143194#2568366 (10valhallasw) Which grid job number / host did this happen with? [18:45:13] 10MediaWiki-extensions-OpenStackManager: Update HTMLForm definitions to use `'dropdown' => true` rather than `'cssclass' => 'mw-chosen'` - https://phabricator.wikimedia.org/T143445#2568405 (10matmarex) [19:41:25] 06Labs, 10Math: Request increased quota for labs project - https://phabricator.wikimedia.org/T143446#2568538 (10Plato2000) [19:42:19] 06Labs, 10Math: Request increased quota for Math labs project - https://phabricator.wikimedia.org/T143446#2568552 (10Plato2000) [20:22:04] yuvipanda, can you install a package on the toollabs-python2-base docker image for me? [20:22:24] The enchant library is complaining about not having it. [20:22:37] The package is libenchant1c2a [20:27:20] hey tom29739 [20:27:28] I have a task for this somewhere [20:27:51] tom29739 https://phabricator.wikimedia.org/T140110 [20:33:25] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Install libenchant1c2a in the toollabs-python2-base docker image - https://phabricator.wikimedia.org/T143449#2568629 (10tom29739) [20:33:33] yuvipanda, ^ [20:34:24] man do we need that PaaS [20:34:46] Yep.. [20:34:55] I wonder if we could make a micro-paas for this kind of thing... [20:35:00] so this is an interesting dilemma [20:35:14] or is that a slippery slope we will never recover from? [20:35:26] bd808, just don't. [20:35:27] python's manylinux wheel support is solid as of a couple months ago [20:35:31] and packages are adopting to it [20:35:38] so that'll put an end to this in some way [20:35:41] numpy / scipy for example [20:35:59] no longer need a fortran compiler / any compiler [20:36:04] so for pyenchant... [20:36:10] it will probably get a wheel sometime [20:36:13] so then the question becomes [20:36:17] what to do in th emeantime? [20:36:20] there are two options [20:36:37] one is we install these libraries in base like this [20:36:46] the other is... we setup a devpi instance where we host our wheels [20:36:49] It does say: pre-built binary wheel from PyPI [20:37:15] tom29739 yeah, but the prebuilt one on PyPI is only for windows [20:37:20] they don't have a manylinux wheel yet [20:38:32] yuvipanda, why can't we use apt in the docker containers? [20:38:47] They get erased when you've finished with them. [20:38:54] tom29739 because that would require you to run as root [20:38:57] and we prohibit that [20:39:07] because imagine if you could run a process as root inside the docker container [20:39:13] then you can mount /data/project [20:39:16] and read/write to it as root! [20:39:23] and /home too [20:39:37] you put something in my .bashrc and boom now you have a root exploit [20:39:39] that's why. [20:41:32] tom29739 bd808 thoughts on the devpi idea? [20:41:46] I think that's a better idea. [20:41:49] pros: our containers keep lean, it also speeds up pip installs a bit. cons: we have to maintain it. [20:43:43] yuvipanda, also it means that the extra burden of the packages only happens if the tool needs it [20:44:28] So if I'm the only person to use package x, then only my docker container has that extra load, rather than everyone's [20:44:48] for devpi? yeah. but the wheels also live on NFS than the container itself [20:44:53] so I am general very pro that [20:51:12] yuvipanda, how long will it take to set up a devpi server? [20:51:55] tom29739 not today, https://gerrit.wikimedia.org/r/#/c/282102/ already exists [20:52:08] tom29739 how about I help you build the wheel for it just now and you can use it in your tool? [20:52:13] and I will set up this later? [20:52:18] building a wheel isn't that hard [20:52:34] OK. [21:02:28] tom29739 python2 right? [21:02:35] Yep. [21:02:50] kk let me try [21:06:02] tom29739 can you try the wheel from tools-docker-builder-01.eqiad.wmflabs/pyenchant-1.6.7-py2.py3.cp27.cp32.cp33.cp34.cp35.pp27.pp33-none-any.whl [21:08:22] tom29739 you might need a newer version of pip as well (pip install -U pip in the venv) [21:10:56] !log ores deployed ores-wmflabs-deploy:f0fc59b [21:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [21:12:23] yuvipanda, my pip doesn't like it [21:12:31] what does it say [21:12:37] It's pip 8.1.2 [21:12:47] Processing /tools-docker-builder-01.eqiad.wmflabs/pyenchant-1.6.7-py2.py3.cp27.cp32.cp33.cp34.cp35.pp27.pp33-none-any.whl [21:12:57] IOError: [Errno 2] No such file or directory: '/tools-docker-builder-01.eqiad.wmflabs/pyenchant-1.6.7-py2.py3.cp27.cp32.cp33.cp34.cp35.pp27.pp33-none-any.whl' [21:13:08] 06Labs, 10Labs-Infrastructure: Plan deprecation of all precise instances in Labs - https://phabricator.wikimedia.org/T143349#2568696 (10chasemp) With tools compare between jobs run on precise vs. trusty https://graphite-labs.wikimedia.org/render/?width=1077&height=509&_salt=1471638847.104&target=cactiStyle(su... [21:13:37] yuvipanda, do I need to download the wheel to the local system? [21:13:42] tom29739 prefix with http:// [21:13:45] ? [21:13:47] 06Labs, 10Tool-Labs, 13Patch-For-Review: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2568697 (10chasemp) >>! In T140999#2567507, @chasemp wrote: > Consider adding: > > * /data/project/.system/gridengine/spool/qmaster/jobseqnum > * job count by host done [21:14:22] 06Labs, 10Graphite, 06Operations: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2568698 (10yuvipanda) I'm thinking of just running this in a cron: ``` find . -type f \! -mtime 672 -delete ``` 672 is 28 days, 4 weeks. That sound ok to everyone? [21:14:37] yuvipanda, it works [21:14:48] tom29739 \o/ try the functionality to make sure? [21:18:14] Doesn't seem to want to work [21:18:15] [22:17:44] Missing pyenchant module. [21:20:09] tom29739 but pip works? [21:20:13] worked? [21:20:23] can you also try downloading the wheel and installing it and see what happens? [21:20:36] Pip works [21:20:43] I mean, the pip install worked [21:20:58] That's what I meant [21:21:08] ah, right [21:21:11] that's weird [21:22:18] hey yuvi, this is Anthony working on recitation-bot [21:23:00] hello dfko [21:24:15] dfko your tool is python3 right? [21:24:29] yes [21:24:53] I have a venv set up in the labs project account hume [21:24:55] home [21:25:21] dfko ok, let me try to get it set up following https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web/Kubernetes#python_.28uwsgi_.2B_python3.4.29 [21:25:23] and see how it goes [21:25:57] there is a uwsgi.ini in the home as well [21:27:57] ok, am trying to start uwsgi now [21:30:00] I see 502 Bad Gateway [21:33:26] dfko look at ~/uwsgi.log now [21:33:30] your problem isn't multiprocessing, at least not yet [21:33:36] there's a templatenotfound exception? [21:33:51] (I changed your app.py code slightly, mostly to stop having app.run run when called by uwsgi) [21:34:51] might have to do with my changing directories [21:35:22] if you run under kubernetes does the same directory structure appear i.e. /data/project/recitation-bot ? [21:35:52] yes [21:36:40] yuvipanda, ImportError: No module named pyenchant [21:37:12] Oh. [21:37:17] It's pip's fault [21:37:29] Installing collected packages: pyenchant [21:37:30] Found existing installation: pyenchant 1.6.7 [21:37:30] Uninstalling pyenchant-1.6.7: [21:37:30] Successfully uninstalled pyenchant-1.6.7 [21:37:30] Successfully installed pyenchant-1.6.7 [21:38:45] also this may not work if I cannot run a shutdown method under uwsgi [21:39:39] dfko nope, you can! [21:39:46] dfko you should use the flask specific things for it moment [21:41:03] dfko in general I'd like to phase out use of the 'generic' webservice type on gridengine, and I want to help you make this work on uwsgi. It'll also help you in the long run - running the flask built in server *is* going to cause multiple issues over time... [21:41:30] what does flask have for exiting? [21:41:48] someone points me to the python exit handling https://docs.python.org/3.5/library/atexit.html [21:41:52] dfko > @app.teardown_appcontext [21:41:58] from the link I pasted. [21:48:54] you mean this? http://flask.pocoo.org/docs/0.10/appcontext/ [21:51:14] dfko that's the same link I sent you right? [21:51:18] oh, for a slightly earlier version, but sure! [21:51:20] yeah that [21:51:44] I didn't see a link so I searched [21:52:48] oh, that's strange :| [21:53:55] i can still see it in my client [21:54:01] dfko oh well. but yeah, you can use that for cleaning up [21:54:07] tom29739 ok, so you're set for now right? [21:55:21] yuvipanda, pip is acting strange [21:55:45] It's uninstalling the wheel. [21:55:59] When it's supposed to be installing it [21:57:33] Hmm, apparently whereever bigbrother is run from doesn't have php installed.... any easy way to convince jstart it might just work and to stop complaining? [22:00:43] yuvipanda, it appears to have the same problem as before [22:01:15] ImportError: The 'enchant' C library was not found. Please install it via your OS package manager, or use a pre-built binary wheel from PyPI. [22:01:23] That's using your wheel. [22:02:29] hmm [22:03:01] tom29739 ok, I'll temporarily install the library for you, but we'll revisit in a week or so to switch to devpi? [22:03:38] Sure. [22:14:48] dfko any luck? [22:21:07] 06Labs, 10Tool-Labs, 10Mail: Move tools-mail to trusty - https://phabricator.wikimedia.org/T96299#2568903 (10valhallasw) Apparently I even wrote some tests for the relay at some point; https://github.com/valhallasw/mailrelay-tests [22:21:31] Damianz: what's the error you're getting? [22:22:06] afaik the bigbrother host is supposed to have the same packages installed as bastions and exec hosts [22:22:53] valhallasw`cloud: Program not found on php, /usr/bin/php, /etc/alternatives/php, /usr/bin/php5 etc (seems to be the place on the exec hosts)..... 'fixed' it by making it call a bash script that calls exec to php [22:24:37] 06Labs, 10Tool-Labs: bigbrother hosts missing exec packages - https://phabricator.wikimedia.org/T143458#2568904 (10valhallasw) [22:33:17] 06Labs, 10Tool-Labs: bigbrother hosts missing exec packages - https://phabricator.wikimedia.org/T143458#2568941 (10valhallasw) exec_environ indeed seems not included on the bigbrother host. We can either include that, or we can figure out how to make SGE not complain about missing files (which I think is confi... [22:41:32] yuvipanda: did you install that package? [22:46:53] tom29739 yup [22:46:54] try [22:49:25] 10PAWS: Non-notebook files don't redirect to paws-public when URL is changed - https://phabricator.wikimedia.org/T143459#2568964 (10Staeiou) [22:56:17] yuvipanda: do I need to recreate my pod to use the new image? [22:58:24] tom29739 yup [23:06:16] yuvipanda: it works. [23:06:20] Yay. [23:08:19] tom29739 cool! I'll make a note on the temporariness :) [23:38:01] @yuvipanda gotta go will be wrapping this up this evening will let you know how it goes. thanks for the help