[00:00:28] (03CR) 10Legoktm: [C: 032] "Err, what? Details plz" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/235357 (owner: 10Greg Grossmeier) [00:00:44] (03Merged) 10jenkins-bot: Naming is hard [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/235357 (owner: 10Greg Grossmeier) [00:02:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [00:03:12] !log tools.wikibugs Updated channels.yaml to: 5f5fbb9243566dd512f1b1bf65ed60e5e08d6e92 Naming is hard [00:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [01:04:23] 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, and 2 others: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1595565 (10EBernhardson) [01:14:52] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Change sid pbuilder image name to 'unstable' - https://phabricator.wikimedia.org/T111097#1595576 (10scfc) For the `labs/toollabs` repository, (after merging a patch) we create (at the moment identical) packages for Precise and Trusty, but do not deploy to... [01:23:48] petan: Seems botbot is down - bad gateway; https://botbot.wmflabs.org/ [01:23:55] https://bots.wmflabs.org/~wm-bot/searchlog/ is also 404 [01:52:28] tools-exec-1218 appears to be having problems, stuck job 11474, can't even delete it. https://tools.wmflabs.org/nagf/?project=tools#h_tools-exec-1218 shows way more wait I/O than other instances [02:38:27] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1595626 (10chasemp) Thinking we could combine a bump in allowed clients with logging the request before denying at https://phabricator.wikimedia.org/diffusion/ODDY/browse/master/src/nc_p... [03:30:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [03:31:44] LGTM... [03:35:47] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 879430 bytes in 2.261 second response time [04:13:00] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [04:39:15] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1595759 (10Harej) 3NEW [04:40:51] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1595759 (10Harej) [04:40:53] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1595759 (10Harej) [04:51:22] is nfs slow? [04:53:13] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 22.22% of data above the critical threshold [0.0] [04:54:15] https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Labs+NFS+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report looks suspicious [05:33:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [08:30:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [08:33:10] ^ intermittent [08:34:27] mark: is data/project/.system/gridengine/spool what's causing the SGE NFS load? [08:34:55] or is it all the *.out/*.err files written out? [08:35:46] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 879430 bytes in 2.327 second response time [08:40:25] Hi. If I execute a script on one of the exec servers (via qsub) and in the script I'm writing to a file, is it somehow buffered on the exec machine and then written to my tool acc (it's in public_html) on bastion? Or is it writing directly to the file in my tool's folder? [08:40:55] it writes directly to the file, but because it's NFS there can be some delay if you read from another host [08:41:35] also, consider using jsub (which takes care of several standard settings you have to pass manually with qsub) [08:46:20] alkamid: yes there is always a delay if that file is on nfs, it also matter how you write to the file, your program can be buffering, and also kernel is probably buffering it [08:47:42] valhallasw`cloud, petan, I'm writing with python's open(), I believe it reads the default buffer from the system (4096?) [08:48:25] what do you actually want to do? [08:48:51] the delay is there and you can't do much about it, I don't think there is any simple way to override it [08:49:58] if you want some faster communication between instances, eg. be able to push some information on exec machine and be able to immediately retrieve it on another machine you should probably use redis or something like that [08:51:59] petan, no, I've just run the script and after ~20 min there was no output in the file, but when I run it locally ~2 min is enough to fill in the buffer and write [08:52:12] so I just wanted to understand if it's normal or something's wrong [08:52:23] no 20 mins is not normal [08:52:35] alkamid: try logging in to the exec host and reading file contents there [08:52:56] qstat tells you the host, then just ssh [08:53:54] valhallasw`cloud: that sure is a large part of it yes [08:54:16] we should look into taking that off NFS, SGE supports that I saw [08:54:23] yes, sounds good [08:54:40] maybe that will also stop SGE from losing jobs *looks angrily at SGE* [08:54:56] possible yeah [08:55:00] i think it does an insane amount of locking now [08:55:05] still might not be enough ;) [08:56:44] 6Labs, 10Tool-Labs: Reduce SGE NFS usage - https://phabricator.wikimedia.org/T111158#1596167 (10valhallasw) 3NEW [08:56:49] ^ :-) [08:58:57] mark: eta on rebuild? [09:02:03] 10 days? :P [09:05:03] at 5MS/s? [09:07:18] * YuviPanda feeds paravoid stroopwafels with nutella [09:07:24] haha [09:07:49] although this is some strange Lidl's ripoff of nutella that isn't quite as good [09:08:42] anyway, off to the WMDE Office! [09:13:46] valhallasw`cloud, you suggest using jsub. Currently my jobs are scheduled with "qsub -N 'porzucone.py' -b y -q task -l h_rt=10:00:00 -l h_vmem=500M -l release=trusty -e $HOME/output/ -o $HOME/output/ $HOME/scripts/venv/bin/python $HOME/scripts/porzucone.py >/dev/null" [09:14:09] i suppose I should then drop "-b y -q task -l h_rt=10:00:00" ? [09:14:16] and the rest stays the same? [09:14:43] you don't really have to use jsub, there is no point if you want to specify all parameters yourself [09:14:58] jsub is just trying to make it more simple for people who don't want to mess up with qsub [09:15:02] it's just a wrapper around it [09:16:00] alkamid: jsub -N porzucone.py -mem 500M -l release=trusty -q $HOME/scripts/venv/bin/python $HOME/scripts/porzucone.py [09:16:06] although that will change where your output ends up [09:16:31] (namely in ~/porzucone.py.out and ~/porzucone.py.err instead of in a gazillion .o1233456 .e12332542 files) [09:17:28] valhallasw`cloud, was "-q" in your line intentional? [09:17:40] sorry, that should be -quiet [09:17:58] that prevents output on success (but not on errors) [09:23:20] valhallasw`cloud, thanks. Coming back to my original problem, I use python's "with open(file, 'w') as f:" and then in the loop I'm writing to this file (see https://github.com/alkamid/wiktionary/blob/feature/py3k/porzucone.py#L39). Might the problem be that within the "with" scope, the file is still open, and therefore is not synced? [09:23:51] alkamid: it will definitely buffer data until the file is closed [09:24:25] alkamid: the typical solution for these kinds of issues is to write to porzucone.html.1, then move porzucone.html.1 over porzucone.html [09:24:36] (shutil.move) [09:24:37] valhallasw`cloud: couldn't you simplify your jsub line by putting python as #! in the script file and leaving -N out? [09:25:14] Eh, yeah. That's true as well. Needs +x on the file as well, though. [09:25:46] true [09:29:39] valhallasw`cloud, why is it different than local execution? On my machine I can just run the script and it will write to file every 4096 bytes (or whatever the buffer size is) and then I can read it [09:29:59] I mean I can read the output file [09:31:02] alkamid: I don't know what the nfs buffer size is, but it's likely larger than 4KB [09:32:31] ok, so I'll get the same behaviour, only with a lot more lag because NFS buffer is larger. Fair enough [09:32:38] thanks for your help [09:33:57] I think so, yes. [09:34:40] As noted, the best way to handle this is to write to a temp file first and then move (this is guaranteed to be atomic, so users get either the old or the new file) [09:36:42] ok, I'll do this [09:37:13] (it would really be cool if there was a with: handler for that, but I don't know if one exists) [09:54:10] 6Labs, 10Tool-Labs, 7user-notice: No file system on toollabs, unable to login, web service broken - https://phabricator.wikimedia.org/T110827#1596297 (10valhallasw) 5Open>3Resolved a:3valhallasw [09:54:36] (03PS1) 10Hashar: Nodepool database user/pass [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) [09:55:03] 6Labs, 10Tool-Labs, 7user-notice: No file system on toollabs, unable to login, web service broken - https://phabricator.wikimedia.org/T110827#1587427 (10valhallasw) The initial issue was resolved sunday afternoon (CEST), but I forgot to close the task at that point. [09:58:17] (03PS2) 10Hashar: Nodepool database pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) [09:59:39] (03CR) 10Jcrespo: [C: 031] Nodepool database pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) (owner: 10Hashar) [10:17:34] (03CR) 10Hashar: [C: 032 V: 032] Nodepool database pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) (owner: 10Hashar) [11:11:21] (03PS1) 10Jean-Frédéric: Update ru configuration [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235432 (https://phabricator.wikimedia.org/T110665) [11:12:21] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Update ru configuration [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235432 (https://phabricator.wikimedia.org/T110665) (owner: 10Jean-Frédéric) [11:14:40] 10Tool-Labs-tools-Other, 6Commons, 6Community-Tech, 6Multimedia: [AOI] Ceate a new DerivativeFX after the Toolserver shutdown - https://phabricator.wikimedia.org/T110409#1596423 (10Steinsplitter) >>! In T110409#1586685, @Snaevar wrote: > I have seen DerivativeFX at http://tools.wmflabs.org/derivative/deri1... [11:20:55] 10Tool-Labs-tools-Other, 6Commons, 6Community-Tech, 6Multimedia: [AOI] Create a new DerivativeFX after the Toolserver shutdown - https://phabricator.wikimedia.org/T110409#1596439 (10Aklapper) [11:43:01] i've frozen the rebuild a bit [11:43:07] it makes like no difference to load on that array [11:44:57] (03PS1) 10Jean-Frédéric: Template names in monuments-config are case-sensentive [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235438 (https://phabricator.wikimedia.org/T110665) [11:45:14] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Template names in monuments-config are case-sensentive [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235438 (https://phabricator.wikimedia.org/T110665) (owner: 10Jean-Frédéric) [12:04:57] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Change sid pbuilder image name to 'unstable' - https://phabricator.wikimedia.org/T111097#1596558 (10akosiaris) In general I find it way preferable to refer to releases by their name (e.g. `jessie`, `potato`, `trusty`) than by their designation (e.g. `stabl... [12:19:57] (03PS1) 10Sitic: Fix notifications [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/235449 [12:20:27] (03CR) 10Sitic: [C: 032 V: 032] Fix notifications [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/235449 (owner: 10Sitic) [12:26:30] hi mark [12:26:43] valhallasw`cloud: around? need some help looking at a pwb based bot on tools which might be hammering NFS [12:26:55] semi [12:27:14] what's it doing? [12:27:19] mark: from the looks of it ralgisbot seems to be an agglomerate, running various different pywikibot commands all the time [12:27:25] and logging them to command.log [12:27:46] valhallasw`cloud: just lots of accesses, mark suspects it might be part of the reason that rebuilding is slow [12:27:48] YuviPanda: you can also ask jayvdb or xzise [12:27:58] if it's pwb itself, thatis [12:28:02] it's not clear whether that's a worst offender or anything [12:28:04] valhallasw`cloud: they don't have root, so can't see. [12:28:05] right [12:28:13] it just shows up on fatrace [12:29:09] (03CR) 10Yuvipanda: [C: 04-1] Add list-user-databases command (032 comments) [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) (owner: 10Tim Landscheidt) [12:29:17] there's a whole set of jobs running under qstat -u tools.ralgisbot [12:29:56] yeah [12:30:03] and I'm reluctant to 'stop' them [12:30:19] ...it's using -putthrottle:0 for all bots, it seems [12:30:36] what does that do? [12:30:54] it disables write throttling [12:31:09] aah. I guess that's not reccomended? [12:31:15] which might be OK for some bots, but could explain large resource usage [12:31:22] and I guess that might also be responsible for it re-reading the .py files all the time? [12:31:26] uuuh [12:31:36] no, that should not happen [12:32:00] valhallasw`cloud: look at /data/project/ralgisbot/PWB/core/logs/command.log [12:32:12] valhallasw`cloud: I presume that's different python scripts being run on a tight-ish loop [12:32:30] looks like it [12:33:04] so basically they should throttle themselves some more? [12:33:08] no, that's seperate [12:33:27] I mean, both in the writes and also in their loop [12:34:14] what I think is happening [12:34:23] 6Labs, 10LabsDB-Auditor, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, and 8 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1596647 (10Aklapper) p:5Normal>3Lowest [12:34:42] is that the jobs are failing (check the recently written .out files) [12:34:46] but being restarted immediately [12:35:04] aaah [12:35:52] valhallasw`cloud: qstat has start dates for them from a longish time ago, I don't know if that updates after a restart? [12:36:08] qmod -rj keeps the old start date [12:37:43] ah I see [12:37:55] valhallasw`cloud: which out files are you looking at? [12:37:59] there's a ton of them :| [12:38:12] ls -lh | grep 12:3 [12:38:29] bes-isbn.out for example [12:39:12] 6Labs, 10Tool-Labs: Reduce SGE NFS usage - https://phabricator.wikimedia.org/T111158#1596663 (10mark) Indeed, let's start by moving off the spool dir. [12:39:34] that's restarting once every 5 seconds [12:39:43] because the error doesn't actually change the exit code or something :|| [12:39:53] see e.g. /var/spool/gridengine/execd/tools-exec-1206/job_scripts/331273 [12:39:59] ok, let me stop those [12:40:52] :) [12:40:58] YuviPanda: might be a leftover from the NFS breakup actually [12:41:05] IOError because it can't write to a file? [12:41:20] valhallasw`cloud but if it is continuously restarting... [12:41:34] YuviPanda: sec. [12:41:37] ok [12:43:59] odd. The file descriptors of the `/bin/bash /var/spool/gridengine/execd/tools-exec-1206/job_scripts/331273` processes just point to the current NFS mount. (and 6301, on tools-exec-1206) [12:44:27] you mean files on the mount? [12:44:27] but if I run /data/project/ralgisbot/mrBes05.sh directly it doesn't crash? let me try again [12:44:31] or the mount itself [12:44:32] hmm [12:44:34] the files [12:44:37] with the webgrid [12:44:47] the access.log was incorrect (fd pointed to deleted file) [12:45:50] I can't reproduce any of it [12:45:51] argh :( [12:46:44] YuviPanda: or try just rescheduling the offending jobs.. [12:46:58] sigh :( [12:47:01] orrrr [12:47:02] hmm [12:47:14] I'm wondering if we can figure out what throws the IOError.... [12:47:28] strace! [12:47:44] is an option [12:47:45] let me do that [12:47:51] ok! [12:49:19] [pid 20637] write(2, "IOError", 7) = -1 ESTALE (Stale NFS file handle) [12:49:19] [pid 20637] write(2, ": ", 2) = -1 ESTALE (Stale NFS file handle) [12:49:20] [pid 20637] write(2, "[Errno 116] Stale NFS file handl"..., 33) = -1 ESTALE (Stale NFS file handle) [12:49:27] followed by [12:49:27] aha [12:49:31] so an -rj should fix things? [12:49:31] [pid 20638] _exit(0) = ? [12:49:35] which is obviously stupid [12:49:52] ok [12:49:55] so what happens is this [12:50:01] "/data/project/ralgisbot/PWB/core/pywikibot/userinterfaces/terminal_interface_base.py" wants to write to stdout (or stderr) [12:50:06] but stdout/stderr is a stale NFS handle [12:50:34] and then pwb doesn't handle this correctly and exit(0)s instead of exit(1) for some reason. Not sure, could also be a Python bug. [12:51:03] hmm, shouldn't an exception exit always be non-zero? [12:51:06] oh yeah, and writing to stderr also doesn't work because of the stale handle :P [12:52:35] right [12:52:39] I wonder how widespread this is :| [12:53:11] could actually be a bug in pwb.py [12:53:19] which wraps in all kinds of fun ways [12:54:17] but it really shouldnt [12:54:18] gah. [12:54:58] valhallasw`cloud: want me to -rj all these jobs? [12:55:04] yeah, sure [12:55:07] I have the strace log saved [12:56:24] ok [12:56:30] YuviPanda: if I just have a test.py which does 'raise IOError', pwb.py returns 1 correctly [12:57:56] !log tools rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles [12:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:01:21] valhallasw`cloud: success! in that I got a in a log now :P [13:01:33] and the job dies? :P [13:01:35] because it should [13:01:54] valhallasw`cloud: I suppose so. but it's in the continous queue, so it gets restarted [13:02:08] not if exit code > 0, I thought? [13:02:39] or is it the other way around, it doesn't get restarted if exit code == 0? [13:03:04] I think it doesn't get restarted it exit code == 0 [13:03:13] which... makes me even more confused? [13:03:38] maybe the idea is it keeps restarting until it finishes succesfully :P [13:04:14] > The '-continuous' option ensures that the job will be restarted automatically until it exits normally with an exit value of zero, indicating completion. [13:04:17] valhallasw`cloud: ^ [13:04:22] right [13:04:42] then I don't get why I saw an _exit(0), because that would mean the job should have stopped [13:05:02] yes [13:05:09] I don't either :| [13:05:13] oh! [13:05:19] because there's actually a later exit_group(1) [13:09:23] so it's working correctly [13:09:34] but restarting every 5 seconds is a Bad Idea(TM) [13:09:41] yes [13:10:40] it's still constantly re-reading all of pwb into memory though [13:11:02] well, yes [13:11:05] because it's still failing [13:11:08] and rstarting all the time [13:11:13] right [13:11:16] just due to a different error [13:11:29] but now at least the .err should show something is wrong [13:12:07] !log suspended all jobs in ralgisbot temporarily [13:12:07] suspended is not a valid project. [13:12:16] !log tools suspended all jobs in ralgisbot temporarily [13:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:12:26] you can just check the relevant .err files? [13:12:30] and just suspend those? [13:12:38] no, so even after suspending them all [13:12:44] there's still constant reads... [13:12:47] so wtf [13:13:10] mark: ^ I suspended all the jobs, but even then there's constant reads of all the python files... [13:13:11] YuviPanda: I'm not sure what suspend means in SGE terminology [13:13:16] hmm [13:13:53] also, fun fun fun: there's a daily 'if my jobs aren't running, restart them' crontab [13:14:05] aaah [13:14:15] well, then, I feel less guilty about killing it all [13:14:25] but we'll just have the same problem [13:14:27] at 3 am [13:14:32] ...at 3 am. [13:14:37] is this what killed SGE the other nigth? [13:14:45] no, probably not, the jobs are older [13:15:58] !log deleted all jobs of ralgisbot [13:15:58] deleted is not a valid project. [13:16:03] !log tools deleted all jobs of ralgisbot [13:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:16:42] ok, so that shut them up [13:16:50] valhallasw`cloud: so looks like gridengine's suspend doesn't do much [13:17:05] mark: any improvements in the re-assembly speed? [13:17:48] I think suspend might be a thing for parallel jobs and stuff like that [13:17:50] interview [13:17:53] mark: ok [13:21:28] valhallasw`cloud: I'm going to put a notice on the maintainers' talk page [13:21:33] (y) [13:31:55] valhallasw`cloud: done [13:35:22] valhallasw`cloud: thanks for helping dig into it :) [13:35:26] yw [13:38:48] YuviPanda: don't really see any change [13:41:11] mark: hmm, so the biggest other thing I see is wikihistory doing funny looking things with certs [13:41:25] right, noticed that as well [13:42:40] mark: want me to investigate those as well? [13:42:50] if you can :) [13:42:54] would be good if we could find the cause of this [13:43:10] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1596786 (10akosiaris) Great! Let's hope it's that indeed. [13:43:17] I can also poke the iftt people and ask them to see if they can switch to Redis for cache [13:43:18] instead of NFS [13:43:31] (I had told them earlier they can switch if they want to, but didn't force it) [13:43:44] right [13:44:28] mark: outside of that, not sure what we can do. zoomviewer requires a large amount of cache storage so we can't put them on local storage anywhere [13:44:34] and outside of those I don't see any big single-users [13:44:41] yep [13:44:43] (and gridengine itself, ofcourse) [13:44:55] if we could just stop that ;) [13:45:04] hehe [13:49:53] i also still think tools-exec-1403 is suspicoius [13:50:02] it seems to do only locking operations [13:50:22] very different from the other hosts [13:50:26] can we temp disable that host? [13:50:47] there's a single job running there -- job id 1072678 from tools.hat-collector [13:52:01] which is an IRC bot [13:52:08] I'll reschedult it [13:53:16] great [13:53:18] !log tools tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job. [13:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:53:42] does that stop the locking operations? [13:53:57] let's see [13:54:09] not yet no [13:54:14] could be SGE itself? [13:54:17] or are they now happening from @tools-exec-1408 [13:54:18] maybe [13:54:28] just looking at 1403 [13:54:41] can you stop SGE at 1403? [13:54:56] or reboot the instance for all I care :) [13:55:00] I'll restart execd [13:55:27] lots of stale nfs client ids [13:55:31] !log tools restarted gridengine_exec on tools-exec-1403 [13:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:55:48] no improvement [13:55:59] ok, let's reboot the host then [13:56:03] [1298282.198826] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [13:56:03] [1303376.591049] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [13:56:04] [1304531.280859] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [13:56:07] that may have something to do with it :) [13:56:07] yeah [13:56:19] eh, let me check one thing first [13:57:08] pacct is huge again [13:57:26] 10.64.37.10-man5 <-- hey, I remember seeing that :-p [13:57:55] stopped now [13:58:37] !log tools rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load [13:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:59:11] no reduction in i/o, but certainly a lot less nfs packets ;) [13:59:30] elee_: around? [14:00:54] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1596838 (10valhallasw) This happened again on tools-exec-1403; probably a hangover from the NFS breakdown last sunday (T110827). [14:03:22] oh, yeah, I was going to upstream that [14:07:47] valhallasw`cloud: where are jobscripts stored again? [14:07:58] we need to move scratch to another raid array [14:08:03] YuviPanda: /var/spool/gridengine/execd/tools-exec-1206/job_scripts/331273 [14:08:09] currently both tools and scratch are on the same one, and both are the busiest filesystems [14:08:10] but only on that specific host, apparently [14:08:29] valhallasw`cloud: waaa, isnt' /var/spool/gridengine an NFS mount? [14:08:30] 6Labs, 10Analytics, 10Labs-Infrastructure, 3Labs-Sprint-108, 5Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1596854 (10Ottomata) @ellery fyi, you can do this now: https://wikitech.wikimedia.org/wiki/Analytics/FAQ#How_d... [14:09:01] YuviPanda: no.. /var/log/gridengine is [14:09:04] er, /var/lib [14:09:07] aaah [14:09:10] i missed the spool [14:09:11] ok [14:09:25] so actually only the master messages are on NFS I think [14:09:48] 271 0.79988 smallwiki_ tools.wikihi Rr 08/17/2015 15:04:37 continuous@tools-exec-1216.eqi 1 [14:10:05] cat: /var/spool/gridengine/execd/tools-exec-1216/job_scripts/271: No such file or directory [14:10:10] (on exec-1216) [14:10:24] .... [14:10:37] all I want to do is to find the process... [14:10:40] of a particular job [14:10:59] ? [14:11:06] ps aux | grep 271? [14:11:09] valhallasw`cloud: so I can't find the jobscript [14:11:18] that's the gridengine jobid [14:11:19] no? [14:11:21] eee [14:12:36] eeeh. [14:13:15] there are no tools.wikihistory jobs on exec-1216 (uid = 51512) [14:13:49] yeah [14:13:51] so... [14:13:53] qstat is lying? [14:14:29] mark: is the resync moving faster without that exec host? [14:14:37] nope [14:14:43] it's really weird, nothing affects it [14:14:59] YuviPanda: pretty sure SGE is lying :| [14:15:09] so the wikihistory thing seems to be http://www.mono-project.com/docs/faq/security/ in that it is auto-accepting SSL certs and storing them on NFS [14:15:13] instead of using the system certificate store [14:15:20] from what I can tell [14:15:27] I can't strace the process because I can't find where the process is [14:16:31] valhallasw`cloud: the number is also fairly low... [14:16:33] YuviPanda: only 108048 108049 465830 465831 465832 465833 seem to be running anywhere :| [14:16:35] maybe it's been lying for a long time [14:16:40] or just recently [14:16:43] job number reset... [14:16:57] so [14:17:23] 271 (17 aug) 974 (31 aug) 80114 80115 and 1342218 are missing?! [14:17:49] 465833 is constantly restarting [14:17:56] like, I can't strace it because the pid keeps changing [14:17:57] lol. [14:18:03] YuviPanda: strace the parent process with -f [14:18:17] the /bin/bash gridengine/whatever/465833 process [14:19:56] valhallasw`cloud: aaahaaaaa [14:20:02] valhallasw`cloud: so it's actually a php job that shells out to mono... [14:21:18] oh, 974 is webgrid so probably fine (I only checked -exec-12* hosts) [14:22:04] maybe there isn't always a job_script? [14:44:57] valhallasw`cloud: hmm, that'd be strange [14:45:03] valhallasw`cloud: but 271 doesn't seem to exist anywhere [14:45:07] :/ [14:45:25] valhallasw`cloud: but a job with same name does exist both in gridengine and in reality [14:57:37] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1597094 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done! [15:00:02] hey yalls, my labs instances ahve disappeared again [15:00:06] i have logged out and logged back in [15:02:33] ottomata: have you howled at the moon while planting turmeric on your hair thrice? [15:02:43] 6Labs, 10Tool-Labs: tools-exec-1218 problem - https://phabricator.wikimedia.org/T111193#1597129 (10Bamyers99) 3NEW [15:03:26] ottomata: in other words, have you tried removing yourself from the project and re-adding yourself? [15:03:38] mark: ^ perhaps another instance with NFS issues/fallout? [15:03:52] Yes YuviPanda, I also did the thing where I diluted essence-of-labs 100 fold and then used that to water my head tumeric [15:05:11] YuviPanda: that worked :) [15:06:52] ottomata: homeopathy for labs! [15:06:57] JohnFLewis: looking [15:09:36] (03CR) 10Tim Landscheidt: Add list-user-databases command (032 comments) [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) (owner: 10Tim Landscheidt) [15:09:40] I'm not good at jsub stuff, just wondering if jsub-sent crontab doesn't send me an email? (When I used crontab without jsub it used to send me a mail) [15:09:53] (03PS2) 10Tim Landscheidt: Add list-user-databases command [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) [15:10:45] (03CR) 10Tim Landscheidt: [C: 04-2] "(Until T110939 is resolved.)" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) (owner: 10Tim Landscheidt) [15:11:08] revi: jsub will output the job id if you don't pass -quiet [15:11:14] revi: so it should just send an e-mail [15:11:38] I don't have -quiet but spam/inbox is quiet :p [15:14:30] which tool is this? [15:15:15] tools.revibot [15:15:59]