[00:00:28] (03CR) 10Legoktm: [C: 032] "Err, what? Details plz" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/235357 (owner: 10Greg Grossmeier) [00:00:44] (03Merged) 10jenkins-bot: Naming is hard [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/235357 (owner: 10Greg Grossmeier) [00:02:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [00:03:12] !log tools.wikibugs Updated channels.yaml to: 5f5fbb9243566dd512f1b1bf65ed60e5e08d6e92 Naming is hard [00:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [01:04:23] 6Labs, 10CirrusSearch, 6Discovery, 10wikitech.wikimedia.org, and 2 others: Wikitech CirrusSearch jobs throwing exceptions on silver - https://phabricator.wikimedia.org/T110635#1595565 (10EBernhardson) [01:14:52] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Change sid pbuilder image name to 'unstable' - https://phabricator.wikimedia.org/T111097#1595576 (10scfc) For the `labs/toollabs` repository, (after merging a patch) we create (at the moment identical) packages for Precise and Trusty, but do not deploy to... [01:23:48] petan: Seems botbot is down - bad gateway; https://botbot.wmflabs.org/ [01:23:55] https://bots.wmflabs.org/~wm-bot/searchlog/ is also 404 [01:52:28] tools-exec-1218 appears to be having problems, stuck job 11474, can't even delete it. https://tools.wmflabs.org/nagf/?project=tools#h_tools-exec-1218 shows way more wait I/O than other instances [02:38:27] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1595626 (10chasemp) Thinking we could combine a bump in allowed clients with logging the request before denying at https://phabricator.wikimedia.org/diffusion/ODDY/browse/master/src/nc_p... [03:30:55] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [03:31:44] LGTM... [03:35:47] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 879430 bytes in 2.261 second response time [04:13:00] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [04:39:15] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1595759 (10Harej) 3NEW [04:40:51] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1595759 (10Harej) [04:40:53] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1595759 (10Harej) [04:51:22] is nfs slow? [04:53:13] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 22.22% of data above the critical threshold [0.0] [04:54:15] https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Labs+NFS+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report looks suspicious [05:33:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [08:30:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [08:33:10] ^ intermittent [08:34:27] mark: is data/project/.system/gridengine/spool what's causing the SGE NFS load? [08:34:55] or is it all the *.out/*.err files written out? [08:35:46] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 879430 bytes in 2.327 second response time [08:40:25] Hi. If I execute a script on one of the exec servers (via qsub) and in the script I'm writing to a file, is it somehow buffered on the exec machine and then written to my tool acc (it's in public_html) on bastion? Or is it writing directly to the file in my tool's folder? [08:40:55] it writes directly to the file, but because it's NFS there can be some delay if you read from another host [08:41:35] also, consider using jsub (which takes care of several standard settings you have to pass manually with qsub) [08:46:20] alkamid: yes there is always a delay if that file is on nfs, it also matter how you write to the file, your program can be buffering, and also kernel is probably buffering it [08:47:42] valhallasw`cloud, petan, I'm writing with python's open(), I believe it reads the default buffer from the system (4096?) [08:48:25] what do you actually want to do? [08:48:51] the delay is there and you can't do much about it, I don't think there is any simple way to override it [08:49:58] if you want some faster communication between instances, eg. be able to push some information on exec machine and be able to immediately retrieve it on another machine you should probably use redis or something like that [08:51:59] petan, no, I've just run the script and after ~20 min there was no output in the file, but when I run it locally ~2 min is enough to fill in the buffer and write [08:52:12] so I just wanted to understand if it's normal or something's wrong [08:52:23] no 20 mins is not normal [08:52:35] alkamid: try logging in to the exec host and reading file contents there [08:52:56] qstat tells you the host, then just ssh [08:53:54] valhallasw`cloud: that sure is a large part of it yes [08:54:16] we should look into taking that off NFS, SGE supports that I saw [08:54:23] yes, sounds good [08:54:40] maybe that will also stop SGE from losing jobs *looks angrily at SGE* [08:54:56] possible yeah [08:55:00] i think it does an insane amount of locking now [08:55:05] still might not be enough ;) [08:56:44] 6Labs, 10Tool-Labs: Reduce SGE NFS usage - https://phabricator.wikimedia.org/T111158#1596167 (10valhallasw) 3NEW [08:56:49] ^ :-) [08:58:57] mark: eta on rebuild? [09:02:03] 10 days? :P [09:05:03] at 5MS/s? [09:07:18] * YuviPanda feeds paravoid stroopwafels with nutella [09:07:24] haha [09:07:49] although this is some strange Lidl's ripoff of nutella that isn't quite as good [09:08:42] anyway, off to the WMDE Office! [09:13:46] valhallasw`cloud, you suggest using jsub. Currently my jobs are scheduled with "qsub -N 'porzucone.py' -b y -q task -l h_rt=10:00:00 -l h_vmem=500M -l release=trusty -e $HOME/output/ -o $HOME/output/ $HOME/scripts/venv/bin/python $HOME/scripts/porzucone.py >/dev/null" [09:14:09] i suppose I should then drop "-b y -q task -l h_rt=10:00:00" ? [09:14:16] and the rest stays the same? [09:14:43] you don't really have to use jsub, there is no point if you want to specify all parameters yourself [09:14:58] jsub is just trying to make it more simple for people who don't want to mess up with qsub [09:15:02] it's just a wrapper around it [09:16:00] alkamid: jsub -N porzucone.py -mem 500M -l release=trusty -q $HOME/scripts/venv/bin/python $HOME/scripts/porzucone.py [09:16:06] although that will change where your output ends up [09:16:31] (namely in ~/porzucone.py.out and ~/porzucone.py.err instead of in a gazillion .o1233456 .e12332542 files) [09:17:28] valhallasw`cloud, was "-q" in your line intentional? [09:17:40] sorry, that should be -quiet [09:17:58] that prevents output on success (but not on errors) [09:23:20] valhallasw`cloud, thanks. Coming back to my original problem, I use python's "with open(file, 'w') as f:" and then in the loop I'm writing to this file (see https://github.com/alkamid/wiktionary/blob/feature/py3k/porzucone.py#L39). Might the problem be that within the "with" scope, the file is still open, and therefore is not synced? [09:23:51] alkamid: it will definitely buffer data until the file is closed [09:24:25] alkamid: the typical solution for these kinds of issues is to write to porzucone.html.1, then move porzucone.html.1 over porzucone.html [09:24:36] (shutil.move) [09:24:37] valhallasw`cloud: couldn't you simplify your jsub line by putting python as #! in the script file and leaving -N out? [09:25:14] Eh, yeah. That's true as well. Needs +x on the file as well, though. [09:25:46] true [09:29:39] valhallasw`cloud, why is it different than local execution? On my machine I can just run the script and it will write to file every 4096 bytes (or whatever the buffer size is) and then I can read it [09:29:59] I mean I can read the output file [09:31:02] alkamid: I don't know what the nfs buffer size is, but it's likely larger than 4KB [09:32:31] ok, so I'll get the same behaviour, only with a lot more lag because NFS buffer is larger. Fair enough [09:32:38] thanks for your help [09:33:57] I think so, yes. [09:34:40] As noted, the best way to handle this is to write to a temp file first and then move (this is guaranteed to be atomic, so users get either the old or the new file) [09:36:42] ok, I'll do this [09:37:13] (it would really be cool if there was a with: handler for that, but I don't know if one exists) [09:54:10] 6Labs, 10Tool-Labs, 7user-notice: No file system on toollabs, unable to login, web service broken - https://phabricator.wikimedia.org/T110827#1596297 (10valhallasw) 5Open>3Resolved a:3valhallasw [09:54:36] (03PS1) 10Hashar: Nodepool database user/pass [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) [09:55:03] 6Labs, 10Tool-Labs, 7user-notice: No file system on toollabs, unable to login, web service broken - https://phabricator.wikimedia.org/T110827#1587427 (10valhallasw) The initial issue was resolved sunday afternoon (CEST), but I forgot to close the task at that point. [09:58:17] (03PS2) 10Hashar: Nodepool database pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) [09:59:39] (03CR) 10Jcrespo: [C: 031] Nodepool database pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) (owner: 10Hashar) [10:17:34] (03CR) 10Hashar: [C: 032 V: 032] Nodepool database pass placeholder [labs/private] - 10https://gerrit.wikimedia.org/r/235424 (https://phabricator.wikimedia.org/T110693) (owner: 10Hashar) [11:11:21] (03PS1) 10Jean-Frédéric: Update ru configuration [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235432 (https://phabricator.wikimedia.org/T110665) [11:12:21] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Update ru configuration [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235432 (https://phabricator.wikimedia.org/T110665) (owner: 10Jean-Frédéric) [11:14:40] 10Tool-Labs-tools-Other, 6Commons, 6Community-Tech, 6Multimedia: [AOI] Ceate a new DerivativeFX after the Toolserver shutdown - https://phabricator.wikimedia.org/T110409#1596423 (10Steinsplitter) >>! In T110409#1586685, @Snaevar wrote: > I have seen DerivativeFX at http://tools.wmflabs.org/derivative/deri1... [11:20:55] 10Tool-Labs-tools-Other, 6Commons, 6Community-Tech, 6Multimedia: [AOI] Create a new DerivativeFX after the Toolserver shutdown - https://phabricator.wikimedia.org/T110409#1596439 (10Aklapper) [11:43:01] i've frozen the rebuild a bit [11:43:07] it makes like no difference to load on that array [11:44:57] (03PS1) 10Jean-Frédéric: Template names in monuments-config are case-sensentive [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235438 (https://phabricator.wikimedia.org/T110665) [11:45:14] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Template names in monuments-config are case-sensentive [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235438 (https://phabricator.wikimedia.org/T110665) (owner: 10Jean-Frédéric) [12:04:57] 6Labs, 10Tool-Labs, 10Continuous-Integration-Config: Change sid pbuilder image name to 'unstable' - https://phabricator.wikimedia.org/T111097#1596558 (10akosiaris) In general I find it way preferable to refer to releases by their name (e.g. `jessie`, `potato`, `trusty`) than by their designation (e.g. `stabl... [12:19:57] (03PS1) 10Sitic: Fix notifications [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/235449 [12:20:27] (03CR) 10Sitic: [C: 032 V: 032] Fix notifications [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/235449 (owner: 10Sitic) [12:26:30] hi mark [12:26:43] valhallasw`cloud: around? need some help looking at a pwb based bot on tools which might be hammering NFS [12:26:55] semi [12:27:14] what's it doing? [12:27:19] mark: from the looks of it ralgisbot seems to be an agglomerate, running various different pywikibot commands all the time [12:27:25] and logging them to command.log [12:27:46] valhallasw`cloud: just lots of accesses, mark suspects it might be part of the reason that rebuilding is slow [12:27:48] YuviPanda: you can also ask jayvdb or xzise [12:27:58] if it's pwb itself, thatis [12:28:02] it's not clear whether that's a worst offender or anything [12:28:04] valhallasw`cloud: they don't have root, so can't see. [12:28:05] right [12:28:13] it just shows up on fatrace [12:29:09] (03CR) 10Yuvipanda: [C: 04-1] Add list-user-databases command (032 comments) [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) (owner: 10Tim Landscheidt) [12:29:17] there's a whole set of jobs running under qstat -u tools.ralgisbot [12:29:56] yeah [12:30:03] and I'm reluctant to 'stop' them [12:30:19] ...it's using -putthrottle:0 for all bots, it seems [12:30:36] what does that do? [12:30:54] it disables write throttling [12:31:09] aah. I guess that's not reccomended? [12:31:15] which might be OK for some bots, but could explain large resource usage [12:31:22] and I guess that might also be responsible for it re-reading the .py files all the time? [12:31:26] uuuh [12:31:36] no, that should not happen [12:32:00] valhallasw`cloud: look at /data/project/ralgisbot/PWB/core/logs/command.log [12:32:12] valhallasw`cloud: I presume that's different python scripts being run on a tight-ish loop [12:32:30] looks like it [12:33:04] so basically they should throttle themselves some more? [12:33:08] no, that's seperate [12:33:27] I mean, both in the writes and also in their loop [12:34:14] what I think is happening [12:34:23] 6Labs, 10LabsDB-Auditor, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, and 8 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1596647 (10Aklapper) p:5Normal>3Lowest [12:34:42] is that the jobs are failing (check the recently written .out files) [12:34:46] but being restarted immediately [12:35:04] aaah [12:35:52] valhallasw`cloud: qstat has start dates for them from a longish time ago, I don't know if that updates after a restart? [12:36:08] qmod -rj keeps the old start date [12:37:43] ah I see [12:37:55] valhallasw`cloud: which out files are you looking at? [12:37:59] there's a ton of them :| [12:38:12] ls -lh | grep 12:3 [12:38:29] bes-isbn.out for example [12:39:12] 6Labs, 10Tool-Labs: Reduce SGE NFS usage - https://phabricator.wikimedia.org/T111158#1596663 (10mark) Indeed, let's start by moving off the spool dir. [12:39:34] that's restarting once every 5 seconds [12:39:43] because the error doesn't actually change the exit code or something :|| [12:39:53] see e.g. /var/spool/gridengine/execd/tools-exec-1206/job_scripts/331273 [12:39:59] ok, let me stop those [12:40:52] :) [12:40:58] YuviPanda: might be a leftover from the NFS breakup actually [12:41:05] IOError because it can't write to a file? [12:41:20] valhallasw`cloud but if it is continuously restarting... [12:41:34] YuviPanda: sec. [12:41:37] ok [12:43:59] odd. The file descriptors of the `/bin/bash /var/spool/gridengine/execd/tools-exec-1206/job_scripts/331273` processes just point to the current NFS mount. (and 6301, on tools-exec-1206) [12:44:27] you mean files on the mount? [12:44:27] but if I run /data/project/ralgisbot/mrBes05.sh directly it doesn't crash? let me try again [12:44:31] or the mount itself [12:44:32] hmm [12:44:34] the files [12:44:37] with the webgrid [12:44:47] the access.log was incorrect (fd pointed to deleted file) [12:45:50] I can't reproduce any of it [12:45:51] argh :( [12:46:44] YuviPanda: or try just rescheduling the offending jobs.. [12:46:58] sigh :( [12:47:01] orrrr [12:47:02] hmm [12:47:14] I'm wondering if we can figure out what throws the IOError.... [12:47:28] strace! [12:47:44] is an option [12:47:45] let me do that [12:47:51] ok! [12:49:19] [pid 20637] write(2, "IOError", 7) = -1 ESTALE (Stale NFS file handle) [12:49:19] [pid 20637] write(2, ": ", 2) = -1 ESTALE (Stale NFS file handle) [12:49:20] [pid 20637] write(2, "[Errno 116] Stale NFS file handl"..., 33) = -1 ESTALE (Stale NFS file handle) [12:49:27] followed by [12:49:27] aha [12:49:31] so an -rj should fix things? [12:49:31] [pid 20638] _exit(0) = ? [12:49:35] which is obviously stupid [12:49:52] ok [12:49:55] so what happens is this [12:50:01] "/data/project/ralgisbot/PWB/core/pywikibot/userinterfaces/terminal_interface_base.py" wants to write to stdout (or stderr) [12:50:06] but stdout/stderr is a stale NFS handle [12:50:34] and then pwb doesn't handle this correctly and exit(0)s instead of exit(1) for some reason. Not sure, could also be a Python bug. [12:51:03] hmm, shouldn't an exception exit always be non-zero? [12:51:06] oh yeah, and writing to stderr also doesn't work because of the stale handle :P [12:52:35] right [12:52:39] I wonder how widespread this is :| [12:53:11] could actually be a bug in pwb.py [12:53:19] which wraps in all kinds of fun ways [12:54:17] but it really shouldnt [12:54:18] gah. [12:54:58] valhallasw`cloud: want me to -rj all these jobs? [12:55:04] yeah, sure [12:55:07] I have the strace log saved [12:56:24] ok [12:56:30] YuviPanda: if I just have a test.py which does 'raise IOError', pwb.py returns 1 correctly [12:57:56] !log tools rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles [12:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:01:21] valhallasw`cloud: success! in that I got a in a log now :P [13:01:33] and the job dies? :P [13:01:35] because it should [13:01:54] valhallasw`cloud: I suppose so. but it's in the continous queue, so it gets restarted [13:02:08] not if exit code > 0, I thought? [13:02:39] or is it the other way around, it doesn't get restarted if exit code == 0? [13:03:04] I think it doesn't get restarted it exit code == 0 [13:03:13] which... makes me even more confused? [13:03:38] maybe the idea is it keeps restarting until it finishes succesfully :P [13:04:14] > The '-continuous' option ensures that the job will be restarted automatically until it exits normally with an exit value of zero, indicating completion. [13:04:17] valhallasw`cloud: ^ [13:04:22] right [13:04:42] then I don't get why I saw an _exit(0), because that would mean the job should have stopped [13:05:02] yes [13:05:09] I don't either :| [13:05:13] oh! [13:05:19] because there's actually a later exit_group(1) [13:09:23] so it's working correctly [13:09:34] but restarting every 5 seconds is a Bad Idea(TM) [13:09:41] yes [13:10:40] it's still constantly re-reading all of pwb into memory though [13:11:02] well, yes [13:11:05] because it's still failing [13:11:08] and rstarting all the time [13:11:13] right [13:11:16] just due to a different error [13:11:29] but now at least the .err should show something is wrong [13:12:07] !log suspended all jobs in ralgisbot temporarily [13:12:07] suspended is not a valid project. [13:12:16] !log tools suspended all jobs in ralgisbot temporarily [13:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:12:26] you can just check the relevant .err files? [13:12:30] and just suspend those? [13:12:38] no, so even after suspending them all [13:12:44] there's still constant reads... [13:12:47] so wtf [13:13:10] mark: ^ I suspended all the jobs, but even then there's constant reads of all the python files... [13:13:11] YuviPanda: I'm not sure what suspend means in SGE terminology [13:13:16] hmm [13:13:53] also, fun fun fun: there's a daily 'if my jobs aren't running, restart them' crontab [13:14:05] aaah [13:14:15] well, then, I feel less guilty about killing it all [13:14:25] but we'll just have the same problem [13:14:27] at 3 am [13:14:32] ...at 3 am. [13:14:37] is this what killed SGE the other nigth? [13:14:45] no, probably not, the jobs are older [13:15:58] !log deleted all jobs of ralgisbot [13:15:58] deleted is not a valid project. [13:16:03] !log tools deleted all jobs of ralgisbot [13:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:16:42] ok, so that shut them up [13:16:50] valhallasw`cloud: so looks like gridengine's suspend doesn't do much [13:17:05] mark: any improvements in the re-assembly speed? [13:17:48] I think suspend might be a thing for parallel jobs and stuff like that [13:17:50] interview [13:17:53] mark: ok [13:21:28] valhallasw`cloud: I'm going to put a notice on the maintainers' talk page [13:21:33] (y) [13:31:55] valhallasw`cloud: done [13:35:22] valhallasw`cloud: thanks for helping dig into it :) [13:35:26] yw [13:38:48] YuviPanda: don't really see any change [13:41:11] mark: hmm, so the biggest other thing I see is wikihistory doing funny looking things with certs [13:41:25] right, noticed that as well [13:42:40] mark: want me to investigate those as well? [13:42:50] if you can :) [13:42:54] would be good if we could find the cause of this [13:43:10] 6Labs, 6operations, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1596786 (10akosiaris) Great! Let's hope it's that indeed. [13:43:17] I can also poke the iftt people and ask them to see if they can switch to Redis for cache [13:43:18] instead of NFS [13:43:31] (I had told them earlier they can switch if they want to, but didn't force it) [13:43:44] right [13:44:28] mark: outside of that, not sure what we can do. zoomviewer requires a large amount of cache storage so we can't put them on local storage anywhere [13:44:34] and outside of those I don't see any big single-users [13:44:41] yep [13:44:43] (and gridengine itself, ofcourse) [13:44:55] if we could just stop that ;) [13:45:04] hehe [13:49:53] i also still think tools-exec-1403 is suspicoius [13:50:02] it seems to do only locking operations [13:50:22] very different from the other hosts [13:50:26] can we temp disable that host? [13:50:47] there's a single job running there -- job id 1072678 from tools.hat-collector [13:52:01] which is an IRC bot [13:52:08] I'll reschedult it [13:53:16] great [13:53:18] !log tools tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job. [13:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:53:42] does that stop the locking operations? [13:53:57] let's see [13:54:09] not yet no [13:54:14] could be SGE itself? [13:54:17] or are they now happening from @tools-exec-1408 [13:54:18] maybe [13:54:28] just looking at 1403 [13:54:41] can you stop SGE at 1403? [13:54:56] or reboot the instance for all I care :) [13:55:00] I'll restart execd [13:55:27] lots of stale nfs client ids [13:55:31] !log tools restarted gridengine_exec on tools-exec-1403 [13:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:55:48] no improvement [13:55:59] ok, let's reboot the host then [13:56:03] [1298282.198826] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [13:56:03] [1303376.591049] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [13:56:04] [1304531.280859] NFS: nfs4_reclaim_open_state: Lock reclaim failed! [13:56:07] that may have something to do with it :) [13:56:07] yeah [13:56:19] eh, let me check one thing first [13:57:08] pacct is huge again [13:57:26] 10.64.37.10-man5 <-- hey, I remember seeing that :-p [13:57:55] stopped now [13:58:37] !log tools rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load [13:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [13:59:11] no reduction in i/o, but certainly a lot less nfs packets ;) [13:59:30] elee_: around? [14:00:54] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1596838 (10valhallasw) This happened again on tools-exec-1403; probably a hangover from the NFS breakdown last sunday (T110827). [14:03:22] oh, yeah, I was going to upstream that [14:07:47] valhallasw`cloud: where are jobscripts stored again? [14:07:58] we need to move scratch to another raid array [14:08:03] YuviPanda: /var/spool/gridengine/execd/tools-exec-1206/job_scripts/331273 [14:08:09] currently both tools and scratch are on the same one, and both are the busiest filesystems [14:08:10] but only on that specific host, apparently [14:08:29] valhallasw`cloud: waaa, isnt' /var/spool/gridengine an NFS mount? [14:08:30] 6Labs, 10Analytics, 10Labs-Infrastructure, 3Labs-Sprint-108, 5Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1596854 (10Ottomata) @ellery fyi, you can do this now: https://wikitech.wikimedia.org/wiki/Analytics/FAQ#How_d... [14:09:01] YuviPanda: no.. /var/log/gridengine is [14:09:04] er, /var/lib [14:09:07] aaah [14:09:10] i missed the spool [14:09:11] ok [14:09:25] so actually only the master messages are on NFS I think [14:09:48] 271 0.79988 smallwiki_ tools.wikihi Rr 08/17/2015 15:04:37 continuous@tools-exec-1216.eqi 1 [14:10:05] cat: /var/spool/gridengine/execd/tools-exec-1216/job_scripts/271: No such file or directory [14:10:10] (on exec-1216) [14:10:24] .... [14:10:37] all I want to do is to find the process... [14:10:40] of a particular job [14:10:59] ? [14:11:06] ps aux | grep 271? [14:11:09] valhallasw`cloud: so I can't find the jobscript [14:11:18] that's the gridengine jobid [14:11:19] no? [14:11:21] eee [14:12:36] eeeh. [14:13:15] there are no tools.wikihistory jobs on exec-1216 (uid = 51512) [14:13:49] yeah [14:13:51] so... [14:13:53] qstat is lying? [14:14:29] mark: is the resync moving faster without that exec host? [14:14:37] nope [14:14:43] it's really weird, nothing affects it [14:14:59] YuviPanda: pretty sure SGE is lying :| [14:15:09] so the wikihistory thing seems to be http://www.mono-project.com/docs/faq/security/ in that it is auto-accepting SSL certs and storing them on NFS [14:15:13] instead of using the system certificate store [14:15:20] from what I can tell [14:15:27] I can't strace the process because I can't find where the process is [14:16:31] valhallasw`cloud: the number is also fairly low... [14:16:33] YuviPanda: only 108048 108049 465830 465831 465832 465833 seem to be running anywhere :| [14:16:35] maybe it's been lying for a long time [14:16:40] or just recently [14:16:43] job number reset... [14:16:57] so [14:17:23] 271 (17 aug) 974 (31 aug) 80114 80115 and 1342218 are missing?! [14:17:49] 465833 is constantly restarting [14:17:56] like, I can't strace it because the pid keeps changing [14:17:57] lol. [14:18:03] YuviPanda: strace the parent process with -f [14:18:17] the /bin/bash gridengine/whatever/465833 process [14:19:56] valhallasw`cloud: aaahaaaaa [14:20:02] valhallasw`cloud: so it's actually a php job that shells out to mono... [14:21:18] oh, 974 is webgrid so probably fine (I only checked -exec-12* hosts) [14:22:04] maybe there isn't always a job_script? [14:44:57] valhallasw`cloud: hmm, that'd be strange [14:45:03] valhallasw`cloud: but 271 doesn't seem to exist anywhere [14:45:07] :/ [14:45:25] valhallasw`cloud: but a job with same name does exist both in gridengine and in reality [14:57:37] 6Labs: Request for Labs project LibraryBase - https://phabricator.wikimedia.org/T111141#1597094 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done! [15:00:02] hey yalls, my labs instances ahve disappeared again [15:00:06] i have logged out and logged back in [15:02:33] ottomata: have you howled at the moon while planting turmeric on your hair thrice? [15:02:43] 6Labs, 10Tool-Labs: tools-exec-1218 problem - https://phabricator.wikimedia.org/T111193#1597129 (10Bamyers99) 3NEW [15:03:26] ottomata: in other words, have you tried removing yourself from the project and re-adding yourself? [15:03:38] mark: ^ perhaps another instance with NFS issues/fallout? [15:03:52] Yes YuviPanda, I also did the thing where I diluted essence-of-labs 100 fold and then used that to water my head tumeric [15:05:11] YuviPanda: that worked :) [15:06:52] ottomata: homeopathy for labs! [15:06:57] JohnFLewis: looking [15:09:36] (03CR) 10Tim Landscheidt: Add list-user-databases command (032 comments) [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) (owner: 10Tim Landscheidt) [15:09:40] I'm not good at jsub stuff, just wondering if jsub-sent crontab doesn't send me an email? (When I used crontab without jsub it used to send me a mail) [15:09:53] (03PS2) 10Tim Landscheidt: Add list-user-databases command [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) [15:10:45] (03CR) 10Tim Landscheidt: [C: 04-2] "(Until T110939 is resolved.)" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/234934 (https://phabricator.wikimedia.org/T91231) (owner: 10Tim Landscheidt) [15:11:08] revi: jsub will output the job id if you don't pass -quiet [15:11:14] revi: so it should just send an e-mail [15:11:38] I don't have -quiet but spam/inbox is quiet :p [15:14:30] which tool is this? [15:15:15] tools.revibot [15:15:59] 0 0 * * * /usr/bin/jsub -N cron-tools.revibot-1 -once -quiet sh /data/project/revibot/red-kowiki.sh [15:16:03] ^ looks like -quiet to me [15:17:22] oh meh [15:17:28] I didn't look it closely [15:17:32] duh [15:17:37] * revi needs a new glasses [15:18:43] * revi slaps himself and go to bed [15:20:55] 6Labs, 10Tool-Labs: tools-exec-1218 problem - https://phabricator.wikimedia.org/T111193#1597229 (10yuvipanda) Ok, so gridengine says that job is running there, but is totally false - that job is running nowhere... [15:20:56] valhallasw`cloud: ^ another lost job [15:21:06] :(( [15:22:00] 6Labs, 10Tool-Labs: tools-exec-1218 problem - https://phabricator.wikimedia.org/T111193#1597236 (10yuvipanda) The host itself might be ok - it has other jobs running. qdel -f the jobid and re-submit it? [15:22:08] now I'm worried how much jobs have been lost [15:23:33] hm, that seems to happen to my jobs from time to time [15:34:17] 6Labs, 10Tool-Labs: tools-exec-1218 problem - https://phabricator.wikimedia.org/T111193#1597329 (10scfc) 5Open>3Resolved a:3scfc On the host, the job was stuck on device I/O: ``` scfc@tools-exec-1218:~$ ps auxfwww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND […] sgeadmin 421... [16:16:34] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112, 5Patch-For-Review, and 3 others: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1597759 (10Andrew) @scfc, there is yet another patch in place. If the problem is still present can you ping me on IRC so that I can d... [16:23:10] 6Labs, 10Labs-Infrastructure, 6operations, 7Monitoring: labstore monitoring: NRPE: Command 'check_cleanup-snapshots-labstore-state' not defined - https://phabricator.wikimedia.org/T111211#1597821 (10Dzahn) [16:24:16] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 25.00% of data above the critical threshold [0.0] [16:36:35] 10Wikibugs: Assigning a task to someone with a name beginning with a number causes IRC colour issues - https://phabricator.wikimedia.org/T111214#1597912 (10Krenair) 3NEW [16:39:20] 6Labs, 3Labs-sprint-112: Restore some files from /home/gwicke - https://phabricator.wikimedia.org/T110698#1597939 (10matmarex) [16:41:18] 6Labs, 3Labs-sprint-112: Update openstack docs for new command-line format - https://phabricator.wikimedia.org/T110912#1597952 (10Andrew) 5Open>3Resolved I updated the sections related to new image creation. I don't see a lot of others. [17:06:39] 6Labs, 6operations, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987#1598167 (10Andrew) I no nothing about implementation, but I'd certainly prefer that search be self-contained on silver. [17:23:18] YuviPanda: Did we decide on Monday to go ahead with catchpoint stuff in https://phabricator.wikimedia.org/T107058? Or are we going to add paging to shinken instead? [17:34:15] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [17:34:45] andrewbogott: Hi! Re https://phabricator.wikimedia.org/T110629, after deleting the wikitech cookies and logging in again, https://wikitech.wikimedia.org/wiki/Special:NovaInstance still lists no instances for me. [17:36:14] scfc_de: ok, stay tuned... [17:41:08] scfc_de: I’m logging you out, please stay logged out until I give the ok [17:42:51] andrewbogott: k [17:44:16] scfc_de: it seems SGE lost quite some other jobs as well :/ we're not quite sure how large the issue is, or how to measure that effectively... [17:46:16] e.g. tools.wikihistory has a few jobs (jid 271, 80114, 80115, 1342218) that are missing their SGE supervisor process. I haven't checked whether the actual processes are still running somehow [17:46:20] valhallasw`cloud: "Lost" = Jobs simply ended? Or master lost track of them, but they kept on running locally? [17:46:26] no, it's in qstat [17:46:28] but not running [17:46:52] it's probably another issue caused by the NFS outage [17:50:21] YuviPanda and I were unable to figure out what's going on... [17:54:39] scfc_de: Hi! I'm trying to fix https://en.wikipedia.org/wiki/Wikipedia:Database_reports but there's some error with the setup on the labs instance. It's looking in the wrong directory, and I can't figure out why. [17:56:11] valhallasw`cloud: Reminds me very much of https://phabricator.wikimedia.org/T95094. [17:56:46] scfc_de: ok, that took forever. Try once more? [17:56:50] Niharika: I haven't touched the dbreps tool in months, so I might not be the best contact. What's the problem? [17:57:21] mmm. This one has usage 1: cpu=00:00:00, mem=0.01683 GBs, io=0.00011, vmem=436.898M, maxvmem=436.898M, so it's not all zeros. But that might just be because it was resubmitted at some point [17:57:37] scfc_de: Scripts are failing with "permission denied to access to access tool server table" when I'm not even using that table in the script. [17:57:47] andrewbogott: I logged in again, and now the instances are showing. So it seems to work. [17:57:56] all right! [17:58:14] So… all I did (just now) was purge everything, log everyone out, and restart memcache [17:58:40] scfc_de: Anything to do with setup.py perhaps? [17:58:46] Which is the same thing I did on Wednesday. I can only imagine that there was some race that messed with people who logged in in the middle of the purge... [17:59:11] Well, and/or an interaction with the bug that https://gerrit.wikimedia.org/r/#/c/235469/ fixed [17:59:23] scfc_de: anyway, thank you for your patience with all this. [17:59:48] Niharika: I'm sorry, I couldn't tell. Could you file a bug, then others (and maybe I :-)) could take a more detailed look? [18:00:14] scfc_de: Sure. I could file one. Thanks. [18:00:32] andrewbogott: No problem. Is the purge now permanent or will it hit again in a few weeks? [18:00:39] hi scfc_de :) Nice to see you on IRC again :D [18:00:59] YuviPanda: I'll be off soon again :-). [18:01:09] scfc_de: yeah, figured. nice to see you on anyway. [18:01:25] scfc_de: reminded me to merge a bunch of your patches. sorry about falling behind on that [18:01:31] scfc_de: I’m not sure. Things should be stable now, but I might have to mess with login tokens again on Wednesday when/if I upgrade keystone to Kilo. That needs more reasearch. [18:01:34] andrewbogott: (Because I think the error is related to usage pattern and I use wikitech more/in different ways than other users.) [18:06:34] andrewbogott: re: catchpoint vs icinga, those are orthogonal [18:06:50] andrewbogott: mark wanted icinga alerts for NFS to be paging all of ops, which we don't have atm [18:06:59] andrewbogott: unrelated to rest of the catchpoint stuff [18:08:09] YuviPanda: ok… I remember offering to do something paging-related during the meeting but can’t remember what :( Was it the nfs-paging thing? [18:08:37] andrewbogott: I am just finishing up the paging thing, you offered to do the catchpoint checks for the other stuff (LDAP, puppetmaster, etc) [18:08:52] andrewbogott: which was the ticket you linked me to [18:08:54] oh, ok, great! [18:09:26] andrewbogott: there's checker.py in the puppet repo that's python code that does the checking, and then you just add a HTTP check to catchpoint [18:09:29] And… all the times that we keep saying ‘checkpoint’ it’s just a typo meaning ‘catchpoint’ right? [18:09:34] Or is checkpoint an actual thing? [18:09:54] andrewbogott: haha, yes, yes, typo [18:09:58] ok, cool [18:10:04] well checkpoint tbf is a real thing :) [18:10:06] just not for here ;0 [18:10:13] YuviPanda: for http://www.projectcalico.org/calico-networking-for-kubernetes/ [18:10:19] 6Labs, 3Labs-Sprint-108, 3Labs-Sprint-109: Have catchpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1598519 (10Andrew) [18:10:45] are you thinking then taht toollabs move off of openstack entirely and onto some hardware designated for kubern8 [18:10:58] which networking wise is tbd on setup? [18:11:01] chasemp: no that can't happen before we have labs on real hardware [18:11:26] chasemp: but yeah, that's a long term possibility, provided we maintain the same set of access. Esp. with networking, k8s seems much better off run on bare metal, maybe. [18:11:40] so what's the thought now for kubernates? [18:11:43] chasemp: but I ugess we can't do calico on top of nova network since it requires BGP [18:12:00] chasemp: flannel for overlay network, in the toollabs project [18:15:01] 6Labs, 10Tool-Labs: Labs_lvm::Volume[separate-tmp] is noisy on execution hosts - https://phabricator.wikimedia.org/T109933#1598529 (10Andrew) [18:15:01] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112, 5Patch-For-Review, and 3 others: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1598527 (10Andrew) 5Open>3Resolved The above patch plus a complete purge of keystone tokens, wikitech tokens, and a restart of mem... [18:16:57] 6Labs, 10Tool-Labs, 5Patch-For-Review: sql script does not accept wildcards as parameter - https://phabricator.wikimedia.org/T75595#1598540 (10scfc) 5Open>3Resolved Removed. [18:20:21] link to flannel? [18:22:07] chasemp: github.com/coreos/flannel [18:28:46] hi i'm ashok. [18:29:27] I need to change my instance shell account name "ashokvigneshk" instead of "ahokvigneshk". [18:30:07] Is it possible to change if not? What can I do to change? [18:34:19] Akki: We don’t generally support the renaming of users, as it has all kinds of weird side-effects. [18:34:28] If you want to create a new account, I can delete the old one. [18:35:05] What about my wikipedia username? [18:35:26] your wikipedia username is 'Ashok Vignesh K’? [18:35:37] Yes [18:35:51] Have you used labs quite a bit in the past, or are you a new user? [18:36:02] If you haven’t logged in anywhere then renaming isn’t so bad. [18:36:31] I'm new user. [18:36:41] ok. Well, let me try to rename, we’ll see what happens :) [18:37:25] o_o [18:37:27] linux shell rename [18:37:29] * FastLizard4 ducks and covers [18:37:59] If you delete my username presently after that i create my username as same. Is it possible? [18:52:35] * Akki slaps Akki around a bit with a large fishbot [18:52:59] Akki: ok, I’ve changed your shell name… we’ll see what breaks. [19:01:20] Akki: can you tell me exactly what you need to do? [19:05:09] ssh access is usually handled with this: https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 [19:05:15] Unless you’re running windows, then it’s... [19:05:37] https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_a_graphical_file_manager [19:25:39] YuviPanda: looking at https://portal.catchpoint.com/ui/Content/Tests/TestModuleList.aspx, it looks to me like we already have a wikitech check. Is https://phabricator.wikimedia.org/T107457 something else? [19:25:43] 6Labs, 10Salt: clean up old ec2id-based salt keys on labs - https://phabricator.wikimedia.org/T103089#1598752 (10ArielGlenn) [19:26:00] 6Labs, 10Salt, 6operations: salt does not run reliably for toollabs - https://phabricator.wikimedia.org/T99213#1598755 (10ArielGlenn) [19:27:00] andrewbogott: oh, not sure who set that up [19:30:09] andrewbogott: where's the wikitech check? [19:30:33] YuviPanda: click ‘internal tools' [19:30:42] andrewbogott: aha! cool [19:30:46] andrewbogott: yes, I just didn't know [19:31:34] Are we using catchpoint for pages at all, or just for stats? [19:32:47] 6Labs, 3Labs-Sprint-108, 3Labs-Sprint-109: Have catchpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1598801 (10Andrew) [19:32:47] 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1598798 (10Andrew) 5Open>3Resolved a:3Andrew Looks like this is done already. [19:33:38] andrewbogott: supposedly for pages as well but that didn't work last time [19:33:49] ok, I’ll skip that for now [19:34:03] andrewbogott: yeah, let's skip pages and get numbers [20:08:01] 6Labs: Create a catchpoint check for labs puppetmaster - https://phabricator.wikimedia.org/T107456#1599136 (10Andrew) [20:09:00] YuviPanda: can you give me a quick explanation for how something like ^^ would work? Using catchpoint for an internal service seems unobvious [20:09:14] andrewbogott: ah, look at checker.py in operations/puppet [20:09:35] andrewbogott: tools-checker.wmflabs.org/nfs/home for example, writes to a file on NFS, reads it back, and reports OK if it can do that [20:09:43] andrewbogott: and that is done by checker.py [20:10:14] How do things get communicated to catchpoint? [20:10:20] or is that what checker.py does? [20:10:54] andrewbogott: no checker.py exposes the web HTTP endpoint [20:10:59] andrewbogott: and catchpoint hits that every few minutes [20:11:15] andrewbogott: if you look at the NFS check in catchpoint it just hits that URL I pointed out [20:16:47] ah, I see, ok. [20:17:00] “OK” :) [20:17:17] andrewbogott: yeah : [20:17:19] :) [20:17:25] andrewbogott: so it's kindof a hack [20:17:48] It’s at least a simple one! [20:18:23] andrewbogott: yup [20:18:39] andrewbogott: I had a much more complicated solution in mind in the beggining, but talking to ori convinced me this was a much better / simpler way [20:40:31] YuviPanda: can you clarify checker.py vs. toolschecker.py? [20:41:01] Hi Krenair: another quick request here... would it be possible to add cwdent as a CentralNotice admin on the beta cluster too please? He's still pretty new on the FR team, and has been helping out a bit w/ CentralNotice recently :) [20:41:14] andrewbogott: ugh, yes, toolschecker.py [20:41:14] sorry [20:41:17] yep [20:41:20] AndyRussG, doing [20:41:27] Krenair: fantastic! many thanks [20:41:33] YuviPanda: ok, that makes more sense :) [20:41:41] Krenair: his labs user name is cdentinger [20:41:51] thanks Krenair! [20:42:59] http://meta.wikimedia.beta.wmflabs.org/w/index.php?title=Special:Log/rights&page=Cdentinger [20:46:33] 6Labs: Puppet errors on newly created instances - https://phabricator.wikimedia.org/T100108#1599258 (10scfc) This is still happening for me. [22:32:27] (03PS1) 10Jean-Frédéric: Add configuration for object monuments in France [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235630 (https://phabricator.wikimedia.org/T111263) [22:35:23] 6Labs: Puppet errors on newly created instances - https://phabricator.wikimedia.org/T100108#1599860 (10Andrew) Can you tell me more about how to reproduce? [22:45:16] (03PS1) 10Jean-Frédéric: Collapse Unknown field report on one line [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235633 [22:45:35] (03CR) 10Jean-Frédéric: [C: 032 V: 032] Collapse Unknown field report on one line [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/235633 (owner: 10Jean-Frédéric) [23:04:05] 6Labs: Puppet errors on newly created instances - https://phabricator.wikimedia.org/T100108#1600106 (10scfc) It failed for me in Toolsbeta by just creating a new (Precise) instance, but when I just tried to reproduce in the Tools project, it worked there. As @hashar's issue in T102121 dealt with an instance in... [23:14:27] 6Labs, 10Tool-Labs: Labs_lvm::Volume[separate-tmp] is noisy on execution hosts - https://phabricator.wikimedia.org/T109933#1600162 (10scfc) [23:14:27] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112, 5Patch-For-Review, and 3 others: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1600160 (10scfc) 5Resolved>3Open Unfortunately, that didn't last long :-) (or I tested only the "toolsbeta" part of it a few hours... [23:25:41] 6Labs: Puppet errors on newly created instances - https://phabricator.wikimedia.org/T100108#1600224 (10scfc) I just created a new Trusty instance in Toolsbeta and that worked fine (but @hashar's issue in T102121 was on a Trusty instance). [23:35:45] 6Labs, 10wikitech.wikimedia.org, 3Labs-sprint-112, 5Patch-For-Review, and 3 others: Can't list instances on Special:NovaInstance - https://phabricator.wikimedia.org/T110629#1600259 (10scfc) And now toolsbeta is gone as well :-). [23:57:14] Is there a table in the in production or tools database which stores the article category? For example good articles, featured articles? [23:57:59] I don't think those are concepts in MediaWiki. [23:58:07] Just user ideas. [23:58:35] Although... Maybe Wikidata implemented something like that? aude, do you know? [23:58:50] i think it does [23:58:53] you can kindof fake it with the categorylinks table [23:58:57] for enwiki [23:59:04] ashwinpp: you should talk to halfak or harej [23:59:13] ashwinpp: they've been doing a lot of work in the area and might know [23:59:20] Well yeah, if you can figure out the right category on each wiki [23:59:38] I see [23:59:42] Assuming they categorise them via MW