[00:12:50] 6Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118, 10labs-sprint-119: Move project membership/assignment from ldap to keystone mysql - https://phabricator.wikimedia.org/T115029#1968161 (10Andrew) a:3Andrew [00:15:09] 6Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118, 10labs-sprint-119: Move project membership/assignment from ldap to keystone mysql - https://phabricator.wikimedia.org/T115029#1968169 (10Andrew) [00:34:29] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 6operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1968235 (10bd808) [00:34:31] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure: Setup real ssl certs for Beta Cluster using a restricted project - https://phabricator.wikimedia.org/T75919#1968234 (10bd808) [00:55:24] (03CR) 10Tim Landscheidt: [C: 032] "Did not test, but looks good enough to me." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266466 (owner: 10BryanDavis) [03:27:45] YuviPanda: what did you do today... [03:28:45] a continuous job started with -once has two running process right now [03:29:04] job ID 2176874 state=Rr and job ID 2695870 state=r [03:31:59] 6Labs, 10Tool-Labs: Tool Labs: jsub starts multiple instances of tasks declared as "once" - https://phabricator.wikimedia.org/T62862#1968447 (10liangent) I got hit by this again after some NFS issue / maintenance: ``` job-ID prior name user state submit/start at queue... [03:56:48] liangent: happened to me twice I think [07:45:55] hi, who knows how to change nodejs version on betacluster's sca01/02 [07:47:56] yurik: Marko and Alex have started to upgrade sc* to nodejs 4.2 (scb is already), better check with them [07:50:00] moritzm, scb ? there is no such instances in labs, are you talking about production? [08:06:04] yurik: yeah, they've been doing that in production, so best check with them wrt the updates in beta (so that it doesn't differ in unexpected ways) [08:07:07] moritzm, thx, but it already is different - since there are no node4 in beta cluster [11:02:26] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969175 (10Phe) 3NEW [11:07:50] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969222 (10valhallasw) ``` tools-webgrid-lighttpd-1206 login: [418800.440736] INFO: task lighttpd:28011 blocked for more than 120 seconds. [418800.443108] "echo 0 > /proc/sys/... [11:10:58] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969229 (10Vituzzu) I was about to notify a guc outage, I suppose that's the reason of it. [11:11:14] here's the problem [11:12:55] 6Labs, 10Tool-Labs: user `marcmiquel` gzipping 41G file on NFS - https://phabricator.wikimedia.org/T124877#1969242 (10valhallasw) 3NEW [11:40:48] Hey guys, can I request deletion of a labs tool of mine and its associated files and configuration? [11:43:33] request deletion? [11:43:57] you probably have the technical ability to do it yourself, but why? [11:45:54] Krenair: Yes. Oh, how? [11:47:05] tool labs tool? [11:47:27] Yes [11:47:44] For a unknown reason, the log of that specific tool shows it is somehow unable to be restarted after breakages. My other tools doesn't have that issue [11:48:22] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Can_I_delete_a_tool.3F <= admin permissions are required [11:48:43] what's the error? [11:49:09] zhuyifei1999_: my logs doesn't show any internal error of my codes [11:49:52] ? [11:50:43] 2016-01-21 20:32:55: (server.c.1558) server stopped by UID = 0 PID = 14798 [11:50:43] 2016-01-24 12:04:59: (log.c.166) server started [11:51:32] so a web tool. lighttpd? [11:51:41] Yes, just a simple one [11:52:04] I am not sure why it is not auto restarting [11:52:27] My other tools are able to [11:52:43] and even after you restart it and it crash again? [11:53:27] if you only need auto restarts you can use the webservicemonitor or something (I don't remember the name) [11:53:31] Well, it is really not crash of my script [11:53:53] I mean lighttpd crashing [11:54:01] 2016-01-21 19:55:13: (server.c.1558) server stopped by UID = 0 PID = 1227 [11:54:01] 2016-01-21 20:07:28: (log.c.166) server started [11:54:35] The other tool, without any configuration, is able to. I just wonder if recreating the tool may fix the issue [11:55:45] <_joe_> ebraminio: did you open a phab ticket about this issue? [11:57:50] _joe_: honestly don't know what to write there [11:58:00] no [11:58:30] <_joe_> ebraminio: you can go to phabricator.wikimedia.org, you can login with your labs login I guess, and open a ticket [11:58:57] <_joe_> add the Tool-Labs tag to it [11:59:02] <_joe_> and describe your issue [11:59:25] <_joe_> that might get the attention of one of the toollabs admins easier [12:07:33] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969385 (10Ebraminio) 3NEW [12:07:50] _joe_, zhuyifei1999_: https://phabricator.wikimedia.org/T124884 [12:08:28] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969394 (10Ebraminio) [12:12:04] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969413 (10zhuyifei1999) [12:35:51] Hello! [12:36:24] I'm new native french speaker [12:37:57] Discovering labs and trying to bring a project relative foreign vocabulary in fr.wikiversity [12:38:07] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969461 (10Ebraminio) [12:44:31] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969466 (10valhallasw) Running `webservice stop` `webservice start` regenerated the `service.manifest` file in `/data/project/linkstranslator`. I'm not sure why that file was missing --... [12:45:24] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969474 (10valhallasw) 5Open>3Resolved a:3valhallasw [12:48:07] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969482 (10valhallasw) [12:48:09] 6Labs: Instances locking up randomly - https://phabricator.wikimedia.org/T121998#1969483 (10valhallasw) [12:48:14] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1969481 (10valhallasw) [13:23:42] hi everybody [13:25:48] I'm reading arround the Labs tools list [13:26:30] Trying to check if already exist a project similar to the one i'm dealing [13:27:20] and discovering a few fr.user [13:27:47] Un coucou à eux! [15:10:15] 6Labs, 10Tool-Labs, 10WLX-Jury: Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#1969784 (10intracer) 3NEW a:3intracer [15:14:31] 6Labs, 10Tool-Labs, 10WLX-Jury: Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#1969809 (10intracer) [15:24:39] 6Labs, 10Tool-Labs, 10WLX-Jury: Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#1969835 (10intracer) [15:25:07] (03PS1) 10Subramanya Sastry: ruthenium services: Add dummy secrets for parsoid-rt and parsoid-vd [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [16:02:47] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969975 (10Ebraminio) Yes. Actually I was using a node.js solution before and then converted this to a simple PHP code. So is it going to be okay from now on? That would be very nice :)... [16:10:58] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969998 (10valhallasw) 5Open>3Resolved a:3valhallasw As long as there is a `service.manifest`, everything should be OK :-) [16:37:56] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1970063 (10Ebraminio) Thank you :) Actually I was using `webservice restart` and that probably can explain why I didn't have the file, probably webservice script worth to be fixed in a... [16:49:36] (03PS2) 10Subramanya Sastry: ruthenium services: Add dummy secrets for parsoid-rt and parsoid-vd [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [17:02:34] (03CR) 10Jcrespo: [C: 04-1] ruthenium services: Add dummy secrets for parsoid-rt and parsoid-vd (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:06:47] chasemp: YuviPanda I'm preparing a plan @ https://etherpad.wikimedia.org/p/toollabs20160127 [17:07:15] (03PS3) 10Subramanya Sastry: ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [17:10:24] (03CR) 10Jcrespo: [C: 04-1] ruthenium services: Add testreduce::mysql password for db access (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:11:37] (03PS4) 10Subramanya Sastry: ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [17:12:42] (03CR) 10Jcrespo: [C: 032] ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:13:03] will submit this, which will allow testing the other [17:13:53] valhallasw`cloud: what's the thought behind loading the db from another directory? [17:14:12] chasemp: I'm not sure what happens if you reload to the same file in the same directory [17:14:26] basically, I want to be sure it uses an empty file to load the data [17:15:23] (03CR) 10Jcrespo: [V: 032] ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:15:44] valhallasw`cloud: my thining was to do a db_dump to a file and remove the existing queue db [17:15:46] and rebuild it [17:15:47] and see [17:15:49] and if not that [17:16:01] then just remove the db file and start fresh (and a new one gets built) [17:16:09] (03PS1) 10Merlijn van Deen: Report labs/private to #wikimedia-operations [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/266769 [17:16:43] chasemp: sure, that also works [17:17:04] I did that the last round in testing to see how things would cope [17:17:22] and it seeemd sane and even w/ a dump and rebuild it md5'd out differently etc [17:18:23] I don't know what that all ends up as but I do believe if it's only teh queue that is corrupt [17:18:31] it's all contained to sge_job [17:18:43] and if it's more than teh qeuue then uh oh :) [17:19:21] is it also semi-possible teh corruption errors are occuring on nfs flakiness and it's kind of misnomer? [17:19:48] that's hard to say, but as we are using hard mounts, I don't think that should happen [17:20:02] but it's nfs, so.. not sure [17:20:30] I suppose we could also move it off nfs, but I'm not 100% sure which steps we then have to take (e.g. is a symlink good enough or do we need to do bind mounts) [17:21:34] yeah same [17:21:42] well let's see where the above leaves us [17:21:47] and go from there I guess [17:26:35] ok. I'm away for a bit, but I'll be back later. [17:29:25] k [17:35:34] dhlamb: if nbanks makes today's call we'll just have to sort out logistics for ansible merging and merging what you'll have [17:35:39] oops [17:43:46] 6Labs: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#1970352 (10Andrew) This just happened again. Both times it happened right about 15:00 UTC -- maybe that's a clue? today the first alert fired at 15:09; last time the first was at 14:55. [17:48:23] chasemp, YuviPanda, valhallasw`cloud, I got an alert storm this morning about resolution failures. Was there any user-facing impact that you noticed? (From my point of view it looked similar to the DNS outage we had two days ago. Same time, too!) [17:48:40] andrewbogott: eeeh, not sure. [17:49:10] I didn't get anything? [17:49:11] andrewbogott: same time is super interesting [17:49:11] diamond alerts [17:49:14] yeah, I wonder what happens at 14:55 that destabilizes things [17:49:50] I saw the (tools) proxies 502 for a while, but then they came back. Dunno if related. [17:50:27] yeah, most likely the same thing — if the proxy can’t resolve hostnames... [17:50:45] chasemp: I updated the plan with your suggested delete-then-rebuild protocol [17:51:07] chasemp: do you remember what we should do with the __db.* and log files? [17:51:34] they were only related to the sge db iirc [17:51:39] ...or so I thought [17:52:13] I think the log file is so large it has to be from the jobs database [17:53:04] what day was it we did this before? if I have the logs I remember we sussed it out based on trying to mvoe things between dirs [17:53:14] and sge wouldn't start I thought as the .log file as needed by sge db [17:53:15] but [17:53:19] not entirely positive [17:54:12] hrm. [17:54:57] YuviPanda: about? [17:58:31] 20160115.txt:[21:47:23] echo "stopping master" && service gridengine-master stop && echo "------" && ps -ef | grep grid && echo "remove foo" && rm -f foo && ls && db_dump sge_job > foo && cp -p sge_job /root/ && ls /root/ && md5sum sge_job && rm -f sge_job && db_load -f foo sge_job && md5sum sge_job && chown sgeadmin:sgeadmin sge_job && [17:58:31] service gridengine-master start; echo $? [17:59:00] madness :) [17:59:06] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20160115.txt [17:59:16] oh nice [18:00:55] hey chasemp / valhallasw`cloud [18:01:30] chasemp: do you remember where the backup of the sge config is? [18:01:33] iiuc 1800 UTC is now [18:01:38] yep [18:02:03] root/emergency_sge_dump [18:02:03] yeah [18:02:04] from last time [18:02:15] ok so, quick recap on things [18:02:20] YuviPanda: https://etherpad.wikimedia.org/p/toollabs20160127 [18:02:23] corruptino messages 'boo' [18:02:34] I have an unrelated test master here [18:02:34] tool-master-05.tool-renewal.eqiad.wmflabs [18:02:37] all okay up to now? [18:02:53] yup, just reading it [18:03:12] valhallasw`cloud: <3 for setting that up :) [18:05:13] nfs seems hung there [18:05:16] and I haven't even done anything :) [18:05:17] ok [18:05:19] just slow [18:05:32] gah [18:05:32] tools.tools-exec-1208.network.eth0.tx_byte - 15620 kB/s [18:05:37] nfs is under heavy load [18:05:50] let me look at exec node [18:05:50] can we root that out quickly do you think some job is amuck [18:05:53] templatetiger again [18:06:09] we should salt their crons [18:06:15] err [18:06:17] comment [18:06:18] templatetiger is not cronning eanything [18:06:24] oh [18:06:27] so manually submitted? [18:06:31] I think so. [18:06:34] * YuviPanda waits for 'become' to finish [18:06:47] I qdel'ed the job [18:06:56] 24986 be/4 tools.gl 7.79 K/s 0.00 B/s 0.00 % 4.20 % ruby /data/project/glamify/.rvm/rubies/ruby-2.0.0-p598/bin/rake queue:process [18:06:59] that's what iotop shows me [18:07:01] ah [18:07:06] because you already got to templatetiger [18:07:09] yeah [18:07:20] <3 ok [18:07:38] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1970464 (10valhallasw) `qdel`ed another job just now. It's putting severe stress on NFS -- please do not run another job without first consulting. [18:08:00] valhallasw`cloud: I *think* it might be spawned from the webservice [18:08:07] valhallasw`cloud: I can actually find out because we've eventlogging data now [18:08:30] if that's the case we can shut down the webservice [18:08:36] yeah, there's no become templatetiger on tools-bastion-01 today [18:08:59] ok, anyway. That should help with NFS [18:09:39] right [18:09:44] seems solid [18:09:57] I'll send another email to labs-l [18:10:51] chasemp: are you going to do the actual dump? [18:11:07] sure mind if I drive here, I'm oing to try to use the sge_dump to backup first as well [18:11:21] | 4798a5217ea95e25946ae999f62e8b29 | db130c807a2865a603aef6fa46c79c0aaadad816 | 20160127173603 | "Python-urllib/2.7" | NULL | metawiki | jsub | /usr/bin/jstart /data/project/templatetiger/public_html/sort.sh | tools-bastion-01.tools.eqiad.wmflabs | -bash | tools.templatetiger | [18:11:30] valhallasw`cloud: hmm, so that's manually started from bastion [18:11:35] and there's no history entry [18:12:01] what on earth. The master is using 50% memory and large amounts of cpu [18:12:16] maybe it's leaking memory due to the bdb issues -- thta would explain the random crashes [18:12:44] it looks ok atm I think [18:12:52] ok here I go [18:12:58] stopped the master and trying to back it up with [18:13:13] root/emergency_bin/save_sge.sh [18:13:15] chasemp: !log :D [18:13:17] it's still running [18:13:27] that was the 'stop me now' warning :) [18:13:30] ok [18:14:05] !log tools grid master stopped [18:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:14:23] I can confirm the process is now gone ;-) [18:14:48] ok I had to start it...apparently the save job [18:15:05] actually uses qconf to get all settings out of the db [18:15:12] so I'm saving now [18:16:12] look in /root/sge_maint_01272016 [18:16:29] Configuration successfully saved to /root/sge_maint_01272016 directory. [18:17:04] btw I copped these save / dump scripts from taht sun of grid engine repo valhallasw`cloud and I had to modify them a bit [18:17:08] I'll try to puppet that up today [18:17:43] !log tools SGE Configuration successfully saved to /root/sge_maint_01272016 directory. [18:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:17:53] tx [18:18:53] 4.6M /root/sge_maint_pre_jobs_dump_01272016 [18:19:31] !log tools dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M [18:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:20:09] !log tools master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job [18:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:20:44] chasemp: ownership is incorrect (root:root) [18:21:10] and I'm not sure what we should do to the log files, but let's keep them and see what gridengine_master says? [18:22:05] agreed [18:22:40] !log tools messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016' [18:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:23:39] !log tools master sge restarted post dump and restart for jobs db [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:23:45] let's just see what it does? [18:23:45] seems to work [18:23:56] !log tools no errors in log file, qstat works [18:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:24:37] !log tools 'sleep' test job also seems to work without issues [18:24:38] valhallasw`cloud: fyi 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016 was me manually putting in a marker [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:24:44] chasemp: ah, ok [18:24:45] so I would know at what point I did it [18:25:03] *nod* [18:25:54] the rate of error messages was ~one per 5 mins, so we might have to wait for a bit before we cheer [18:26:12] agreed let's sit on it for a few [18:26:18] everybody cool w/ things so far? [18:26:39] !log tools messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate [18:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:26:53] * YuviPanda is watttchinng [18:27:22] I'm not actually doing anything, which is great. let me know if you guys want me to [18:28:14] YuviPanda: yeah, understood, mainly if we have to go another step [18:28:18] and wipe teh queue [18:28:22] * YuviPanda nos [18:28:25] * YuviPanda nods [18:29:20] !log tools job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 . [18:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:29:41] tx valhallasw`cloud that job seems to be heavy often [18:30:27] madhuvishy is working on moving it to its own project [18:30:37] yeah, it's still doing stuff there now, but I'm first going to eat dinner [18:31:22] :D [18:31:31] 10Labs-Other-Projects: Succesful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#1970569 (10AdHuikeshoven) As an admin at https://discourse.wmflabs.org/ I just upgraded the Discourse installation. Discourse has two kind... [18:31:39] YuviPanda: yeah i'll try to work on it tonight [18:31:50] madhuvishy: :D thanks! [18:32:20] can we de-nfs it in that case? [18:33:01] chasemp: yes [18:33:17] oh sweet woohoo [18:34:03] hello labs users! [18:34:31] I'm a new user of fmwlabs [18:35:08] hey Youni welcome [18:35:13] hello, Youni, welcome [18:35:43] trying to creat a tool to collect vocabulary lists in fr.Wikiversity foreign languages [18:36:22] focused in Portuguese language learning [18:37:41] Youni: that's a pretty neat but specific quest :) I'm not sure if I could help you may be better off sending mail to [18:37:42] https://lists.wikimedia.org/mailman/listinfo/labs-l [18:37:47] to try to grab a wider audience [18:37:56] i just created the vocabulary-index hosted tool [18:38:29] and will look for copying the scripts into [18:38:34] YuviPanda: can you run through a test tool you have and such w/ general submission etc [18:38:45] * YuviPanda does [18:38:47] want to see if activity will trigger using the queue db which would trigger the logs [18:39:35] am doing a bunch of 'em now [18:39:49] tx [18:39:55] stepping away for 2m [18:39:59] k [18:42:36] my test script is doing ok [18:43:41] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1970620 (10Kolossos) Sorry, I read this too late. I will sort now to "/tmp/sort/" and hope thats ok. InnoDB seems not to have an physical sort without an primary or unique k... [18:44:07] crap just got teh message again [18:44:28] yup [18:44:29] so [18:44:35] no dice [18:45:02] :'( [18:45:05] just saw that [18:45:30] want to wait for valhallasw`cloud to start a fresh queue? [18:46:27] yeah [18:54:39] chasemp: bah :( [18:54:44] we can also do one more intermediate sttep [18:54:50] which is dump and reload to non-nfs [18:55:12] hmmm [18:55:33] yeah, that'd be nice to try I think [18:56:27] I don't think that helps if the db is already corrupt and dump/load just carries it over [18:56:39] for the current pickle I mean [18:56:47] it's also possible it's not the queue db [18:56:57] that's a good point [18:57:02] it doesn't say what db is corrupt [18:57:05] ! [18:57:05] hello :) [18:57:06] 01/27/2016 18:43:03|worker|tools-grid-master|E|error writing object with key "USER:tools.jimmy" into berkeley database: (22) Invalid argument [18:57:18] well that is interesting [18:57:21] that does sound like the configuration database rather than the job one [18:57:24] should we dump and reload the config one too? [18:57:35] that is a bit more complex maybe [18:57:38] I'm not entirely sure [18:57:51] we can just do the bdb dump? instead of export/import [18:58:51] labstore high load agian [18:58:52] ok give me a sec w/ my test setup [18:59:11] looks ok to me now [18:59:29] this whackamole stuff is for the birds tho [18:59:40] huh I see the alert [18:59:48] yeah seems fine now [18:59:50] * YuviPanda nods [19:00:06] no [19:00:11] valhallasw`cloud: templatetiger is running again [19:00:38] killed it again [19:00:52] same thing? [19:00:54] huh [19:01:30] yeah [19:03:50] ok, I don't get what's happening on tools-webgrid-generic-1405. It's doing 20 Mbit/s to labstore, but iotop doesn't show anything; the ifttt processes are unkillable [19:04:08] same thing happened yesterday [19:04:13] I had to reboot I think [19:04:30] the processes are stuck in uninterruptable sleep ('D' [19:04:52] no [19:04:52] it's nfs but not sure why [19:05:01] I restarted the iffttt process [19:05:06] with a webservice restart [19:05:09] and it went away [19:05:48] yeah, I'll just reboot that host [19:05:49] sorry guys just give me a few more minutes here I'm looking at the dump restore possibilities as well as I can [19:06:25] yeah we'll look at this chasemp [19:06:36] valhallasw`cloud: I'm going to restart ifttt too [19:06:44] YuviPanda: why? [19:06:51] is it also killing 1401? [19:06:59] eh, 1403 now [19:07:19] so it left behind a ghost on 1405 [19:07:25] need to see if it left behnd one on 1401 [19:07:34] how did you find it? [19:07:53] valhallasw`cloud: wait, when you said [19:07:59] > the processes are stuck in uninterruptable sleep ('D' [19:08:02] is that about ifttt [19:08:04] or everything? [19:08:11] 1405? because gridmaster was reporting a job it couldn't kill [19:08:34] there's ifttt processes (~10) plus a whole bunch of lsof -w -l +d /var/lib/php5 [19:08:37] hmm that instance feels hosed [19:08:46] lots of kworker and kthreadd too [19:08:56] let me reschedule the running jobs [19:09:09] or just reboot, I guess [19:09:15] let's reschedule [19:09:20] reboot, gridengine doesn't see it [19:09:22] the last time I tried [19:09:25] you'll have to disable the queue, though [19:09:38] yeah [19:09:39] it should once the exec manager is back up [19:09:40] let me do that [19:09:42] ok [19:09:47] it didn't the last time [19:09:55] it just was like 'yeah, these jobs exist, sure!' [19:10:00] I waited a good 10min [19:11:23] !log tools depooled tools-webgrid-1405 to prep for restart, lots of stuck processes [19:11:25] valhallasw`cloud: done [19:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:15:24] it's not immediately obvious to me how to safely rebuild the sge db from scratch [19:15:53] chasemp: I think db_dump db_load should just work? [19:16:06] it's just another bdb database [19:16:10] I'm all for that I meant for the setup in a place not nfs anew case [19:16:20] it needs some looking into [19:16:29] ok so [19:16:32] dumping the main db [19:17:14] if we decide to move off NFS (and off bdb files possibly) we should probably not try to do that today [19:17:36] agreed i was just giving the once over in case it really was simple [19:17:37] :) [19:17:43] spoiler, it's not [19:19:25] dumping the main db and db_load seems to work fine in test case [19:19:27] let's give it a whirl [19:19:36] cool [19:19:58] valhallasw`cloud: hey btw, I tried stracing the master to see [19:20:03] what db it was opening on throwing that error [19:20:07] no such luck [19:20:14] not sure why I didn't see it [19:20:20] maybe you have an idea? [19:20:30] in theory we can tell what it's trying to do pre-corruptino warning [19:21:02] *nod* [19:21:37] did you strace -f -p ? it might be a child process that's doing it [19:22:00] did not do -f [19:22:02] hm [19:22:31] it has 12 or so child processes [19:22:39] duh on me [19:23:13] !log tools stopped master [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:26:27] so uh [19:26:29] db_load: BDB0004 fop_read_meta: sge: unexpected file type or format [19:26:30] db_load: DB->open: sge: Invalid argument [19:27:16] right duh [19:27:16] uuuh [19:27:26] bad syntax I tink [19:27:29] input/output file switched? [19:27:41] huh [19:28:06] no this should work unless I"m crazy [19:28:25] db_load -f /tmp/sge_db sge [19:28:27] from testing [19:29:12] . /tmp/sge_db doesn't exist? [19:29:23] sorry that was just a syntax example [19:29:31] ah. [19:29:38] root@tools-grid-master:/var/lib/gridengine/spool/spooldb# db_load -f /root/sge > sge [19:29:57] try it in a different directory? might be the log files messing stuff up [19:30:17] hm no [19:30:18] huh [19:30:22] the file /root/sge is bogus [19:30:31] how so? [19:30:40]