[00:12:50] 6Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118, 10labs-sprint-119: Move project membership/assignment from ldap to keystone mysql - https://phabricator.wikimedia.org/T115029#1968161 (10Andrew) a:3Andrew [00:15:09] 6Labs, 10Labs-Infrastructure, 10labs-sprint-117, 10labs-sprint-118, 10labs-sprint-119: Move project membership/assignment from ldap to keystone mysql - https://phabricator.wikimedia.org/T115029#1968169 (10Andrew) [00:34:29] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 6operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1968235 (10bd808) [00:34:31] 6Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure: Setup real ssl certs for Beta Cluster using a restricted project - https://phabricator.wikimedia.org/T75919#1968234 (10bd808) [00:55:24] (03CR) 10Tim Landscheidt: [C: 032] "Did not test, but looks good enough to me." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/266466 (owner: 10BryanDavis) [03:27:45] YuviPanda: what did you do today... [03:28:45] a continuous job started with -once has two running process right now [03:29:04] job ID 2176874 state=Rr and job ID 2695870 state=r [03:31:59] 6Labs, 10Tool-Labs: Tool Labs: jsub starts multiple instances of tasks declared as "once" - https://phabricator.wikimedia.org/T62862#1968447 (10liangent) I got hit by this again after some NFS issue / maintenance: ``` job-ID prior name user state submit/start at queue... [03:56:48] liangent: happened to me twice I think [07:45:55] hi, who knows how to change nodejs version on betacluster's sca01/02 [07:47:56] yurik: Marko and Alex have started to upgrade sc* to nodejs 4.2 (scb is already), better check with them [07:50:00] moritzm, scb ? there is no such instances in labs, are you talking about production? [08:06:04] yurik: yeah, they've been doing that in production, so best check with them wrt the updates in beta (so that it doesn't differ in unexpected ways) [08:07:07] moritzm, thx, but it already is different - since there are no node4 in beta cluster [11:02:26] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969175 (10Phe) 3NEW [11:07:50] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969222 (10valhallasw) ``` tools-webgrid-lighttpd-1206 login: [418800.440736] INFO: task lighttpd:28011 blocked for more than 120 seconds. [418800.443108] "echo 0 > /proc/sys/... [11:10:58] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969229 (10Vituzzu) I was about to notify a guc outage, I suppose that's the reason of it. [11:11:14] here's the problem [11:12:55] 6Labs, 10Tool-Labs: user `marcmiquel` gzipping 41G file on NFS - https://phabricator.wikimedia.org/T124877#1969242 (10valhallasw) 3NEW [11:40:48] Hey guys, can I request deletion of a labs tool of mine and its associated files and configuration? [11:43:33] request deletion? [11:43:57] you probably have the technical ability to do it yourself, but why? [11:45:54] Krenair: Yes. Oh, how? [11:47:05] tool labs tool? [11:47:27] Yes [11:47:44] For a unknown reason, the log of that specific tool shows it is somehow unable to be restarted after breakages. My other tools doesn't have that issue [11:48:22] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Can_I_delete_a_tool.3F <= admin permissions are required [11:48:43] what's the error? [11:49:09] zhuyifei1999_: my logs doesn't show any internal error of my codes [11:49:52] ? [11:50:43] 2016-01-21 20:32:55: (server.c.1558) server stopped by UID = 0 PID = 14798 [11:50:43] 2016-01-24 12:04:59: (log.c.166) server started [11:51:32] so a web tool. lighttpd? [11:51:41] Yes, just a simple one [11:52:04] I am not sure why it is not auto restarting [11:52:27] My other tools are able to [11:52:43] and even after you restart it and it crash again? [11:53:27] if you only need auto restarts you can use the webservicemonitor or something (I don't remember the name) [11:53:31] Well, it is really not crash of my script [11:53:53] I mean lighttpd crashing [11:54:01] 2016-01-21 19:55:13: (server.c.1558) server stopped by UID = 0 PID = 1227 [11:54:01] 2016-01-21 20:07:28: (log.c.166) server started [11:54:35] The other tool, without any configuration, is able to. I just wonder if recreating the tool may fix the issue [11:55:45] <_joe_> ebraminio: did you open a phab ticket about this issue? [11:57:50] _joe_: honestly don't know what to write there [11:58:00] no [11:58:30] <_joe_> ebraminio: you can go to phabricator.wikimedia.org, you can login with your labs login I guess, and open a ticket [11:58:57] <_joe_> add the Tool-Labs tag to it [11:59:02] <_joe_> and describe your issue [11:59:25] <_joe_> that might get the attention of one of the toollabs admins easier [12:07:33] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969385 (10Ebraminio) 3NEW [12:07:50] _joe_, zhuyifei1999_: https://phabricator.wikimedia.org/T124884 [12:08:28] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969394 (10Ebraminio) [12:12:04] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969413 (10zhuyifei1999) [12:35:51] Hello! [12:36:24] I'm new native french speaker [12:37:57] Discovering labs and trying to bring a project relative foreign vocabulary in fr.wikiversity [12:38:07] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969461 (10Ebraminio) [12:44:31] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969466 (10valhallasw) Running `webservice stop` `webservice start` regenerated the `service.manifest` file in `/data/project/linkstranslator`. I'm not sure why that file was missing --... [12:45:24] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969474 (10valhallasw) 5Open>3Resolved a:3valhallasw [12:48:07] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1206 half-dead, webservices and ssh unaccessible - https://phabricator.wikimedia.org/T124875#1969482 (10valhallasw) [12:48:09] 6Labs: Instances locking up randomly - https://phabricator.wikimedia.org/T121998#1969483 (10valhallasw) [12:48:14] 6Labs, 10Labs-Infrastructure, 7Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#1969481 (10valhallasw) [13:23:42] hi everybody [13:25:48] I'm reading arround the Labs tools list [13:26:30] Trying to check if already exist a project similar to the one i'm dealing [13:27:20] and discovering a few fr.user [13:27:47] Un coucou à eux! [15:10:15] 6Labs, 10Tool-Labs, 10WLX-Jury: Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#1969784 (10intracer) 3NEW a:3intracer [15:14:31] 6Labs, 10Tool-Labs, 10WLX-Jury: Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#1969809 (10intracer) [15:24:39] 6Labs, 10Tool-Labs, 10WLX-Jury: Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#1969835 (10intracer) [15:25:07] (03PS1) 10Subramanya Sastry: ruthenium services: Add dummy secrets for parsoid-rt and parsoid-vd [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [16:02:47] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969975 (10Ebraminio) Yes. Actually I was using a node.js solution before and then converted this to a simple PHP code. So is it going to be okay from now on? That would be very nice :)... [16:10:58] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1969998 (10valhallasw) 5Open>3Resolved a:3valhallasw As long as there is a `service.manifest`, everything should be OK :-) [16:37:56] 6Labs, 10Tool-Labs: My tool is not getting restarted after tools server breakages - https://phabricator.wikimedia.org/T124884#1970063 (10Ebraminio) Thank you :) Actually I was using `webservice restart` and that probably can explain why I didn't have the file, probably webservice script worth to be fixed in a... [16:49:36] (03PS2) 10Subramanya Sastry: ruthenium services: Add dummy secrets for parsoid-rt and parsoid-vd [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [17:02:34] (03CR) 10Jcrespo: [C: 04-1] ruthenium services: Add dummy secrets for parsoid-rt and parsoid-vd (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:06:47] chasemp: YuviPanda I'm preparing a plan @ https://etherpad.wikimedia.org/p/toollabs20160127 [17:07:15] (03PS3) 10Subramanya Sastry: ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [17:10:24] (03CR) 10Jcrespo: [C: 04-1] ruthenium services: Add testreduce::mysql password for db access (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:11:37] (03PS4) 10Subramanya Sastry: ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) [17:12:42] (03CR) 10Jcrespo: [C: 032] ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:13:03] will submit this, which will allow testing the other [17:13:53] valhallasw`cloud: what's the thought behind loading the db from another directory? [17:14:12] chasemp: I'm not sure what happens if you reload to the same file in the same directory [17:14:26] basically, I want to be sure it uses an empty file to load the data [17:15:23] (03CR) 10Jcrespo: [V: 032] ruthenium services: Add testreduce::mysql password for db access [labs/private] - 10https://gerrit.wikimedia.org/r/266752 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:15:44] valhallasw`cloud: my thining was to do a db_dump to a file and remove the existing queue db [17:15:46] and rebuild it [17:15:47] and see [17:15:49] and if not that [17:16:01] then just remove the db file and start fresh (and a new one gets built) [17:16:09] (03PS1) 10Merlijn van Deen: Report labs/private to #wikimedia-operations [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/266769 [17:16:43] chasemp: sure, that also works [17:17:04] I did that the last round in testing to see how things would cope [17:17:22] and it seeemd sane and even w/ a dump and rebuild it md5'd out differently etc [17:18:23] I don't know what that all ends up as but I do believe if it's only teh queue that is corrupt [17:18:31] it's all contained to sge_job [17:18:43] and if it's more than teh qeuue then uh oh :) [17:19:21] is it also semi-possible teh corruption errors are occuring on nfs flakiness and it's kind of misnomer? [17:19:48] that's hard to say, but as we are using hard mounts, I don't think that should happen [17:20:02] but it's nfs, so.. not sure [17:20:30] I suppose we could also move it off nfs, but I'm not 100% sure which steps we then have to take (e.g. is a symlink good enough or do we need to do bind mounts) [17:21:34] yeah same [17:21:42] well let's see where the above leaves us [17:21:47] and go from there I guess [17:26:35] ok. I'm away for a bit, but I'll be back later. [17:29:25] k [17:35:34] dhlamb: if nbanks makes today's call we'll just have to sort out logistics for ansible merging and merging what you'll have [17:35:39] oops [17:43:46] 6Labs: Periodic internal labs dns outages - https://phabricator.wikimedia.org/T124680#1970352 (10Andrew) This just happened again. Both times it happened right about 15:00 UTC -- maybe that's a clue? today the first alert fired at 15:09; last time the first was at 14:55. [17:48:23] chasemp, YuviPanda, valhallasw`cloud, I got an alert storm this morning about resolution failures. Was there any user-facing impact that you noticed? (From my point of view it looked similar to the DNS outage we had two days ago. Same time, too!) [17:48:40] andrewbogott: eeeh, not sure. [17:49:10] I didn't get anything? [17:49:11] andrewbogott: same time is super interesting [17:49:11] diamond alerts [17:49:14] yeah, I wonder what happens at 14:55 that destabilizes things [17:49:50] I saw the (tools) proxies 502 for a while, but then they came back. Dunno if related. [17:50:27] yeah, most likely the same thing — if the proxy can’t resolve hostnames... [17:50:45] chasemp: I updated the plan with your suggested delete-then-rebuild protocol [17:51:07] chasemp: do you remember what we should do with the __db.* and log files? [17:51:34] they were only related to the sge db iirc [17:51:39] ...or so I thought [17:52:13] I think the log file is so large it has to be from the jobs database [17:53:04] what day was it we did this before? if I have the logs I remember we sussed it out based on trying to mvoe things between dirs [17:53:14] and sge wouldn't start I thought as the .log file as needed by sge db [17:53:15] but [17:53:19] not entirely positive [17:54:12] hrm. [17:54:57] YuviPanda: about? [17:58:31] 20160115.txt:[21:47:23] echo "stopping master" && service gridengine-master stop && echo "------" && ps -ef | grep grid && echo "remove foo" && rm -f foo && ls && db_dump sge_job > foo && cp -p sge_job /root/ && ls /root/ && md5sum sge_job && rm -f sge_job && db_load -f foo sge_job && md5sum sge_job && chown sgeadmin:sgeadmin sge_job && [17:58:31] service gridengine-master start; echo $? [17:59:00] madness :) [17:59:06] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20160115.txt [17:59:16] oh nice [18:00:55] hey chasemp / valhallasw`cloud [18:01:30] chasemp: do you remember where the backup of the sge config is? [18:01:33] iiuc 1800 UTC is now [18:01:38] yep [18:02:03] root/emergency_sge_dump [18:02:03] yeah [18:02:04] from last time [18:02:15] ok so, quick recap on things [18:02:20] YuviPanda: https://etherpad.wikimedia.org/p/toollabs20160127 [18:02:23] corruptino messages 'boo' [18:02:34] I have an unrelated test master here [18:02:34] tool-master-05.tool-renewal.eqiad.wmflabs [18:02:37] all okay up to now? [18:02:53] yup, just reading it [18:03:12] valhallasw`cloud: <3 for setting that up :) [18:05:13] nfs seems hung there [18:05:16] and I haven't even done anything :) [18:05:17] ok [18:05:19] just slow [18:05:32] gah [18:05:32] tools.tools-exec-1208.network.eth0.tx_byte - 15620 kB/s [18:05:37] nfs is under heavy load [18:05:50] let me look at exec node [18:05:50] can we root that out quickly do you think some job is amuck [18:05:53] templatetiger again [18:06:09] we should salt their crons [18:06:15] err [18:06:17] comment [18:06:18] templatetiger is not cronning eanything [18:06:24] oh [18:06:27] so manually submitted? [18:06:31] I think so. [18:06:34] * YuviPanda waits for 'become' to finish [18:06:47] I qdel'ed the job [18:06:56] 24986 be/4 tools.gl 7.79 K/s 0.00 B/s 0.00 % 4.20 % ruby /data/project/glamify/.rvm/rubies/ruby-2.0.0-p598/bin/rake queue:process [18:06:59] that's what iotop shows me [18:07:01] ah [18:07:06] because you already got to templatetiger [18:07:09] yeah [18:07:20] <3 ok [18:07:38] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1970464 (10valhallasw) `qdel`ed another job just now. It's putting severe stress on NFS -- please do not run another job without first consulting. [18:08:00] valhallasw`cloud: I *think* it might be spawned from the webservice [18:08:07] valhallasw`cloud: I can actually find out because we've eventlogging data now [18:08:30] if that's the case we can shut down the webservice [18:08:36] yeah, there's no become templatetiger on tools-bastion-01 today [18:08:59] ok, anyway. That should help with NFS [18:09:39] right [18:09:44] seems solid [18:09:57] I'll send another email to labs-l [18:10:51] chasemp: are you going to do the actual dump? [18:11:07] sure mind if I drive here, I'm oing to try to use the sge_dump to backup first as well [18:11:21] | 4798a5217ea95e25946ae999f62e8b29 | db130c807a2865a603aef6fa46c79c0aaadad816 | 20160127173603 | "Python-urllib/2.7" | NULL | metawiki | jsub | /usr/bin/jstart /data/project/templatetiger/public_html/sort.sh | tools-bastion-01.tools.eqiad.wmflabs | -bash | tools.templatetiger | [18:11:30] valhallasw`cloud: hmm, so that's manually started from bastion [18:11:35] and there's no history entry [18:12:01] what on earth. The master is using 50% memory and large amounts of cpu [18:12:16] maybe it's leaking memory due to the bdb issues -- thta would explain the random crashes [18:12:44] it looks ok atm I think [18:12:52] ok here I go [18:12:58] stopped the master and trying to back it up with [18:13:13] root/emergency_bin/save_sge.sh [18:13:15] chasemp: !log :D [18:13:17] it's still running [18:13:27] that was the 'stop me now' warning :) [18:13:30] ok [18:14:05] !log tools grid master stopped [18:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:14:23] I can confirm the process is now gone ;-) [18:14:48] ok I had to start it...apparently the save job [18:15:05] actually uses qconf to get all settings out of the db [18:15:12] so I'm saving now [18:16:12] look in /root/sge_maint_01272016 [18:16:29] Configuration successfully saved to /root/sge_maint_01272016 directory. [18:17:04] btw I copped these save / dump scripts from taht sun of grid engine repo valhallasw`cloud and I had to modify them a bit [18:17:08] I'll try to puppet that up today [18:17:43] !log tools SGE Configuration successfully saved to /root/sge_maint_01272016 directory. [18:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:17:53] tx [18:18:53] 4.6M /root/sge_maint_pre_jobs_dump_01272016 [18:19:31] !log tools dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M [18:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:20:09] !log tools master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job [18:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:20:44] chasemp: ownership is incorrect (root:root) [18:21:10] and I'm not sure what we should do to the log files, but let's keep them and see what gridengine_master says? [18:22:05] agreed [18:22:40] !log tools messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016' [18:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:23:39] !log tools master sge restarted post dump and restart for jobs db [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:23:45] let's just see what it does? [18:23:45] seems to work [18:23:56] !log tools no errors in log file, qstat works [18:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:24:37] !log tools 'sleep' test job also seems to work without issues [18:24:38] valhallasw`cloud: fyi 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016 was me manually putting in a marker [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:24:44] chasemp: ah, ok [18:24:45] so I would know at what point I did it [18:25:03] *nod* [18:25:54] the rate of error messages was ~one per 5 mins, so we might have to wait for a bit before we cheer [18:26:12] agreed let's sit on it for a few [18:26:18] everybody cool w/ things so far? [18:26:39] !log tools messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate [18:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:26:53] * YuviPanda is watttchinng [18:27:22] I'm not actually doing anything, which is great. let me know if you guys want me to [18:28:14] YuviPanda: yeah, understood, mainly if we have to go another step [18:28:18] and wipe teh queue [18:28:22] * YuviPanda nos [18:28:25] * YuviPanda nods [18:29:20] !log tools job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 . [18:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:29:41] tx valhallasw`cloud that job seems to be heavy often [18:30:27] madhuvishy is working on moving it to its own project [18:30:37] yeah, it's still doing stuff there now, but I'm first going to eat dinner [18:31:22] :D [18:31:31] 10Labs-Other-Projects: Succesful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#1970569 (10AdHuikeshoven) As an admin at https://discourse.wmflabs.org/ I just upgraded the Discourse installation. Discourse has two kind... [18:31:39] YuviPanda: yeah i'll try to work on it tonight [18:31:50] madhuvishy: :D thanks! [18:32:20] can we de-nfs it in that case? [18:33:01] chasemp: yes [18:33:17] oh sweet woohoo [18:34:03] hello labs users! [18:34:31] I'm a new user of fmwlabs [18:35:08] hey Youni welcome [18:35:13] hello, Youni, welcome [18:35:43] trying to creat a tool to collect vocabulary lists in fr.Wikiversity foreign languages [18:36:22] focused in Portuguese language learning [18:37:41] Youni: that's a pretty neat but specific quest :) I'm not sure if I could help you may be better off sending mail to [18:37:42] https://lists.wikimedia.org/mailman/listinfo/labs-l [18:37:47] to try to grab a wider audience [18:37:56] i just created the vocabulary-index hosted tool [18:38:29] and will look for copying the scripts into [18:38:34] YuviPanda: can you run through a test tool you have and such w/ general submission etc [18:38:45] * YuviPanda does [18:38:47] want to see if activity will trigger using the queue db which would trigger the logs [18:39:35] am doing a bunch of 'em now [18:39:49] tx [18:39:55] stepping away for 2m [18:39:59] k [18:42:36] my test script is doing ok [18:43:41] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1970620 (10Kolossos) Sorry, I read this too late. I will sort now to "/tmp/sort/" and hope thats ok. InnoDB seems not to have an physical sort without an primary or unique k... [18:44:07] crap just got teh message again [18:44:28] yup [18:44:29] so [18:44:35] no dice [18:45:02] :'( [18:45:05] just saw that [18:45:30] want to wait for valhallasw`cloud to start a fresh queue? [18:46:27] yeah [18:54:39] chasemp: bah :( [18:54:44] we can also do one more intermediate sttep [18:54:50] which is dump and reload to non-nfs [18:55:12] hmmm [18:55:33] yeah, that'd be nice to try I think [18:56:27] I don't think that helps if the db is already corrupt and dump/load just carries it over [18:56:39] for the current pickle I mean [18:56:47] it's also possible it's not the queue db [18:56:57] that's a good point [18:57:02] it doesn't say what db is corrupt [18:57:05] ! [18:57:05] hello :) [18:57:06] 01/27/2016 18:43:03|worker|tools-grid-master|E|error writing object with key "USER:tools.jimmy" into berkeley database: (22) Invalid argument [18:57:18] well that is interesting [18:57:21] that does sound like the configuration database rather than the job one [18:57:24] should we dump and reload the config one too? [18:57:35] that is a bit more complex maybe [18:57:38] I'm not entirely sure [18:57:51] we can just do the bdb dump? instead of export/import [18:58:51] labstore high load agian [18:58:52] ok give me a sec w/ my test setup [18:59:11] looks ok to me now [18:59:29] this whackamole stuff is for the birds tho [18:59:40] huh I see the alert [18:59:48] yeah seems fine now [18:59:50] * YuviPanda nods [19:00:06] no [19:00:11] valhallasw`cloud: templatetiger is running again [19:00:38] killed it again [19:00:52] same thing? [19:00:54] huh [19:01:30] yeah [19:03:50] ok, I don't get what's happening on tools-webgrid-generic-1405. It's doing 20 Mbit/s to labstore, but iotop doesn't show anything; the ifttt processes are unkillable [19:04:08] same thing happened yesterday [19:04:13] I had to reboot I think [19:04:30] the processes are stuck in uninterruptable sleep ('D' [19:04:52] no [19:04:52] it's nfs but not sure why [19:05:01] I restarted the iffttt process [19:05:06] with a webservice restart [19:05:09] and it went away [19:05:48] yeah, I'll just reboot that host [19:05:49] sorry guys just give me a few more minutes here I'm looking at the dump restore possibilities as well as I can [19:06:25] yeah we'll look at this chasemp [19:06:36] valhallasw`cloud: I'm going to restart ifttt too [19:06:44] YuviPanda: why? [19:06:51] is it also killing 1401? [19:06:59] eh, 1403 now [19:07:19] so it left behind a ghost on 1405 [19:07:25] need to see if it left behnd one on 1401 [19:07:34] how did you find it? [19:07:53] valhallasw`cloud: wait, when you said [19:07:59] > the processes are stuck in uninterruptable sleep ('D' [19:08:02] is that about ifttt [19:08:04] or everything? [19:08:11] 1405? because gridmaster was reporting a job it couldn't kill [19:08:34] there's ifttt processes (~10) plus a whole bunch of lsof -w -l +d /var/lib/php5 [19:08:37] hmm that instance feels hosed [19:08:46] lots of kworker and kthreadd too [19:08:56] let me reschedule the running jobs [19:09:09] or just reboot, I guess [19:09:15] let's reschedule [19:09:20] reboot, gridengine doesn't see it [19:09:22] the last time I tried [19:09:25] you'll have to disable the queue, though [19:09:38] yeah [19:09:39] it should once the exec manager is back up [19:09:40] let me do that [19:09:42] ok [19:09:47] it didn't the last time [19:09:55] it just was like 'yeah, these jobs exist, sure!' [19:10:00] I waited a good 10min [19:11:23] !log tools depooled tools-webgrid-1405 to prep for restart, lots of stuck processes [19:11:25] valhallasw`cloud: done [19:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:15:24] it's not immediately obvious to me how to safely rebuild the sge db from scratch [19:15:53] chasemp: I think db_dump db_load should just work? [19:16:06] it's just another bdb database [19:16:10] I'm all for that I meant for the setup in a place not nfs anew case [19:16:20] it needs some looking into [19:16:29] ok so [19:16:32] dumping the main db [19:17:14] if we decide to move off NFS (and off bdb files possibly) we should probably not try to do that today [19:17:36] agreed i was just giving the once over in case it really was simple [19:17:37] :) [19:17:43] spoiler, it's not [19:19:25] dumping the main db and db_load seems to work fine in test case [19:19:27] let's give it a whirl [19:19:36] cool [19:19:58] valhallasw`cloud: hey btw, I tried stracing the master to see [19:20:03] what db it was opening on throwing that error [19:20:07] no such luck [19:20:14] not sure why I didn't see it [19:20:20] maybe you have an idea? [19:20:30] in theory we can tell what it's trying to do pre-corruptino warning [19:21:02] *nod* [19:21:37] did you strace -f -p ? it might be a child process that's doing it [19:22:00] did not do -f [19:22:02] hm [19:22:31] it has 12 or so child processes [19:22:39] duh on me [19:23:13] !log tools stopped master [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:26:27] so uh [19:26:29] db_load: BDB0004 fop_read_meta: sge: unexpected file type or format [19:26:30] db_load: DB->open: sge: Invalid argument [19:27:16] right duh [19:27:16] uuuh [19:27:26] bad syntax I tink [19:27:29] input/output file switched? [19:27:41] huh [19:28:06] no this should work unless I"m crazy [19:28:25] db_load -f /tmp/sge_db sge [19:28:27] from testing [19:29:12] . /tmp/sge_db doesn't exist? [19:29:23] sorry that was just a syntax example [19:29:31] ah. [19:29:38] root@tools-grid-master:/var/lib/gridengine/spool/spooldb# db_load -f /root/sge > sge [19:29:57] try it in a different directory? might be the log files messing stuff up [19:30:17] hm no [19:30:18] huh [19:30:22] the file /root/sge is bogus [19:30:31] how so? [19:30:40] (it took it doing it in /tmp fyi) [19:30:53] it contains 3 keys, 2 values, if I understand the dump format correctly [19:31:03] it's way too short for an actual db dump [19:31:11] eh [19:31:12] no [19:31:17] I'm doing something stupid [19:31:25] 3.3M /root/sge [19:31:29] no, it does look OK [19:31:44] I thought I changed my head to -n 100, but apparently mistyped and did not [19:32:24] chasemp: ah! it's db_load -f /root/sge sge not > sge [19:33:01] but I think you figured this out given that there's a new sge in spooldb [19:33:20] let me look back I think it was throwing error w/ valid syntax [19:33:32] but I'm scrambled now and need to review my own history [19:33:35] meanwhile [19:33:39] it loaded [19:33:53] !tools master start grid master [19:33:53] There are multiple keys, refine your input: tools-admin, toolsbeta, tools-bug, toolscors, tools-equiad, toolsmigration, tools-request, tools-status, toolsvslabs, tools-web, [19:33:55] qconf -ss [19:33:57] looks ok [19:34:06] !log tools master start grid master [19:34:07] hm [19:34:09] yeah, qstat works [19:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:35:09] so I've restored both [19:35:14] let's see what happens [19:35:19] * YuviPanda tries submitting a job [19:35:23] jsub also works [19:35:28] runs ok [19:35:30] yeah [19:36:34] nfs guys :) [19:37:32] tools.tools-exec-1208.network.eth0.tx_byte - 18334 kB/s [19:37:38] if it's templatetiger again... [19:37:47] can we ban that tool for now? [19:37:51] idk [19:37:57] need to do someting [19:37:57] it is [19:38:08] oh ok [19:38:12] I thought you said it is that [19:38:38] not if it is :) [19:38:38] kolossos acknowledged my message, though [19:38:38] YuviPanda: hi [19:38:39] I was just guessing and it was right [19:38:39] I can't login to tools-exec-1208, which doesn't help [19:38:39] yeah me neithe [19:38:49] but there is a `sort` job running, yes [19:38:57] rit's jobid is 2735480 [19:39:14] YuviPanda: currently trying to uploading things from labs.... and here it what I get on scp "214.1KB/s" :( [19:39:23] I deleted it [19:39:29] YuviPanda: ok, good [19:39:31] hey Kelson [19:39:45] Kelson: we're in the middle of toollabs maintenance - can you file a bug and we can take a look at it after? sorry! [19:40:04] YuviPanda: OK, I will [19:40:11] thanks [19:40:42] valhallasw`cloud: again, this time on bastion [19:41:01] killed it [19:41:08] it was someone doing zim stuff [19:41:19] chasemp: today has been more whack-a-mole than usual [19:41:37] * YuviPanda will keep an eye on iftop on labstore now [19:41:44] the good news is that gridengine has been running OK for 10 mins now ;-) [19:41:52] woo! [19:41:57] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1970995 (10Kolossos) Is it true that "/tmp/" is locally on a server randomly choosen by jstart? So how can I use this files later in the next job to import it to database? Why... [19:42:56] 10Labs-Other-Projects: Succesful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#1971000 (10Steko) No, moderators have quite limited rights, they are more similar to wiki Administrators than sysadmins and have no contro... [19:45:05] 6Labs: New instance fails to mount /home - https://phabricator.wikimedia.org/T124957#1971005 (10jkroll) 3NEW [19:46:26] valhallasw`cloud: 10 minutes? you are the best ^^ [19:49:25] 6Labs: New instance fails to mount /home - https://phabricator.wikimedia.org/T124957#1971046 (10yuvipanda) 5Open>3Resolved a:3yuvipanda This is because the nfs-exports service on labstore was dead for 3 days. I've started it now, and your next puppet run should give you /home. [19:49:45] valhallasw`cloud: what do you make of the 'unable to find job' messages [19:49:51] I'm not sure [19:49:59] jobs that had state change during the downtime I think [19:50:08] or soemthing along those lines? [19:51:01] YuviPanda: it's something that 'just happens' and has been happening every few minutes since 2015 [19:51:06] sept 2015* [19:51:11] oh in taht case [19:51:13] i.e. the begin of messages [19:51:14] pffffffff [19:51:24] so I wouldn't worry too much [19:52:01] 6Labs, 10Labs-Infrastructure: Labs bandwidth is aleatory/low - https://phabricator.wikimedia.org/T124960#1971067 (10Kelson) 3NEW [19:52:59] the dumps that save_sge.sh script makes are huge [19:52:59] 6.6G sge_maint_01272016 [19:53:54] chasemp: accounting file [19:54:00] ahhh [19:54:05] yes [19:54:21] honeslty maybe we should truncate that [19:54:29] opening a file taht big on nfs all the time [19:54:31] hm [19:55:13] it's also not very useful to have the log going back that far, tbh [19:57:26] YuviPanda: https://github.com/valhallasw/son-of-gridengine/blob/d1673d47d84fa526548657ed8e0771fd1f5cac26/source/daemons/qmaster/sge_follow.c#L1018 [19:57:30] that's where the error comes from [19:57:42] so something with immediate jobs (-now y) that do not get scheduled [19:58:19] fun [19:58:34] hmm I never did actually instrument qsub so can't look for immediate job invocations [20:03:48] valhallasw`cloud: chasemp at what point would you feel comfortable calling the load/dump a success? [20:03:59] well, at 30m I'm cautiously optimistic [20:04:25] last time I restarted at 12:23 [20:04:27] and then we saw it at [20:04:31] 12:44 [20:04:46] this tiem I started at [20:04:47] Wed Jan 27 19:34:37 UTC 2016 restore of sge db chase [20:04:48] it is now [20:04:52] Wed Jan 27 20:04:49 UTC 2016 [20:05:00] ok [20:05:18] I'd suppose we can watch carefully for another 30min [20:05:20] I'm also stracing to see what's really going on [20:05:30] and I have to periodically wipe the file due to low disk space [20:05:35] fyi :) [20:05:47] fun :D [20:08:53] valhallasw`cloud : someone *may* have pointed out another way of achieving the same thing as my big query with a much smaller (and faster) one - but i've examined the first 10 results (over 3000+) for arwiki, all false positives so far ^^' [20:09:41] i'll try and see how so many false poz came up with the new query when my headache recedes, so possibly not before tomorrow [20:20:49] valhallasw`cloud: YuviPanda it has been an hour [20:20:58] \o/ [20:21:05] I'm not sure if it's gone [20:21:12] but we can't prove it's there still for now [20:21:12] :) [20:21:18] and next steps are more radical [20:21:24] and probably best not today [20:21:26] thoughts? [20:21:32] +1 [20:21:38] I would like to truncate that accounting log [20:21:43] assuming by next step you mean 'get it out of NFS' [20:21:46] so we aren't opening a 6G log every time from NFS [20:21:52] ha, well sure [20:22:31] ah [20:22:37] yeah, truncating seems like a good idea [20:23:37] done [20:23:45] would you mind just running through a test job quick? [20:23:46] cool [20:23:48] yea [20:23:48] before we call it [20:24:42] !log tools master stop, truncate accounting log to accounting.01272016, master start [20:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:25:03] chasemp: wfm [20:25:18] I did a webservice restart + check, a jsub + check and a bunch of qstats [20:25:25] cool [20:28:44] YuviPanda: mind responding to that thread on labs-l? can't seem to find it atm [20:30:26] chasemp: yeah let me do that [20:32:46] valhallasw`cloud: man, seriously thanks again the etherpad was super helpful [20:33:08] I would tally the beers we owe you but I only have a 64 bit register [20:36:32] chasemp: hahaha [20:37:10] I'm mostly just happy the issues seem resolved now :-) [20:49:28] 10PAWS, 7Upstream: PAWS cron functionality - https://phabricator.wikimedia.org/T124972#1971301 (10jayvdb) 3NEW [21:10:41] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1971392 (10valhallasw) Yes, /tmp/ is locally on the server (that's the point -- you don't want to hit the NFS server). You can either combine the jobs (so that they run on the... [21:10:50] chasemp: YuviPanda so I figured out why templatetiger's sort was so bad ^ [21:23:31] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1971454 (10valhallasw) Yeah, `/data/project/templatetiger` is full of `sort` temp files (`sortXXXXXX`), totalling about 9GB from the last run. `/tmp` is 'only' 15GB, so you mig... [21:41:58] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1971499 (10valhallasw) For the record, I sent an e-mail on 7 Jun 2015 to the entire group of maintainers: > Hello Templatetiger maintainers, > > One of your SGE jobs, > > `so... [22:31:01] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1971680 (10Kolossos) I will try now -T /tmp/sort. So the temporal dir should be local tmp . I code this stuff, often after my regular job. So sometimes I'm fine if things se... [22:34:40] 6Labs: evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991#1971715 (10chasemp) 3NEW [22:58:22] YuviPanda: look at tools-exec-1208 would you [22:58:24] weird job there [22:58:31] * YuviPanda looks [22:58:31] 28168 be/6 tools.te 0.00 B 99.95 M 0.00 % 60.31 % sort /data/project/temp~6-01-11.txt -T /tmp/sort [22:58:35] hammering nfs I think [22:58:47] chasemp: that's templatetiger again [22:58:56] I killed it before and it came right back [22:59:18] chasemp: the user is probably actively testing it [22:59:30] chasemp: https://phabricator.wikimedia.org/T124822#1970995 [22:59:36] sure but they are sucking the life out of nfs for all [22:59:40] yup [23:00:12] I'm killing it again now [23:00:22] what can we do so this stops happening atm? [23:00:41] I can probably disable user access to that tool [23:00:59] same person what 4x or 5x today [23:01:02] yes [23:01:41] YuviPanda: am I right that the ‘Shared Storage’ section on pages like this does nothing? https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProject&action=configureproject&projectname=testlabs [23:02:03] andrewbogott: you are right [23:02:09] should die [23:02:10] ok, going to rip it out then [23:02:27] (and maybe that whole page) [23:02:32] chasemp: so we can disable that tool for now but I'm not sure what exactly to tell the user [23:02:42] if they don't have access to it they can't find alternate ways to do what they want to do [23:02:47] can they do this on teh -dev host not on nfs? [23:02:48] but I guess disable first and then talk is an ok thing [23:02:51] or get a vm? [23:02:56] probably [23:03:05] yeah they rae literally putting nfs in danger [23:03:14] I'm going to disable them now [23:03:23] and then we can figure out what to do/say I guess [23:03:29] ok [23:03:40] I want to help them out but we gotta stop teh madness first [23:07:36] !log tools removed all members of templatetiger, added self instead, removed active shell sessions [23:07:41] chasemp: valhallasw`cloud ^ [23:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [23:08:15] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1971921 (10yuvipanda) This killed NFS again just now, so I've temporarily disabled the tool. [23:08:47] YuviPanda: uh [23:09:06] That's kolossos trying with /tmp instead of NFS as swa [23:09:42] So they are working on fixing [23:09:55] So removing all members is probably overreacting [23:10:05] valhallasw`cloud: I wasn't sure how exactly to disable access [23:10:08] outside of that [23:10:12] oh god, for a second there i felt awful for sorting a 100KB file. multi GB files ?! >_< [23:10:29] Anyway, sleeping now [23:10:35] valhallasw`cloud: it was killing NFS every time they tried so not much else I Could do [23:10:39] I'm writing a longer comment now [23:13:02] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1971943 (10yuvipanda) >>! In T124822#1971680, @Kolossos wrote: > I will try now -T /tmp/sort. So the temporal dir should be local tmp . > I code this stuff, often after my r... [23:13:51] chasemp: valhallasw`cloud ^ left a comment there [23:21:01] 6Labs: Figure out what to do about servicegrouphomedirpattern - https://phabricator.wikimedia.org/T125002#1971958 (10Andrew) 3NEW a:3Andrew [23:21:39] 6Labs: Figure out what to do about servicegrouphomedirpattern - https://phabricator.wikimedia.org/T125002#1971967 (10yuvipanda) Kill it. Stop offering service groups to non-tools projects - nobody uses them, and only accidentally create them before forgetting they exist. [23:27:13] 6Labs: Figure out what to do about servicegrouphomedirpattern - https://phabricator.wikimedia.org/T125002#1971995 (10scfc) IIRC, currently service groups are the only way to get credentials for the replica servers. [23:28:12] 6Labs: Figure out what to do about servicegrouphomedirpattern - https://phabricator.wikimedia.org/T125002#1971997 (10yuvipanda) Service groups in tools only - the replica server credential generator doesn't operate for any other project. It also requires you have NFS which is also not there by default for new pr... [23:30:56] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#1972018 (10chasemp) Just wanted to say #2 above should be pretty straight forward and is meant for this kind of case. Thanks :) We are in #wikimedia-labs on irc if you have... [23:37:51] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other, 10DBA: tools.tools-info credentials are not functioning - https://phabricator.wikimedia.org/T105911#1972042 (10yuvipanda) 5Open>3Resolved Done and documented [23:39:16] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other, 10DBA: tools.tools-info credentials are not functioning - https://phabricator.wikimedia.org/T105911#1972047 (10yuvipanda) Documentd at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Regenerate_replica.my.cnf, didn't paste last time [23:52:56] YuviPanda: disabling puppet for a few on tools-bastion so I can poke at a diamond collector there w/o my changes going away [23:53:12] if you need remove it, it's just so I can poke somehwere with actual usage I'm a bit confused by something [23:53:26] chasemp: kk.