[00:00:08] valhallasw`cloud: but it's been more than that for a while now and obv it hasn't picked up. [00:00:16] I'm not sure if the heartbeat file is the only thing [00:00:23] also [00:00:25] because from strace, shadow is talking to master as well [00:00:27] does it show the same on both servers? [00:00:29] teh ts [00:00:39] nfs and all but sure [00:00:55] !log tools kill -9'd gridengine master [00:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:01:04] valhallasw@tools-grid-shadow:/data/project/.system$ ls -lh gridengine/qmaster/heartbeat [00:01:04] -rw-r--r-- 1 sgeadmin sgeadmin 6 Feb 28 2014 gridengine/qmaster/heartbeat [00:01:09] ....uh [00:01:18] valhallasw`cloud: that's wrong file. it's apparently in 'common/default' [00:01:22] ah [00:01:24] valhallasw`cloud: there's for some reason two sets of everything there [00:01:32] ffs [00:01:51] am watching shadow [00:02:13] there's also spooldb.new [00:02:16] which I don't think is 'new' [00:02:33] in another 30s if shadow doesn't pick up I'm going to manually start master there [00:02:34] YuviPanda: technically were we still having nfs issues when you rebooted? [00:02:45] chasemp: I think we were in the clear [00:02:52] chasemp: I could 'ls' on NFS before reboot [00:03:40] ok starting [00:03:48] !log tools attempting to start gridengine-master on tools-grid-shadow [00:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:03:54] lol it doesn't [00:04:16] now to hunt around for where this logs anything [00:04:17] YuviPanda: so it read heartbeat [00:04:22] and then did nothing [00:04:34] https://www.irccloud.com/pastebin/AxNToSeL/ [00:04:35] ^ [00:04:55] valhallasw`cloud: was that master on -shadow? [00:05:25] no, shadowd [00:05:26] sgeadmin 28567 0.0 0.0 19804 1248 ? S Dec03 0:30 /usr/lib/gridengine/sge_shadowd [00:05:48] right [00:05:53] but I explicitly started master [00:06:19] oh [00:06:48] and that didn't 'take [00:06:50] ' [00:07:08] I suppose it logs on NFS [00:07:11] I'm sorry, but I'm now really off to bed :( [00:07:58] yeah, should all be in /var/spool/gridengine/qmaster = /data/project/.system/gridengine/spool/gridengine/qmaster [00:08:02] yeah, 'tis ok [00:08:24] sorry, data/project/.system/gridengine/spool/qmaster [00:08:34] fwiw -rw-r--r-- 1 sgeadmin sgeadmin 6 Dec 29 23:33 /var/spool/gridengine/qmaster/heartbeat [00:08:36] yeah am looking there. no logs [00:08:44] chasemp: yup that's mounted on NFS [00:08:52] didn't we say that was the wrong heartbeat? [00:08:54] !log tools attempt to stop shadowd [00:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:09:03] chasemp: nope, the wrong heartbeat is elsewhere! [00:09:49] I'm rebooting shadow too [00:09:58] ok [00:10:00] !log tools reboot tools-grid-shadow [00:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:10:12] chasemp: the wrong one was in /data/project/.system/qmaster [00:13:28] ok, I’m an hour behind on the backscroll here (was distracted by CI) is there a quick summary or should I read everything? [00:13:41] andrewbogott: grid engine masters are not acting as master [00:13:46] qstat type commands hang [00:13:52] rebooted master vm to no effect [00:13:55] ok [00:14:28] so tools-grid-shadow reboot isn't done. it isn't going down even [00:14:35] at least from what I can see in the console on wikitech [00:14:39] I can hard-reboot it [00:14:44] and see what happens [00:15:20] $ qstat [00:15:20] error: commlib error: got select error (Connection refused) [00:15:20] error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error [00:15:24] that? [00:15:27] yeah [00:16:58] YuviPanda: is tools-grid-shadow still refusing to reboot? [00:17:37] no it came back up [00:18:55] nothing in the logs at all [00:19:01] the master started on -shadow [00:20:38] so we think master is there now? [00:20:51] are nodes still looking for tools-grid-master.tools.eqiad.wmflabs? [00:21:02] qstat hangs… I don’t know how things are supposed to know where the new master is [00:21:34] yes [00:21:34] root@tools-grid-shadow:/home/yuvipanda# cat /data/project/.system/gridengine/default/common/act_qmaster [00:21:34] tools-grid-shadow.tools.eqiad.wmflabs [00:21:35] ah [00:21:35] theoretically they're supposed to use that file [00:21:35] to figure out where qmaster is [00:22:09] and the service is live, I can telnet to it from tools-login [00:22:34] right [00:23:14] lots of [00:23:14] tools-grid-shadow nslcd[1136]: [a1c464] error writing to client: Broken pipe [00:23:19] mind if I restart nslcd? [00:23:24] those are known red herrings [00:23:26] I believe [00:23:28] ok [00:23:39] nslcd chokes are large ldap groups [00:23:44] it's been there for awhile and often [00:23:48] am stracing again [00:23:52] to see if it's reading the right response [00:24:29] nope [00:24:31] stat 21737 tools.wikibugs 3u IPv4 24797524 0t0 TCP tools-bastion-01.tools.eqiad.wmflabs:40638->tools-grid-shadow.tools.eqiad.wmflabs:sge-qmaster (ESTABLISHED) [00:24:33] it's talking to the right host alright [00:24:54] so on tools-bastion [00:24:56] does qstat hang? [00:24:58] yeah [00:25:13] 21737 read(3, "tools-grid-shadow.tools.eqiad.wmflabs\n", 1048576) = 38 [00:25:15] new strace just finished [00:25:59] what's the master service [00:26:02] anyway to start in teh fg [00:26:05] and try to use qstat [00:26:08] and see what it spits out [00:26:11] I don't get the logging situation here [00:26:20] yeah, the master immediately forks and daemonizes [00:26:32] don't think it can be started in fg, but should look at man-page [00:26:47] what process / service [00:26:54] gridengine-master [00:27:32] yeah [00:27:53] is there just no logfile at all? [00:28:13] so [00:28:16] the init seems to want [00:28:23] usr/sbin/sge_qmaster [00:28:25] which is a symlink to [00:28:32] ../share/gridengine/gridengine-wrapper [00:28:36] which isn't what I see in ps? [00:28:43] sgeadmin 2501 1 3 00:17 ? 00:00:24 /usr/lib/gridengine/sge_qmaster [00:28:45] theoretically in /data/project/.system/gridengine/spool/qmaster/messages [00:28:47] am I crazy here [00:29:33] 12/29/2015 22:08:21| main|tools-grid-master|E|jvm thread is not running [00:29:40] what's w/ the timestamps? [00:29:57] and is all this legacy then? [00:29:57] 12/29/2015 22:01:27|listen|tools-grid-master|E|commlib error: local host name error (IP based host name resolving "tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs" doesn't match client host name from connect message "tools-webgrid-lighttpd-1403.eqiad.wmflabs") [00:30:07] that's from a few hours ago, right? [00:30:14] also does it really fucking format timestamps in US notation? jfk [00:30:20] ok so looking at the strace [00:30:26] I see that the master is actually responding [00:30:34] (/data/project/wikibugs/fuck2) [00:30:50] so tools-grid-shadow as the master [00:30:50] it responds with [00:30:52] is responding to what? [00:30:53] 2145143465235642201741not available [00:30:56] to qstat [00:30:59] hi guys [00:31:00] from tools-bastion-01 [00:31:00] someone could give me the tool youtube2commons link, please [00:31:01] from where? [00:31:15] from tools-bastion-01 [00:31:24] fwiw I dropped NEWWed Dec 30 00:30:23 UTC 2015 [00:31:25] if you look at the strace file and start reading from 2145143465235642201741not available [00:31:26] into [00:31:27] err [00:31:35] data/project/.system/gridengine/spool/qmaster/messages [00:31:38] to see new entries [00:31:44] wtf, clipboard [00:31:48] fuck [00:31:54] what now? [00:31:59] my clipborad doesn't work [00:32:02] unrelated to actual otuage [00:32:06] am trying to paste an IP [00:32:12] andrewbogott: does the qstat you were trying work now? [00:32:15] 10.68.18.210 [00:32:28] still hangs [00:32:32] The_Photographer, maybe https://tools.wmflabs.org/video2commons/index.py [00:32:40] if you look at the strace file and start reading from that ip [00:32:47] you can see the xml exchange between client and server [00:32:54] which is all mysterious xml [00:33:26] andrewbogott: what are you try exactly and where from? [00:33:34]

233binmessage abs\" comp=\"qmaster\" id=\"1\">0 [00:33:38] disabled [00:33:43] pengo: "video2commons" is not approved as a Connected App. Contact the application author for help. [00:33:59] chasemp: tools.morebots@tools-bastion-01:~$ qstat [00:34:18] The_Photographer, full list is here https://tools.wmflabs.org/ [00:34:19] is it a perms thing then [00:34:24] as yuvi is trying as root? [00:34:42] fwiw so far no logging period [00:34:44] all users are allowed to use qstat [00:34:45] where I know to look [00:34:58] well sure but what's different between yours and his essentially [00:35:44] chasemp: mine too hangs, except with strace it's obvious that network communication does happen [00:35:49] sorry that wasn't clear. [00:35:52] the qstat doesn't work [00:35:56] oh ok I meant, do things work [00:35:56] just that network communicaton happens [00:35:57] right ok [00:37:12] qstat still erroneous [00:40:26] !log copied and cleaned out spooldb [00:40:27] copied is not a valid project. [00:40:37] !log tools copied and cleaned out spooldb [00:40:37] is not a valid project. [00:40:43] !log tools copied and cleaned out spooldb [00:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:40:56] !log tools restarted master on grid-master [00:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:45:49] YuviPanda: have you used qping ever? Or is it known not to work? [00:46:53] have never used it [00:47:22] I can’t get it to tell me anything other than ‘connection refused' [00:47:23] i just pushed a patch for the labstore servers which should help with the ksoftirq issues, but should wait for later [00:47:43] YuviPanda: you ahve some logs now :) [00:47:45] grep -A 300 'NEWWed Dec 30 00:30:23 UTC 2015' /data/project/.system/gridengine/spool/qmaster/messages [00:47:48] unsure if useful [00:48:05] your cleanup maybe caused an issue [00:48:09] or surfaced an existing one [00:48:10] chasemp: yeah that's me messing around with the berkleydbs. the spooldb had a fuckton of files, so I was wondering if that's the problem [00:48:15] YuviPanda: do you know what ports the master and exec nodes should be listening on? [00:48:20] chasemp: but I moved all the existing required files and am at couldn't open berkeley database "sge": (22) Invalid argument [00:49:50] YuviPanda: https://serverfault.com/questions/522314/how-do-i-recover-a-berkeley-db-included-in-a-sun-grid-engine-installation/522333 [00:50:13] db_recover -c [00:50:15] :D am on the same page, reading the link [00:50:20] heh [00:50:29] might be the same reason why torrus always freezes [00:50:54] and now my shell's frozen [00:51:59] YuviPanda: so neither host is running a master now, and that’s on purpose, correct? [00:52:28] root@tools-grid-master:/data/project/.system/gridengine/bin/lx24-amd64# ls | grep recover [00:52:41] mark: so far there was a bug in wikitech/ldap that was issuing dupe id's, that lead to an openldap patch, which seems to have taken out ldap to some degree which caused nfs to go nuts which caused the usual chaos and somewhere in there gridengine died [00:52:42] andrewbogott: no, it's because theyre failing since my attempt-to-cleanup-spooldb-files [00:52:55] ah, ok :( [00:53:00] right [00:53:33] my patch is just to fix something I noticed on when glancing over labstore1001, and may help in the future [00:53:53] gridengine seems not to recover and no one knows why, sure was just catching you up as I wasn't sure how much you had seen [00:54:02] yeah thanks [00:54:10] bdb corruption sounds entirely likely [00:54:36] no that's a result of yuvi trying to clean up a seemingly insanely large spool backlog [00:54:48] I imagine to see if that was causing gridengine to choke [00:55:02] yeah so when I 'reverted' to the backed-up bdb files gridengine master starts [00:55:08] and proceeds to say absolutely nothing [00:55:13] am trying to increase debug levels now [00:55:38] ok [00:55:45] > SGE_DEBUG_LEVEL [00:55:51] > If set, specifies that debug information should be written to stderr. In addition the level of detail in [00:55:53] which debug information is generated is defined. [00:56:00] doesn't provide values, but let me try! [00:56:11] great stop the deamon and try running in console to catch a failing qstat [00:56:22] yeah [00:56:44] reading docs on gridengine is like traveling back in time [00:57:00] using nfs is like... [00:57:20] LDAP, NFS and GridEngine outage [00:57:24] I suppose we're in late 90s? [00:57:38] if you were to substitute LDAP by NIS... [00:57:44] > illegal debug level format [00:57:48] thank you, gridengine [00:57:56] now to find out wtf are legal debug levels [00:58:39] YuviPanda: http://stackoverflow.com/questions/312378/debug-levels-when-writing-an-application [00:58:46] https://blogs.oracle.com/templedf/entry/using_debugging_output [00:58:48] qmaster or other Grid Engine daemons keep crashing [00:58:48] Tell SGE daemons to not daemonize by setting the environment variable: SGE_ND [00:58:48] Start the SGE daemon from a shell. The daemon will now print debug information to stardard output. [00:58:49] 'tis crazy [00:58:50] Also, you may want to run the daemon under a debugger or strace (on Linux) to identify the location of the crash (and file a bug report!). [00:59:40] so that worked except it also managed to kill my terminal [01:01:32] oookay [01:02:03] i got it dumping debug info now! [01:02:08] except it's a *lot* of debug info [01:02:26] well, I guess try to do a qstat and see what you see [01:03:36] my battery is empty... [01:03:46] it seems to be [01:03:49] 3102 6625 trigger_thre <-- sge_copy_hostent() ../libs/uti/sge_hostname.c 659 } [01:03:51] 3103 6625 trigger_thre --> sge_copy_hostent() { [01:03:53] 3104 6625 trigger_thre 1 names in h_addr_list [01:03:55] 3105 6625 trigger_thre 0 names in h_aliases [01:03:57] 3106 6625 trigger_thre <-- sge_copy_hostent() ../libs/uti/sge_hostname.c 659 } [01:03:59] in a loop [01:04:03] dns related? [01:04:09] could this all be realted to funky hostname stuff [01:04:13] hey, qping works now! [01:04:15] ala those weird hacks in that static file [01:04:21] * andrewbogott is on his own little useless sidetrack [01:04:25] possibly. [01:04:36] good luck guys [01:04:43] because it also tries to contact teh master at foo.tools.eqiad.wmflabs [01:04:45] and I'm wondering [01:04:47] if that's meant to be [01:04:50] foo.eqiad.wmflabs [01:05:04] but nothing seems consistent [01:06:02] SGE_DEBUG_LEVEL="3 0 0 3 0 0 3 0" SGE_ND=1 sge_qmaster [01:06:04] is what I'm using [01:06:16] chasemp: so gridengine uses rdns for authentication [01:06:28] chasemp: and the aliases file is just telling it 'if rdns says this or this that is ok' [01:06:31] it could very well be related [01:06:48] the last time something like this happened was when our /etc/hosts file got too large for it [01:06:52] does it atuenticate by hostname? [01:07:13] yeah. [01:07:15] I'mc onfused on authenticate [01:07:18] that's...ok then [01:07:27] if you're thinking it is terrible [01:07:27] I can't hit ssh root@tools-grid-shadow.eqiad.wmnet now [01:07:28] then it is [01:07:32] chasemp: .wmflabs [01:07:40] silly fingers [01:08:02] did you take it back down or did it die? [01:08:19] oh, and now it’s back... [01:08:50] andrewbogott: yeah I Took it down and turned it back on with more logging [01:08:53] still useless [01:09:07] qping just says that it is in state ‘warning’ which we knew [01:09:10] running qstat [01:09:16] has basically no effect on the log output [01:09:18] I was hoping this would give us some kind of simple test case, but nope [01:11:16] well, andrewbogott want to try to call coren? I mean, kinda shitty but hey [01:12:01] sure [01:14:27] Left a message on his mobile, can’t remember if this home number is a real number or not... [01:15:52] YuviPanda: thoughts? [01:16:04] you know this beast best I guess what do you think [01:16:27] other than 'screwed'? :) it's mostly 'flail about until you hit something' atm [01:16:53] am going to see if I can read the bdb files [01:16:56] and see what's in there [01:17:00] so [01:17:04] teh master changed [01:17:04] tools-grid-master.tools.eqiad.wmflabs [01:17:06] at some point [01:17:35] yeah because I'm doing my restarts and debugging othe master [01:18:42] chasemp: I think he prefers people calling the landline, if you want to try that [01:18:56] stat("/var/lib/gridengine/default/common/sge_qstat", 0x7fff5e18d170) = -1 ENOENT (No such file or directory) [01:19:05] stat("/root/.sge_qstat", 0x7fff5e18d170) = -1 ENOENT (No such file or directory) [01:19:36] what's that from? master or client? [01:19:41] client [01:20:02] I think that's just a defaults file [01:20:53] Coren en route [01:22:16] YuviPanda: is /tmp/sge_messages where you’ve been logging? [01:22:17] what is [01:22:17] tools-grid-master.tools.eqiad.wmflabs. [01:22:24] andrewbogott: nope [01:22:28] connect(3, {sa_family=AF_INET, sin_port=htons(6444), sin_addr=inet_addr("10.68.20.158")}, 16) = -1 EINPROGRESS (Operation now in progress) [01:22:30] ok, well, look there [01:22:32] that’s the log of last resort [01:23:05] maybe that’s nothing, it’s a few minutes old [01:23:16] yeah, could be my attempts to start it when it was failing [01:24:27] yeah so client looks at a bunch of config and resolve thigns and then says it's contacting what it read as the current master [01:24:35] and just times out until fault [01:24:44] YuviPanda: so where /are/ the logs that you’re seeing? [01:24:47] * Coren arrives [01:24:54] andrewbogott: just on my terminal, unfortunately. [01:25:02] What are the symptoms? [01:25:03] Coren: sorry about your otherwise-peaceful evening :( [01:25:13] well Coren it's safe to say you we are grateful [01:25:23] tools master seems up but no commands work [01:25:26] qstat for example just hangs [01:25:33] we have tried rebooting and failing over masters etc [01:25:36] seems like the master is there [01:25:44] Networking? [01:25:47] clients say they connect retry until timeout [01:25:51] I can see it hit the box [01:25:54] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:00] Which is the current master? [01:26:17] Also, no NFS issues? [01:26:24] * Coren notes timeouts. [01:26:26] there were but have been ok for awhile [01:26:27] there were but not anymore. [01:26:30] so we think [01:26:34] but it does seem stable [01:26:40] Coren: the master is a process running on my shell on tools-grid-master [01:26:40] for a good while [01:26:41] Looks snappy. [01:26:55] YuviPanda: is debugging in console [01:26:58] no real wisdom so far [01:27:04] YuviPanda: I see it. Did you strace it yet? [01:27:08] bsaically, master doesn't act like master as it comes up [01:27:52] YuviPanda: I'm attaching to the running process, please not to kill. [01:28:02] home page is back, just another blip :/ [01:28:07] Coren: ok! [01:28:52] What the f? It's currently running like crazy reading the spool db, but there are *thousand* of transaction files/ [01:29:02] yeah I was trying to clear tha out [01:29:13] so cp'd the folder out, and copied back only the bdb files [01:29:26] no worky, it just said 'invalid argument' (you can see that in qmaster/messages) [01:29:26] YuviPanda: That won't work; you need to replay the transactions. [01:29:50] log.* are uncommited BDB transactions; they need to be replayed. [01:30:05] (I am guessing the grid master crashed leaving the db open) [01:30:14] ... there's 60G of them! [01:30:15] ooh, I see. I didn't realize that's what log.* files were [01:30:42] Holey sheets. Someone (accidentally?) DOS it it looks like. [01:30:48] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 973182 bytes in 4.782 second response time [01:31:16] YuviPanda: Stop the master; I'll fix the db [01:31:41] Hah. there are so many files ls fills my terminal backbuffer. [01:31:56] ok [01:32:15] Coren: done [01:32:23] From what I can see, gridengine would have recovered eventually; it was currently reading the transaction logs and (presumably) rebuilding - but at one file per 2-3 secs it would have taken a while. [01:32:56] I see. my strace just has a lot of '6767 gettimeofday({1451438873, 2967}, NULL) = 0' [01:33:09] Hm? [01:33:26] my strace of the master, that is [01:33:38] I got millions of pread() with the regular open() as it walked th db [01:34:25] haha that's really strange, I have 0 preads [01:34:26] YuviPanda: You made a copy of the DB directory? Do you still have that copy intact? [01:34:40] Coren: nope, I moved the copy back. [01:34:56] YuviPanda: It's possible that only one process will get the actual trace at a time; I was already attached to it. [01:34:58] Coren: everything is as it was when it started now [01:35:11] YuviPanda: kk. I'm making a backup; db_recover is one-way [01:35:36] ok [01:35:44] yeah I was going to run db_recover and couldn't find it at all [01:36:02] Actually, it would probably be *much* faster if you did the copy on the labstore so that I don't have to hit NFS twice [01:36:08] an example of accidental DOS would be e.g. submitting the same job over and over in a tight loop? [01:36:20] Coren: let me do that [01:37:14] andrewbogott: I suppose, but I doubt just one loop of submission could cause that much db traffic [01:37:41] andrewbogott: I'm thinking the LDAP issues might have caused everything to try to reschedule the world at once. Like, maybe through cron, etc. [01:38:10] YuviPanda: db_recover doesn't seem to be there, it's in db-util and I don't expect that's part of standard or base [01:38:14] * Coren apt-gets it [01:38:29] ah db-util was the package [01:38:32] ok [01:38:51] YuviPanda: But you definitely had the right idea with a db_recover. [01:39:18] cp is still ongoing [01:39:20] YuviPanda: If that doesn't fix it, a db_dump / db_load would have done the trick. [01:40:14] gridengine is very resilient to the db rolling back to the past (which is what a dump/load would have done). Worst side effect is that there are some running jobs it doesn't know about so it complains and kills them. [01:41:19] ok I looked for the db_revoer stuff to no avail [01:41:20] so that explains [01:41:36] tx much and I'm slightly derailed here on minor irl emergency [01:41:46] DBD rescue normally goes db_recover, db_recover -c, omgwtf, db_dump, db_load. :-) [01:41:53] chasemp: go back to vacation, I'll call you back if needed here [01:41:53] BDB* [01:42:37] I've added a puppet patch to add db-util on all masters [01:44:28] how I could upload a video of 300 MB to commons? [01:44:28] The_Photographer: someone in #wikimedia-multimedia might know the answer better [01:44:47] * Coren takes a look at the log files to figure out what did so much things. [01:44:53]