[00:00:08] valhallasw`cloud: but it's been more than that for a while now and obv it hasn't picked up. [00:00:16] I'm not sure if the heartbeat file is the only thing [00:00:23] also [00:00:25] because from strace, shadow is talking to master as well [00:00:27] does it show the same on both servers? [00:00:29] teh ts [00:00:39] nfs and all but sure [00:00:55] !log tools kill -9'd gridengine master [00:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:01:04] valhallasw@tools-grid-shadow:/data/project/.system$ ls -lh gridengine/qmaster/heartbeat [00:01:04] -rw-r--r-- 1 sgeadmin sgeadmin 6 Feb 28 2014 gridengine/qmaster/heartbeat [00:01:09] ....uh [00:01:18] valhallasw`cloud: that's wrong file. it's apparently in 'common/default' [00:01:22] ah [00:01:24] valhallasw`cloud: there's for some reason two sets of everything there [00:01:32] ffs [00:01:51] am watching shadow [00:02:13] there's also spooldb.new [00:02:16] which I don't think is 'new' [00:02:33] in another 30s if shadow doesn't pick up I'm going to manually start master there [00:02:34] YuviPanda: technically were we still having nfs issues when you rebooted? [00:02:45] chasemp: I think we were in the clear [00:02:52] chasemp: I could 'ls' on NFS before reboot [00:03:40] ok starting [00:03:48] !log tools attempting to start gridengine-master on tools-grid-shadow [00:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:03:54] lol it doesn't [00:04:16] now to hunt around for where this logs anything [00:04:17] YuviPanda: so it read heartbeat [00:04:22] and then did nothing [00:04:34] https://www.irccloud.com/pastebin/AxNToSeL/ [00:04:35] ^ [00:04:55] valhallasw`cloud: was that master on -shadow? [00:05:25] no, shadowd [00:05:26] sgeadmin 28567 0.0 0.0 19804 1248 ? S Dec03 0:30 /usr/lib/gridengine/sge_shadowd [00:05:48] right [00:05:53] but I explicitly started master [00:06:19] oh [00:06:48] and that didn't 'take [00:06:50] ' [00:07:08] I suppose it logs on NFS [00:07:11] I'm sorry, but I'm now really off to bed :( [00:07:58] yeah, should all be in /var/spool/gridengine/qmaster = /data/project/.system/gridengine/spool/gridengine/qmaster [00:08:02] yeah, 'tis ok [00:08:24] sorry, data/project/.system/gridengine/spool/qmaster [00:08:34] fwiw -rw-r--r-- 1 sgeadmin sgeadmin 6 Dec 29 23:33 /var/spool/gridengine/qmaster/heartbeat [00:08:36] yeah am looking there. no logs [00:08:44] chasemp: yup that's mounted on NFS [00:08:52] didn't we say that was the wrong heartbeat? [00:08:54] !log tools attempt to stop shadowd [00:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:09:03] chasemp: nope, the wrong heartbeat is elsewhere! [00:09:49] I'm rebooting shadow too [00:09:58] ok [00:10:00] !log tools reboot tools-grid-shadow [00:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:10:12] chasemp: the wrong one was in /data/project/.system/qmaster [00:13:28] ok, I’m an hour behind on the backscroll here (was distracted by CI) is there a quick summary or should I read everything? [00:13:41] andrewbogott: grid engine masters are not acting as master [00:13:46] qstat type commands hang [00:13:52] rebooted master vm to no effect [00:13:55] ok [00:14:28] so tools-grid-shadow reboot isn't done. it isn't going down even [00:14:35] at least from what I can see in the console on wikitech [00:14:39] I can hard-reboot it [00:14:44] and see what happens [00:15:20] $ qstat [00:15:20] error: commlib error: got select error (Connection refused) [00:15:20] error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error [00:15:24] that? [00:15:27] yeah [00:16:58] YuviPanda: is tools-grid-shadow still refusing to reboot? [00:17:37] no it came back up [00:18:55] nothing in the logs at all [00:19:01] the master started on -shadow [00:20:38] so we think master is there now? [00:20:51] are nodes still looking for tools-grid-master.tools.eqiad.wmflabs? [00:21:02] qstat hangs… I don’t know how things are supposed to know where the new master is [00:21:34] yes [00:21:34] root@tools-grid-shadow:/home/yuvipanda# cat /data/project/.system/gridengine/default/common/act_qmaster [00:21:34] tools-grid-shadow.tools.eqiad.wmflabs [00:21:35] ah [00:21:35] theoretically they're supposed to use that file [00:21:35] to figure out where qmaster is [00:22:09] and the service is live, I can telnet to it from tools-login [00:22:34] right [00:23:14] lots of [00:23:14] tools-grid-shadow nslcd[1136]: [a1c464] error writing to client: Broken pipe [00:23:19] mind if I restart nslcd? [00:23:24] those are known red herrings [00:23:26] I believe [00:23:28] ok [00:23:39] nslcd chokes are large ldap groups [00:23:44] it's been there for awhile and often [00:23:48] am stracing again [00:23:52] to see if it's reading the right response [00:24:29] nope [00:24:31] stat 21737 tools.wikibugs 3u IPv4 24797524 0t0 TCP tools-bastion-01.tools.eqiad.wmflabs:40638->tools-grid-shadow.tools.eqiad.wmflabs:sge-qmaster (ESTABLISHED) [00:24:33] it's talking to the right host alright [00:24:54] so on tools-bastion [00:24:56] does qstat hang? [00:24:58] yeah [00:25:13] 21737 read(3, "tools-grid-shadow.tools.eqiad.wmflabs\n", 1048576) = 38 [00:25:15] new strace just finished [00:25:59] what's the master service [00:26:02] anyway to start in teh fg [00:26:05] and try to use qstat [00:26:08] and see what it spits out [00:26:11] I don't get the logging situation here [00:26:20] yeah, the master immediately forks and daemonizes [00:26:32] don't think it can be started in fg, but should look at man-page [00:26:47] what process / service [00:26:54] gridengine-master [00:27:32] yeah [00:27:53] is there just no logfile at all? [00:28:13] so [00:28:16] the init seems to want [00:28:23] usr/sbin/sge_qmaster [00:28:25] which is a symlink to [00:28:32] ../share/gridengine/gridengine-wrapper [00:28:36] which isn't what I see in ps? [00:28:43] sgeadmin 2501 1 3 00:17 ? 00:00:24 /usr/lib/gridengine/sge_qmaster [00:28:45] theoretically in /data/project/.system/gridengine/spool/qmaster/messages [00:28:47] am I crazy here [00:29:33] 12/29/2015 22:08:21| main|tools-grid-master|E|jvm thread is not running [00:29:40] what's w/ the timestamps? [00:29:57] and is all this legacy then? [00:29:57] 12/29/2015 22:01:27|listen|tools-grid-master|E|commlib error: local host name error (IP based host name resolving "tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs" doesn't match client host name from connect message "tools-webgrid-lighttpd-1403.eqiad.wmflabs") [00:30:07] that's from a few hours ago, right? [00:30:14] also does it really fucking format timestamps in US notation? jfk [00:30:20] ok so looking at the strace [00:30:26] I see that the master is actually responding [00:30:34] (/data/project/wikibugs/fuck2) [00:30:50] so tools-grid-shadow as the master [00:30:50] it responds with [00:30:52] is responding to what? [00:30:53] 2145143465235642201741not available [00:30:56] to qstat [00:30:59] hi guys [00:31:00] from tools-bastion-01 [00:31:00] someone could give me the tool youtube2commons link, please [00:31:01] from where? [00:31:15] from tools-bastion-01 [00:31:24] fwiw I dropped NEWWed Dec 30 00:30:23 UTC 2015 [00:31:25] if you look at the strace file and start reading from 2145143465235642201741not available [00:31:26] into [00:31:27] err [00:31:35] data/project/.system/gridengine/spool/qmaster/messages [00:31:38] to see new entries [00:31:44] wtf, clipboard [00:31:48] fuck [00:31:54] what now? [00:31:59] my clipborad doesn't work [00:32:02] unrelated to actual otuage [00:32:06] am trying to paste an IP [00:32:12] andrewbogott: does the qstat you were trying work now? [00:32:15] 10.68.18.210 [00:32:28] still hangs [00:32:32] The_Photographer, maybe https://tools.wmflabs.org/video2commons/index.py [00:32:40] if you look at the strace file and start reading from that ip [00:32:47] you can see the xml exchange between client and server [00:32:54] which is all mysterious xml [00:33:26] andrewbogott: what are you try exactly and where from? [00:33:34]
233
binmessage abs\" comp=\"qmaster\" id=\"1\">0 [00:33:38] disabled [00:33:43] pengo: "video2commons" is not approved as a Connected App. Contact the application author for help. [00:33:59] chasemp: tools.morebots@tools-bastion-01:~$ qstat [00:34:18] The_Photographer, full list is here https://tools.wmflabs.org/ [00:34:19] is it a perms thing then [00:34:24] as yuvi is trying as root? [00:34:42] fwiw so far no logging period [00:34:44] all users are allowed to use qstat [00:34:45] where I know to look [00:34:58] well sure but what's different between yours and his essentially [00:35:44] chasemp: mine too hangs, except with strace it's obvious that network communication does happen [00:35:49] sorry that wasn't clear. [00:35:52] the qstat doesn't work [00:35:56] oh ok I meant, do things work [00:35:56] just that network communicaton happens [00:35:57] right ok [00:37:12] qstat still erroneous [00:40:26] !log copied and cleaned out spooldb [00:40:27] copied is not a valid project. [00:40:37] !log tools copied and cleaned out spooldb [00:40:37] is not a valid project. [00:40:43] !log tools copied and cleaned out spooldb [00:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:40:56] !log tools restarted master on grid-master [00:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:45:49] YuviPanda: have you used qping ever? Or is it known not to work? [00:46:53] have never used it [00:47:22] I can’t get it to tell me anything other than ‘connection refused' [00:47:23] i just pushed a patch for the labstore servers which should help with the ksoftirq issues, but should wait for later [00:47:43] YuviPanda: you ahve some logs now :) [00:47:45] grep -A 300 'NEWWed Dec 30 00:30:23 UTC 2015' /data/project/.system/gridengine/spool/qmaster/messages [00:47:48] unsure if useful [00:48:05] your cleanup maybe caused an issue [00:48:09] or surfaced an existing one [00:48:10] chasemp: yeah that's me messing around with the berkleydbs. the spooldb had a fuckton of files, so I was wondering if that's the problem [00:48:15] YuviPanda: do you know what ports the master and exec nodes should be listening on? [00:48:20] chasemp: but I moved all the existing required files and am at couldn't open berkeley database "sge": (22) Invalid argument [00:49:50] YuviPanda: https://serverfault.com/questions/522314/how-do-i-recover-a-berkeley-db-included-in-a-sun-grid-engine-installation/522333 [00:50:13] db_recover -c [00:50:15] :D am on the same page, reading the link [00:50:20] heh [00:50:29] might be the same reason why torrus always freezes [00:50:54] and now my shell's frozen [00:51:59] YuviPanda: so neither host is running a master now, and that’s on purpose, correct? [00:52:28] root@tools-grid-master:/data/project/.system/gridengine/bin/lx24-amd64# ls | grep recover [00:52:41] mark: so far there was a bug in wikitech/ldap that was issuing dupe id's, that lead to an openldap patch, which seems to have taken out ldap to some degree which caused nfs to go nuts which caused the usual chaos and somewhere in there gridengine died [00:52:42] andrewbogott: no, it's because theyre failing since my attempt-to-cleanup-spooldb-files [00:52:55] ah, ok :( [00:53:00] right [00:53:33] my patch is just to fix something I noticed on when glancing over labstore1001, and may help in the future [00:53:53] gridengine seems not to recover and no one knows why, sure was just catching you up as I wasn't sure how much you had seen [00:54:02] yeah thanks [00:54:10] bdb corruption sounds entirely likely [00:54:36] no that's a result of yuvi trying to clean up a seemingly insanely large spool backlog [00:54:48] I imagine to see if that was causing gridengine to choke [00:55:02] yeah so when I 'reverted' to the backed-up bdb files gridengine master starts [00:55:08] and proceeds to say absolutely nothing [00:55:13] am trying to increase debug levels now [00:55:38] ok [00:55:45] > SGE_DEBUG_LEVEL [00:55:51] > If set, specifies that debug information should be written to stderr. In addition the level of detail in [00:55:53] which debug information is generated is defined. [00:56:00] doesn't provide values, but let me try! [00:56:11] great stop the deamon and try running in console to catch a failing qstat [00:56:22] yeah [00:56:44] reading docs on gridengine is like traveling back in time [00:57:00] using nfs is like... [00:57:20] LDAP, NFS and GridEngine outage [00:57:24] I suppose we're in late 90s? [00:57:38] if you were to substitute LDAP by NIS... [00:57:44] > illegal debug level format [00:57:48] thank you, gridengine [00:57:56] now to find out wtf are legal debug levels [00:58:39] YuviPanda: http://stackoverflow.com/questions/312378/debug-levels-when-writing-an-application [00:58:46] https://blogs.oracle.com/templedf/entry/using_debugging_output [00:58:48] qmaster or other Grid Engine daemons keep crashing [00:58:48] Tell SGE daemons to not daemonize by setting the environment variable: SGE_ND [00:58:48] Start the SGE daemon from a shell. The daemon will now print debug information to stardard output. [00:58:49] 'tis crazy [00:58:50] Also, you may want to run the daemon under a debugger or strace (on Linux) to identify the location of the crash (and file a bug report!). [00:59:40] so that worked except it also managed to kill my terminal [01:01:32] oookay [01:02:03] i got it dumping debug info now! [01:02:08] except it's a *lot* of debug info [01:02:26] well, I guess try to do a qstat and see what you see [01:03:36] my battery is empty... [01:03:46] it seems to be [01:03:49] 3102 6625 trigger_thre <-- sge_copy_hostent() ../libs/uti/sge_hostname.c 659 } [01:03:51] 3103 6625 trigger_thre --> sge_copy_hostent() { [01:03:53] 3104 6625 trigger_thre 1 names in h_addr_list [01:03:55] 3105 6625 trigger_thre 0 names in h_aliases [01:03:57] 3106 6625 trigger_thre <-- sge_copy_hostent() ../libs/uti/sge_hostname.c 659 } [01:03:59] in a loop [01:04:03] dns related? [01:04:09] could this all be realted to funky hostname stuff [01:04:13] hey, qping works now! [01:04:15] ala those weird hacks in that static file [01:04:21] * andrewbogott is on his own little useless sidetrack [01:04:25] possibly. [01:04:36] good luck guys [01:04:43] because it also tries to contact teh master at foo.tools.eqiad.wmflabs [01:04:45] and I'm wondering [01:04:47] if that's meant to be [01:04:50] foo.eqiad.wmflabs [01:05:04] but nothing seems consistent [01:06:02] SGE_DEBUG_LEVEL="3 0 0 3 0 0 3 0" SGE_ND=1 sge_qmaster [01:06:04] is what I'm using [01:06:16] chasemp: so gridengine uses rdns for authentication [01:06:28] chasemp: and the aliases file is just telling it 'if rdns says this or this that is ok' [01:06:31] it could very well be related [01:06:48] the last time something like this happened was when our /etc/hosts file got too large for it [01:06:52] does it atuenticate by hostname? [01:07:13] yeah. [01:07:15] I'mc onfused on authenticate [01:07:18] that's...ok then [01:07:27] if you're thinking it is terrible [01:07:27] I can't hit ssh root@tools-grid-shadow.eqiad.wmnet now [01:07:28] then it is [01:07:32] chasemp: .wmflabs [01:07:40] silly fingers [01:08:02] did you take it back down or did it die? [01:08:19] oh, and now it’s back... [01:08:50] andrewbogott: yeah I Took it down and turned it back on with more logging [01:08:53] still useless [01:09:07] qping just says that it is in state ‘warning’ which we knew [01:09:10] running qstat [01:09:16] has basically no effect on the log output [01:09:18] I was hoping this would give us some kind of simple test case, but nope [01:11:16] well, andrewbogott want to try to call coren? I mean, kinda shitty but hey [01:12:01] sure [01:14:27] Left a message on his mobile, can’t remember if this home number is a real number or not... [01:15:52] YuviPanda: thoughts? [01:16:04] you know this beast best I guess what do you think [01:16:27] other than 'screwed'? :) it's mostly 'flail about until you hit something' atm [01:16:53] am going to see if I can read the bdb files [01:16:56] and see what's in there [01:17:00] so [01:17:04] teh master changed [01:17:04] tools-grid-master.tools.eqiad.wmflabs [01:17:06] at some point [01:17:35] yeah because I'm doing my restarts and debugging othe master [01:18:42] chasemp: I think he prefers people calling the landline, if you want to try that [01:18:56] stat("/var/lib/gridengine/default/common/sge_qstat", 0x7fff5e18d170) = -1 ENOENT (No such file or directory) [01:19:05] stat("/root/.sge_qstat", 0x7fff5e18d170) = -1 ENOENT (No such file or directory) [01:19:36] what's that from? master or client? [01:19:41] client [01:20:02] I think that's just a defaults file [01:20:53] Coren en route [01:22:16] YuviPanda: is /tmp/sge_messages where you’ve been logging? [01:22:17] what is [01:22:17] tools-grid-master.tools.eqiad.wmflabs. [01:22:24] andrewbogott: nope [01:22:28] connect(3, {sa_family=AF_INET, sin_port=htons(6444), sin_addr=inet_addr("10.68.20.158")}, 16) = -1 EINPROGRESS (Operation now in progress) [01:22:30] ok, well, look there [01:22:32] that’s the log of last resort [01:23:05] maybe that’s nothing, it’s a few minutes old [01:23:16] yeah, could be my attempts to start it when it was failing [01:24:27] yeah so client looks at a bunch of config and resolve thigns and then says it's contacting what it read as the current master [01:24:35] and just times out until fault [01:24:44] YuviPanda: so where /are/ the logs that you’re seeing? [01:24:47] * Coren arrives [01:24:54] andrewbogott: just on my terminal, unfortunately. [01:25:02] What are the symptoms? [01:25:03] Coren: sorry about your otherwise-peaceful evening :( [01:25:13] well Coren it's safe to say you we are grateful [01:25:23] tools master seems up but no commands work [01:25:26] qstat for example just hangs [01:25:33] we have tried rebooting and failing over masters etc [01:25:36] seems like the master is there [01:25:44] Networking? [01:25:47] clients say they connect retry until timeout [01:25:51] I can see it hit the box [01:25:54] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:00] Which is the current master? [01:26:17] Also, no NFS issues? [01:26:24] * Coren notes timeouts. [01:26:26] there were but have been ok for awhile [01:26:27] there were but not anymore. [01:26:30] so we think [01:26:34] but it does seem stable [01:26:40] Coren: the master is a process running on my shell on tools-grid-master [01:26:40] for a good while [01:26:41] Looks snappy. [01:26:55] YuviPanda: is debugging in console [01:26:58] no real wisdom so far [01:27:04] YuviPanda: I see it. Did you strace it yet? [01:27:08] bsaically, master doesn't act like master as it comes up [01:27:52] YuviPanda: I'm attaching to the running process, please not to kill. [01:28:02] home page is back, just another blip :/ [01:28:07] Coren: ok! [01:28:52] What the f? It's currently running like crazy reading the spool db, but there are *thousand* of transaction files/ [01:29:02] yeah I was trying to clear tha out [01:29:13] so cp'd the folder out, and copied back only the bdb files [01:29:26] no worky, it just said 'invalid argument' (you can see that in qmaster/messages) [01:29:26] YuviPanda: That won't work; you need to replay the transactions. [01:29:50] log.* are uncommited BDB transactions; they need to be replayed. [01:30:05] (I am guessing the grid master crashed leaving the db open) [01:30:14] ... there's 60G of them! [01:30:15] ooh, I see. I didn't realize that's what log.* files were [01:30:42] Holey sheets. Someone (accidentally?) DOS it it looks like. [01:30:48] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 973182 bytes in 4.782 second response time [01:31:16] YuviPanda: Stop the master; I'll fix the db [01:31:41] Hah. there are so many files ls fills my terminal backbuffer. [01:31:56] ok [01:32:15] Coren: done [01:32:23] From what I can see, gridengine would have recovered eventually; it was currently reading the transaction logs and (presumably) rebuilding - but at one file per 2-3 secs it would have taken a while. [01:32:56] I see. my strace just has a lot of '6767 gettimeofday({1451438873, 2967}, NULL) = 0' [01:33:09] Hm? [01:33:26] my strace of the master, that is [01:33:38] I got millions of pread() with the regular open() as it walked th db [01:34:25] haha that's really strange, I have 0 preads [01:34:26] YuviPanda: You made a copy of the DB directory? Do you still have that copy intact? [01:34:40] Coren: nope, I moved the copy back. [01:34:56] YuviPanda: It's possible that only one process will get the actual trace at a time; I was already attached to it. [01:34:58] Coren: everything is as it was when it started now [01:35:11] YuviPanda: kk. I'm making a backup; db_recover is one-way [01:35:36] ok [01:35:44] yeah I was going to run db_recover and couldn't find it at all [01:36:02] Actually, it would probably be *much* faster if you did the copy on the labstore so that I don't have to hit NFS twice [01:36:08] an example of accidental DOS would be e.g. submitting the same job over and over in a tight loop? [01:36:20] Coren: let me do that [01:37:14] andrewbogott: I suppose, but I doubt just one loop of submission could cause that much db traffic [01:37:41] andrewbogott: I'm thinking the LDAP issues might have caused everything to try to reschedule the world at once. Like, maybe through cron, etc. [01:38:10] YuviPanda: db_recover doesn't seem to be there, it's in db-util and I don't expect that's part of standard or base [01:38:14] * Coren apt-gets it [01:38:29] ah db-util was the package [01:38:32] ok [01:38:51] YuviPanda: But you definitely had the right idea with a db_recover. [01:39:18] cp is still ongoing [01:39:20] YuviPanda: If that doesn't fix it, a db_dump / db_load would have done the trick. [01:40:14] gridengine is very resilient to the db rolling back to the past (which is what a dump/load would have done). Worst side effect is that there are some running jobs it doesn't know about so it complains and kills them. [01:41:19] ok I looked for the db_revoer stuff to no avail [01:41:20] so that explains [01:41:36] tx much and I'm slightly derailed here on minor irl emergency [01:41:46] DBD rescue normally goes db_recover, db_recover -c, omgwtf, db_dump, db_load. :-) [01:41:53] chasemp: go back to vacation, I'll call you back if needed here [01:41:53] BDB* [01:42:37] I've added a puppet patch to add db-util on all masters [01:44:28] how I could upload a video of 300 MB to commons? [01:44:28] The_Photographer: someone in #wikimedia-multimedia might know the answer better [01:44:47] * Coren takes a look at the log files to figure out what did so much things. [01:44:53] YuviPanda: thanks yuvi [01:45:32] let us know about, Coren [01:46:19] Oy. There's over 12000 of them. [01:46:21] what is "them"? [01:46:32] doctaxon: BDB log files. [01:51:07] So it would’ve recovered in just 10 hours [01:51:36] I'll probably have to db_archive the lot. [01:51:49] We'll see. I'll do the recover first. [01:53:20] Erm. NFS is being slow and stally. Copy still going, YuviPanda? [01:53:20] yes Coren [01:53:28] am copying it to spooldb.bak.30dec [01:53:35] kk [01:54:34] Coren: it's only about 10% done [01:54:43] > 12385 total files [01:54:47] 1554 [01:54:49] * Coren nods. [01:54:49] files copied [01:54:53] 60G [01:55:05] I wonder if tarring would make it go quicker [01:55:09] but probably not [01:55:11] it isn't that many files [01:56:32] Probably not, though copying it to some /other/ device might help. [01:56:41] aren’t they fairly sparse? Wouldn’t they compress a whole lot? [01:56:51] I guess tar.gzing 60 gb of files is also not quick [01:57:13] andrewbogott: Not BDB logs; but also, I don't think DBDs have been sparse in a while. [01:57:35] Besides, the 60G figure is from a 'du', so reflects allocated blocks not sum of sizes. [01:57:52] 1852 now [01:57:54] so about 15% [01:57:56] andrewbogott: do you mean, qstat and queue subst does work in at least 10 hours [02:04:17] Coren: are you doing the recovery now or are you waiting for the copy to finish? [02:04:19] YuviPanda: I think it's wiser to wait until the copy finishes. [02:04:19] * YuviPanda agrees [02:04:19] YuviPanda: I don't remember the db-utils ever /eating/ a DB but now is not a good time for firsts. [02:04:20] * YuviPanda nods too [02:04:20] it's going to be at least... 30 minutes? [02:04:20] for the cp to complete [02:04:20] that's really long time [02:04:20] sorry, but when are you thinking qstat is running again? [02:04:21] doctaxon: we don't have an ETA yet. [02:04:21] okay [02:04:21] that seems way too long for 60G [02:04:22] and stracing the cp hangs?! [02:04:22] wtf [02:04:22] YuviPanda: ... what? [02:04:22] NFS just died [02:04:23] * Coren headdesks. [02:04:23] During the copy? [02:04:23] copy is still running [02:04:23] since this is on the server [02:04:23] YuviPanda: The copy might be eating all the I/O bandwidth on that volume. [02:04:24] > strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted [02:04:24] yeah [02:04:25] am killing copy now [02:04:32] and going to make copy go to non NFS [02:04:54] might want to also make it ionice -c Idle too [02:05:21] yup [02:05:34] NFS just unstuck. I don't expect that was a coincidence. [02:06:08] yeah [02:06:18] am copying on to /tmp now [02:06:30] it has more than a T of free space [02:08:36] Going faster? [02:08:44] It's certainly easier on the volume itself. [02:09:27] yeah [02:09:34] Coren: it's already at 20% [02:11:56] 30% done [02:12:34] going to take a bathroom break, I'll brb shortly [02:16:56] YuviPanda: FYI, looking at the log files shows about 10M of transactions/second until 23:16 which I presume is when things just went boom [02:17:49] Ah, and it started at 21:48 [02:18:13] 99% of that 60G of transactions is between those two times. [02:18:39] Something definitely dos the living crap out of the master. [02:19:03] andrewbogott: How do those times line up with the LDAP issues? ^^ [02:20:04] nothing much happened at 21:48… it was between restarts [02:22:14] hcclab <-- culprit [02:25:04] That tool attempted to start ~2400 jobs/s [02:26:13] you think coincidentally/unrelated to ldap and nfs? [02:26:49] The timing annoys me, but I can't see why flaky ldap would get exactly one tool to go craycray [02:27:31] * Coren attempts to locate what runs under that UID where. [02:27:48] back [02:27:52] No crontab [02:27:54] Hm. [02:27:59] Coren: cp is done [02:28:14] YuviPanda: Allright; seeing what can be done to clean this. [02:29:06] \o/ [02:30:39] Fun fact: hcclab has a bit of 28 million entries in there. [02:30:54] ouch [02:31:22] Yeah; Ima thinking accidental DOS. Started at 21:48 and the master gave up the ghost at 23:16 (still not bad) [02:31:41] I'm going to disable that tool [02:31:54] and leave messages on the maintainers [02:32:24] db_archive would take hours to go through that crap. [02:32:59] can we selectively discard it? [02:33:44] To a point; I'm trying to figure out the exact boundaries and blow up the logs; from that I should be able to recover [02:34:11] We might loose a few legit transactions in the deal, but it's better than the alternative. [02:34:26] yeah, definitely [02:35:57] YuviPanda: Allright, I'll work from the end. Can you cp log.0000089* back to the spool from the backup? [02:36:07] sure [02:37:00] Coren: done [02:37:37] doing [02:39:35] !log tools remove lbenedix and ebekebe from tools.hcclab [02:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [02:41:57] Not recoverable; the whole thing has a tranaction around it. [02:42:00] @^#% [02:42:02] dump/load [02:42:17] dump is consistent. [02:43:15] fun [02:43:59] loading from dump [02:45:19] I'm reading through the bdb manual slowly and completely [02:45:57] 6Labs, 10Tool-Labs: cron fail on tools-submit - https://phabricator.wikimedia.org/T121305#1909516 (10russblau) No cron jobs have run for approximately the last six hours. The error messages I've been getting vary quite a bit; the most recent one was: ``` error: commlib error: got select error (Connection ref... [02:46:41] YuviPanda: Hah. The sge_job db just went from 39M to 4M [02:46:42] :-) [02:46:48] * Coren checks. [02:47:11] nice [02:47:20] Or. [02:47:25] Ow* [02:47:34] We did loose the job db. [02:47:42] It was not recoverable. [02:48:04] ow indeed [02:48:18] Coren: do you think it might recover if we let it churn through for hours? [02:48:18] Master is working, but all running jobs have been lost track of - it's going to be 'amusing' for a while. [02:48:41] Loosing the job DB isn't a catastrophe. [02:48:53] :D I guess it'll be okay-ish, since most things are on Cron. I guess continueous jobs will be lost? [02:49:59] YuviPanda: Yes; and it'll make things crappy for a while since the actual *jobs* are mostly still running. [02:50:09] gridengine lost track of them though. [02:50:33] right [02:50:42] should we like reboot all instances? [02:50:46] But there was no way to fix this that I can see; it would have taken at least 10h just to parse the transaction log with no guarantee that it would have worked at the end. [02:50:46] or something crazy like that? [02:50:56] You know, I think that's the best move. [02:51:06] It's ugly; but it guarantees a known stable state. [02:51:34] Coren: I wonder if we could've recovered the bdb files from the last tools snapshot [02:51:58] YuviPanda: That'd have been even worse: rather than have no job db you'd have all the wrong job db. :-/ [02:52:01] ah [02:52:03] fun [02:52:12] Coren: should I start rebooting the instances? [02:52:25] Lemme stop the master first. [02:52:42] Stopped. [02:52:48] ok [02:53:24] Now if you reboot all exec nodes, the net result will be tart once gridengine restarts it'll see everything empty and return with a blank slate. Then the manifest will kick the webservices back up. [02:53:43] right [02:53:45] ok [02:53:48] YuviPanda: are you just going to click on things on wikitech? Or do you want me to script something from labcontrol? [02:54:36] andrewbogott: was just trying to piece something together from labcontrol [02:54:43] andrewbogott: if you can do it that'd be wonderful [02:54:53] ok, hang on... [02:55:05] andrewbogott: basically, anything 'tools-exec' or anything 'tools-webgrid' needs to restart [02:55:19] andrewbogott: and I suppose we need to stagger the restarts to not kill the virt* hosts [02:55:23] which has happened before I think [02:55:29] but not tools-elastic? [02:55:32] andrewbogott: nope [02:55:37] andrewbogott: re your labs-l email; it wasn't 12000 job, it was 12000000 jobs. :-) [02:55:54] what about -web-static? [02:56:01] andrewbogott: nope [02:56:03] Coren: oh! [02:56:24] YuviPanda: and not -worker either, right? [02:56:34] andrewbogott: nope, not -worker either [02:56:50] afaik, all the exec nodes are either -exec-* or -webgrid-* [02:56:54] yup [02:57:44] ok, one reboot every 30 seconds? does that sound slow enough? [02:58:12] andrewbogott: yeah [02:58:19] ok, here we go [02:58:26] (ready?) [02:59:01] Coren, YuviPanda, ready for me to throw the switch? [02:59:12] andrewbogott: yup doit [02:59:27] Aye [02:59:34] ok [03:01:07] andrewbogott: Coren that's about... 25 mins? [03:01:09] I guess? [03:01:20] hm, yeah [03:01:42] We might be more generous if we start as many on parallel hosts as possible [03:02:06] I think writing that script would take more than 25 minutes :) [03:02:13] yeah, ideally we want 'not more than 1 instance on host per 30s' but I don't think we can write that in 25min [03:02:15] hah :) [03:02:18] we should write that tho at some point [03:02:23] Yep, my thought exactly. [03:02:50] I’m going to drop it to 15 seconds, though, 30 is too slow [03:03:02] andrewbogott: yeah, ok [03:06:28] YuviPanda: can you write the ‘what just happened’ email? [03:06:38] andrewbogott: I just did, kindof. [03:06:45] we disabled the paging for tools-home and NFS services temp. this change can be reverted to enable it again https://gerrit.wikimedia.org/r/#/c/261610/1 [03:06:56] andrewbogott: hah, mid-air collision with your email [03:07:24] yeah, oops [03:10:04] * legoktm hugs everyone <3 [03:10:18] andrewbogott: virt hosts surviving the load? [03:10:29] seems ok so far [03:10:35] so this entire thing just happened to happen at the same time one errant tools ubmitted 12million jobs? [03:11:05] it’s not a great theory, but so far it’s the best one we have [03:11:07] entire thing -> LDAP and NFS [03:11:18] need to investigate the code of that tool and see if that can cause it to happen [03:11:19] I don't believe in coincidences. [03:11:22] vice versa could've also happened [03:11:28] in that the load from 1m jobs killed NFS [03:11:31] and LDAP was mostly unrelated [03:11:34] or [03:11:40] the load from 1m jobs killed NFS and LDAP [03:11:44] which is both totally possible [03:12:09] The jobs can't possibly have run- there is a hard per-uid limit [03:12:21] oh yeah, but the submission of the jobs themselves [03:12:27] bdb files are on NFS afterall [03:12:43] True; but afaict the LDAP issues started before that. [03:12:47] * Coren ponders. [03:13:10] exec nodes should all be back up [03:13:10] My guess: LDAP issues caused the tool to fail somehow, and the million submission hit NFS hard. [03:13:18] webgrid nodes in process [03:13:19] * Coren restarts the master [03:13:23] not [03:13:29] (sorry) [03:14:58] bigbrotherrc doesn't start jobs when they aren't running, does it? [03:15:12] it doesn't! [03:15:15] YuviPanda: No. [03:15:27] so that's 206 jobs we need to figure out how to start [03:15:46] I suppose I can just become the tool and exec that file [03:15:56] YuviPanda: It was intended to keep running jobs up; but if you rely on the .bigbrotherrc the actual invocation is in there [03:16:02] right [03:16:06] so I can just [03:16:21] sudo -u bash /data/project//.bigbrotherrc [03:16:25] if that file exists [03:16:29] and that should do the trick [03:17:15] Hm. iff there are no webservice invokations in there. You excised them all right? [03:17:21] yeah [03:17:29] webservice invocations are also harmless [03:17:33] True [03:18:17] two more minutes [03:21:28] ok [03:21:33] so as soon as we start the master back up [03:21:36] ok, now everything is rebooted [03:21:41] webservice jobs and cron will start flooding gridengine [03:21:46] Coren: sohuld we do it one at a time? [03:21:54] stop cron for a bit until webservices are done? [03:22:21] YuviPanda: I think we can rely on the load balancing from sge; it'll grind a bit at first but nothing should breath [03:22:24] !log tools stop cron on tools-submit, wait for webservices to come back up [03:22:32] Coren: better safe than sorry, I guess [03:22:36] Coren: wanna start the master back up? [03:22:53] It's up. [03:23:19] ooook [03:23:47] With the jobs from when it went down; it'll take a bit to match up with the node status. [03:24:22] yeah [03:24:43] ooo [03:25:10] admin's back [03:25:12] Good news; the updates allowed it to notice continuous jobs that needed restarting because the restartable queues are handled separately. [03:25:22] oh? [03:25:58] Follow along on qhost -j [03:26:16] Probably not all of them, though. [03:28:39] hmm [03:28:40] Coren: true [03:29:16] Would this be a terrible time for me to go to dinner? I’m standing some people up :( [03:29:32] andrewbogott: No, go. AFAICT things are back up. [03:29:43] andrewbogott: hi5, I am too :D [03:29:55] * andrewbogott hi-5s yuvi [03:30:03] Need me further, guys? [03:30:14] YuviPanda: so maybe that means /you/ should go to dinner and I should restart jobs [03:30:31] Coren: I don’t think so. Thank you for coming to the rescue! [03:30:53] Coren: <3 thanks a lot [03:31:29] Heh, no worries guys. [03:31:36] YuviPanda: so, script to start all the big brother jobs, and then what? [03:31:39] then nothing? [03:31:55] andrewbogott: yeah, but right now - get on tools-services-02 and tail /var/log/upstart/webservicemonitor.log [03:31:58] and watch as it restarts things [03:32:03] when it's done restarting all the webservices [03:32:09] then start the cron service on tools-submit [03:32:27] and *then*the bigbrother part needs to hapen only for jobs taht haven't already been started, so that's a bit tricky [03:32:41] andrewbogott: but if you can do the first and second, that'll allow me toget dressed and out the door and I can look at #3 [03:33:03] first and second = waiting and then restarting cron service? [03:33:05] That I can do [03:33:15] what’s the cron service called? [03:33:22] andrewbogott: yes, looking at the webservice monitor and then wen it's done restarting cron [03:33:24] andrewbogott: cron [03:33:28] heh, ok :) [03:33:52] should I care about errors in webservicemonitor.log? [03:34:13] andrewbogott: only if all of them are erroring [03:34:31] many, but not all [03:35:25] andrewbogott: so the errors it'll try again on next round [03:35:31] ok, cool [03:46:42] "bash: cannot create temp file for here-document: No space left on device" [03:46:54] bd808: where’s that? [03:47:07] I'm getting that trying to tab complete on tools-bastions-02 [03:47:44] andrewbogott: back [03:47:58] well, back as in 'I am in a cab' [03:48:15] YuviPanda: webservices still starting, although most are failing [03:48:23] also, tools-bastion-02’s disk is full [03:48:27] I’m about to wipe out ‘atop’ logs [03:48:48] ok [03:49:05] * bd808 hates the damn atop logs [03:49:06] andrewbogott: I think at this point you can start cron and go. I'll take care of webservice logs [03:49:10] err [03:49:12] webservice errors [03:49:15] bd808: any better? [03:49:31] seems to be andrewbogott. thanks [03:49:49] YuviPanda: ok, started cron... [03:49:57] lemme find some more space on tools-bastion-02... [03:51:03] ok, done [03:51:23] So, YuviPanda, you can really handle this from your phone? [03:51:44] andrewbogott: am not on my phone :) [03:51:50] ok then [03:51:52] andrewbogott: I'm on my laptop + mifi [03:52:01] plus my hands are 'good enough' to not need the big keyboard [03:52:23] I’m going to head out — email me if I can do anything? I should be back at a laptop in 90 mins or so [03:52:35] andrewbogott: ok [03:52:41] thank you! [03:54:51] !log tools qmod -rj all tools in the continuous queue, they are all orphaned [03:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [04:06:12] !log tools delete all webgrid jobs to start with a clean slate [04:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [04:09:33] ok it's all good now [04:09:36] webservices recovering [04:09:38] foooood [04:09:40] brb [04:09:42] call me if anything happens [04:09:45] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.002 second response time [04:10:23] admin 503s [04:11:08] * Coren|Away restared it manually. [04:14:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 960405 bytes in 3.541 second response time [04:16:39] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1909567 (10coren) >>! In T122638#1909554, @yuvipanda wrote: > I'm also wondering if more up-to-date backups of the job db would help. For backups to be useful for recovery of actually running jobs, it'd have to b... [04:19:29] 6Labs, 10Tool-Labs: mediawiki-mirror jobs generate grid errors (?) - https://phabricator.wikimedia.org/T122640#1909568 (10scfc) 3NEW [04:45:15] Can anyone bring https://tools.wmflabs.org/musikanimal/nonautomated_edits back up? [05:44:52] YuviPanda: I’m back, for a few minutes at least. Does anything need immediate attention? [06:03:00] andrewbogott: nope, all goodish [06:03:04] Zhaofeng_Li: I'll try [06:04:14] YuviPanda: Thank you! [06:04:23] Zhaofeng_Li: try now? [06:04:45] Hmm. Still "not currently serviced". [06:04:45] YuviPanda: great, then I’m going to sleep! [06:05:35] oh I see [06:05:39] Zhaofeng_Li: it was a custom webservice [06:05:43] let me bring that up [06:07:10] Zhaofeng_Li: ok it's running now [06:07:13] I suppose.. [06:07:41] YuviPanda: Nope, it's still down. [06:10:32] hmm [06:10:35] the process is running [06:10:38] I'll take a look in a bit? [06:12:22] YuviPanda: Okay, take your time. It's not urgent :) [06:12:44] it's a custom webserver, so needs more poking :) [06:12:46] thanks [07:13:58] trying to work out which table in wikidatawiki_p stores relationships like European mink (Q26559) instance of (P31) taxon (Q16521) [08:00:51] pengo: I think for those you should use query.wikidata.org [08:00:58] I don't think those relationships are in rdbms tables [08:01:11] pengo: and you can probably get better answers on #wikidata :) [08:21:07] YuviPanda, thanks.. and nope, asked there too and got nothing [08:21:39] probably too early in europe :) [08:21:50] i looked at that query language before but it scared me too much.. will try staring at it some more later [08:23:02] also have used a bunch of other wikibase db queries already so thought i should try to keep it consistant [11:19:32] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1909769 (10valhallasw) Apparently hcclab has caused issues before: {T92614}, but it seems Ebekebe and Lbenedix were not contacted about this at that time. http://wm-bot.wmflabs.org/logs/%23wikimedia-labs/20150313... [11:19:41] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ogmios was created, changed by Ogmios link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Ogmios edit summary: Created page with "{{Tools Access Request |Justification=I'd like to create maps for wikipedia using your instance of the openstreetmap database. |Completed=false |User Name=Ogmios }}" [11:26:26] Hi, is there anyone here who might be able to help us? We´re experiencing problems with https://nl.wikipedia.org/wiki/Wikipedia:Labels. I seem to be unable to connect to the server, and a user is getting the message ¨waiting for labels.wmflabs.org¨ which seems to be offline (error 502). Is it something that needs to be looked at, or do we just have to wait it out? [11:31:25] It seems to be working now. [11:31:38] Woodcutterty: yeah, I just restarted it [11:31:46] !log wikilabels restarted uwsgi on wikilabels-01 [11:31:49] Sweet. Thanks. [11:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [11:31:58] Woodcutterty: np. thanks for reporting [11:35:10] YuviPanda: ah, still awake? [11:35:13] or awake again? [11:36:52] YuviPanda: is everything back to sort-of normal or is there still stuff that needs to be done? [11:40:15] valhallasw`cloud: everything is sort of normal except for some jobs that were continuous and had .bigbrotherrc won't be there [11:40:23] or rather, some continuous jobs were recovered, and some weren't [11:40:23] YuviPanda: ok [11:40:28] need to exec all .bigbrotherrc files [11:40:35] and make sure that they have -once set [11:40:37] or something like that [11:40:50] won't bigbrother just restart them anyway? [11:41:22] valhallasw`cloud: no since it doesn't start jobs if they aren't already started :D [11:41:28] ah [11:41:31] and hence is useless in our situation [11:49:34] W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/trusty-wikimedia/InRelease Unable to find expected entry 'thirdpartymain/source/Sources' in Release file (Wrong sources.list entry or malformed file) [11:49:53] does this look familiar to anyone? doing the last few salt updates on labs instances and I ran into about three with this problem [11:50:03] cleaning up /var/lib/apt/lists didn't help it any [11:50:12] apergos: yes, it looks familiar to me [11:50:22] apergos: look at /etc/apt/sources* [11:50:27] apergos: which instance is it? [11:50:35] I thought I had cleaned them all up but clearly not [11:50:36] YuviPanda: I'm also confused where the 12M jobs came from -- the hcclab maintainers haven't logged in to tools-astions this month [11:50:43] orch-client-02.orch.eqiad.wmflabs [11:50:50] there's a couple more too lemme check [11:51:43] metamorphosis-kafka-01.metamorphosis.eqiad.wmflabs and conf100.analytics.eqiad.wmflabs [11:51:59] wtf is orch [11:52:30] apergos: anyway, in cat /etc/apt/sources.list.d/wikimedia.list [11:52:34] deb http://apt.wikimedia.org/wikimedia trusty-wikimedia main universe thirdpartymain thirdpartymain thirdparty [11:52:37] it should be [11:52:41] yeah I see the thirdpartymain in there [11:52:58] I've fixed it there now [11:53:06] deb http://apt.wikimedia.org/wikimedia trusty-wikimedia main universe thirdparty [11:53:08] is what it should be [11:54:19] YuviPanda: Can you restart crosswatch? https://phabricator.wikimedia.org/T122585 [11:55:03] Glaisher: probably in about 5-10mins? [11:55:04] Glaisher: looking at it already [11:55:06] ah [11:55:09] valhallasw`cloud: is on it [11:55:11] <3 [11:55:12] :) [11:55:13] cool [11:55:25] but crosswatch is ... very non-standard, to put it mildly [11:55:46] but it might actually be a scheduling issue, it seems [11:56:45] thanks YuviPanda! [11:57:22] YuviPanda: uh, /usr/local/bin/tool-generic seems to be missing [11:57:24] valhallasw`cloud: most of that non-standardness is my fault tho [11:59:26] and wtf [11:59:27] -r-xr-xr-x 1 ssh-key-ldap-lookup ssh-key-ldap-lookup 723 Dec 17 18:53 tool-nodejs [11:59:28] ?! [11:59:42] why on earth is that owned by ssh-key-ldap-lookup [12:00:12] valhallasw`cloud: uh, is the tools-webservice package installed? [12:00:57] not sure [12:01:01] python-tools-webservice? [12:01:08] no, tools-webservice [12:01:23] because we have two packages?! [12:01:58] no, the first one was a 'shit I gotta rename this' it shouldn't exist anywhere... [12:02:13] I see [12:03:11] no, doesn't seem installed :/ [12:03:28] uhu [12:03:38] there's also toollabs-webservice [12:03:41] which is installed [12:04:09] lol [12:04:13] soooo.. my grid jobs seems to have grind to a halt.. should I kill and try to run again? [12:04:15] let me look at what the package name actually ended up being [12:04:46] YuviPanda: so toollabs-webservice doesn't contain any of that /usr/local/bin stuff [12:05:02] because that's from the old-old-webservice setup, I guess [12:06:32] > Source: toollabs-webservice [12:06:35] valhallasw`cloud: that's the actual name [12:06:37] it looks like [12:06:39] oh [12:06:40] YuviPanda: so what I don't get [12:07:04] crosswatch just uses a service.manifest web: generic thing [12:07:24] so it must be your code that calls out to /usr/local/bin/tool-generic [12:07:32] valhallasw`cloud: aaaaah, I see the confusion [12:07:35] there's no tool-generic [12:07:42] the tool-* shell stuff isn't used by webservice-new [12:08:03] those are just implemented as python classes [12:08:16] valhallasw`cloud: the only two scripts are webservice-new and webservice-runner [12:08:20] then why is webservicewatcher trying to run it?! [12:08:32] because webservicewatcher needs a patch [12:08:35] that I never wrote [12:08:56] right [12:09:00] webservice-new restart fixes it [12:09:21] well, it starts the job, anyway [12:09:34] right, and webservicemonitor doesn't know about webservice-new at all [12:09:38] and we're back [12:10:23] YuviPanda: ok, so I'm also completely confused by the gazillion webservice things [12:10:43] I thought it was rewritten once, and webservice was now what webservice-new once was, but apparently not [12:11:11] valhallasw`cloud: right, so the original webservice wasn't in python (I think?), so the rewrite was to make it python [12:11:27] valhallasw`cloud: but it still was a loose collection of things that just invoked each other via shell [12:11:41] YuviPanda: no, don't tell me now, write it down somewhere where we can find it in two months' time [12:11:45] valhallasw`cloud: webservice-new was written to bring it all into a python module, and what not. [12:11:55] valhallasw`cloud: and *then* the big 3day labs NFS outage happened [12:12:07] valhallasw`cloud: and this entire thing got dropped on the floor while that was fixed [12:12:16] valhallasw`cloud: I think there's a bug with all this info [12:12:19] valhallasw`cloud: let me find it [12:13:48] 6Labs, 10Tool-Labs: Unify / simplify webservice code - https://phabricator.wikimedia.org/T98440#1909788 (10yuvipanda) To recap, the code is mostly done, but the big NFS outage in June took vastly bigger priority over this. Just needs some finishing touches and support in webservicemonitor and then we can switc... [12:14:12] there [12:15:16] valhallasw`cloud: I think I'll pick that up this week/next-weekish [12:15:24] YuviPanda: that would be great [12:15:55] valhallasw`cloud: I think it just needs testing + webservicemonitor integration and then it's good to go, since testing should make sure it's fully b/c [12:15:55] 6Labs, 10Tool-Labs: Webservicemonitor doesn't understand webservice-new - https://phabricator.wikimedia.org/T122647#1909789 (10valhallasw) 3NEW [12:18:58] valhallasw`cloud: anything else for me to do before I go to sleep? [12:18:59] down to only a few instances I can't get cause no salt or ssh [12:19:00] nice [12:19:11] YuviPanda: no, thanks for your help [12:20:54] valhallasw`cloud: <3 thanks for taking care of it [12:22:49] "Continuous jobs will not be brought back up" mine are there … [12:23:16] Stigmj: yes, if they aren't that's probably a good idea [12:23:19] gifti: some survived [12:23:56] ah, ok [12:23:59] YuviPanda: I'm still confused by the bigbrother story. Why wouldn't it bring back up continuous jobs? [12:24:43] valhallasw`cloud: mostly the way it was written. [12:24:54] valhallasw`cloud: I think it looks at all running jobs first and then looks for their .bigbrotherrc files? [12:25:06] no [12:25:08] that's not what it does [12:25:29] no, it just reads all the files and restarts jobs that are not in there [12:25:32] at least, that's what I remember [12:25:57] right, but it had problems with times when a .bigbrotherrc file was removed and when you had 'clean slates' [12:26:08] I remember that I think, and I asked Coren|Away and he confirmed it [12:26:17] I just tried to read it and then my brain went 'splat' on the perl5 [12:26:18] yeah, it didn't understand 'empty file is stop restarting' [12:28:33] valhallasw`cloud: maybe I should rewrite it to be perl6? :D [12:28:36] * YuviPanda runs [12:28:43] it should actually be rolled into service.manifest too [12:28:46] YuviPanda: we're getting k8s, remember [12:28:53] all of that work was waylaid by the NFS outage [12:29:22] valhallasw`cloud: I was joking (re: perl6) [12:29:35] anyway, time to sleeeeeep [12:29:38] it's almost 5 [12:29:45] g'night! [13:17:40] Coren: Noooo. /o\ [13:17:58] 6Labs, 10Tool-Labs: Unify / simplify webservice code - https://phabricator.wikimedia.org/T98440#1909834 (10valhallasw) After thinking about this some more -- should we spend time on this, or is it better to document which webservices use 'generic' (i.e. which webservices are not restarted by webservicewatcher)... [13:18:20] Coren: Who will I go to when labs explodes now? :'( [13:18:29] a930913: I recommend Yuvi. [13:19:24] a930913: or andrew, or chase, or scfc, or me [13:19:39] but I'd suggest just using #wikimedia-labs and see who responds ;-) [13:19:47] valhallasw`cloud: Yeah, but none of you are Coren :p [13:20:13] ... [13:20:58] hehe [13:21:06] Coren: Thank you very much for all the support you gave me over the years. You will be sorely missed <3 [13:21:42] Heh, I'll still be around; just at more random times and durations. :-) [13:23:08] Coren: Like more English times? :p [13:23:52] I would have said "Yuvi times" :-P [13:24:42] Coren: Isn't that 12 hours offset from local or similar? [13:25:03] a930913: No, that was mostly an inside joke - sorry. [13:25:31] Oh, /me hasn't been around enough lately :( [13:25:59] Darn real world... It sucks. Nowhere near as good as the reviews said. [13:29:49] Coren: Off to pastures green then? [13:31:23] I suppose, yes. :-) [13:31:37] Though right now, winter means "white" more likely. :-P [13:31:59] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1909844 (10coren) >>! In T122638#1909769, @valhallasw wrote: > What is unclear to me at this point is what commands were issued when and by whom. Was this a cronjob gone wrong? The tool owners haven't logged in to... [13:32:38] Coren: the crontab was empty when you checked it last night? [13:32:48] valhallasw`cloud: Yep. [13:32:53] weird. [13:33:01] did we save the bdb files? [13:33:11] was it perhaps related to the wrong uids being given out in that ldap bug or so? [13:33:33] valhallasw`cloud: I don't think yuvi deleted the backup we made, so they may still be around. [13:33:35] (I don't have any details) [13:34:18] mark: Well, the timing is suspicious but I'd be at a loss to make a sane hypothesis about why ldap being funky would make exactly /one/ tool go crazy. [13:34:26] ok [13:34:45] i've also just installed irqbalance on labstore1001, and later we'll enable Receive Packet Steering; should make the nfs server a bit more resilient under overload [13:35:16] * Coren googles receive packet steering. [13:35:28] https://gerrit.wikimedia.org/r/#/c/261598/ [13:36:05] we use that on LVS boxes, but haven't really tried it with this NIC chipset yet, so I'm a bit wary of deploying it right now [13:37:37] mark: RPS or RSS? The kernel doc I read says the former is a software implementation so should be chipset independent. [13:37:53] this only enables RPS [13:38:06] i wouldn't say anything involving hardware IRQs is fully chipset independent :) [13:38:23] https://www.kernel.org/doc/Documentation/networking/scaling.txt right? [13:38:33] yes [13:38:40] we can possibly try RSS later [13:41:13] Huh. Those are pretty old mechanisms, ca. 2.6.35. Never ran into them before. [13:42:44] Heh. Scaling code written by google dudes. Why am I not surprised. :-) [14:21:24] Coren: I'm starting to wonder if this is actually the march issue coming back to bite us somehow. The log.* files seem to be referring to tools-exec-09, tools-exec-13, etc?! [14:22:01] ( https://phabricator.wikimedia.org/T122638#1909769 ) [14:23:30] valhallasw`cloud: I don't think we can reach a more specific conclusion than "that tool is somehow broken and goes onto rampages" [14:24:20] I'm not convinced -- a tool whose owners have not logged in for a month, and which doesn't have a crontab? [14:24:31] Ah, yuvi took the tool over. [14:24:40] I checked the original owners [14:24:45] last did not have any recent login registered [14:24:53] on any of the tools bastions [14:25:52] valhallasw`cloud: What were the original maintainers? [14:26:01] Coren: Ebekebe and Lbenedix [14:27:18] Hm, it doesn't seem to have any actual work done - but afawk it may well have had a continuous job chugging along quietly. [14:28:14] * Coren needs to look at the logs-logs [14:28:42] * Coren does that now. [14:35:31] I'm mostly wondering whether this was some kind of housekeeping job choking on the large number of jobs from back-then [14:36:50] I'm crunching the qacct logs now. [14:37:25] ok. I tried finding one of the hcclab jobids in there (8914870), but 'error: job id 8914870 not found' [14:37:49] of course, qacct only registers jobs that failed sensibly [14:39:33] Huh. Well, that tool has got 115112 seconds of CPU time logged, so it clearly ran. [14:40:00] * Coren tries to find specific jobs [14:40:16] *nod* let me grep the accounting file for hcclab [14:40:25] I'm already doing so atm [14:40:49] ok [14:42:04] Interesting. [14:42:35] So, it's not the same as the previous issue. That time, the jobs were running fairly often with high CPU and I/O requirements - that's the big load that was found and squashed. [14:42:36] anything more recent than march in there? [14:42:41] ah, ok [14:43:05] Then, nothing since. Except for the last entry, jobnumber 9100500, which is clearly broken and defective [14:43:25] But the record is broken enough that even the qsub_time is the epoch [14:43:54] iirc qsub_time epoch is what you get for a qmod -rj [14:44:07] although thta should be the end time, not the start time. [14:44:20] But the job itself, 'analyze_comments3' is one that ran in the past from that tool. [14:44:55] "failed 21 : in recognizing job" is about as useless and uninformative as it gets. [14:45:56] Oooh. [14:46:10] job ids rolled over. [14:47:34] The (normal) jobs immediately preceeding are 909xxxx in March. That 9100500 job should have been shortly after. [14:48:21] But right now the job IDs are in the very low range, so they rolled over from 9999999 recently. [14:49:01] yeah, there's something odd here. Jobids were in the 286000's at the beginning of dec, then back to 211535 the 21st [14:49:10] I don't know if it means anything, but perhaps there was a broken job entry for that tool in the DB and as a new job reached that ID rather than skip as it normally does it got confused? [14:50:17] I don't think we can know what happened for sure unless we read source a lot - no doubt the actual DB entries could tell us if we knew how to read them. [14:50:57] 0 2015-12-02 18:06:55 tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 321001 [14:50:57] 0 2015-12-02 18:33:58 tools-exec-1208.eqiad.wmflabs tools.geocommons update 9999894 [14:50:58] 0 2015-12-02 18:38:14 tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 1 [14:50:58] 0 2015-12-02 19:13:03 tools-exec-1220.tools.eqiad.wmflabs tools.cluebot cron-tools.cluebot-cbng_bot 1000 [14:51:06] last number is the jobid [14:51:17] and I'm only showing the first in every 1000 [14:51:36] so we jump from 321xxx to 9999xxx to 1 [14:52:03] Well, if there was over 12 million jobs started (or tried to start) then you'd expect a 10 million counter to roll over. [14:52:59] Though we don't know if they are all starts. "Activity". [14:54:36] Right. But it jumps by almost 10 million, but without anything showing for it in the accounting log, other than the jump in jobid [14:55:19] and this already happened 2 dec, so I find it weird we wouldn't have noticed it earlier [14:58:43] valhallasw`cloud: Perhaps it caused the same issue, but gridengine managed to survive it. [14:59:10] valhallasw`cloud: It may very well have been able to commit and flush the transactions to the DB but wasn't able to this time because ldap and NFS were not well. [14:59:32] * Coren shakes his head. [14:59:48] We just don't have enough information to make more than random stabs in the dark right now. [15:02:30] Coren: ah! found something [15:02:34] there was also a jobid reset at 2015-12-29 20:55:19 [15:03:34] So, basically the same thing. [15:03:46] Only this time the master didn't survive it? [15:03:53] we jump from 547758 to 9999897 then run over to 1 [15:04:53] As far as I can recall, to avoid races gridengine allocates a fresh jobid to any new attempt to start a job before it even decides if it will be allowable. [15:05:09] So failed job starts will make 'holes' in allocation. [15:05:32] yeah. And of course, the accounting log only shows completed jobs [15:06:13] Except for that one broken 9100500 job from that tool. [15:07:03] I'm actually happy we lost the job DB now; it would really seem as though there was a broken entry in it. [15:08:21] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1909877 (10valhallasw) The issue seems to have coincided with a weird jump in jobids. This has happened before (at the beginning of december), and it looks roughly like this. (I'm using lighttpd-wikiviewstats beca... [15:11:13] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1909882 (10coren) >>! In T122638#1909877, @valhallasw wrote: > which actually shows another interesting tidbit: thre's 20 seconds between restarts, but between 547749 and 9999897 there's a two minute gap. That is... [15:11:28] Coren: yes, exactly [15:12:57] Now, to me, the interesting question is "did the gridengine itself go crazy trying to reschedule a broken job entry from its own database or did something external actually try to schedule those jobs"? [15:13:40] I'm leaning towards the former - I see no reason to believe the hcclab sudently had phantom code running that wasn't there for months. [15:14:27] That it was hcclab-related is probably a symptom of its past (real) issue that may have left the broken entry in the first place. [15:14:40] Does that seem sane to you? [15:19:09] Coren, I'm having one hell of a time using mySQL workbench with the tools DB [15:20:06] Coren: yes, that sounds like an explanation, although it's not clear to me what we could have done/do in the future to prevent things like this [15:20:17] I get MySQL server has gone away after a minute of use. [15:20:59] Cyberpower678: set keep-alive to something lower? https://dev.mysql.com/doc/workbench/en/wb-preferences-sql-editor.html [15:21:41] valhallasw`cloud, that's never been a problem in the past. [15:31:43] valhallasw`cloud, Coren: why does the SQL server go away so quickly. [15:32:16] Cyberpower678: I don't know. I have given you a suggestion on how to solve the issue. Please try it. [15:32:57] In the time it takes to write a query, the server goes away. [15:34:17] valhallasw`cloud, I set it to 30 seconds as any longer seems to kill the connection. [15:34:43] valhallasw`cloud, that didn't fix [15:34:48] It still went away [15:35:09] through which host are you tunneling? [15:35:19] tools-login or bastion.wmflabs? [15:35:30] tools-login [15:36:54] hm. So the ssh connection disconnects after 2 minutes or so, but that might be mysql workbench trying to reconnect [15:37:31] This hasn't been a problem in the past, and I didn't change anything. [15:42:13] Cyberpower678: odd. [15:42:56] hi. question... I seem to miss replica.my.cnf in my home directory on toollabs... is it me something doing wrong or? [15:43:36] no, there should be one. Please create a task in phabricator. [15:43:48] okay. will do. [15:45:13] valhallasw`cloud, any idea what's causing it. I'm going to need a reliable connection. [15:45:58] no, not really. The obvious culprits are the ssh connection timing out, or the mysql connection itself getting disconnected somehow [15:52:33] Cyberpower678: ok, I can reproduce it, but I don't have a clear argument for what's happening yet. [15:52:38] please create a bug? [15:52:45] 6Labs, 10Tool-Labs: [Tool Labs] Database credential file replica.my.cnf missing in my home directory on Tool Labs (/home/wiki13). - https://phabricator.wikimedia.org/T122657#1909976 (10Wiki13) 3NEW [15:52:59] valhallasw`cloud, glad to know that I'm not going insane. :p [15:54:08] Ugh, [15:54:17] Now phabriator is bitching at me again. [15:54:46] I'm getting really frustrated here. [15:55:10] Why would phab dislike you? [15:56:14] It doesn't seem to be capable of keeping me logged in. [15:57:18] It seems to like to give me an invalid session error and throws me out. Then it doesn't let me log in. [15:58:10] Wiki13: I thought user replica.my.cnf for user accounts (as opposed to tool accounts) was no longer here since long ago [15:58:51] Coren, I found that waiting a bit seems to resolve the issue. [15:59:28] well, i asked zhuyifei1999_, and the admin of the toolllabs I spoke said it has to be there... so i dont think so [15:59:50] i looked around and other people have the file too, for me it just misses [16:00:29] Well that home directory can also mean home directory for tool accounts [16:00:44] As $HOME points to them [16:01:56] my toolaccount has the file, my personal account doesnt [16:02:23] but that should become clear when reading the title of the bug report I just made [16:02:25] Also for me my /home/zhuyifei1999/replica.my.cnf is entirely useless. Those weren't updated in the pmtpa to eqiad migration [16:03:01] maybe report that as bug? [16:03:07] 6Labs, 10Tool-Labs: MySQL connections die in less than 30 seconds using tools-login tunnels - https://phabricator.wikimedia.org/T122658#1909989 (10Cyberpower678) 3NEW [16:03:13] valhallasw`cloud, ^ [16:03:54] Only replica.my.cnf of tool accounts had migrations. I thought connecting to replica via user account was no longer allowed [16:05:07] well if thats the case, the documentation on wikitech should be update to reflect that [16:05:07] Coren: can you confirm this, or am I wrong? [16:05:29] Wiki13: which part [16:05:34] ? [16:05:43] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database [16:05:56] ""Tool and Labs accounts are granted access to replicas of the production databases." [16:06:00] zhuyifei1999_: You're correct; given that quarry exists the use of per-user connections (for one-off queries) was mooted. [16:06:10] ah [16:06:10] So the documentation needs updatin' [16:06:25] thats clear thanks Coren [16:06:38] will close the the bug report then [16:06:43] as invalid [16:07:03] 6Labs, 10Tool-Labs: [Tool Labs] Database credential file replica.my.cnf missing in my home directory on Tool Labs (/home/wiki13). - https://phabricator.wikimedia.org/T122657#1910005 (10Wiki13) 5Open>3Invalid [16:09:17] 10Tool-Labs-tools-Quentinv57's-tools: editcount tool gives a significant different/wrong number of edits - https://phabricator.wikimedia.org/T67741#1910008 (10Cyberpower678) 5Open>3declined [16:10:02] 6Labs, 10Tool-Labs: [Tool Labs] Database credential file replica.my.cnf missing in my home directory on Tool Labs (/home/wiki13). - https://phabricator.wikimedia.org/T122657#1910012 (10Wiki13) Invalid. Asked about this on IRC, and personal accounts are not used for database access anymore (you should use tool... [16:10:20] Hmm Wiki13 too long. My first skimming through the page doesn't find the section to update :/ [16:10:31] 10Tool-Labs-tools-Quentinv57's-tools: SULInfo feature request: sorting thru GET parameters or some other way - https://phabricator.wikimedia.org/T51624#1910013 (10Cyberpower678) 5Open>3declined [16:11:27] y zhuyifei1999... [16:11:44] Um never mind [16:11:45] maybe an notice of some sort at the top of the page [16:14:01] Ok updated :) [16:14:20] I see [16:14:23] thakns [16:14:29] thanks* [16:16:31] np [16:32:22] hello, I realize there a labs instability issues going on, but it seems that over webservices on tool labs are running, while mine keeps dying :( https://tools.wmflabs.org/musikanimal/ [16:32:44] MusikAnimal: What does the log say? [16:32:58] I'll attempt to reboot, just as I have in the past, and it will hang and eventually load the page but without any CSS. Then I hit refresh and it's offline again [16:33:01] the logs look normal [16:33:25] 2015-12-30 16:32:59: (configfile.c.957) source: /var/run/lighttpd/musikanimal.conf line: 554 pos: 15 parser failed somehow near here: (EOL) [16:33:26] the unicorn log suggests it booted up properly, and I even see the first HTTP GET request [16:33:32] That looks suspicious [16:33:58] huh, lighttpd is a webserver, right? [16:34:00] unicorn log? [16:34:02] I don't think I'm using that [16:34:06] yes unicorn web server [16:34:13] lol [16:34:25] hey, it's pretty awesome, not sure it's that common for non-Ruby apps tho [16:34:33] 24627 0.30000 lighttpd-m tools.musika r 12/30/2015 16:34:28 webgrid-lighttpd@tools-webgrid 1 [16:34:56] You might want to do a webservice stop if you don't want a webservice. [16:35:06] weird [16:35:13] You're clearly set up for it; and even got a configuration file for it [16:35:20] I got "Your webservice is not running" [16:35:33] with qstat I get `24357 0.30002 httpserver tools.musika r 12/30/2015 16:25:20 webgrid-generic@tools-webgrid-` [16:35:36] which looks correct [16:36:01] don't see lighttpd anywhere [16:36:06] (You have a .lighttpd.conf in your home, which you last edited Aug 13) [16:36:17] MusikAnimal: It keeps dying, that's the issue. [16:36:35] MusikAnimal: Because there is a bug in the config file. [16:36:42] ah [16:36:43] Do 'webservice stop' [16:37:15] that time it did something [16:37:21] After which, you'll probably have to restart the web server you /do/ want since the lighttpd will have taken the association on the proxy [16:38:00] You may want to remove the .lighttpd.conf in your tool's home regardless. [16:38:17] alright, restarting [16:38:51] can I just delete .lighttpd.conf and ensure it won't try to boot up by itself anymore? [16:40:06] MusikAnimal: Having done a succesful 'webservice stop' will have it not try to start again, but removing the .lighttpd.conf is going to help make sure. [16:40:28] ok I shall delete it [16:40:48] (Sometimes, we rely on a configuration existing meaning you want a lighttpd running) [16:41:35] woohoo! we're back up and running :) thanks Coren [16:42:36] also very happy to report that my app stayed up and running for nearly a month and a half [16:42:43] with no issues [16:42:48] that's my longest streak I think [16:43:17] Nice. [16:44:18] Ruby 2's garbage collection is pretty good, so I was pleased to see it never died because due to memory, like xtools does 2-3 times a day [16:56:41] zhuyifei1999_: I [16:56:49] I'm pretty sure my user account still works [16:56:52] for mysql [16:59:16] valhallasw`cloud: I close the bug report btw, because I have been told you should use your toolaccount for it. [17:01:04] 6Labs, 10Tool-Labs: [Tool Labs] Database credential file replica.my.cnf missing in my home directory on Tool Labs (/home/wiki13). - https://phabricator.wikimedia.org/T122657#1910071 (10valhallasw) 5Invalid>3Open I'm pretty sure users are supposed to have a replica.my.cnf. For what it's worth, recently-adde... [17:14:23] 6Labs, 10Tool-Labs, 10DBA: MySQL connections die in less than 30 seconds using tools-login tunnels - https://phabricator.wikimedia.org/T122658#1910083 (10valhallasw) Judging from the 'client connections' panel, MySQL workbench opens four connections to the database server. Two of these disconnect after 30 se... [17:24:54] 6Labs, 10Tool-Labs, 10DBA: MySQL connections die in less than 30 seconds using tools-login tunnels - https://phabricator.wikimedia.org/T122658#1910096 (10valhallasw) As a workaround, click the 'reconnect to server' button (rightmost on the toolbar) before running the query. Not optimal, but at least you can... [17:25:12] Cyberpower678: ^. It's unclear to me why Workbench completely ignores keep-alives... [17:26:23] Then what's the purpose of setting it in the settings? And why is this a problem now? [17:26:30] It never was before. [17:26:40] I've been using Workbench for years now. [17:27:17] This makes my work impossible. [17:29:05] I just provided you a workaround, so 'impossible' is a bit [17:29:07] valhallasw`cloud, a new record. 15 seconds [17:29:21] You did? [17:29:21] bit over the top. 'Inconvenient', yes. [17:29:24] Where [17:29:37] In the bug. [17:30:23] I don't see it. [17:30:25] I'm pretty sure it's 30 seconds. See the Server > Client connections. [17:30:52] I set up an alter table query within 15 seconds, and hit execute and got an error. [17:31:19] Does the connection disconnect at 'Time = 15 seconds' for you, in thta client connections panel? [17:31:23] Yet, if I persistently bombard the server with requests, the connection stays alive. [17:31:35] The what? [17:31:46] server > client connections [17:31:50] in mysql workbench [17:32:03] shows you the four connections it makes. Two of them time out after 30 seconds [17:32:20] How do I see that? [17:32:35] 18:31 server > client connections [17:33:41] I see -- Connection Id: 207907178 [17:33:41] -- User: s51059 [17:33:41] -- Host: 10.68.17.228:59467 [17:33:41] -- DB: None [17:33:42] -- Command: Query [17:33:42] -- Time: 0 [17:33:43] -- State: None [17:33:45] SHOW FULL PROCESSLIST [17:33:47] -- Connection Id: 207907199 [17:33:51] -- User: s51059 [17:33:53] -- Host: 10.68.17.228:59471 [17:33:55] -- DB: None [17:33:57] -- Command: Sleep [17:33:59] -- Time: 1 [17:34:01] -- State: [17:34:03] Whoops [17:34:09] I didn't know that would flood. [17:34:57] Yes. The 'Time' column is the relevant one [17:35:13] if you reconnect, you'll see four connections instead of two [17:35:24] two of them will disconnect when time is ~30 seconds [17:36:13] Yes I see them. [17:37:12] They disappear after thirty seconds [17:38:37] 6Labs, 10Tool-Labs: Not showing up as maintainer for several projects on toollabs - https://phabricator.wikimedia.org/T122456#1910104 (10Andrew) [17:39:04] valhallasw`cloud, I still don't see the work around. [17:39:19] Cyberpower678: "As a workaround, click the 'reconnect to server' button (rightmost on the toolbar) before running the query." [17:40:29] I just refreshed the thread, it seems my cache decided to load an old copy of the bug. [17:40:37] So I didn't see your comment below. [17:40:42] Now I do. [17:40:43] SOrry [17:42:56] np [17:44:53] valhallasw`cloud, thank you for the workaround. That makes things simpler. [17:53:44] 6Labs: Add LDAP uidNumber uniqueness overlay - https://phabricator.wikimedia.org/T122664#1910106 (10valhallasw) 3NEW [17:56:53] 6Labs: Ensure unique uidNumber field in ldap user records. - https://phabricator.wikimedia.org/T122665#1910116 (10Andrew) 3NEW a:3MoritzMuehlenhoff [17:58:16] valhallasw`cloud: you got there first! [17:59:07] merged :-) [17:59:12] 6Labs: Add LDAP uidNumber uniqueness overlay - https://phabricator.wikimedia.org/T122664#1910125 (10valhallasw) [17:59:13] 6Labs: Ensure unique uidNumber field in ldap user records. - https://phabricator.wikimedia.org/T122665#1910126 (10valhallasw) [17:59:20] 6Labs: Add LDAP uidNumber uniqueness overlay - https://phabricator.wikimedia.org/T122664#1910128 (10Andrew) Note that that in our user record, the linux id is called 'uidNumber'. [17:59:43] 6Labs: Add LDAP uidNumber uniqueness overlay - https://phabricator.wikimedia.org/T122664#1910130 (10Andrew) [17:59:44] 6Labs: Ensure unique uidNumber field in ldap user records. - https://phabricator.wikimedia.org/T122665#1910129 (10Andrew) [17:59:49] aaaaaaah! ok, I'm going to remove my fingers from the keyboard now [18:00:15] andrewbogott: sorry! [18:00:26] no worries, I merged mine into yours [18:00:35] and I merged yours into mine [18:00:41] see where this goes wrong :D [18:01:00] crap [18:01:07] well, my battery is dead so I leave this in your capable hands. [18:01:09] back later! [18:01:14] ok! [18:02:32] 6Labs: Ensure unique uidNumber field in ldap user records. - https://phabricator.wikimedia.org/T122665#1910131 (10valhallasw) 5duplicate>3Open Sorry for the back-and-forth spam! [18:41:24] 6Labs: Can't ssh to social-tools1 from bastion-01 - https://phabricator.wikimedia.org/T121313#1910164 (10ashley) 5Open>3Resolved a:3ashley This was fixed a while ago, thanks @yuvipanda and everyone else! [19:16:49] 6Labs, 10Tool-Labs: shinken does not warn about tools-grid-master puppet staleness - https://phabricator.wikimedia.org/T122667#1910179 (10valhallasw) 3NEW [19:18:21] 6Labs, 10Tool-Labs: puppet disabled on tools-grid-master - https://phabricator.wikimedia.org/T122668#1910187 (10valhallasw) 3NEW [20:05:02] 6Labs, 10Tool-Labs: GridEngine down due to bdb issues - https://phabricator.wikimedia.org/T122638#1910234 (10valhallasw) So.. good news and bad news. The good news: it seems we have a job explosion more often, and it doesn't typically bring down SGE. The bad news: it seems we have a job explosion more often.... [20:22:02] !log ores create ores-cache-01 [20:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [20:23:18] !log ores Deployed with wb-vandalism:d940cea [20:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [20:33:57] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ogmios was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=243983 edit summary: [20:49:07] YuviPanda: where should I start debugging my python bot being able to connect to tools-elastic-* from tools-dev but not when running as a job on the grid? [20:53:42] hey bd808 [20:53:45] that's strange [20:54:01] bd808: I'd say start by sshing into a grid node and try just curling from there [20:54:13] *nod* [21:20:02] whaaa [21:22:05] 6Labs, 10Tool-Labs: puppet disabled on tools-grid-master - https://phabricator.wikimedia.org/T122668#1910336 (10yuvipanda) I think it can be safely re-enabled now. Am doing so and watching a run.. [21:22:39] still major changes/fixes happening? [21:23:30] no pengo [21:23:36] everything should be mostly ok [21:23:40] oh [21:23:46] 6Labs, 10Tool-Labs: puppet disabled on tools-grid-master - https://phabricator.wikimedia.org/T122668#1910337 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Done [21:24:06] bd808: any luck? [21:24:55] YuviPanda: haven't gotten to it yet. Sidetracked on a php-yaml task I've been neglecting for a couple of weeks [21:25:27] I got an email about T122458 but can't view it.. how many people did it affect or was it just me? [21:26:21] bd808: ah ok [21:26:52] pengo: ah, it was a fair number of users unfortunately [21:26:56] been fixed now tho [21:28:06] oops i thought it was that but i just messed up my login [21:30:04] not your fault, pengo :) [21:30:42] yeah, it's too early :) [21:32:47] hey cool. i can finally write to my tool dir with my user account [21:33:05] :D [21:36:53] 6Labs, 7Tracking: Request new project - https://phabricator.wikimedia.org/T122678#1910401 (10Wilfredor) 3NEW [21:37:28] 6Labs, 7Tracking: Request new project - https://phabricator.wikimedia.org/T122678#1910401 (10Wilfredor) [21:37:40] 6Labs, 7Tracking: Request new project - https://phabricator.wikimedia.org/T122678#1910409 (10yuvipanda) Can this be written as a tool instead? [21:40:15] sheesh what's hogging all the cpu on tools-bastion-02? [21:40:36] do we know if there's anything wrong with redis right now? I'm getting `Error connecting to Redis on tools-redis-01.tools.eqiad.wmflabs:6379` [21:44:43] YuviPanda: curl from tools-exec-1202 is working fine. I'll try running my bot again this evening and see if I can reproduce the problem I saw last night. [21:45:02] I should just be able to do `redis-cli` in the terminal to open a redis console, right? [21:45:24] 10MediaWiki-extensions-OpenStackManager, 10CirrusSearch, 6Discovery: Searching for "Hiera:" with namespace "Hiera" deselected still shows results in "Hiera:" - https://phabricator.wikimedia.org/T110377#1910438 (10Deskana) [22:00:07] going to assume there is a redis connectivity issue? because if not there's something wrong with my tools [22:43:44] MusikAnimal: hmm let me look [22:43:49] MusikAnimal: redis-cli -h tools-redis should work [22:43:52] MusikAnimal: wait [22:43:59] MusikAnimal: why are you connecting to tools-redis-01 directly? [22:44:09] MusikAnimal: you should just use 'tools-redis' and it'll route to appropriate instance [22:44:14] it's tools-redis-1001 right now for example [22:44:19] bd808: cool thanks [22:44:24] hmm [22:44:44] I suppose that explains it [22:45:05] let me try [22:50:06] that did it! the Redis gem I'm using has you past in a host, I felt for sure that would entail some sort of something.something.something [22:50:35] it does seem to be a little slow, though [22:55:58] I guess it was tools-redis-01 for a really long time! [22:56:07] anyway thanks YuviPanda, tools are working now :) [22:59:04] it was tools-redis-01 for a reallty long time :D [22:59:06] MusikAnimal: np [23:13:04] YuviPanda|afk: thanks for solving puppet [23:16:54] YuviPanda|afk: is labstore network i/o (i.e. NFS usage) somewhere in graphite? I'd like to correlate it to the weird jumps in sge jobids [23:17:16] but graphite.wm.o only shows cpu and disk usage [23:18:00] hm, I can probably plot the sge master network usage [23:28:45] YuviPanda|afk: so actually tools-shadow had been running the show since 17/12 [23:30:45] http://i.imgur.com/p1Gdx1p.png [23:37:26] ok, bed time! [23:48:04] Is it possible to access the user-db's on tools-db via Quarry? [23:53:35] hi guys [23:54:52] 6Labs, 7Tracking: Request new project - https://phabricator.wikimedia.org/T122678#1910651 (10Wilfredor) Yes yuvipanda, the right word is tool, thanks [23:55:15] its me [23:56:11] Stigmj: no, I don't think so. That's physically a different database server, and there's no way to select a server