[00:05:54] YuviPanda: if/when you are working, I’m interested in your thoughts about why https://gerrit.wikimedia.org/r/#/c/258071/ doesn’t work [00:06:17] andrewbogott: looking [00:06:49] andrewbogott: is realm set properly? [00:07:29] YuviPanda: puppet is currently failing with [00:07:30] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to determined $::labsproject at /etc/puppet/manifests/realm.pp:18 on node promethium.eqiad.wmnet [00:07:39] That’s inside an if realm == labs block [00:10:15] andrewbogott: wait [00:10:22] andrewbogott: $::labsproject is not from hiera.. [00:10:26] andrewbogott: $::labsproject is from facter [00:10:36] is that set by us anywhere? [00:11:06] YuviPanda: I encourage you to update your puppet repo :) [00:11:19] https://gerrit.wikimedia.org/r/#/c/258051/ [00:11:34] ah :) [00:12:57] It’s possible I’m misinterpreting what’s happening… I assume that that initial hiera lookup is failing, it’s falling back on the fact (which doesn’t work on metal) and erroring out [00:16:07] andrewbogott: right. I guess we can test that with a notice {} there maybe? [00:16:34] sure [00:16:37] * andrewbogott tries [00:18:29] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867916 (10Merl) Two additional reasons: First one is caused by wikimedia plan to change to ssl only, please read my comment at T105794#1574563. So I need Java 1.8 to keep my bot working. The second one is that it causes m... [00:19:38] 6Labs, 10Tool-Labs: Update Java 7 to Java 8 - https://phabricator.wikimedia.org/T121020#1867920 (10Merl) [00:21:48] YuviPanda: I added notify messages and I see them on a labs box but not on promethium [00:21:59] so… no idea what puppet server promethium is hitting :( [00:22:06] but that would explain a lot [00:22:09] right [00:25:02] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [00:27:02] YuviPanda: it’s hitting the right puppetmaster [00:27:15] it’s just failing too soon to get to any of the notify lines [00:28:27] andrewbogott: hmm, I guess we can't really put an ordering thing there [00:31:39] did you change something? It just started working [00:36:17] andrewbogott: no [00:36:23] andrewbogott: also hiera caches for a min or something [00:36:31] hm [00:36:32] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [00:36:34] must be something cache-related [00:36:42] although it was way more than a minute [00:37:22] andrewbogott: wait puppet also needs to run on the puppetmaster to apply the labs hiera config change [00:37:25] did that happen already too? [00:37:34] I… think so? [00:37:39] But if not, that would explain the delay [00:37:54] right [00:56:51] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Amitie 10g was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=223625 edit summary: [00:57:26] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Frisko was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=223628 edit summary: [00:57:58] hmm... is there a reason I cannot create an instance? [00:58:12] YuviPanda, ? [00:58:24] did you use up all server space? [01:07:44] YuviPanda: sorry to ping but this is driving me nuts. I'm trying to build a new instance to replace bd808-vagrant. I created a new project "mediawiki-vagrant" to put it in this morning. I have since tried to build 2 instances there and neither one would allow me to ssh in after puppet finished running. [01:07:57] YuviPanda: mwv-image-builder.mediawiki-vagrant.eqiad.wmflabs is the instance I haven't deleted yet with the problem [01:11:40] RECOVERY - Puppet failure on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [01:15:00] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [01:40:12] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [02:10:00] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1868319 (10bd808) 3NEW [02:25:32] 10MediaWiki-extensions-OpenStackManager, 10Echo, 3Collaboration-Team-Current, 5Patch-For-Review, 5WMF-deploy-2015-12-08_(1.27.0-wmf.8): Write presentation models for notifications in OpenStackManager - https://phabricator.wikimedia.org/T116853#1868344 (10Catrope) 5Open>3Resolved [02:31:04] 6Labs, 10Labs-Infrastructure: ssh access to instances in mediawiki-vagrant project failing with "Connection closed by UNKNOWN" response - https://phabricator.wikimedia.org/T121064#1868356 (10bd808) [02:31:08] 6Labs, 10MediaWiki-Vagrant, 15User-bd808: Create "mediawiki-vagrant" project - https://phabricator.wikimedia.org/T120982#1868357 (10bd808) [02:56:21] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:08] bd808: ugh, sorry :| [02:57:11] also wat testing-shinken- [02:57:26] fuck fuck fuck [03:02:46] ok [03:02:46] it's fine [03:02:46] just slow [03:02:47] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:47] mm [03:02:47] maybe not [03:02:47] Hi! i created an instance - apache-multihost-01.analytics - turned role::puppet::self on, and am trying to ssh into it - but i get denied [03:02:48] YuviPanda: I filed a bug about it. Hopefully it's something simple [03:02:48] yeah [03:02:49] I'm at an event now with super low data [03:02:57] I've called andrew and he should be here soon [03:03:00] for the tools outage [03:03:42] YuviPanda: is this another one of those ‘everything is slow for a minute’ things? [03:03:43] Or different? [03:03:46] andrewbogott: not sure. [03:03:49] (maybe you were asleep for those) [03:03:53] bd808: are you having the same problem that i am? [03:04:01] everything is recovered now, of course [03:04:01] andrewbogott: icinga / shinken paged and then it timed out when I tried it [03:04:04] augh [03:04:10] So, same as before I think [03:04:37] I see [03:04:48] madhuvishy: apache-multihost-01.analytics looks just fine to me [03:05:13] andrewbogott: interesting - i can get in now [03:05:18] It’s a mess [03:05:20] but reachable [03:05:21] i tried 10 times before [03:05:29] why is it a mess [03:05:29] I got the page too, was it transient? [03:05:32] did you apply puppet::self before logging in [03:05:33] ? [03:05:39] chasemp: same thing as this morning I think [03:05:41] andrewbogott: oh yeah i think so [03:05:55] madhuvishy: usually best to make sure puppet is running cleanly before you apply new classes [03:06:01] probably you should just delete this one and start over [03:06:20] andrewbogott: ok cool - i thought that was for other classes - not puppet::self. cool, i'll do that thanks [03:06:53] madhuvishy: /especially/ for puppet::self; it’s like fixing a car while it’s driving down the freeway. [03:07:09] Good analogy heh [03:07:25] chasemp: So those pages are going to keep happening, and I don’t know how to debug them [03:08:17] andrewbogott: I see. Thanks for explaining! [03:08:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953411 bytes in 5.580 second response time [03:09:55] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953352 bytes in 4.140 second response time [03:12:53] andrewbogott: yeah, so seemsl ike toollabs homepage is showing up pretty consistently [03:13:01] I guess that's because it's one of the few things we consistently check [03:13:07] yeah [03:13:09] any ideas how many times that has happened today? [03:13:17] three or four I think [03:13:20] looking at logs to see if anything makes sense now [03:13:35] but we should send an email w/ the times and how far apart and generally what the deal is to @ops [03:13:43] yeah [03:13:48] in case it happens first tiem and someone can catch in in teh act [03:15:25] holy crap I see w/ hto slapd fill the screen w/ procs [03:15:40] and it's only at 30-40% on the 4 cores but I could easily see this spiking [03:15:52] i don't know openldap much at all [03:16:08] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [03:16:22] I’m trying to write a little tool to record ldap query times [03:16:52] that would be a good idea [03:17:03] assuming that it’s queries and not writes... [03:18:40] queries are definitely affected at least [03:18:53] oh? [03:19:00] I feel like it could just be ‘labs was locked up’ [03:19:07] unless you have a slow query running on a non-labs host? [03:19:09] otherwise why would regular dns be affected? [03:19:15] queries I mean [03:19:33] hm true [03:19:41] hey can you spin up a vm quick [03:19:46] that was causign this am [03:19:53] ... recovery? [03:19:57] Same as the other times? [03:20:03] Coren: same [03:20:10] (Sorry for the delay, was on the road back from the gym) [03:20:38] chasemp: doing [03:21:06] Still DNS stalls pointing to ldap? [03:21:48] well that's the symptom theory [03:21:50] hm, it’s happening I think [03:21:53] root cause unknown [03:21:54] Now? [03:21:59] or else my local network is just slow [03:22:05] No, it might be. [03:23:31] DNS is erratic - I have delays on labs-ns1 but labs-ns0 is responsive. [03:23:32] what do you think, is labs stalling out? [03:23:34] Hm. Though I expect pdns itself does some caching so won't always hit ldap. [03:24:23] Yeah, something is definitely wrong with labs in general, but I just ruled out the FS (active, no issues there) [03:24:31] But lookups of users stall. [03:24:41] (getent passwd just hangs) [03:24:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:25:00] Most bots are happily chugging along. [03:25:10] ok, there, the storm is over [03:25:15] so it lasts just long enough to page :) [03:25:28] But anything that hits passwd or group db goes blam. [03:25:49] I can login to a vm fine now etc [03:25:59] chasemp: same here [03:26:52] But I had a ssh open on a bastion so I could do tests without hitting auth. 'touch foo -> worked', 'sleep 1 -> worked', 'ps faux -> hangs', 'getent passwd -> hangs' [03:27:01] I'll eat my had if that isn't ldap. [03:27:04] hat* [03:27:30] I mean, surely seems like ldap but why [03:27:36] Coren: but why do our shells freeze at the same time? [03:27:44] Oh, yours didn’t? [03:27:58] Mine did but I was doing :wq in a VM [03:27:59] Not as long as I didn't do anything that looked usernames up. [03:28:01] which would have hit NFS [03:28:14] Hm. [03:28:36] NFS might be partially impacted too - it does ldap lookups for group membership. But it caches aggresively. [03:29:10] andrewbogott: how often does ldap "sync" [03:29:12] between them [03:29:31] chasemp: ooooh. I like the way you think. [03:29:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 953520 bytes in 3.765 second response time [03:31:14] where is this coming from [03:31:15] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [03:31:19] is it main icinga? [03:31:22] or shinken? [03:31:42] The alert? Shinken afaik. Yuvi? [03:32:22] * Coren points out that it might be a good idea for shinken-wm to say 'l.team' somwehere in its alerts. [03:33:06] I've got instrumentation on labstore and on an instance looking and processes. If it happens again I'll see if there is a NFS impact as well. [03:33:11] pretty sure it’s shinken [03:33:28] Coren: great, that saves me the trouble of setting that up [03:33:37] history since the top of the hour for tools.wmflabs.org [03:33:38] [2015-12-10 03:25:38] SERVICE ALERT: tools.wmflabs.org;tools-home;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 953415 bytes in 4.299 second response time [03:33:39] Service Ok[2015-12-10 03:25:17] SERVICE ALERT: tools.wmflabs.org;NFS read/writeable on labs instances;OK;SOFT;2;HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.017 second response time [03:33:39] Service Critical[2015-12-10 03:23:48] SERVICE ALERT: tools.wmflabs.org;tools-home;CRITICAL;SOFT;2;CRITICAL - Socket timeout after 10 seconds [03:33:41] Service Critical[2015-12-10 03:23:28] SERVICE ALERT: tools.wmflabs.org;NFS read/writeable on labs instances;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [03:33:42] the other thing I wanted to do is gather stats about ldap query response time, and gather those stats on a non-labs box [03:33:43] Service Critical[2015-12-10 03:21:48] SERVICE ALERT: tools.wmflabs.org;tools-home;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [03:33:45] Service Ok[2015-12-10 03:02:27] SERVICE ALERT: tools.wmflabs.org;tools-home;OK;HARD;3;HTTP OK: HTTP/1.1 200 OK - 953455 bytes in 9.700 second response time [03:33:47] Service Ok[2015-12-10 03:01:57] SERVICE ALERT: tools.wmflabs.org;NFS read/writeable on labs instances;OK;SOFT;3;HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.106 second response time [03:33:49] Service Critical[2015-12-10 03:00:18] SERVICE ALERT: tools.wmflabs.org;tools-home;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds [03:33:57] Coren: Is that something you can arrange easily, or should I keep working on that? [03:34:22] chasemp: I just sent an introductory email to the Ops list; please respond with additional details as you see fit [03:34:39] ok thanks I don't have much to go on atm, I see tools.wmflabs.org flapping a lot [03:34:40] andrewbogott: I'm doing this the manual way - eyeball with tops and iptraf; do you want longer-term recording? [03:34:42] but idk why [03:34:48] would dns issues cause a check of that to die? [03:35:09] chasemp: No, but if ldap stalls the tools.wmflabs.org landing page does many. [03:35:14] I want to reset the master account password for my labs-hosted wiki but the email for changing it isn't going through. Is there a special incantation I forgot to say? [03:35:18] chasemp: As it looks up group membership. [03:35:23] ah [03:35:32] well then [03:35:44] I can make a quick fix and remove the tool list from the landing page [03:35:58] But that will just hide the problem. [03:36:31] Coren, want me to cause another outage? :) [03:37:00] andrewbogott: If you have a repeatable way to trigger it, I'd love to see the effects live for once. [03:37:41] Punchline: After the crash, the first thing the software engineer says is “Let’s go back to the top of the hill and see if the brakes fail again!" [03:37:45] andrewbogott: [03:37:48] could it still be file descriptors [03:37:49] http://www.openldap.org/lists/openldap-software/200507/msg00063.html [03:37:55] Dec 10 03:37:05 seaborgium slapd[17834]: connection_input: conn=1463599 deferring operation: binding [03:37:59] and I see lots of [03:38:03] Dec 10 03:36:23 seaborgium slapd[17834]: connection_read(1855): no connection! [03:38:06] some ppl say to ignore but [03:38:19] Coren: strap in [03:38:19] some more savy posts seem to indicate if you see openldap deferring [03:38:23] it's bouncing folks is it not? [03:38:32] If there's lots of them, that smells bad chasemp. [03:38:34] Jul 4 13:50:03 annuaire slapd[19523]: connection_read(18): no connection! [03:38:35] That is why it is deferring the operation. A quick google on that [03:38:35] shows some tuning of idle timeouts and possibly ulimit might be [03:38:36] helpful. I would recommend that you start tracking connections and [03:38:39] file descriptors to get a feel of conditions surrounding your [03:38:40] "freeze". [03:38:47] we have a ulimit issue already didn't we? [03:39:16] chasemp: alex pushed that limit way up and we're nowhere close to it that he could see. (2000+ fd iirc) [03:39:36] Well, nevermind, it’s not going to happen this time apparently [03:39:52] Max open files 4096 4096 files [03:39:53] andrewbogott: Of course, it *had* to be an heisenbug too. [03:41:44] openldap can open a shit ton of threads and use a shitton of connections I imagine [03:41:44] If I could’ve made it happen while Moritz was watching it’d be fixed by now :) [03:41:45] especially since it's getting hit w/ web traffic to icinga and other business [03:41:45] grep deferring debug | wc -l [03:41:45] 1000 [03:42:01] ah, there we go [03:42:07] chasemp: What time period is this? [03:42:19] I’m seeing some freezing now [03:42:23] but I can still do ldap queries [03:42:51] andrewbogott: I'm seeing a reduction in NFS traffic consistent with stuff freezing, but I can now confirm that the server /itself/ remains responsive. [03:42:53] * andrewbogott is going to keep triggering that page until EVERYONE is out of bed [03:43:10] ldap looks fine to me though... [03:43:24] Most processes run without issue. [03:43:27] dafu [03:45:25] Coren: ldap working for you too? [03:45:27] andrewbogott: I'm not sure. I'm trying a few things. [03:45:27] chasemp: Are you seeing a burst of defered in the logs? [03:45:30] hmmm [03:45:32] bursts of [03:45:34] Dec 10 03:44:39 seaborgium slapd[17834]: connection_read(1858): no connection! [03:45:34] Dec 10 03:45:07 seaborgium slapd[17834]: connection_read(1802): no connection! [03:45:35] Dec 10 03:45:17 seaborgium slapd[17834]: connection_read(1810): no connection! [03:45:41] idk some ppl say ignore it but... [03:45:43] it's odd [03:45:50] and we’re back [03:45:52] The timing is very suspicious. [03:46:00] chasemp: did it stop doing that? [03:46:07] when we recovered? [03:46:21] (or shortly before) [03:46:31] what's recovery time? [03:46:32] idk [03:46:42] Near 03:46 [03:47:13] Like, if there are many less during 03:45 than 03:46 we might be on to something. [03:47:18] last one is Dec 10 03:45:58 seaborgium slapd[17834]: connection_read(1813): no connection! [03:47:20] now I see another one [03:47:22] Err, the other way around. [03:48:07] the tools page seems pretty happy now [03:48:20] (weird that that last go didn’t page. It seemed like a freeze to me.) [03:49:10] there is a specific messages for file descriptors exhaustion adn I don't see any past the 8th [03:49:14] so there is that [03:49:57] andrewbogott: The check has to happen during the stall, if it's less than 5m it may well be missed entirely. [03:50:00] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [03:54:38] I am sleepy and out of ideas. Are we waiting until tomorrow to pester moritz and alex, or are there more angles of attack yet tonight? [03:54:52] I'm going to increase log level on slapd [03:54:56] and see if it spits out more useful [03:55:00] and I"m looking at timeout tuning ideas [03:55:39] I honestly can't think of a new approach at this time. Unless the increased logs verbosity suggests something. [03:56:42] Here’s a thing that doesn’t fit the timeline but has been bothering me for a couple of days: https://dpaste.de/wXNx [03:57:00] if internal WMF dns has been going down, that would cause all kinds of hurt, including what we’re seeing now [03:57:04] Seems far-fetched though [03:57:19] andrewbogott: Hmm. [03:57:29] PROBLEM - Host tools-master is DOWN: CRITICAL - Host Unreachable (10.68.16.9) [03:58:58] * Coren sighs. [03:58:58] And we need a way to explain to shinken when hosts no longer exists. :-) [03:58:59] shinken itself is fine [03:58:59] this is a testing instance someone not me is running [03:58:59] Krenair: i think [03:58:59] Ah. [03:59:00] andrewbogott: Hm. I admit that while the timing doesn't work, internal dns resolution errors are worrisome. [03:59:53] ah, ‘testing-shinken’ [04:00:01] RECOVERY - Host tools-master is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [04:00:41] if someone has op they can ban it from here [04:01:37] how long has it been now? :) [04:01:52] ok ok I restarted it there [04:01:55] so that may have been my bad [04:03:47] Hm. You just restarted it? [04:03:47] And we get exactly the same symptoms. [04:03:56] no around 21:57 [04:03:56] ... is puppet restarting the daemon at every run, perchance? [04:07:33] chasemp: Are the more verbose logs being helpful, at least? [04:08:20] not particurly but there is a crazy number of logging options [04:08:26] http://www.openldap.org/doc/admin24/slapdconfig.html [04:08:32] Table 6.1: Debugging Levels [04:08:53] Meh. Set all bits to 1! [04:14:47] see if a page comes I restarted it again [04:16:36] I dunno about pages, but things just stalled. [04:16:37] other than this stall any in the last 15 minutes? [04:16:38] Yeah, cool I suppose. It seems we now know of a surefire way to cause the issue: restart slapd. [04:16:40] are the stalls only on instances with NFS? [04:16:41] so I did a few things here guys [04:16:44] I haven't gotten any ores pages and they have no NFS [04:16:45] disabled puppet on seaborgium and set a few limits in slapd.conf [04:16:46] YuviPanda: No, the general bastion also stalls and iirc that one doesn't have NFS [04:16:58] ok [04:17:21] it's just anything ldap afau [04:17:32] YuviPanda: The stalls are always fairly short; 2-3 min, so depending on check interval we don't always get alerts. [04:17:40] slapd sometimes decides it can't answer or honor a client or client request idk [04:18:03] I did one more thing I commented out [04:18:04] #limits dn.exact="cn=repluser,dc=wikimedia,dc=org" time=unlimited size=unlimited [04:18:16] which I believe says the replication can use as much tiem or query size as it wants [04:18:19] Unstalled. [04:18:34] Hm. Or not quite yet. [04:18:43] so it's still hosed up now? [04:19:04] sure enough [04:19:11] chasemp: It's erratic. I managed to log into bastion, then not. [04:19:18] o_O [04:19:42] ok I'm going to reset here w/ puppet [04:19:58] YuviPanda: what do you know about openldap? :) [04:20:02] I know...not much [04:20:06] 0 [04:20:14] me too [04:20:26] chasemp: I've deployed openldap in the past, but not with that many clients. [04:20:29] I also have an intermittent connection for the next hour before I reach stability [04:20:45] can we increase cores / ram for the ganeti instance? [04:22:26] I've always been super uncomfortable about the fact that we're using ganeti for this [04:22:27] it's not really hammering away at ram or cpu [04:22:27] I'm more suspicious of an in-slapd limit [04:22:27] ok [04:22:27] * YuviPanda is on his phone and just making stuff up [04:23:21] chasemp: Wait, are you doing evil things to ldap? I'm getting auth errors now. [04:24:22] uh yeah so yes [04:24:24] idk why it failed to start but it did [04:24:25] give it like 10s [04:24:25] Ah, kk. [04:24:26] so this is pretty much nuts [04:24:30] and unfortunately the nature of our rollout makes rollback a nightmare as well [04:24:45] Interestingly enough, you just restarted it and this time: no stall. [04:26:52] stall now? can you let me know if it starts again? [04:26:52] I'd like to lay off for a few to see what happens [04:26:52] Nope. Not stalling that I can see. [04:26:56] Oh, wait, here it comes. [04:27:03] I was probably saved by the caches. [04:27:09] nscd for the save. :-) [04:27:18] doing it now? [04:27:22] Ayup. [04:28:17] * andrewbogott can still ldaplist [04:30:06] Hm. All but one instance have stopped talking to NFS [04:30:07] .... and it's back [04:30:17] Ah. stracing the mountd showed it was stuck in a getgrgid() [04:31:05] Interesting fact: the only instance that wasn't stuck was quarry-runner-02.quarry.eqiad.wmflabs. [04:31:31] s/wasn't stuck/was talking to NFS/ [04:31:46] (I presume others might have been not stuck but just not talking to nfs) [04:33:01] First time I see that effect on a stall; I wonder if there is a cache timeout issue to compound things. [04:33:32] chasemp: But I note that none of the stalls have lasted long enough for shinken to pick it up. Has anything changed on your end? [04:33:53] at "I'd like to lay off for a few to see what happens" [04:33:59] I put in soem limit params and left it alone [04:34:15] I'm not optimistic tho [04:34:49] I mean if this is prod we wake up moritz at this point as the main guy on this [04:34:55] idk what the protocol is here [04:35:08] but learning ldap in 30m at this tiem of night isn't working out [04:35:18] I'm not sure either - is this an outage or just degraded? [04:35:37] I think we can live with 'degraded' until Moritz wakes up. [04:36:29] yeah, I think chasemp should add a brain dump to the email I just sent, and leave it to moritz and alex to look at in a few hours [04:38:03] how can I prevent a SGE error from blocking further submission of -once jobs in the future? [04:38:25] Ah, because the job errored out? [04:38:33] andrewbogott: best guess is something to do w/ replication...is [04:38:33] do_syncrep2: rid=001 LDAP_RES_SEARCH_RESULT [04:38:37] when replication kicks off? [04:38:44] it happened several times that another user notifies me about my bot not working, then I qdel the job to get it working again [04:39:00] Did the job status include an 'e'? [04:39:00] Coren: it's always the "NIS error" or something [04:39:09] Coren: yeah a big "E" [04:40:08] liangent: Could could do a qmod -cj before attempting to start the job - that will clear the error condition if there is one. [04:42:03] liangent: At the cost, of course, of hiding possible issues. [04:42:09] Coren: what will happen with this command [04:42:24] the job in error state will be restarted? [04:42:55] chasemp: your guess is as good as mine. It looks replicationy [04:43:40] ehm [04:43:48] andrewbogott: yeah it's just flapping all over the place [04:43:49] Coren: labs instance has stopped working (/data/project) [04:44:11] andrewbogott: Coren i saw NFS read/writeable on labs instances flap bad/good/bad/good there [04:44:13] but this second running again [04:44:31] so what's up? I just got paged for the 400th time :P [04:44:48] chasemp: Yeah, NFS also suffers from the same symptoms because it uses ldap. :-( [04:45:44] plz don't let it crash completely [04:45:47] bblack: seems the ldap servers put in place are periodically freezing or not answering clients or so it seems [04:45:59] basically anything that hits ldap has flapping issues we believe [04:46:07] and that is a lot of stuff [04:46:25] Where "a lot" means "pretty much all" [04:46:40] andrewbogott: ok so how do we just turn off repl then [04:46:45] to see if that helps [04:46:52] we need sanity more than robustness [04:46:53] atm [04:47:38] chasemp: I have not touched either of the openldap servers. I can start flipping through the docs though :) [04:47:50] ha I'm in teh same boat [04:48:01] in light of the email, can we just mark that one service in downtime while it's being worked on? [04:48:13] I don't want to ignore my phone that could have real pages on it, and I'm tired of it going off :P [04:48:35] chasemp: what would it meant to break replication? Services will write to one or the other server and they’ll get out of sync… seems like that could cause hilarious things to happen [04:48:56] andrewbogott: I was thinking stop slapd on the secondary and have only one server not trying to replicate [04:49:09] ah [04:49:14] bblack: I'm not positive which thing paged you [04:49:27] well… presumably if we stop one of them it will stop replicating, right? :) [04:49:27] but a downtime is probably in order if it's just tools homepage or something [04:49:37] andrewbogott: well yeah but does it spin it's wheels harder or not [04:49:39]