[04:07:23] heh [04:08:39] vgutierrez: I had assumed all this time, with how fancy all the front of it is, that it had some native ssh client library [04:08:46] (clustershell) [04:08:55] turns out it just shells out to /usr/bin/ssh at the back of things :) [04:09:06] :) [04:10:18] so with that in mind, if there's a way to pass additional options, then -oIdentityAgent=/run/keyholder/proxy.sock would work [04:10:45] I know with the "clush" CLI driver such things are possible, so that should imply the underlying library has some argument to pass those things down too [04:11:34] I guess this should've all dawned on me when it was taking ssh-style -oFoo options in via "clush" as well, but some naive part of me wanted to believe maybe they were just parsing those for compatibility or whatever :) [04:12:53] the whole universe is fork+exec wrappers all the way down :P [04:13:20] https://en.wikipedia.org/wiki/Turtles_all_the_way_down [04:23:51] vgutierrez: I think it's like: task.set_info("ssh_options", "-oIdentityAgent=/run/keyholder/proxy.sock") [04:23:59] yep [04:24:02] I'm trying that as we speak [04:24:10] on acmechief-test1001 [04:31:45] yep, that works [04:32:00] iff I remember to spawn my python3 interpreter as acme-chief instead of vgutierrez [04:32:03] L8 issues [05:48:29] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [06:59:37] hello! Is cp3063 under maintenance? [07:00:49] didn't find anything, seems not reachable via ssh [07:01:25] yet another occurrence of T238305 [07:01:26] T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 [07:01:32] elukey: I'll take care of it, thanks [07:02:12] vgutierrez: I am in the mgmt console, seems frozen, I can powercycle if you want [07:02:19] otherwise I'll go away and shut up [07:02:20] :D [07:02:34] elukey: sure.. give me one second first [07:03:53] 10Traffic, 10Operations: cp3063 crashed - https://phabricator.wikimedia.org/T239310 (10Vgutierrez) [07:04:04] !log depool & powercycle cp3063 - T239310 [07:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:09] T239310: cp3063 crashed - https://phabricator.wikimedia.org/T239310 [07:04:44] elukey: go ahead please, powercycle ti [07:04:46] *it [07:04:58] ack [07:05:59] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [07:09:41] it's up now [07:10:35] yup [07:15:29] 10Traffic, 10Operations: cp3063 crashed - https://phabricator.wikimedia.org/T239310 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez As "expected" nothing on SEL or logs, this is yet another occurrence of T238305 [07:15:31] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [07:20:47] vgutierrez: qq - I am investigating a rise in connections to eventstreams, do we have the same backend->app throttling limit that we had with varnish-be ? [07:21:07] elukey: I don't think so [07:21:14] but ema is your man [07:22:03] I'll follow up with him [07:26:43] elukey: even if the same throttling limit is in place, now you'd see way more backend servers [07:27:05] cause now the backed layer in esams, eqsin and ulsfo reaches directly the applayer [07:27:17] instead of going via eqiad/codfw [07:27:33] and AFAIK that was a throttling limit per varnish-be instance, right? [07:29:14] yeah exactly, 25 for every varnish-be [07:29:24] interesting [07:30:27] it is true that we have never solved the connection limits for eventstreams, I am not pointing the finger to ats now, just trying to figure out if some interim solution could be applied [07:32:14] sure, I'm just offering an explanation [07:32:23] yep it is a very good point [08:26:10] elukey: good morning, from which cp host do you see the increase? Is it by any chance cp3050? [08:31:39] ema: hello! Didn't check if it was a specific host, ES saw an increase of conns from the 17th/18th more or less [08:32:54] ok, entirely unrelated then [08:34:48] I was checking an old task in which Andrew mentioned 25 max conns for each varnish-be -> es [08:34:57] and then I realized that we have ATS now :D [08:35:53] yeah :) [08:36:42] so I did a roll restart of ES in codfw this morning, and I freed some slots https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&from=now-6h&to=now&refresh=1m&var-stream=All&var-topic=All&var-scb_host=All [08:37:51] but IIRC we have never really solved es' max conns issue [08:37:55] lemme pull the last task [08:38:16] I haven't finished the "entirely unrelated" thought above, kept it for me in my head: yesterday I've disabled request coalescing on cp3050 as a performance test, I was wondering if that could have had anything to do with the increase in connections you've seen. It is however unrelated given that you mentioned Nov 17/18 [08:38:26] https://phabricator.wikimedia.org/T226808#5300508 [09:06:56] ema: so is ATS currently limiting backend conns in some way, or can we do something for ES? [11:02:20] elukey: ignoring you on purpose for now, let's discuss this later today :) [11:02:32] another day, another attempt to enable tslua reloads [11:02:38] disabling puppet on all cp-ats hosts [12:01:30] vgutierrez: good to go for the sync script? [12:02:34] lgtm [13:46:49] bblack, vgutierrez: tslua reload patch applied to cp1075 and cp2001. Things look fine to me but it would be great if you can double-check :) [14:07:02] 10Traffic, 10Operations, 10serviceops: Investigate the remaining usage of X-Real-IP - https://phabricator.wikimedia.org/T239340 (10akosiaris) [14:07:12] 10Traffic, 10Operations, 10serviceops: Investigate the remaining usage of X-Real-IP - https://phabricator.wikimedia.org/T239340 (10akosiaris) p:05Triage→03Low [15:22:36] elukey: hi :) [15:22:48] elukey: so no, we do not currently limit the number of connections to origins [15:23:31] there's a setting called proxy.config.http.origin_max_connections, which we haven't touched and defaults to 0 [15:24:58] is it per-backend though? [15:25:21] it isn't but it's reloadable and overridable according to the docs, so we should be able to set it in lua [15:25:44] set it in lua... as a global affecting all backends? :) [15:25:52] :) [15:26:13] it's a nice-to-have feature [15:26:30] although I think in most cases, 0 is probably better [15:27:02] well the interesting part is what happens once the limit is reached [15:27:35] if proxy.config.http.origin_max_connections_queue is -1 (default), then requests that would have otherwise resulted in a new connection to be created get queued [15:28:04] so maybe some combination of these settings would help with the excessive number of new connections we create to eqiad? [15:28:56] ema: didn't get the non per-backend part.. how are connections tracked? [15:30:35] elukey: there's a counter per-origin server. The point is, we do not want to limit all our origins to a very low value like 25 (that's what we use for ES), and the setting is global (all origins get the same max) [15:32:18] ah okok [15:32:49] so the counter is related to a single ats-backend -> origin right? (that was the unclear part) [15:33:03] like the 25 varnish-be -> origin that we used to have [15:33:33] anyway I agree that a global feature is not really usable up to now, [15:33:52] https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy-config-http-origin-max-connections [15:35:29] elukey: so my understanding is that we need to enforce this limit otherwise ES creates one kafka connection for every incoming tcp connection it receives [15:37:27] the problem is not on the kafka side, but on the ES side, since it may happen that all free "slots" to pull data from kafka are exhausted and then some new clients will fail [15:38:06] but ES can't queue on its own? [15:38:15] nope [15:38:32] we could put a queueing revproxy in front of ES perhaps [15:39:04] e.g. nginx or whatever configured to allow all the connections from ats-be, and funnel them into a limited pool of conns to ES, thus queueing in that daemon. [15:39:27] (in the nginx I mean) [15:39:45] the main issue that I can see is that clients usually connect and then keep pulling data, so the workers on ES side are held busy until they disconnect [15:39:49] (this is my understanding) [15:39:54] queueing like that (the way we were before with varnish max conns even) introduces either latency or failure as a result of course [15:40:19] but they're pulling data with discrete http requests over a normal http conn, right? [15:40:35] I assume, or things would've have worked before [15:40:44] *wouldn't have* [15:42:49] it uses the SSE protocol, so (IIUC) http chunked responses that keep flowing until the client disconnect [15:44:36] so we couldn't have blended those before anyways (by queueing in varnish) [15:44:51] yep this is my understanding [15:46:05] so what was happening before, is varnish was limiting you to 25 (well 25 x number of varnish-be, at most), and any new client which would cause the limit to be exceeded would effectively just stall and timeout and fail in the common case? [15:46:40] maybe you get lucky and one of the existing clients actually closes up and finishes before one of the pending ones times out and fails [15:46:48] but probably not often if this is the model [15:47:08] so if in the new model the excess clients still fail, not much has changed [15:47:41] Andrew wrote a per-node-worker x-client-ip connection limit at the time: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520068/3/hieradata/role/common/scb.yaml [15:48:12] but all the calculations etc.. were based on the 25 * varnish-be max conns [15:48:28] that also were not ideal since we could have ended up in all connections slot fulls in ES anyway [15:48:48] heh ok [15:49:07] just saying, varnish wasn't really effectively providing any queueing for this scheme [15:49:28] yep yep [15:49:30] it was just helping in the sense of helping to limit the carnage of a singular crazy remote client opening way too many conns [15:49:34] interesting things from this morning: https://grafana.wikimedia.org/d/000000336/eventstreams?orgId=1&refresh=1m [15:50:06] I rolled restart the es codfw backends, since they were the ones showing up most of the conns, and the client(s) causing the block freed slots [15:50:26] (but also, by doing that imperfectly and wrongly, varnish was also potentially artificially-limiting clients in cases where it wasn't needed as well) [15:50:37] it was reported by internet archive that from the 18th few events were pulled, but in the task I didn't have more info [15:51:45] task is https://phabricator.wikimedia.org/T239220 [15:52:12] so we have some problem, and we have a pair of hacks in the existing setup (varnish-be + scb config) that kinda-solve it poorly, and now with the ATS swap we've lost one of those poor hacks [15:52:26] to be clear, I am trying to reason out loud with you guys now to see if ATS can help in some way, not telling you that this needs to be solved by traffic :) [15:52:59] but I think maybe we'd be best served by trying to understand the real problem and make a real solution, rather than trying to copy the varnish-level hack (which was implementation-capability-specific and incorrect) to ATS. [15:53:09] makes sense yes [15:54:22] as best I can guess from the convo and details so far, I'd say maybe the real problem is "ES can handle the legit load of most of our reasonable clients we want to serve, but sometimes there are remote client IPs which hog tons of connection-slot resources, and we'd like to limit their impact and make their excess connections fail" [15:54:39] ? [15:54:58] correct [15:55:47] but whatever mechanism was being tried by otto on scb before, was unforunately per-worker rather than global to the given ES server [15:55:57] correct again [15:56:37] the varnish hack I kind of understand the impact of, I think [15:56:41] yes, the per worker stuff is hacky [15:56:47] (only barely following this convo) [15:58:14] https://phabricator.wikimedia.org/T196553 [15:58:22] hello ottomata :) [15:58:26] o/ [15:59:33] hmmm [15:59:57] I wonder if we could build a better version of the javascript limiter, which can limit XCIP parallelism global to all workers [16:31:08] would def be better, just neeed some state :( [16:46:41] <_joe_> bblack: I am about to merge a discovery dns change [16:46:48] <_joe_> the procedure is the same as always, right? [16:47:11] <_joe_> puppet merge, run on all role::authdns, dns merge, authdns-update? [16:47:37] <_joe_> I am asking as I saw movement in that area and I didn't follow what changed [16:48:07] the only thing that changed recently is that "role::authdns" isn't the right target [16:48:11] cumin A:dns-auth will get it [16:48:39] <_joe_> ok [16:48:57] (there's authdns that are role::authdns, but there's also authdns that are role::dnsbox, and there's also role::dnsbox that are not authdns, but the common factor that cumin A:dns-auth catches is that all authdns consume profile::dns::auth now) [16:51:49] <_joe_> ack [16:57:47] (and eventually this will all get better, we're in temporary transitional state. The goal on this front, at this stage, is to end up at 10x role::dnsbox that are combined recdns+authdns+ntp, 2 per site, and no more role::authdns). [17:10:28] ...and then convert them to buster, and route public authdns into them with bird, and implement DoTLS and Anycast AuthDNS, all in some ill-defined order [18:07:28] bblack: is the intent to offer DoTLS to the public? [18:14:28] yes [18:14:52] DoTLS, to be clear, is for our authservers, for traffic from caches [18:14:59] unrelated to the separate subject of e.g. DoH caches [18:15:17] (well, related, but only in the sense that both are about encrypting some kind of DNS traffic) [18:31:16] ah okay [19:28:06] 10Traffic, 10Operations: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) Please note that ganeti4002 and ganeti4003 are showing as 'staged' in netbox but not in puppetdb, and throwing report errors on https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ "missin... [19:53:01] that looks like a countdown "OCSP staple validity for www.wikimania.com has 69008 seconds left" [20:01:46] jynus: noticed that too. hmm. only 19 hours [20:03:35] looking at "Use acme-chief provided OCSP stapling responses" https://phabricator.wikimedia.org/T232988 [20:05:18] vgutierrez: do you know if OSCP staple validity alerts on nrredir* are actually critical ? [20:05:36] ncredir [20:08:00] I'd assume if they're new criticals, they matter [20:09:23] since 5h8m ago by what I see currently [20:09:37] which is probably when they got under a 24h threshold [20:09:56] there are a bunch of WARNINGs for ncredir too [20:10:04] e.g. SSL WARNING - OCSP staple validity for wikipedia.com has 132690 seconds left [20:10:35] yeah so all certs 1-5 are counting down [20:10:47] each has a large set of SNIs in it, but will be named by the main subject [20:11:15] ack, wikimania just happens to be the first one i assumed [20:11:22] there's also a similar alert for install[12]002's acme-chief cert for apt.wm.o [20:11:36] so it's not just ncredir, could be most/all acme-chief certs [20:12:07] anyways, I'll go poke at some cases and see if I can figure out what's going on [21:11:32] acme-chief issues should go away shortly [21:11:56] the daemon was stuck for sitting on a mutex for days, so it stopped refreshing cert expiries / ocsp staples / etc [21:12:18] it suddenly caught up as soon as it was restarted, and the next agent runs should pick up the new ocsp stuff, etc [21:12:35] probably a python-level bug in acme-chief or its dependencies somewhere [21:15:29] fun [21:22:10] I wonder what getting stuck on a mutex means in python terms [21:28:27] yeah I donno, in strace/gdb sense, it was stuck in a futex() call [21:28:47] the backtrack was a bunch of unmemorable python interpreter stuff [21:29:11] it's single-threaded in every sense, so I don't imagine it makes any explicit use of mutexes