[08:24:37] 10Traffic, 10Operations: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10elukey) p:05Triage→03High [09:55:13] 10Traffic, 10Operations, 10Patch-For-Review: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10elukey) Current status: - dns* nodes have `LimitNOFILE=infinity` but since daemons have not been restarted, they are r... [09:55:16] ema: --^ [09:55:44] we'll probably need also to roll restart all gdnsds [09:56:50] elukey: excellent, thank you so much! [09:57:01] <3 [10:57:01] Hi traffic folks! In https://phabricator.wikimedia.org/T266702#6587811 I make the assumption that I remember being mentioned somewhere that doing URl based routing at the cache layer is hard or impossible? Can anyone tell me if that is indeed the case / confirm that [10:57:51] The desire would be [10:57:54] cache -> /sparql -> wdqs service [10:57:54] `---> /* -> static site [13:07:24] elukey: thanks! [13:08:07] sadly, this will be the first true-restart in a long time, since the "replace" operation can't fix the ulimits set by the unit files [13:09:02] (well, already has been the first true-restart in a long time, but will be again for the next cycle of them to get Infinity) [13:14:50] I'm gonna poke at haproxy tunables and other related ulimits first, see if anything else needs adjusting [13:25:24] actually, having stared at all the other bits in depth, I think everything else is ok for where we're at now [13:40:52] bblack: what do you think of adding some alert, maybe on the sockets error metric or similar? [13:43:10] yeah, but perhaps not specifically to this role [13:43:22] seems like it would be a good idea fleet-wide to catch cases like this based on the typical system metrics [13:43:46] (if possible without noise, but I'd think it would be) [13:44:53] yeah, I was also thinking of a more general disk alert that will alert if the disk will be full in N days/hours and the trend doesn't change after N days/hours based on the prediction of being full [13:48:34] another output of this: accept4() failure causes a log_err() (which is what was observed for this ticket), but doesn't increment any kind of gdnsd-level stats counter [13:49:06] maybe past-me had some rationale about the possibility of it being a noisy stat, but still, noisy stats are more acceptable than noisy log output :P [13:50:23] request-level failures get stats, and any request or connection-level thing that results in the teardown (normal or not) of a connection gets a stat [13:50:27] and new conns have a stat [13:50:35] but there's no acceptfail stat, that's the hole [13:51:24] volans: a disk alert fired for authdns2001 last night, at least [13:51:41] do we also alert for gdnsd answering TCP queries? [13:51:42] and was noticed when already full [13:52:36] right, the disk stat was from the extremely-noisy syslog output I assume [13:53:25] I do wonder why those aren't suppressed at some layer into "Last message repeated 873 times" or whatever [13:53:34] Oct 29 00:00:08 authdns1001 gdnsd[18626]: TCP DNS: accept() failed: Too many open files [13:53:37] Oct 29 00:00:08 authdns1001 gdnsd[18626]: TCP DNS: accept() failed: Too many open files [13:53:40] they seem identical [13:54:16] do we explicitly unconfigure the repeat-compression stuff? I'm not even sure which layer would be responsible, but I assume rsyslog [13:54:20] yes there are a ton of them [13:55:52] at a glance, it looks like rsyslog has some config for that legacy-style repeat-suppress, but it would also affect centrallog output, which would be harmful on another level [13:56:18] this might be a discussion for #wikimedia-observability :) [13:56:51] from the rsyslog docs: $RepeatedMsgReduction - [...] This parameter models old sysklogd legacy. Note that many people, including the rsyslog authors, consider this to be a misfeature. See Discussion below to learn why [...] [13:58:05] I could also move the log output for these accept-fails to the debug level [13:58:28] the problem is, then it's hard for the user to know the errno-reason for them from the stat alone, and creating stats for all possible errnos is ick [13:59:11] you could add the errno as a label on the prom stat [13:59:30] yeah I mean at the gdnsd level itself [13:59:52] it only has a fixed set of stats counters with fixed labels and indices [13:59:59] ah [14:00:39] I could also just implement some kind of log output rate limiter in gdnsd at a basic sanity level [14:02:53] it's kind of an interesting case at a whole different level, too, because if accept4() returns EMFILE, apparently that doesn't make the accept()ability condition go away [14:03:37] so the daemon thread will immediately loop back to blocking on accept4() after logging that error, and even in the abscence of any new traffic, it will immediately return EMFILE again for the same thing, in a tight loop [14:04:18] (which makes a certain sense from the Linux POV: after all, a smart daemon might do something about EMFILE before trying again, and get a success] [14:04:30] [) ? [14:06:16] gdnsd actually has a behavior it could use here already: when the gdnsd-level configured limits on connection count are exceeded, we shoot down the most-idle of the existing conns and bump a stat counter to inform the admin (which is the "Server Close Kill" stat on our dashboards) [14:06:31] so maybe EMFILE shouldn't be handled by the generic "any other errno" code, and should do that instead [14:06:48] at least it would avoid the loopiness [14:10:00] (for some common cases, anyways. you could still get odd differentials between separate TCP threads contributing to the process-global ulimit, when the thread that gets EMFILE doesn't have another idle connection to kill) [14:10:39] so gdnsd-level outputs (which aren't an emergency or anything): [14:11:20] 1) self-ratelimiting log outputs to some sanity level would be a win against insane spam (I saw ~5k/sec from this case in our syslog output) [14:11:40] 2) EMFILE could at least attempt to close the most-idle to make room, when possible [14:12:40] 3) there should be a stats counter for acceptfail (possibly two: one for the known benign cases that can be caused by network conditions, and one for the default/unknown more-important cases like these) [14:19:27] 10Traffic, 10Operations: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BBlack) All the authdns are restarted with the infinite limit applied. There's been some IRC discussion about a few possible spinoff tickets... [14:21:52] addshore: https://phabricator.wikimedia.org/T266702#6588396 [14:23:14] bblack: <3 ty [14:42:17] addshore: we also already do something similar for wdqs https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#154 [14:42:40] /bigdata/ldf -> wdqs1005.eqiad.wmnet/bigdata/ldf [14:42:52] everything else to wdqs.discovery.wmnet [14:42:55] aaaah nice [14:43:12] the very important thing is that we don't rewrite the uri_path [14:43:25] like in /bigdata/ldf -> wdqs1005.eqiad.wmnet/bigdata/ldf [14:43:53] we don't want to deal with uri rewrites at the caching layer if at all possible [14:46:12] addshore: so `/sparql -> wdqs_service/sparql` is great :) [14:47:01] ack, thats probably all perfect :) [14:47:13] I may copy these logs into the ticket so that I don't loose the links [14:53:38] ack! [15:12:52] 10Traffic, 10Operations: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10Vgutierrez) p:05High→03Medium [15:21:04] bblack: following to https://gerrit.wikimedia.org/r/c/operations/puppet/+/636902 I'm thinking in considering all non TLSv1.3 cipher suites as medium security instead of strong [15:22:02] yeah sounds reasonable, for what it affects [15:36:25] as long as our entire fleet is able to talk TLSv1.3 we should be ok [15:36:42] and that means everything running on buster [15:37:36] yeah, which isn't yet true :) [15:38:14] one thing that makes ssl_ciphersuite hard to deal with is it's affecting both public and internal cases [15:38:36] * moritzm mumbles something about 55 remaining jessie servers [15:39:10] bblack: yeah.. but we should eat our own dog food and be using TLSv1.3 :) [15:39:19] a bunch of core things (e.g. puppetmaster) are using 'strong', so yeah, strong should better support all the existing legacy-os hosts :) [15:39:53] I imagine the public removal of TLSv1.2 support is probably still a long way out [15:42:07] yeah.. like a ~13% of our traffic is still TLSv1.2 [15:47:05] 10Traffic, 10Operations: ATS trying to set socket options SO_MARK / IP_TOS - https://phabricator.wikimedia.org/T265911 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'apt-cache policy trafficserver|grep Installed' 72 hosts will be targeted: cp[2027-2042].cod... [15:47:48] in any case, even jessie should support strong ciphers + ECDSA I think [15:47:51] just not 1.3 [15:48:30] vgutierrez: wasn't a look at https://debmonitor.wikimedia.org/packages/trafficserver enough? :) [15:48:50] volans: that's not pasteable :( [15:49:02] 8.0.8-1wm3 72 [15:49:23] yeah.. technically I could paste the screenshot [15:49:28] or just the cumin CLI output :) [15:49:35] volans: either way I'm using one of your tools... ;P [15:49:39] ahahaha [15:50:19] bblack: that would be interesting for our (evil) plans of deprecating RSA [15:50:36] yeah [15:50:55] I think that transition might have some slight tricks to it on our end [15:51:15] to fully benefit from it? yeah [15:51:24] I think at the 100% -> Remove step for that, we actually have to stop configuring the RSA cert first, before removing the ciphers from the list [15:51:33] I think you end up with very strange failures doing it the other way around [15:52:21] I might have that backwards, I donno [15:52:46] also... [15:52:49] willikins:puppet vgutierrez$ git grep acmecerts |grep rsa-2048 |wc -l [15:52:49] 86 [15:52:49] willikins:puppet vgutierrez$ git grep acmecerts |grep ec-prime256v1 |wc -l [15:52:49] 80 [15:52:58] but one way that you do it, the handshake just fails like other transitions, and the other way around it does something crazier that might look weird to users or our logs [15:53:18] (because it will negotiate a cipher for which we have no matching auth cert, or it will pick the cert and have no matching cipher, or whatever) [15:53:40] what's configured as rsa-only? [15:54:35] exim, apparently [15:54:57] some wmcs stuff too [15:55:13] so yeah, we'll need to fix the rsa-only cases before as well [16:51:28] 10netops, 10Operations, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero)