[08:50:06] hello people, I'd merge https://gerrit.wikimedia.org/r/#/c/397765/9 (new vk test instance on pink unicorn) if you are ok [08:51:21] elukey: yup! [08:54:04] \o/ [09:53:28] merged the change, and of course it doesn't work [09:54:00] seems complaining about certificates/varnishkafka/varnishkafka.key.pem that should be there, checking [09:56:50] ahhh certificates/varnishkafka/varnishkafka.key.private.pem [09:57:59] elukey: wrong path? [09:59:35] yes exactly [10:35:26] ok the vk instance is up but there are some TLS issues reported in the logs, investigating them [10:37:43] ah there you go, varnishkafka.key.public.pem returns Expecting: TRUSTED CERTIFICATE [10:38:06] I might have used the wrong file generated by cergen [10:59:46] ssl://kafka-jumbo1002.eqiad.wmnet:9092/bootstrap: SSL handshake failed: SSL syscall error number: 5: Connection reset by peer [10:59:49] much better now [11:00:07] slowly getting there :D [11:05:32] :) [11:07:26] for example, sending TLS data to the plaintext port may not be the right move [11:10:18] probably not [11:10:49] not recommended by nine doctors out of ten [11:22:07] ema: do you want to be the first one and send some traffic over to jumbo? [11:22:18] vk seems up and running now [11:24:44] elukey: yeee [11:25:01] elukey: do you see anything? [11:25:48] with tcpdump I can see traffic flowing but in theory the 'webrequest_canary_test' should be created [11:26:20] we'd need something that generates webrequest data though [11:26:26] hitting cp1008's frontend [11:27:25] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3840062 (10ema) [11:27:44] ah wait now I can see Dec 15 11:27:26 cp1008 varnishkafka[20220]: KAFKADR: Kafka message delivery error: Local: Message timed out [11:27:53] so I guess you already did something valid [11:27:54] sigh [11:28:13] yeah I've sent a few requests to https://pinkunicorn.wikimedia.org [11:28:23] ahhh right [11:33:24] it works now! [11:33:30] I had to create the topic in kafka [11:33:33] \o/ [11:33:46] I can see "GET","uri_host":"varnishcheck","uri_path":"/wikimedia-monitoring-test" [11:33:49] etc.. [11:34:22] * elukey dances [11:35:25] nice! [11:38:34] the only thing missing now in puppet is to refactor a bit kafka_config.rb, since it assumes that only one kafka port can be specified (defaulting to plaintext/9092 if nothing is in hiera) [11:38:44] meanwhile for Jumbo we'll have both [12:01:03] ema: I'd temporary proceed with https://gerrit.wikimedia.org/r/#/c/398446 to configure the testing vk instance properly [12:01:31] the kafka_config.rb is very delicate and needs a ton of testing before getting refactord [12:01:34] *refactored [12:02:52] if this is ok, after this we could 1) triple check that no data is leaked between cp1008 and kafka-jumbo (for misconfig, etc..) and then think about extending the experiment to cache misc? [12:04:34] elukey: re: https://gerrit.wikimedia.org/r/#/c/398446, woud it make sense to add the hostlist to hiera? [12:05:06] I guess it might, unless the plan is to fix kafka_config.rb Soon(TM) [12:05:54] and then yes, if all is good, misc next [12:09:09] my plan would be to confine this hack only to the test instance, while me and andrew work on kafka_config.rb [12:09:37] that must happen of course before we even thing about adding tls for webrequest (not test) [12:10:02] so in this case hiera vs inside the profile doesn't change a lot in my opinion but I have no preference [12:11:16] the other thing that we could do is add jumbo-eqiad-tls in common.yaml [12:11:26] in which we specify the port [12:11:32] that should be picked up by kafka_config [12:12:13] but for this test my hack should be enough [12:12:31] so there will be no risk that people pick up a config in common.yaml [12:18:55] all right I chose the hack, will chat with Andrew on Monday about how to proceed with kafka_config.rb [12:26:59] ok! [12:45:13] https://grafana.wikimedia.org/dashboard/db/prometheus-tls-ciphersuite-explorer?orgId=1 [12:45:16] \o/ [13:22:48] also created https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&from=now-3h&to=now&var-instance=webrequest_jumbo_duplicate [13:55:47] filed https://gerrit.wikimedia.org/r/#/c/398468/ to add the new vk instance to cache misc. I'd like to verify with you guys that we are effectively using TLS and not leaking any data by mistake before proceeding. [14:01:00] elukey: what do you mean by "effectively using TLS"? I assume it would fail completely to connect/send if TLS isn't working [14:05:12] bblack: simply paranoia, I configured librdkafka/vk to use TLS and it seems to be doing so, but I was trying a good way to verify that. tcpdump is a straighforward choice but I was wondering if you had any suggestion about better ways to do it [14:18:13] does the server even accept non-TLS on that port? [14:18:57] nope it should not, 9093 is TLS only and 9092 is plaintext [14:20:37] just tried "kafkacat -P -b kafka-jumbo1003.eqiad.wmnet:9093 -t webrequest_canary_test" to confirm, and I am not able to produce anything [14:20:51] I guess another question I'd have (since I haven't looked at the details of this at all) - is there client auth as well? [14:21:21] (as in, client must authenticate with its own TLS cert, signed by a CA the server trusts) [14:22:03] yep, on cp1008 in /etc/varnishkafka/ssl there is a client cert signed by the puppet ca, and in ssl/private its private key [14:22:27] librdkafka wants both and it uses those while establishing the TLS session [14:22:40] then on the kafka brokers we are able to set ACLs on authenticated users [14:23:07] the idea would be, for example, to allow only varnishkafka to produce to webrequest topics [14:23:42] I guess at the TLS level, what I mean is: does the broker require that the client have a client certificate, and that the certificate is signed by the puppet CA? [14:25:03] yep, it is all contained in a java trustore, the kafka broker holds the certificate of the ca that it needs to trust (in our case the puppet one) [14:25:24] ok cool [14:25:47] [for optimization purposes, my brain is going pretend it didn't see the word java above, of course] [14:26:42] :D [14:29:50] Acceptable client certificate CA names [14:29:50] /CN=Puppet CA: palladium.eqiad.wmnet [14:29:50] Client Certificate Types: RSA sign, DSA sign, ECDSA sign [14:29:50] Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1 [14:29:53] Shared Requested Signature Algorithms: ECDSA+SHA512:RSA+SHA512:ECDSA+SHA384:RSA+SHA384:ECDSA+SHA256:RSA+SHA256:DSA+SHA256:ECDSA+SHA224:RSA+SHA224:DSA+SHA224:ECDSA+SHA1:RSA+SHA1:DSA+SHA1 [14:29:57] Peer signing digest: SHA512 [14:30:00] Server Temp Key: ECDH, P-256, 256 bits [14:30:04] ^ this is what openssl is showing when I do my own connection to a jumbo broker TLS [14:30:51] which seems like it will probably accept SHA1 and/or RSA/DSA client certs, probably not a great idea :) [14:31:53] I wonder about kafka's TLS negotiation restrictions in general [14:33:27] New, TLSv1/SSLv3, Cipher is ECDHE-ECDSA-DES-CBC3-SHA [14:33:27] Server public key is 256 bit [14:33:30] ouch we haven't checked those [14:33:51] ^ apparently it allows 3DES, but because it only has an ECDSA (and not RSA) server cert, it won't negotiate anything that's RSA-only [14:35:01] where is the client/server side TLS params configured, or docs, etc? [14:35:16] ah, I am reading https://docs.confluent.io/current/kafka/ssl.html - ssl.cipher.suites (Optional). A cipher suite is a named combination of authentication, encryption, MAC and key exchange algorithm used to negotiate the security settings for a network connection using TLS or SSL network protocol. [14:35:33] and also ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1. It should list at least one of the protocols configured on the broker side [14:36:48] right [14:38:30] so on the broker side: [14:38:33] ssl.enabled.protocols = TLSv1.2 [14:39:38] ssl.cipher.suites = ECDHE-ECDSA-AES256-GCM-SHA384 [14:39:45] would be a start [14:40:24] I mean really, there are potentially better options and we control both sides, but I donno about compatibility here. That should be a safe choice though, even for stock ssl libs on trusty or whatever. [14:41:25] the same two settings are available for the kafka client config too, so set to the same values [14:41:27] we can start from those and then add more if we encounter issues with clients while porting them to TLS [14:41:51] sure, going to send the code reviews in a bit [14:41:59] I'm not sure what encoding/format/whatever ssl.cipher.suites wants for the actual list/string/whatever [14:42:34] ECDHE-ECDSA-CHACHA20-POLY1305 is an even better choice, or could put both in for now if you figure out what type of "list" it wants there [14:42:45] (an openssl string in foo:bar form? or a kafka foo, bar list? [14:42:46] ) [14:43:15] but then again if java's providing the crypto, maybe it doesn't use openssl and doesn't get chapoly support anyways (which is ok, just not the most-ideal choice) [14:45:05] yeah, doesn't appear to support chapoly, tried it [14:45:51] anyways, yeah try locking down client+server proto+suites as above for 1.2 and aes256-gcm [14:45:58] that's a good start [14:46:27] the "requested signature algorithms" part still bugs me a little, but it will need more digging to figure out if it's really a problem [14:46:59] I'm worried that either side might accept e.g. RSA/DSA+SHA1 certs from the peer if they look legit. [14:48:07] good point [14:48:19] going to test on cp1008 [14:51:41] for the librdkafka part, the available configs are https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md [14:53:25] it accepts ssl.cipher.suites but no mention of protocols [14:53:36] (if fails if I try to set it) [14:53:56] with ssl.cipher.suites = ECDHE-ECDSA-AES256-GCM-SHA384 works fine afaics [14:54:56] heh [14:55:39] so the server broker config takes ssl.protocols, but not vk due to librdkafka not having it? [14:56:00] exactly [14:57:02] does ECDHE-ECDSA-AES256-GCM-SHA384 imply a TLS minimum version to support it? [14:58:29] I think so, but it's not outside the realm of possibility that it still matters to specify it. [15:00:45] in any case, it doesn't seem to be the case that one can negotiate that cipher over tlsv1 or tlsv1.1 [15:01:09] I just don't know if failing to restrict the protocol choice on the client leads to some other scenario that's undesirable [15:01:36] (thinking from the perspective of someone who's actively trying to mitm/proxy/whatever our kafka TLS streams as they go over supposedly-secure transport) [15:02:44] e.g. I wonder if the broker/client support the renegotiation SCSV stuff to avoid downgrades [15:06:07] anyways, summing all this up in the context of the current date/time: [15:06:21] it's one thing to get TLS working at all and demo it on cp1008 for non-important stats [15:06:46] but we should probably spend time on some serious security review of the TLS connections here before we rely on it in place of ipsec for our x-dc live stats traffic [15:06:57] and that review may lead to needing some upstream patches to turn on features, etc [15:07:43] and with next week being the last before xmas break, I don't see us making it through that process comfortably-enough to leave our PII in the hands of this over xmas. [15:08:07] makes sense [15:08:34] but still, it's amazing progress on very useful stuff :) [15:09:15] \o/ [15:09:31] going to open some phab tasks then [15:09:48] is it ok if we keep experimenting on pink unicorn right? [15:09:52] yeah [15:15:10] what's interesting here on the tls security review is we really care in both directions about both sides [15:15:32] whereas for public HTTPS, basically we can only do so much to care on our side, and it's up to the end-user to use a reasonably-secure UA [15:15:53] or whereas for my laptop, I care about reasonably-secure UAs but can't do much about some servers' server-side choices. [15:16:38] we control both sides so it makes sense to make them as robust as possible [15:16:49] but in this case, we clearly care about the client-side's security params because we don't want the client to be fooled into connecting to an illegitimate "broker" and feeding it sensitive data. [15:17:28] and similarly we don't want to allow illegitimate clients to connect to the broker either (sending fake/confusing data, or as the output of of some kind of proxy after capturing the real client's connection) [15:19:00] and neither side should allow negotiation tricks that might allow for a true end-to-end connection between our legit clients+brokers, which has bad security properties that make it more-sniffable than we'd expect. [15:21:19] and so in the quick review this morning, I'd think we need to (a) upstream a librdkafka patch for tls version setting (should be pretty trivial) and (b) sort out what's going on with allowed sigalgs on both sides. But I donno if there's more to look at here, it's a deep topic that we don't normally have to think about because modern stacks on both sides (e.g. Chromium/FF+nginx+OpenSSL) take care [15:21:25] of a lot of important details automagically for us [15:21:34] whereas kafka + java TLS libraries, not so much, so we really need to think->verify->audit [15:22:13] * elukey takes notes and writes them to a task [15:23:35] I think librdkafka is actually using OpenSSL, which is nice [15:23:55] but I have no recent experience on the details of the java side. I imagine it's some native java implementation that's separate. [15:24:52] maybe the jvm is smart enough to wrap openssl, but not sure about it [15:26:51] historically, it was based on Sun's own implementation I think [15:26:55] I'm not sure about modern openjdk [15:28:38] I see some hints in googling, that openjdk uses NSS for at least some parts [15:28:51] (NSS being the mozilla alternative to OpenSSL, which is also pretty decent) [15:30:09] or at least, it can? [15:30:22] it may be a configuration thing whether you use the sun or nss implemtnation, or a code thing [15:34:56] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3840663 (10elukey) p:05Triage>03Normal [15:35:00] there you go :) [15:36:56] hope that I got everything right [15:37:12] Cc: moritzm [15:37:22] (the above task --^) [15:39:15] the Debian builds of OpenJDK use NSS as the crypto provider [15:39:34] and I think these days it's also the default upstream, but would need to have a closer look [22:58:43] someone finally implemented the last bits of smarts in the Linux kernel for getting IRQ->daemon routing right for multi-queue cards [22:58:56] (quite a while back actually, but I hadn't noticed) [22:59:09] there's even an nginx patch uploaded by not accepted at: https://trac.nginx.org/nginx/attachment/ticket/1437/SO_INCOMING_CPU.patch [22:59:32] it's not an ideal patch, because there's no new config flag to turn it on, and some using worker_affinity might not want SO_INCOMING_CPU [22:59:57] but we should really make a slightly better optional version of that patch, and then turn it on when we turn on all the RPS/RSS stuff [23:00:07] (and cut our worker affinities back to one-per-real-core) [23:00:50] you can also now attach BPFs to the REUSEPORT routing for more-advanced behavior, but SO_INCOMING_CPU is the most "obvious" easy answer without resorting to BPF code.