[08:53:20] I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178880 to reconfigure throttling via nftables on the https port, please lmk if it impacts ux on gerrit or gitlab! [09:40:08] tappof: hey, it seems like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1174729 broke puppet on all of cloud vps [09:41:19] taavi: ack, I'm going to check [09:41:22] `prometheus::instances_defaults` is not defined in cloud hiera, and prometheus::alert::rule requires that to be set via prometheus::instances() [09:44:51] tappof: do you have a quick fix or should we revert for now and retry later? [09:46:03] taavi: No quick fix, reverting. [09:47:53] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180520 [09:48:17] thanks! [13:59:28] after you brouberol [14:00:33] oops, sorry. looking [14:01:59] sry, I forgot that changes to labs/private need to be puppet-merged [14:07:42] Does anyone know how to troubleshoot IRC notifications? Ref https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-data-platform/stat_host.yaml . The latter alert is brand-new and is only notifying via email, while the older alert also notifies in IRC. The receiver is defined here (I think) [14:07:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/alertmanager/template [14:07:42] s/alertmanager.yml.erb#145 [14:12:17] inflatador_: what's notable to me: in stat_host.yaml you use severity "critical" while in alertmanager.yml.erb you have matches for severity "page" and "task" but not for critical. it should match a combination of team and severity, so "data-platform" + "critical" (instead of "page" ?) [14:13:44] (doesnt explain why they behave different when they both have the same setting but still seems like it has to match something there) [14:15:14] yeah, I didn't look too closely at other teams' routes, let me do that [14:16:29] The last one is a catchall and should trigger for critical [14:17:13] looks liek there is a `/var/log/icinga/irc-team-data-platform.log` on alert1002, but it's already rotated [15:01:10] taavi: I’ve just merged a patch that should fix the behavior we saw this morning on VPS instances. Could you run Puppet on some of the instances that showed the error earlier today and confirm whether the fix works? [15:26:10] I have a reprepro question, I added a thirdparty component to the trixie repo here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180584 , do I have to add an `Updates` section too with that thirdparty to import the packages? (mabye taavi or moritzm might know from the to of their heads) [15:28:59] tappof: I at least don't see the massive alert storm we had earlier! [15:29:15] dcaro: yes [15:29:29] taavi: thanks! I'll do [15:29:31] dancy: yes, you need to add a matching definition in the "updates" file [15:29:56] I assume that was meant for dcaro. [15:30:20] xd [15:32:33] also left a comment on the patch, there's also a missing entry for distributions-wikimedia as well [15:34:48] part of me wonders whether it'd be worth it to generate those files from hiera that we could have better CI checks for [15:40:26] thanks! [15:42:46] sounds like a good idea :-) [17:22:06] anyone know how I would create TLS certificates that are then to be used by a zookeeper cluster? I am not sure any existing zookeeper cluster we have has used client auth before. I think I could configure it but mainly wondering what to use to create the actual certs. [17:23:02] we need it for new zuul, where nodepool is to be configured to communicate with a zookeeper [17:23:49] cwhite, jhathaway: (unless someone objects) I'm going to deploy a new sessionstore container shortly. No code changes, but in upgrading to Bookworm there are a bunch of new transitive dependencies (including Cassandra driver), so it seems worth making everyone aware of it. [17:24:31] mutante: I think you would want to make a new intermediary in cfssl for this zookeeper cluster, and then use cfssl-generated certs (which could be provisioned via puppet) [17:24:57] https://wikitech.wikimedia.org/wiki/PKI/Clients should help [17:25:19] cdanis: thank you! will get into that [17:25:58] that's the model used for any cert generated on k8s, for instance [17:26:08] ACK!:) [17:28:11] Thanks for the heads up urandom! [17:47:51] Ok, yeah, this is going badly [17:48:13] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.08.20?id=oiSWyJgBAlpnrixJ5y7d [17:48:32] > Error connecting to Cassandra: gocql: unable to create session: unable to discover protocol version: x509: certificate signed by unknown authority [17:49:32] 🥺 [17:50:32] I started with codfw (much less traffic), and I assume helm is doing the right thing, and rolling back [17:52:53] urandom: I think it's gonna wait until its timeout in helmfile.yaml to roll back [17:53:44] cdanis: yeah, is there a better course of action other than "waiting it out"? [17:53:47] which I think will be in another minute-ish, per `kubectl get event -n sessionstore --sort-by .lastTimestamp` [17:54:55] yeah there it goes [17:55:17] urandom: I don't have a good answer for you, in theory a ctrl-c should DTRT but I think there have been cases where that was very much not true [17:56:06] it doesn't seem to have created impact. I assume it's not advancing until it gets a successful startup on the pod or pods it's already done [17:56:36] yeah that's right [17:56:47] `kubectl get rs` at the time: [17:56:51] NAME DESIRED CURRENT READY AGE [17:56:52] kask-production-5784b89f9b 6 6 6 58d [17:56:54] kask-production-c7679cd74 4 4 0 7m18s [17:57:15] the old replicaset was only scaled down by 2, until some of the new pods became ready (which they didn't) [17:58:17] awesome. [17:59:01] you can leave something like `kubectl get event -n sessionstore --sort-by .lastTimestamp -w | ts` running in another terminal [17:59:24] oh, i think those k8s events are also in logstash if that's your jam [17:59:53] nice [18:00:18] the console probably works better in this case [18:00:26] one less thing to find in the moment [18:00:34] less context switching [18:00:50] yeah, it's what I usually do [18:01:12] now to figure out why certificate verification fails (and why it didn't in staging) [18:01:45] I feel like tls is becoming the new dns [18:02:30] where are the certs in question coming from? are these the certs of the cassandra servers? [18:04:16] yeah, it's the driver [18:04:50] so I assume it's the client (driver) complaining about the server certs [18:05:06] * swfrench-wmf missed all this until now [18:05:13] that would be my guess as well, yeah [18:05:28] I'm guessing that's the `cassandra` CA intermediate heh [18:05:33] these are the custom certs we load from `cassandra.tls.ca`? [18:08:22] how do you create those intermediate CAs? totally unrelated I was just staring at this because I need one: https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Adding_a_new_intermediate [18:08:54] swfrench-wmf: so... there is a ca cert supplied I think (client-side), but I don't think we verify it on Cassandra. I think the driver expects it for encrypted connections? [18:09:21] Cassandra isn't configured to verify client certificates [18:09:23] mutante: yeah that's how you create them 😅 I can help you if you need [18:09:30] cool:) thanks [18:10:45] I do see cassandra in the list of intermediates.. so there is that. [18:11:29] which cassandra servers does staging connect to, urandom ? [18:11:39] cassandra-dev2* [18:11:59] cassandra-dev200[1-3] [18:12:23] wow, so sessionstore production is running a quite old version of kask, one that predates this: https://gitlab.wikimedia.org/repos/mediawiki/services/kask/-/commit/a890c083f3d88d2af9652cac61c562014463347f [18:12:44] now that *is* in staging, but it never made it into production [18:13:50] so maybe —if I'm not thinking about this backward— maybe it was never good, and we're just now actually attempting to verify? [18:14:30] or maybe we never gave it the right CA or intermediate signatures to verify with [18:14:36] (and it didn't matter before) [18:14:56] are you giving the cassandra server the full chain? [18:16:40] uhh... I'm not sure. We have a keystore (a Java'ism) [18:17:33] everything we have is bundled within, and eluke.y did the PKA setup [18:18:22] somewhere on a TODO list I have an item for "come up to speed on PKI stuff" [18:18:51] (long since buried in other things I haven't done either) [18:19:45] sslcert::x509_to_pkcs12 { "cassandra_keystore_${tls_hostname}" : [18:19:47] owner => 'cassandra', [18:19:49] group => 'cassandra', [18:19:51] public_key => $tls_files['chained'], [18:19:54] that's something [18:24:59] oh geez [18:25:27] apparently an earlier version of me knew about this [18:25:55] urandom: how can I point an `openssl s_client` at cassandra and cassandra-dev ? [18:26:17] ah nvm I just had to not do it from a cumin host [18:29:51] Ok, so an early version of me tried to roll out the changeset above, encountered https://phabricator.wikimedia.org/T352647#9715110, and then disabled verification (in staging) [18:30:11] that error isn't quite what we're seeing in production today, though [18:30:37] https://phabricator.wikimedia.org/P81602 [18:30:51] seems like it's the same intermediate [18:31:49] I mean, the reason it worked in staging, but not production seems pretty straightforward [18:31:52] oh: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/sessionstore/values-staging.yaml#12 [18:32:13] vs. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/sessionstore/values.yaml#39 [18:32:22] good catch [18:32:41] i.e., in production, we use what;s in the configmap populated from certs.cassandra.cs [18:32:43] *ca [18:33:18] swfrench-wmf: but look at the following line (for staging) [18:33:47] I'm guessing that if you tried again in production with the WMF CA bundle, but with verification still enabled, you would get a different error message [18:34:00] oh, yeah, ok [18:34:06] that's sounds right [18:34:19] gah [18:34:22] what a mess [18:35:02] urandom: indeed, yeah - what I mean is, we're using totally different certs in the two environment, where the staging one should "clearly be valid" if the server is presenting a chained cert [18:36:04] swfrench-wmf: yeah, that makes sense. So like cdanis said, fixing the ca cert would get past this error, and presumably leave us with the previous one that caused us to disable verification in staging [18:36:25] are those IPs VIPs? [18:36:33] yup, that makes sense [18:36:48] cdanis: the cassandra contact nodes? no [18:38:25] but once the driver dials a cassandra node, it asks it for a full node list to fill it's pool, and the cassandra node returns IPs (which can't be verified because the certs use hostnames) [18:38:45] I think... this is starting to come back to me [18:38:48] are those IPs already in puppet hiera? [18:38:53] no [18:39:12] we're using hostnames there as well [18:40:19] urandom: ... it looks like profile::cassandra::instances has a `listen_address` key ? [18:40:35] oh, that's right [18:40:43] my bad [18:40:56] it won't long term though [18:41:09] ok [18:41:21] networking hates that (me too), so we're going to get rid of that [18:41:41] but cassandra clients would still be using IP addresses? [18:41:41] though, if *this* is any indication, you can probably rely on there being IPs there for *years* [18:41:53] so [18:42:07] well the driver retrieves an IP list for the node directly [18:42:13] s/for/from/ [18:42:13] you should be able to modify the profile::pki::get_cert() call in instance.pp to also include the instance's listen_address [18:42:19] for a SAN [18:42:36] imo doing that is better than disabling verification :) but maybe it's harder than it seems [18:42:39] this also sounds familiar... I think eluke.y wasn't fond of that [18:42:43] ok :) [18:42:58] apparently we haven't been doing verification all along [18:44:10] we're already not doing verification, and this change "fixed" that :) [18:44:36] fair enough [18:51:39] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1180622 [18:51:49] +1 :) [18:52:34] thanks! [18:56:35] yeah, that's looking better [18:56:38] less crashy [19:00:16] nice :) [19:01:58] session traffic in codfw is elevated, but that's just me (generating some synthetic load before upgrading eqiad) [19:02:14] in case someone sees that an low-key panics [19:02:19] s/an/and/ [19:36:01] Is there a meaningful difference between `ts.client_request.set_url_host `and `ts.client_request.header['Host'] =` in an ATS plugin? [19:36:10] I see instances of both. [20:05:36] cdanis: maybe we should revisit IP SAN. I think e.lukey's concern was that we weren't doing that elsewhere, that it was exceptional (which is fair). But solving it otherwise seems...worse. [20:06:16] yeah I just saw what you said re: HostDialers [20:06:37] those IPs are in netbox, right? [20:06:45] and the eventual goal would for them to only be in netbox and not in puppet hiera? [20:07:28] well... the eventual goal would be that those IPs (those are secondary IPs) would go away, and we'd just bind all of the instances to different ports on the host IP [20:07:38] ahhh [20:08:39] that's only recently become possible, and it's going to be lot of work to undo the years worth of assumptions we've made in the mean time, so I'm not sure when that's going to happen [20:09:03] well, any way that that works out, I don't think you're making more of a mess for yourself in the future for adding the already existing in hiera listen_address to the cert SANs list [20:09:11] it's been a fairly high priority for a couple of years, so probably any day :) [20:09:11] s/for adding/by adding/ [20:11:01] when that day comes, would we have to add the host IP to hiera? [20:11:19] are you using IP SAN anywhere else at this point? [20:11:33] I think the answer was No a year ago [20:11:53] s/are you/are we/ [20:12:04] I don't think we are using IP SAN anywhere else, but I'm not 100% on that [20:12:54] Krinkle: the difference I think is that set_url_host changes the address of the origin server, while header['Host'] strictly changes the header value (keeping the same connection) [20:14:41] while I am on the expert on where to use what but from what I have seen in reviews, you would use set_url_host for a different hostname whereas header['Host'] is usually used in conjunction with other wrangling [20:15:02] urandom: and, for when that day comes, you have either the basic facts that come from the machine, or you have the data about the machine (including IPAM) in netbox, which makes it way into puppet via sre.puppet.sync-netbox-hiera cookbook [20:15:28] oh right, ofc, the hosts IP is already a fact [20:16:23] Krinkle: https://docs.trafficserver.apache.org/en/9.2.x/admin-guide/plugins/lua.en.html#ts-client-request-set-url-host [20:16:47] but yeah, note that I am strictly basing this only on reviewing code and not actually having written it [20:16:56] sukhe: I see, makes sense. so when we decide between primary/secondary dc (multi-dc) or rest-gateway, we use set_url_host, but when changing the vhost for a MediaWiki request (if X-Subdomain, set Host=dt-host) we just change the header in-place. [20:17:07] Krinkle: I think so, yes