[00:34:44] 3~3~/win 38 [01:40:53] 10netops, 10Operations, 10Patch-For-Review: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10faidon) This sounds a lot like T133387, which we reported a while back and had ATAC and engineering involved... [09:02:33] paravoid, I didn't really understand your 'conditional to the domain' thing re ferm [09:03:29] hi Krenair [09:07:06] so ferm has this "domain" thing [09:07:09] hey [09:07:17] you can say [09:07:38] domain ip { chain INPUT { proto tcp dport ssh ACCEPT } } [09:07:46] braces are optional, so this is equivalent to [09:07:50] domain ip chain INPUT proto tcp dport ssh ACCEPT [09:08:01] so that creates an iptables output [09:08:03] but you can also say [09:08:06] domain ip6 chain INPUT proto tcp dport ssh ACCEPT [09:08:13] which would create the ip6tables equivalent [09:08:17] and then... [09:08:21] domain (ip ip6) chain INPUT proto tcp dport ssh ACCEPT [09:08:24] would create both [09:08:39] the (foo bar) notation is standard in ferm, you can also do [09:08:47] interesting okay. so in domain ip6 it should know to look up AAAA records for @resolve too [09:08:47] domain (ip ip6) chain INPUT proto tcp dport (ssh smtp) ACCEPT [09:09:07] yeah, so IMHO @resolve shouldn't always resolve 'A', but it should vary that depending on which domain we're under [09:11:13] Krenair: $ echo 'domain (ip ip6) table filter chain INPUT proto tcp dport (ssh smtp) ACCEPT;' | /usr/sbin/ferm --noexec --lines --slow - [09:11:28] (doesn't change anything on the system, doesn't require root) [09:12:06] so [09:12:09] echo 'domain (ip ip6) table filter chain INPUT saddr @resolve(bast1002.wikimedia.org) proto tcp dport (ssh smtp) ACCEPT;' | /usr/sbin/ferm --noexec --lines --slow - [09:12:13] doesn't DTRT [09:12:27] IMHO that's a bug [09:13:10] and that has caused the proliferation of @resolve(..., AAAA) in our tree [09:13:27] e.g. [09:13:27] srange => "(@resolve((${pybaltest_hosts_ferm})) @resolve((${pybaltest_hosts_ferm}), AAAA))", [09:13:35] one potential problem around this is that you can specify NS or SRV types there [09:13:43] but I think if we can just change the default depending on what type of domain... [09:13:48] should probably be okay [09:13:49] yes exactly [09:13:56] could make it support multiple types at the same time etc. [09:13:58] the default is the problem, not having the rrtype as a parameter [09:17:36] IIRC the fix was a bit complicated, because the resolving happened at the first pass (the tokenizer) which doesn't know the domain yet [09:17:45] but I haven't looked at it in 3 years :P [09:24:26] hm okay [09:24:55] likely much easier to just fix up the current situation and have @resolve() @resolve(, AAAA) everywhere [09:25:01] but might be nice to do properly later [09:30:17] hm [09:30:19] I think I fixed it [09:41:34] Krenair: https://github.com/MaxKellermann/ferm/pull/38 [09:45:04] sorry :P [09:46:06] heh ok [09:46:17] I'm in class right now but will take a look soon [10:14:28] 10Traffic, 10Operations: Update certspotter - https://phabricator.wikimedia.org/T204993 (10faidon) einstenium and tegmen still run jessie and I didn't build a version for jessie-wikimedia. I believe they're being migrated to stretch as we speak, so maybe we should just wait for that. [10:39:32] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10ema) [10:39:59] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10ema) [10:44:07] paravoid, bah, the hosts are jessie, of course. [10:45:09] 10Traffic, 10Operations: Update certspotter - https://phabricator.wikimedia.org/T204993 (10Krenair) {T202782} for einsteinium, not sure about tegmen [10:45:22] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Nikerabbit) >>! In T191183#4098656, @demon wrote: > `avatars-gravatar` I am not installing. I did once, and was very quickly told not remove it. While that Wordpress proxy plugin s... [11:07:50] paravoid, so when we use &R_SERVICE in ferm config... [11:07:54] will this be okay with that? [11:08:29] yes, why not :) [11:09:22] just conscious about what you said about how this is all interpreted given that R_SERVICE is a function that produces a domain block [11:11:04] like &R_SERVICE(tcp, 22, @resolve(bast1002.wikimedia.org)); [11:11:24] will produce a domain block out of &R_SERVICE, but if @resolve is evaluated first... [11:13:03] ha! you're right [11:15:05] that's a very good catch [11:16:21] when you said @resolve was evaluated in the tokenizer, that's when the alarm bells started going off :) [13:15:38] Krenair: BTW, we should ensure that /etc/certcentral/accounts/${account_id}/ exists before dropping files there, right? [13:36:26] vgutierrez, ... true [13:36:28] good point [13:38:04] vgutierrez, fixed, hopefully.# [13:43:40] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) >>! In T191183#4651560, @Nikerabbit wrote: >>>! In T191183#4098656, @demon wrote: >> `avatars-gravatar` I am not installing. I did once, and was very quickly told not remov... [13:46:04] 0666? [13:46:14] we should grant at least 0770? right? [13:46:19] to the directory itself I mean [13:47:09] +x to allow certcentral go inside the dir :) [13:48:47] do we want to allow certcentral to write? [13:48:59] am thinking 550 [13:49:15] hmmm actually 0550 should be enough [13:53:50] regarding hieradata for our production nodes, the eqiad and the codfw nodes are going to be identical in terms of configured accounts, right? [14:05:54] vgutierrez, I assume so [14:06:03] could give them different accounts I guess [14:06:21] might be beneficial for rate limiting further down the line? [14:07:08] I don't think so, only one of them is going to be working [14:09:12] Krenair: BTW, the naming central/client, could be slighty better/more standard to have server/client or master/client [14:09:36] vgutierrez, I think we could have them both get certs simultaneously and either have a global pointer at a single host or point clients at their local certcentral server? [14:09:48] as long as we're using dns-01 challenges [14:10:19] IIRC it has been defined as active/passive in the goal [14:10:23] ok [14:10:40] +1 that certcentral::central is confusing and doesn't give the idea that is referring to the server|master part of it [14:12:39] it's not like the others are replicas [14:12:51] the only other part of this inside our infra is the clients [14:13:11] but that's more or less just some puppet resources on each varnish etc. host that tells them where to get the cert [14:13:39] 10Traffic, 10Operations, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Ottomata) Ahm, afaict this is very different and I am likely very ignorant here... buuuuut just in case you don't know about [[ https://github.com/wikimedia/cergen |... [14:18:47] so it's just a server? :) [14:19:18] if it listen to a port and answer to client's connections I'd say it's a server :-P [14:19:37] that's true at least for the API :) [14:20:00] if then it acts as a client for LE API is still a server that has a client inside to call external resources [14:20:02] that's how certs are served to the clients internally, so yeah, server looks (and feels) right [14:20:30] if those two pieces are detached one from the other you can call them server and le_client I guess [14:22:27] 10Traffic, 10Operations, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Krenair) It looks like that has some code involving interacting with X509 certs, but not ACME APIs or the Puppet fileserver API. It seems to have something to do wit... [14:26:56] I don't think we separate them in the new manifests [14:27:19] it all just sits under central.pp [14:27:44] yup, we are just suggesting a s/central/server/g change :) [14:30:16] ok [14:32:07] vgutierrez, volans: PS43 [14:32:23] * vgutierrez wondering what would happen if we reach PS99 [14:32:44] sorry PS42 is the right one, by definition :-P [14:32:55] hahahah [14:33:03] gerrit overflow [14:39:54] I'm sure someone can find out with gsql what the highest number of patch sets a change has had in wikimedia gerrit [14:41:34] 10Traffic, 10Operations, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Ottomata) Yeah, its mostly for new certificate generation from CAs. Puppet CA is optional; a Letsencrypt (or ACME?) Certificate Signer class could be implemented.... [14:42:20] hmmm also... profile::certcentral::server::available_certs::accounts feels kinda weird [14:42:34] something like profile::certcentral::server::config::accounts [14:42:45] and profile::certcentral::server::config::certificates [14:42:51] looks better IMHO [14:43:42] 10Traffic, 10Operations, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Krenair) >>! In T194962#4652052, @Ottomata wrote: > Just saying it should be considered! Maybe if this had been suggested 3-5 months ago this could be considered. T... [14:43:53] historical [14:43:57] will rename [14:44:00] ack [14:45:09] PS44 [14:45:17] and growing [14:45:18] :D [14:57:14] Krenair: so.. certcentral/manifests/server.pp expects the account config in the parameter accounts, but it isn't being set in profile/manifests/certcentral/server.pp [14:58:47] uh right [14:58:51] we should have the profile pull that from hiera [14:59:33] yup, a hash with account ids and regr.json contents, right? [15:01:28] yeah [15:01:59] ema: re https://phabricator.wikimedia.org/T201039#4649384 is it possible to check if there is the same issue anywhere else in eqiad? [15:02:02] would you like to amend that in for prod or do a follow-up commit? [15:03:00] I'll ammend it with hieradata/role/common/certcentral/server.yaml [15:03:11] ok [15:06:00] XioNoX: https://grafana.wikimedia.org/dashboard/db/pybal provides that information somehow, looking for a prometheus way to better highlight servers which are marked as down but pooled [15:07:07] ema: I managed to use cumin for a one time check [15:08:42] a dashboard should be the way [15:08:55] if it's not possible right now, we should improve that :) [15:11:27] bblack: do you have the (cumin?) command you used to check the hosts with a failed v6 neighbor entry? [15:23:51] XioNoX: I don't know that I have one that identifies just the switch issue. The one I had looked for any host neighbor table with FAILED entries, but there could be other "legitimate" causes for them that are unrelated. [15:23:55] sudo cumin '*' 'ip -6 neigh show|grep FAILED' [15:24:03] it returns quite a lot of data to sift through [15:24:11] bblack: yeah, I'm working on it [15:24:17] bblack: some are false positive for sure [15:24:38] eg. lvs1002 have 2620:0:861:103:10:64:32:97 as failed [15:25:09] also, there could be broken hostpairs we'd never naturally observe. Scenarios where hostX and hostY never have a reason to talk to each other ever, normally, but they're affected by whatever this bug is and would fail to talk if they ever tried. [15:25:12] which doesn't exist since the host has been re-purposed as spare and lost its static v6 IP (not now have a dynamic one) [15:26:15] s/not// [15:27:08] Krenair: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/46/hieradata/role/common/certcentral/server.yaml --> that looks sane to you? [15:27:25] yes [15:27:33] but [15:28:13] you'll want to set up the dns-01 challenge stuff later [15:28:29] until then it won't work [15:28:41] thing is that doesn't get set up until the next commit, so [15:29:54] ack [15:32:04] so uh [15:32:25] do we have a service that count be used to test this somewhere bblack? [15:32:26] could* [15:33:10] we should probably do a test-run first on a fake service before we accidentally break certs for a real one [15:33:39] yeah [15:34:01] as in, deploy the certcentral and authdns-hookup stuff, then define some puppetized bullshit cert for qwerty.wikimedia.org deployed to cp1008 or something and see if it flies [15:34:11] we can set up certcentral to get a cert for a domain that isn't actually getting certs from certcentral yet [15:34:29] (and use the testing LE endpoint of course) [15:36:08] well... we could get one for pinkunicorn right? [15:36:24] I guess? I don't think it has to be an existing or defined name though [15:36:44] (although it would be interesting to see if it fails if the CN is a non-existent hostname in the DNS I guess heh) [15:37:02] there's not really a good use-case for it, but good to know [15:37:29] hmmm good point, that's not covered in my tests :) [15:37:39] my fake DNS servers resolves everything to 127.0.0.1 [15:38:28] in theory with dns-01 validation, nothing should need the CN to exist in DNS [15:38:41] but I wouldn't be shocked if something on their or ours pointlessly tries to check it [15:38:50] yup [15:39:40] I don't think it will look for an A record [15:39:52] just for the relevant TXT [15:41:36] so.. I'll move the server.yaml to the authdns integration commit, asking for certcentraltest.wikimedia.org [15:42:05] and as bblack pointed out, we can set cp1008 as a client for that certificate [15:43:33] how about we leave certificates empty at first [15:43:50] then after the DNS stuff goes in, in a new commit, we add the test cert [15:44:17] and then once we're happy with that, in a new commit we can make cp1008 actually get the cert [15:45:48] ack [15:47:57] vgutierrez, I think you meant certificates: {} [15:48:25] that looks like it will just set certificates to empty string [15:50:08] XioNoX, bblack: so does the IPv6 situation in eqiad look stable enough for the switchback tomorrow? [15:50:20] Krenair: gotcha [15:50:21] I'd say yes [15:51:08] which will probably cause certcentral to error [15:51:23] fixed [15:51:29] (PS48) [15:56:41] vgutierrez, alright [15:56:44] what next? [15:58:06] hmmm [15:58:14] we still lack a certcentral role right? [15:58:37] to be referenced in site.pp for the certcentral[12]001 nodes [15:59:00] right... [15:59:07] class role::certcentral { [15:59:18] indeed [15:59:24] system::role { 'certcentral': description => 'Central certificates service' } [15:59:33] include ::profile::certcentral::server [15:59:54] + ::standard + ::pprofile::base::firewall [16:00:05] ok [16:00:10] one moment [16:00:19] basically the spare::system role + the certcentral::server class [16:02:09] profile :) [16:03:12] PS49 looking good, if you want, feel free to modify the site.pp and assign that role to the certcentral instances [16:03:39] maybe the change needs to be rebased though [16:04:05] but right now in the production branch we already have the node definition with role(spare::system) [16:05:16] which certcentral instance should I assign it to vgutierrez? [16:05:21] eqiad or codfw? [16:06:45] hmm both should be puppetized, but we should have a hiera boolean setting one as active and one as passive [16:06:51] (aka not starting the service) [16:08:03] and if I'm right, we should have a svc dns entry pointing to the one acting as active [16:12:10] vgutierrez, and the puppet certs on these hosts will have SANs matching that? [16:12:59] we can generate puppet certs with certgen for that [16:13:47] ok [16:13:59] I have no idea where to find the DNS stuff for svc.eqiad.wmnet and co. btw [16:15:12] vgutierrez: are you using LVSes [16:15:31] hmmm not right now [16:15:42] do you plan to? [16:15:53] if it's only one server per DC it's probably overkill [16:16:09] btw [16:16:10] and I guess we could use a "fake" discovery DNS entry as we already have others [16:16:13] on a semi-related note [16:16:15] I did some grepping [16:16:21] $ git grep ganeti02 [16:16:22] templates/1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa:9.4.0.0.0.0.0.0.4.6.0.0.0.1.0.0 1H IN PTR kubestagetcd1003.eqiad.wmnet. ; VM on ganeti02.svc.eqiad.wmnet [16:16:26] is ganeti02 a thing? [16:16:34] yup.. I was suggesting that fake discovery DNS entry [16:16:41] see debmonitor for an example [16:16:50] volans: LVS for 1 backend server per DC looks crazy [16:16:56] yeah [16:17:46] Krenair: how would be that related? [16:18:14] it's got svc in the name volans :) [16:18:37] I was going to say unrelated [16:18:41] I think this is the kind of thing we can manage with simple aliasing if we need to flip [16:19:02] yep like debmonitor and others [16:19:04] it's not runtime-critical. if it fails, we have ~1 month to fix it before it someone's cert expires [16:19:19] yup... anything more complicated like that looks like a clear example of overengineering :) [16:19:58] really the only part where we need a failover-able alias, is whatever entrypoint hostname is used by client machines to e.g. fetch cert/key updates [16:20:10] (I guess puppet fileserver access and such) [16:20:17] indeed [16:20:34] right which is the part we're discussing [16:20:39] right [16:20:45] or at least I thought we were [16:20:54] yes that's what I was referring to [16:20:58] I don't know, I'm in the midst of a meeting and haven't read backscroll :) [16:21:03] ok :) [16:21:13] the use case is very similar to debmonitor internal API for server reporting [16:21:34] but I'd expect a schema something like ccentral1001.eqiad.wmnet + ccentral2001.codfw.wmnet for the actual hosts, and ccentral.svc.wmnet that CNAMEs over to one or the other I guess. [16:21:51] I was suggesting certcentral.discovery.wmnet bblack [16:21:56] modulo whatever bikeshedding [16:22:12] as svc are usually LVSes services and we have other CNAME based discovery entries [16:22:27] I think there's already some non-discovery names in discovery, but I think it's probably a bad idea even if it works [16:22:51] ? [16:22:52] svc is generic, it just means it's some kind of virtual service hostname [16:23:16] we don't have svc.wmnet, only svc.$DC.wmnet [16:23:46] yeah I guess, but we should maybe. there's a lot to bikeshed in this area from different POVs [16:23:54] the non-discovery discovery entries so far are: [16:23:54] ; Will become a proper discovery endpoint once we add more registries [16:23:54] debmonitor 300 IN CNAME debmonitor1001.eqiad.wmnet. [16:23:54] docker-registry 300 IN CNAME darmstadtium.eqiad.wmnet. [16:24:14] so it's not like we're creating a new problem if we stick a certcentral one in there now [16:24:38] that's what I was suggesting, we already have that and we know we want to move that to something that we don't know yet what will be [16:24:46] but I suspect putting non-discovery things in the discovery domain might not be the best long-term idea, e.g. if we end up having that whole subzone auto-generating by something actually specific to some discovery tooling [16:25:09] (or even, delegated to another piece of software) [16:25:39] related topics came up in how we're going to structure things for ATS [16:26:10] for e.g. appservers.discovery.wmnet or whatever, we want ATS to consume the discovery hostnames, too. In any case, we don't want the eqiad/codfw switches to be up in ATS config itself. [16:26:37] but does that imply we should create fake discovery hostnames for all of the (~50-ish?) non-discovery services that exist behind varnish today? [16:27:21] things pollute pretty quickly if our standard is "any standardized/aliased service hostname should be under .discovery whether the actual discovery tooling is involved or not" [16:28:17] okay [16:28:19] so the outcome of this is what [16:28:20] on this I agree and I'd be ok to add another subdomain for this [16:28:32] but not sure if it's worth to block certcentral for that ;) [16:28:41] no svc/discovery stuff, just use the standard certcentral1001/certcentral2001 and flip with puppet? [16:28:56] prior to the existence of .discovery, I would've argued that a datacenter-neutral alias would go under foo.svc.wmnet without the dcname part [16:29:31] flipping with puppet would be a good idea in this particular case, too, I think [16:29:50] since all access is via puppet fileserver reference hostnames anyways, inside of puppet [16:29:54] yes seeing as by flipping all your clients end up with different certs [16:29:57] how many clients? [16:30:07] however many public-facing web servers we have [16:30:16] number of varnishes plus gerrit plus... [16:30:22] we could define it just once though [16:30:31] oh plus mail servers [16:30:43] some hieradata certcentral::current_fileserver: certcentral1001.eqiad.wmnet or whatever [16:30:44] and various other misc stuff [16:30:51] but varnishes are gonna be the main thing I expect [16:31:01] puppet doesn't seem the right solution IMHO, but I might be biased [16:31:12] Uh well I called it certcentral::cert::certcentral_host bblack [16:31:20] that's fine [16:33:07] volans: so this service hostname we're discussing (e.g. what could be certcentral.discovery.wmnet): its only use, is to be the hostname used inside a puppet manifest for a file reference like file { ..., source => puppet://certcentral.discovery.wmnet/certs/foo } or something like that [16:33:40] instead of hardcoding certcentral.discovery there and having a DNS commit flip it, why not have hieradata flip the hostname stuck into all of those? [16:34:32] no reference to this hostname whatsoever outside of puppet? [16:34:43] I don't think there's a reason to, no [16:35:08] then if it depends on a puppet run anyway seems to make sense [16:35:10] the two certcentral hosts operate independently. the centralized bit is "which one do clients fetch certs from?" [16:35:19] I thought it would endup in some config file somewhere in the hosts [16:35:23] (so that they're not flipping randomly between two compatible but differet certs) [16:35:40] no volans [16:35:48] we're using the puppet fileserver api [16:36:03] take a look at the certcentral::cert class [16:36:06] anyways [16:36:27] in particular these bits: source => "puppet://${host}/acmedata/${title}/${type}.crt", [16:36:36] ack, then puppet seems fine [16:36:46] so this just stays in the puppet catalog [16:38:24] and gets used to tell puppet clients where to connect to, but I don't think they'll save the path to disk [18:30:33] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata) Just added the `accept` field to the varnishkafka generated webrequest logs. @JAllemandou I haven't done this in a while, I'll ping you in m... [18:38:18] 10netops, 10Operations: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) I took the 16 hosts unable to reach the eqsin anchor over v6 during the last measurement (https://atlas.ripe.net/measurements/11645088/) and ran traceroutes from them to the e... [20:39:48] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#4651560, @Nikerabbit wrote: > Really? As far as I know all gravatar integrations send hashes of emails. This means they don't know which email it is, unless it... [20:52:26] 10netops, 10Operations, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) 05Open>03Resolved a:03ayounsi That should be good enough to make the alerts useful by removing the "false positive". Please reopen if still too no... [21:34:14] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10CRoslof) It sounds like adding a delegation for wikimediaeesti would just allow you to change the nameservers, but we can change the nameservers now without ad... [23:34:31] 10Traffic, 10Gerrit, 10Operations, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10thcipriani) >>! In T191183#4649245, @Dereckson wrote: >>>! In T191183#4647075, @Krinkle wrote: >> Gerrit wants 100x100px square thumbnails. > > The 100x100 size isn't what curren...