[07:36:43] 10Certcentral: certcentral crashes on network errors - https://phabricator.wikimedia.org/T209980 (10Vgutierrez) p:05Triage>03Normal [09:36:59] 10Traffic, 10Operations, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [14:44:36] bblack: what do you think about issuing one cert for gerrit/gerrit-slave.wm.o instead of two separate certs? [14:46:09] of course we could issue two different certs and change their current puppetization: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/gerrit/server.pp#L77 [14:49:58] 10netops, 10Operations: Remove neodymium/sarin from router ACLs - https://phabricator.wikimedia.org/T210612 (10MoritzMuehlenhoff) [14:54:21] vgutierrez: I think the whole reason they did it this way was because we don't reasonably support x-dc failover (or really, two endpoint hosts of any kind anywhere) with our old LE puppetization. Many other cases, if there was a codfw backup host, it was just left in a self-signed and/or broken state. [14:54:39] vgutierrez: those others are easy to just "fix" and give both of them valid certs. [14:55:46] vgutierrez: for this case, yes, I think it's probably best to put both gerrit and gerrit-slave hostnames together as a SAN set and issue it to both for now. Then they can decide how they want to proceed in the future based on their other requirements, and maybe (or maybe not!) we remove one of the SANs later. [14:55:57] ack [15:01:32] 10Certcentral: puppet still restarts certcentral on config changes instead of reloading it - https://phabricator.wikimedia.org/T209976 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez After applying change XXXX certcentral gets reloaded instead of restarted: ` Nov 28 14:59:24 certcentral1001 systemd[1]: Rel... [15:01:48] damn... I forgot to paste the change number in my comment [15:03:18] change XXXX is always the bad one [15:09:08] agreed, always so mysterious [15:11:38] so it turns out that ATS does not have the equivalent of obj.hits, which we need for the n-hit-wonder admission policy (and in general, we recently used that as a measure of popularity for webp) [15:12:13] I took the opportunity to see how to write an ATS C++ plugin, this is what I've produced https://phabricator.wikimedia.org/P7858 [15:12:44] it adds `X-ObjHits: 42` to the response on fresh hits [15:13:22] using an unordered_map in memory as a data structure [15:14:50] at a basic level it seems to be working, but I'm happy to hear your thoughts about it [15:18:47] is there a RAII-style object for TSMutexLock? [15:20:26] cdanis: not AFAIK, I've followed the style of all other plugins [15:20:37] oh wow [15:22:01] please tell me they don't use exceptions [15:24:06] there's a few `throw` here and there :) [15:24:20] that's... very unsafe [15:25:42] i just really don't understand it; a RAII object for this is trivial to write, saves you from having to sprinkle 'goto' everywhere for every early termination, and also gives you exception safety if something you call throws [15:26:03] not your fault ofc, just baffled about making this decision in a C++ codebase in 2018 [15:26:48] well... ATS is not that new :) [15:27:31] like, 1995? [15:27:58] was it always a C++ application? [15:30:59] it was opensourced as C++ AFAIK [15:31:09] before that.. who knows what happened at Yahoo! [15:32:03] I think inktomi developed it in C++ from the get go [15:32:08] I perhaps have strong opinions about C++ code (always use RAII, always use smart pointers) [15:34:37] do you want such a thing for your code ema? I will write you one right now haha [15:36:05] cdanis: I'm sure the traffic server folks would love a few patches :) [15:37:35] here's a question -- do folks usually make a separate github acct for their WMF work? [15:37:59] nope [15:38:19] or at least, nobody in the SRE team has a separate one I think [15:38:26] 10Traffic, 10Operations, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [15:40:10] I just tacked on my @wikimedia.org as a second email for my existing account when I joined here, it's worked out ok [15:40:31] minus occasionally non-issue confusion where I upload patches to personal projects using the wrong email address for the commitmsg or whatever :) [15:42:22] 10Wikimedia-Apache-configuration, 10Operations, 10Patch-For-Review: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10jcrespo) 05Open>03Resolved a:03ArielGlenn @Hello903hello @Urbanecm this seems to be working after Ariel's patch, resolving. Plea... [15:42:31] ema: I think, at a more abstract level, things like cache perf hacks we shouldn't port directly from VCL->ATS anyways. Just the functional parts. It may turn out that we need entirely different hacks in ATS, or that ATS's default algorithms are fine, or that we want to implement n-hit-wonder or the exponential-size policy more-deeply inside of some ATS module (or core!) instead of as higher-lev [15:42:37] el hacks around a replication of Varnish's hit counters, etc... [15:42:43] yeah.. I've added my @wikimedia.org email to my personal gh account [15:43:23] ema: but yeah, I understand there's some tricky bits here on the frontend/backend barrier, where the varnish fe wants info from the formerly-varnish be to do perf hacks... [15:45:28] right, our current implementation of n-hit-wonder on the frontend relies on information from the backend [15:45:55] we could always replace nhw with the exponential policy of course (I don't even remember why we stalled on it?) :) [15:49:19] so certcentral managed to get the gerrit/gerrit-slave issue on the first attempt (first cert with several SANs issued in production) [15:49:27] s/issue/cert/ [16:00:31] ema: because the exponential policy needs size data, and I don't think we ever convinced ourselves we could get size data for cache_text for all responses [16:01:08] ema: I think maybe we tried by making the requests to the applayer non-streaming if there was no CL header in the initial response, and then that broke something and we reverted it? It's hard to remember. [16:02:12] ema: but in any case, n-hit-wonder was just a minor optimization really. We could always just break it or kill it. [16:03:13] (at least for the transition period, and then re-evaluate all related things once we're all-ATS again) [16:27:59] https://github.com/cdanis/trafficserver/commit/b029dd62354caf3451073d72a77ce071ad67d6db [16:28:27] it's so simple to write this, and saves so much boilerplate and silly mistakes [16:29:44] https://phabricator.wikimedia.org/P7859 [16:30:35] would not be too hard to do the same for everything else you have in the cleanup goto [16:39:11] ok apologies for getting nerdsniped on this, i will go back to doing actually-helpful things :) [16:41:10] :) [16:43:05] ema: maybe a good place to start, would be to kill the n-hit-wonder code in the frontend now and see if you can see a big impact over a few days or a week or whatever in graphs. If we can't really tell for sure, then who cares not worth the complexity given the complexities it adds to ATS transition. or maybe we can tell, but it's like 1% hitrate moves off the frontend to the backend and meh. [16:43:15] ema: I suspect it's not going to cause a huge net loss in hitrate [16:45:39] vgutierrez, that announcement should probably go to wikitech-l rather than ops [16:45:46] cdanis: haha thanks :) Maybe open a PR if you think that would be generally useful in ATS [16:46:18] yeah, i think i may convert one or two other users of TSMutexLock as examples as part of the PR [16:46:22] bblack: +1 [16:46:48] I feel pretty strongly that it is just the right way to do things in C++ [16:46:56] in C you don't have options better than "goto cleanup;" [16:48:00] in C++ you do, and they're simple classes you write once, and then in maintaining code that uses the stuff, the lazy/thoughtless thing (like adding an early return) becomes the correct thing [16:50:06] Krenair: sure, I can send it there too if it's really necessary [16:50:35] I don't know if it's necessary at all tbh [16:50:48] you're only swapping the cert and reloading apache [16:51:13] yep, I've checked with bblack before sending that one and ops was enough apparently [16:51:50] also the rollback is easy enough, so we should be OK [16:52:58] ok [16:56:12] e.g. right now mutante is doing an actual restart :) [16:56:46] or possibly just did [17:38:05] vgutierrez: Krenair: the gerrit issue gets into all kinds of deep territory. From production's POV, making invisible labs things work too isn't really a requirement, and shouldn't spawn a bunch of complexity or realm conditionals on the prod side. [17:38:30] I know that seems crappy, but this issue crops up all the time in other contexts [17:39:23] As far as I am concerned, it is possible to make this work within labs with the current puppet manifest. [17:39:50] it's probably also not reasonable to ask every labs-deployed copy used to test some production service, all of which will presumably eventually have HTTPS and thus use LE, to spin up separate puppet+CC setups for themselves, unless that's highly automated+abstracted [17:39:55] needs some extra work on the part of project admins but that's okay [17:40:08] it needs a fairly complex amount of extra work on eventually every useful project [17:40:55] I think if you're using labs to test a production service it's not unreasonable to depend on puppet+CC [17:41:16] if you're not testing a production service you'll probably find it a heck of a lot easier to sit behind the labs proxy which handles TLS termination for you [17:41:46] well, ok [17:41:54] soo [17:41:57] I think this setup is fine [17:42:14] but still, this probably won't be the last time we find ourselves in this boat, and the two easy answers at present aren't very palatable, those being: [17:42:27] 1) Put in $realm conditionals to avoid screwing up a running working labs setup, or: [17:42:46] 2) Break their stuff without warning and tell them they need to sort out some puppet/CC infrastrcuture on their end to get it working again [17:43:33] yes these come up all the time and are by no means limited to the latest instance [17:44:07] but this is the first time it's come up with our prod CC migrations I think? [17:44:11] yes [17:44:36] at least, that we've heard about. maybe arzhel has some netbox instance in labs that's broken now too but doesn't care, yet :) [17:45:17] icinga was done too, dunno if there's a test instance of that lying around [17:45:25] (using the prod puppet manifest) [17:45:31] other stuff likely to run into this [17:45:36] mailman/lists? [17:45:43] mx*? [17:46:10] for many cases, sharing roles/profiles between prod and some labs test instance works out fine, with hieradata and/or the alternate private-repo sorting out data differences. [17:46:49] but this is a little different, right? because CC is a service that won't exist commonly or by default over in some labs project, and we can't just do a hieradata change to work around that. [17:47:08] we're adding a new infrastructure dependency [17:47:13] right [17:47:18] that infrastructure is possible to set up within labs with a little work [17:47:26] it's not like we're saying "Cannot be used without LVS" or something [17:47:31] :) [17:48:14] some of these problems where operations/puppet.git patches break puppet in labs instances are as easy as changing hiera [17:48:30] some of them are "oh, you have to set up some infrastructure" [17:48:49] so, another approach we could take with this, is a higher level of abstraction, possibly [17:49:14] some universal class like deploy::a::damn::cert { 'gerrit': .... } [17:49:19] to properly test the production system it'd a good idea to try to follow the same infrastructure [17:49:24] it'd be* [17:49:58] which can go in roles/profiles, and hieradata can switch between production cases like CC and/or manual certs, and some test-junk variant which just generates a self-signed snakeoil cert or whatever works simplistically to get a labs test running. [17:51:14] difficult to do if we're not prepared to leave branches in for http-01 handling etc [17:51:23] (and if they want real certs, then sure they can go deploy a CC server in their project and set their project hieradata to use it, but it provides a non-broken baseline default for other cases) [17:51:48] no http-01, I mean the default for a random labs instance really just issues a local self-signed cert [17:52:28] actually adding a self-signed certs mode to certcentral is pretty easy [17:53:05] it already does it to allow services to start if for some reason it fails to get the real one [17:53:29] bblack: "maybe arzhel has some netbox instance in labs that's broken now too but doesn't care, yet :)" not sure I udnerstand, my lab instance of netbox is not puppetized, and use the *.wmflabs.org catch all https [17:53:31] so we could add a config variable and stop the process there [17:53:39] (if that's the question/context) [17:54:06] if we really just want self-signed certificates for this use case we could ignore certcentral entirely [17:54:08] keep it very simple [17:54:14] target node signs it's own cert [17:54:32] that too [18:00:44] maybe something where our prod hieradata defines the real cert issuance, and if it's deployed in labs somewhere without any custom hieradata for that labs deployment (the default?) it would just use the local self-signed and be done, yeah