[00:19:31] volans: so this used to work: https://netbox.wikimedia.org/api/extras/topology-maps/1/render/ and this still works: https://netbox.wikimedia.org/api/extras/topology-maps/3/render/ [00:21:36] nevermind: https://github.com/digitalocean/netbox/issues/2745 [09:18:10] vgutierrez: are we mostly done with the certcentral rename now? [09:18:23] mostly :) [09:19:01] https://phabricator.wikimedia.org/T207389 [09:19:35] so I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/492744 [09:19:55] and fight a little bit with gerrit to create a new repo called acme-chief, and that should be it [09:21:02] legoktm: remember me to get you a beer next time we meet in the same physical location ;P [09:25:36] vgutierrez: yay! I assume you already know that we can't rename gerrit repos, just create new ones and copy everything over? [09:25:47] indeed [09:26:03] I'll create a new one and push the master and debian branches [09:42:16] XioNoX: eh :( [10:02:08] vgutierrez, also a version branch? [10:02:33] hmm I pushed the tags as well yes [10:03:15] https://gerrit.wikimedia.org/g/operations/software/acme-chief [10:04:12] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) [10:05:45] Krenair: BTW, I changed the new project merge policy to ff-only [10:05:52] I hated those merge commits by gerrit [10:06:00] cool [10:06:13] the old one is already set to read-only [10:06:22] but we will keep it there [10:06:48] wonder where the phab mirror is [10:07:15] oh maybe it doesn't get auto-created? [10:08:20] um, vgutierrez [10:08:37] the gerrit rights are a bit weird [10:08:41] maybe I can fix this though, one sec [10:08:50] weird, why? [10:09:02] they're basically inherited from the certcentral repo [10:09:24] yeah that's what's weird about them [10:09:31] this repo should inherit from operations/software [10:12:24] hm nope apparently only admins can fix this [10:12:51] https://gerrit.wikimedia.org/r/#/c/operations/software/acme-chief/+/492983/ [10:14:17] we should create another group instead of keeping that certcentral one I guess [10:15:15] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) [10:37:36] sorted out https://phabricator.wikimedia.org/source/operations-software-acme-chief/ [10:38:06] cool thx [11:08:01] 10Traffic, 10ExternalGuidance, 10Operations, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Thanks, @BBlack, will give the heads up once the date is set. The "Desktop" link... [11:44:32] 10netops, 10Operations, 10ops-eqiad: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10akosiaris) `wtp1028-1030` can indeed be moved at anytime, provided they are shutdown gracefully. `ores1002` can indeed be moved at anytime, provided it is shutdown gracefully. `ganeti1008` This... [13:28:50] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) [13:43:00] 10Acme-chief, 10Patch-For-Review: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) [14:09:41] Krenair: right now the only references to certcentral in the puppet codebase are in the file hieradata/labs/deployment-prep/common.yaml [14:09:51] Krenair: should I get rid of those or rename them to acme-chief ones? [14:10:18] vgutierrez, get rid of them... I'll deal with it later, potentially not involving the puppet repo at all [14:10:34] ack [14:12:51] Krenair: https://gerrit.wikimedia.org/r/c/operations/puppet/+/493027 [14:22:50] 10Acme-chief: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) 05Open→03Resolved [14:22:54] * vgutierrez dances [14:22:56] \o/ [14:26:06] \o/ [14:28:15] (59 commits later, the task is resolved) [14:36:08] :) [14:36:50] 10Acme-chief: Improve expected exceptions logging in certcentral - https://phabricator.wikimedia.org/T208326 (10Vgutierrez) p:05Triage→03Low [14:38:06] 10Acme-chief: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 (10Vgutierrez) p:05Triage→03High a:03Vgutierrez [14:39:47] 10Acme-chief: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 (10Vgutierrez) [14:39:49] 10Acme-chief, 10Traffic, 10Operations, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [14:41:20] 10Acme-chief, 10Traffic, 10Operations, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Krenair) Possible other subtasks: {T207294} {T203423}? [14:42:19] Krenair: regarding T207294, we already have icinga checks that make sure that acme-chief-backend and acme-chief-api are working [14:42:20] T207294: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 [14:42:47] vgutierrez, sure but presumably we need icinga checks covering stuff like cert status? [14:42:48] and that replication of the certificates between the master and the slave nodes is being done [14:43:09] cert status check is being performed by the users of each certificate [14:43:13] ok [14:43:22] what else do we need to add? [14:43:27] of course we could do the check on the acme-chief side as well [14:43:35] but it's kinda redundant [14:43:45] so far, nothing else [14:46:17] 10Acme-chief, 10Icinga, 10Operations, 10monitoring: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:46:47] vgutierrez, ok [14:46:52] vgutierrez, what about revocation? [14:47:15] yeah, I think we can provide that as part of this quarter goal [14:57:23] bblack: I'm thinking about T207295, there is no actual need of performing the OCSP stapling N times on the clients, right? [14:57:24] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [15:16:35] hmmm so... I'm thinking the following, now that we can stage a certificate for a configurable amount of seconds before it's copied to the live_certs directory, we can deploy those adding the suffix .new to those files [15:16:49] that way the clients can play with the new certs as they want [15:16:57] performing OCSP stapling, and so on [15:17:30] at the same time, I believe that acme-chief should be able to offer OCSP stapling without requiring it to be done by the clients [15:17:43] of course is brand new code, nobody trusts it and the current one just works(TM) [15:17:58] so after implementing exposing the new certs to the clients [15:18:13] I'll implement and monitor OCSP stapling on acme-chief side [15:18:33] and I'd just keep it running & being monitored till we gain some trust on it [15:18:51] hmm gain trust or build trust? [15:18:54] E_ENGLISH [15:56:38] hmmm nginx 1.15.9 [15:56:45] this could become handy at some point [15:56:49] Feature: variables support in the "ssl_certificate" and [15:56:49] "ssl_certificate_key" directives. [16:01:48] vgutierrez: if we're doing OCSP on the acme server side, are we doing it there forever? [16:02:13] bblack: it looks more efficient [16:02:23] it is I think, I'm just trying to understand the model [16:02:33] so the proposed model is: [16:02:51] 1) acme-chief is responsible for keeping up-to-date staples for all valid certs it has [16:03:19] 2) clients pull ocsp staple data like they pull keys/certs (meaning they'll routinely get new OCSP data via a puppet run for a file resource) [16:04:14] 3) clients don't actually need their own ocsp stapler script/infra (probably want to make sure the deploy paths for acme-derived ocsp differ from the existing script, so they mix well in cases where we still have legacy commercial certs + update-ocsp script) [16:04:40] right? [16:04:51] on the long term yes [16:05:24] short term acme-chief will also offer the new cert before being deployed as the live cert to let the clients perform OCSP stapling in advance [16:05:34] aka T207295 [16:05:35] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [16:05:37] that's going to be tricky to puppetize [16:05:43] (the short term plan) [16:05:47] I don't think so [16:05:54] ok [16:06:01] we just deploy those as /etc/acmecerts/blabla.new [16:06:23] but the staple data prefetching is only useful if it matches both the .new path and the final path [16:07:01] but you can fix that in update-ocsp config I guess [16:07:30] Certificates=/etc/acmecerts/blabla.new; Output=/var/cache/ocsp/blabla.ocsp [16:07:52] and then when it moves to live use, the Certificates config gets updated there, but the ocsp path is already correct? [16:08:03] but that all assumes this isn't a renewal where the "blabla" is unchanging [16:08:11] hmmm [16:08:57] hmm I see your point [16:10:21] I mean in theory this is a trivial problem, but the fact that the interface is puppet file stanzas makes it a bit tricky [16:10:22] that could get sorted out via update-ocsp though, cause it's triggered on every nginx reload, right? [16:10:50] other way around, update-ocsp runs from cron (or gets triggered by puppet), and it reloads nginx [16:11:03] checking that the configured cert gets the expected OCSP stapling data [16:11:16] hmmm [16:11:35] but nginx failed to start in some nodes cause we had issues with the update-ocsp-all due to a expired certificate [16:11:40] (this was a few weeks ago) [16:11:49] failed to reload? [16:12:04] failed to start in cp4025 after a node restart IIRC [16:12:59] ah right we inject an ExecStartPre [16:13:06] ExecStartPre=/usr/local/sbin/update-ocsp-all [16:13:17] right, so that's executed only on nginx start but not on reload, right? [16:13:27] that's to make sure if the host was offline for days/weeks, it gets fresh ocsp before letting nginx start exposing things to users [16:13:30] right [16:13:32] ack [16:13:47] normally it's the other way around, and update-ocsp is run by puppet or cron and it tells nginx to reload after [16:15:43] what makes the existing scheme with manual commercial certs + update-ocsp-all + etc... work out ok in practice right now, is we don't reuse certificate pathnames [16:16:25] every time we do a renewal, it's a new name (e.g. digicert-2017.pem then digicert-2017.pem), so the certs, keys, and ocsp outputs can all co-exist, and we decide which one is in live use by changing the nginx config to point at one or the other of the fs-deployed and stapled certs. [16:16:31] err [16:16:38] digicert-2017.pem then digicert-2018.pem [16:16:47] right [16:17:02] we could work in something like that for acme-chief [16:17:07] but doing that for the acme case with the current puppetization is probably challenging, since it hooks in via file [16:17:22] that would solve the race condition I mentioned to you during the all hands [16:17:42] but yeah, in theory you could rework the puppetization a bit to put e.g. datestamps in cert names [16:18:10] yep... I'll give a think to that [16:18:13] thx :D [16:18:49] (don't ask me how though, that's tricky, since the puppet layer doesn't really know the set of datestamps for the set of certs that are overlapping on a renewal or whatever) [16:19:22] or come up with some other way around this mess [16:20:01] we didn't have OCSP for the old acme stuff either of course [16:20:09] only for those manual unified certs [16:20:39] another improvement of using acme-chief \o/ [16:20:54] :) [16:21:42] of course OCSP stapling should be optional [16:22:02] paravoid: mentioned me a bug regarding OCSP in our mx servers [16:22:18] I think at the acme-chief level it needs to be, because there are cases yeah like MX servers that might not even be capable [16:22:26] indeed [16:22:48] although, you could also argue that it doesn't hurt to staple everything to the filesystem level, and just not configure stapling where you don't need it. [16:23:15] as in the MXes would still get a /some/where/mx2001.wikimedia.org.ocsp pulled and sent to them, but the server software might not make use of it. [16:23:49] it actually has some upside, because otherwise we have no monitoring of whether the ocsp is failing because the cert has actually been revoked [16:24:24] (if we don't give OCSP stapled, end-clients might still be checking OCSP for themselves unstapled and failing) [16:25:37] [also, the words "server" and "client" are always very confusing when discussing anything related to acmechief puppetization] [16:26:07] yes [16:26:20] I need to get some motivation to write some documentation and come with proper nomenclature [16:26:37] * vgutierrez blames himself [16:28:26] hmmm, thinking a bit about the cert-naming problem [16:29:03] you could replace the puppet-level file resource with one that's actually a directory that's synced? but I don't know if acmechief implements that [16:29:16] I was thinking exactly the same [16:29:22] and checking the API to see if that's possible [16:29:38] in that case, we'd need to implement that on the acme-chief-api [16:29:52] yeah it's tricky too [16:30:09] or, we might be able to fix this with just an A/B thing hardcoded [16:30:42] e.g. cert puppetization always has a "foo-a" and "foo-b", and on first issue they both put out the same file, but on renewals you're flipping the a/b part around. [16:30:55] I don't know, there's something missing along that path too [16:31:39] or get rid of the fs-emulation part and just deploy an acmechief-client script that puppet executes, which pulls whatever it needs to directly via HTTP reqs to the acmechief server [16:32:00] (that would allow datestamped names easier, anyways) [16:32:19] apparently with something like file { '/etc/acmecerts/mx': ensure => directory, recurse => true, purge => true, source => puppet://....} [16:32:40] so all the files for the mx cert would be dropped in a directory called /etc/acmecerts/mx [16:34:02] so on the acme-chief-api side I'm assuming we'd need to implement some sort of ls operation [16:34:15] to be able to provide with a list of files in the so-called directory [16:39:03] bblack: something like https://puppet.com/docs/puppet/4.8/http_api/http_file_metadata.html#basic-search [16:48:48] 10netops, 10Operations, 10ops-eqiad: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) I neglected to update this task with the email I sent out yesterday. > This task is being tracked on: https://phabricator.wikimedia.org/T212348 > > TL;DR: If you got this email, look... [16:59:36] 10netops, 10Operations, 10ops-eqiad: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) So, the list in my email is too long, and some of those hosts were previously moved by Chris in advance of my email (likely well in advance, awhile ago, via independent projects.) The act... [18:08:04] 10netops, 10Operations, 10ops-eqiad: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) Ok, @ayonsi double checked my ports and migration plan on the gsheet, and I've started to make the needed port changes (he setup ganeti1008 for me as its a special case.) ` robh@asw2-a-e... [18:17:05] 10netops, 10Operations, 10ops-eqiad: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) So I neglected to remove those above from disabled group, did so in next update and removed all the others that were also in disabled but need to be used for this: ` robh@asw2-a-eqiad# s... [18:30:58] vgutierrez: if you still around https://gerrit.wikimedia.org/r/c/operations/puppet/+/493063 [20:02:00] 10netops, 10Operations, 10ops-eqiad: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) >>! In T212348#4985425, @RobH wrote: > I thought it was odd some of them that I ran the command to remove from disabled, turned out to not be in use, but not disabled, like xe-7/0/37.... [22:23:30] 10netops, 10Operations, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) eqsin, the private-peer term has been removed a while back to do traffic engineering specific to this site. `lang=diff [edit policy-options policy-statement BGP_c... [22:27:47] 10netops, 10Operations, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) ~80Mbps traffic shift to transit too. [22:33:12] 10Acme-chief, 10Traffic, 10Operations, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Krenair) [22:33:15] 10Acme-chief, 10Traffic, 10Operations: certcentral: Provide script for certificate revocation - https://phabricator.wikimedia.org/T203423 (10Krenair) [22:49:50] 10netops, 10Operations: ulsfo <-> router1.corp BGP sessions down - https://phabricator.wikimedia.org/T217207 (10ayounsi) p:05Triage→03High