[06:23:41] 10netops, 10Operations: AS36351 BGP session down on cr2-eqiad - https://phabricator.wikimedia.org/T229085 (10elukey) p:05Triage→03Normal [08:32:59] looks like the puppet failures in cp-eqsin are due to missing files [08:33:01] unified]/Acme_chief::Cert[unified]/Exec[unified-new-rsa-2048-create-ocsp]/returns) x509: Cannot open input file /etc/acmecerts/unified/new/rsa-2048.crt, No such file or directory [08:33:15] ema: ^ [08:35:59] godog: mmh, any recent commit that might explain that? [08:36:32] it doesn't look like [08:37:02] no idea, I took a deeper look out of curiosity [08:37:28] so, eqsin is the only DC with profile::cache::ssl::unified::acme_chief: true [08:37:47] which helps explaining why the issue affects eqsin only [09:22:09] acme-chief disabled in cp-eqsin, running puppet now [09:22:18] 10Traffic, 10Operations, 10Patch-For-Review: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) This happens at the same time that the unified cert is being renewed: ` Jul 26 08:00:02 acmechief1001 acme-chief-backend[8198]: Number of certific... [09:22:34] yup, it looks like there is some kind of issue with the new certificate staging period [09:22:48] the unified cert should be moved from staged in new to live in 1 week [09:23:10] ema: puppet is happy now? [09:23:22] vgutierrez: indeed [09:23:40] nice.. so I'm take a shower cause I stink after 147km on a bike (heavy rain included) [09:23:45] *taking [09:23:48] and then I'll debug this [09:23:50] see you soon [09:23:54] vgutierrez: thanks for showing up! Enjoy the rest of your vacations :) [09:24:37] and thanks godog too [09:29:55] np! [09:41:41] ohh... I think I already know what's the issue :) [09:42:26] right [09:43:13] so before the certificate is considered live, acme-chief only keeps the chained version of each cert: ec-prime256v1.chained.crt and rsa-2048.chained.crt [09:43:42] and it generates the chain only version and cert only version when it's moved to live [09:44:07] at the same time, our lovely update-ocsp script needs the cert only version (rsa-2048.crt and ec-prime256v1.crt) [09:44:14] so that's why it's failing [09:46:59] 10Traffic, 10Operations, 10Patch-For-Review: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) update-ocsp is configured to use the certificate only version to perform the OCSP stapling: `` vgutierrez@cp5001:/etc/update-ocsp.d$ cat unified-n... [09:50:43] 10Traffic, 10Operations, 10Patch-For-Review: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) p:05Normal→03High This is a big issue, cause right now due to the invalid state of update-ocsp/acme-chief, nginx cannot be restarted in the cp... [10:38:04] 10Traffic, 10Operations: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) p:05High→03Normal So, I've manually generated the missing versions on acmechief1001: ` >>> cert = Certificate.load('/var/lib/acme-chief/certs/unified/new/rsa-2048.c... [10:40:00] 10Acme-chief, 10Traffic, 10Operations: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 (10Vgutierrez) [10:42:04] 10Traffic, 10Operations: Provide ensure => absent support for acme_chief::cert define - https://phabricator.wikimedia.org/T229097 (10Vgutierrez) [10:43:27] 10Acme-chief, 10Traffic, 10Operations: Provide ensure => absent support for acme_chief::cert define - https://phabricator.wikimedia.org/T229097 (10Vgutierrez) p:05Triage→03Normal [13:39:09] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) a:05RobH→03None [14:04:49] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [14:24:23] 10Traffic, 10Operations: acme-chief failing in puppet with "Cannot open input file" - https://phabricator.wikimedia.org/T229091 (10Vgutierrez) As it's been done with unified, wikibase required the same patch: ` >>> from acme_chief.x509 import Certificate >>> from acme_chief.acme_chief import CERTIFICATE_TYPES... [14:43:24] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10ops-eqiad: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [14:45:57] 10Traffic, 10DC-Ops, 10Operations, 10decommission, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) a:03ayounsi @ayounsi, Per your request, we are assigning this to you for the switch configuration removal for lvs100[1-6]. All of the systems h... [15:26:24] 10Traffic, 10DC-Ops, 10Operations: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10RobH) [16:26:55] 10Acme-chief, 10Traffic, 10Operations: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 (10herron) p:05Triage→03Normal [17:48:06] 10Traffic, 10Operations, 10Core Platform Team (Services Operations): Have Varnish set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976 (10WDoranWMF) [18:45:23] 10Traffic, 10DC-Ops, 10Operations: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ayounsi) 05Open→03Resolved lvs100[1-6] removed from switches.