[00:12:26] 10netops, 10Operations: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) a:03ayounsi ` Apr 16 23:20:49 cr4-ulsfo kernel: spin lock 0xfffff80012ce73c0 (turnstile lock) held by 0xfffff8000941d560 (tid 100012) too long Apr 16 23:20:49 cr4-ulsfo kernel: panic: spin lock h... [00:12:38] 10netops, 10Operations: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) p:05Triage→03Normal [03:11:17] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Andrew) [08:33:46] Krenair: I just pushed 3 CRs: https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/504510/ moves ACMEChiefConfig to acme_chief.config, https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/504511/ moves DNS queries to acme_chief.dns [08:34:58] https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/504512/ it's the main one, adds new two settings for a certificate, the intent here is using it for the non-canonical-redirect certificates, and skip no longer controlled domains without breaking the renewal by trying to get a certificate that's impossible to validate [08:35:54] that will be good for the rate limit control as well, as LE imposes a hard limit on failed challenges [08:37:36] Krenair: so when you have the chance, let me know what do you think about those and any potential concern, thanks! [08:37:44] * vgutierrez switching focus to ATS now [10:24:54] 10Traffic, 10Operations: Allow running several ATS instances in the same server - https://phabricator.wikimedia.org/T221217 (10Vgutierrez) [10:25:18] 10Traffic, 10Operations: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [10:25:20] 10Traffic, 10Operations: Allow running several ATS instances in the same server - https://phabricator.wikimedia.org/T221217 (10Vgutierrez) [13:41:51] 10Traffic, 10Operations: Allow running several ATS instances on the same server - https://phabricator.wikimedia.org/T221217 (10ema) p:05Triage→03Normal [13:56:30] 10netops, 10Operations, 10fundraising-tech-ops: configure switch ports for frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T221232 (10Jgreen) [14:24:17] vgutierrez, thanks, will look soon [14:24:27] vgutierrez, one thing came up last night you may be interested in - https://phabricator.wikimedia.org/T221171 [14:25:03] am testing an acme-chief cert for the beta unified cert and some java clients have started having problems. might be related, might not [14:25:33] hmm the server is delivering the proper certificate chain? [14:25:43] cert + intermediate CA? [14:26:44] yes [14:27:01] what's the endpoint that the faulty client is trying to connect to? [14:27:05] 0 s:/CN=*.wikimedia.beta.wmflabs.org [14:27:06] i:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 [14:27:06] 1 s:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 [14:27:06] i:/O=Digital Signature Trust Co./CN=DST Root CA X3 [14:27:24] one of them is using https://commons.wikimedia.beta.wmflabs.org/ [14:27:51] main one is https://deployment.wikimedia.beta.wmflabs.org [14:29:00] well [14:29:10] wasn't there just a root switch announcement? [14:29:25] maybe those "some java clients" lack the newer root? [14:29:55] oh it's just a pre-announcement with future dates, nevermind me [14:29:59] that's from July 8th [14:30:08] I saw the headline earlier but hadn't read the post :) [14:30:11] https://letsencrypt.org/2019/04/15/transitioning-to-isrg-root.html [14:30:13] and we're serving the DST intermediate, nothing with ISRg [14:30:29] ssllabs is happy about the certificate chains for ECDSA+RSA certs even for Java clients [14:31:08] yeah, thing is at this point it's multiple people across multiple different devices/OSes [14:32:21] hmm right [14:32:26] This Update: Mar 29 09:00:00 2019 GMT [14:32:26] Next Update: Apr 5 09:00:00 2019 GMT [14:32:42] it looks like it's serving a stalled OCSP response [14:33:07] ah there we go [14:33:09] 'OCSP STAPLING ERROR: OCSP response expired on Fri Apr 05 09:00:00 UTC 2019 ' [14:33:18] well that would explain one of the errors seen [14:33:18] what's doing the stapling there? [14:33:31] it should be our old friend update-ocsp [14:33:34] should be the standard puppetisation, let's see [14:34:17] ssl_stapling_file /etc/acmecerts/unified/live/rsa-2048.client.ocsp; [14:34:17] ssl_stapling_file /etc/acmecerts/unified/live/ec-prime256v1.client.ocsp; [14:34:47] both of which were modified today [14:35:06] hm I'm not sure what tools interest with these stapling files [14:35:09] interact [14:35:46] FWIW, in production OCSP stapling seems happy for wikiba.se [14:35:53] https://www.irccloud.com/pastebin/nqdYxLbk/ [14:36:30] Krenair: the standard puppetization does a cronjob to update the staple files routinely [14:36:30] Krenair: so, in the production cluster we have a cronjob prefetching OCSP stapling responses every 12 hours using update-ocsp-all [14:36:40] after ending the process, nginx MUST be reloaded [14:36:55] but that's handled by update-ocsp-all itself [14:37:02] if stapling is failing, the cronjob should be failing it's exit code and/or leaving behind traces of bad responses in the ocsp directory [14:37:12] there is this root cronjob: 44 6,18 * * * /usr/local/sbin/update-ocsp-all 2>&1 | logger -t update-ocsp-all [14:37:22] try running it manually as root and see what happens? [14:37:36] (without the redirect and logger) [14:37:37] Krenair: can you check if /etc/update-ocsp.d/hooks/nginx-reload is there? [14:38:09] prints nothing, return code 0 [14:38:22] and the .ocsp files got touched [14:38:38] ah, it's not: cat: /etc/update-ocsp.d/hooks/nginx-reload: No such file or directory [14:38:40] ok [14:38:43] that's the culprit [14:38:45] if you reload nginx manually [14:38:49] we should get proper OCSP stapling [14:40:26] Looks like I need to set do_ocsp but it has this 'Does not work for ACME (letsencrypt) yet!' [14:43:42] so I failed to mimick the hook setup in acme_chief::cert [14:44:16] it isn't a big problem for the production cluster because it's already there cause it's needed for the goold-old unified cert [14:44:40] but in your environment, where the only certs are acme-chief based, the system lacks de hook [14:44:44] s/de/the [14:45:20] ah [14:46:30] yeah the puppet manifest is not right for this setup [14:47:30] Notice: /Stage[main]/Tlsproxy::Ocsp/Sslcert::Ocsp::Hook[nginx-reload]/File[/etc/update-ocsp.d/hooks/nginx-reload]/ensure: defined content as '{md5}e60d4922fc9d2c78ee2697b0e41f7f89' [14:48:53] vgutierrez, FYI I did manually run the monitoring command for this before and it looked finne [14:48:54] fine [14:49:33] wonder if it should check nginx got reloaded properly [14:49:47] so there should be two checks in place [14:49:48] okay ssllabs is happy now [14:49:55] one that checks that the OCSP stapling response file is fresh [14:50:13] and another one that performs the TLS handshake and checks that the served OCSP stapling response is fresh [14:51:24] check_ssl_unified_sni_letsencrypt should do that [14:51:31] $USER1$/check_ssl --warning 15 --critical 7 -H $HOSTADDRESS$ -p 443 --ocsp must-staple --authalg '$ARG1$' --cn '$ARG2$' --sans '$ARG3$' [14:51:47] ugh right [14:51:56] * Krenair really needs to fix up monitoring [14:52:13] and that's the one that we're using for wikiba.se in the production environment [14:53:50] btw, I think re: https://letsencrypt.org/2019/04/15/transitioning-to-isrg-root.html , we may need to add the new intermediates in some place(s) [14:54:08] IIRC we have the existing X3/X4 intermediates puppetized somewhere to build chains with or verify against or something [14:54:47] yeah [14:54:52] that's right [14:54:55] modules/acme_chief/manifests/init.pp and modules/letsencrypt/manifests/init.pp [14:54:59] have sslcert::ca stuff [14:55:44] sounds like our first renewals after that date are going to be fun [14:55:52] Krenair: hmm require doen't need the if !defined wrap IIRC [14:56:04] *doesn't [14:56:14] I think the defined check is because of acme_chief vs letsencrypt modules? [14:56:31] (can we kill the old letsencrypt module yet?) [14:56:57] once my all of my changes to get beta running on acme-chief are done sure [14:57:06] ok [14:57:14] right now it's running on cherry-picks [14:57:37] the defined check is to ensure we don't try to require the same class twice but vgutierrez might be right that's fine [14:59:59] bblack, oh you might have some fun getting people to drop these ones though: [15:00:00] modules/profile/manifests/toolforge/mailrelay.pp: letsencrypt::cert::integrated { $cert_name: [15:00:06] modules/toolserver_legacy/manifests/init.pp: letsencrypt::cert::integrated { 'toolserver': [15:00:15] well, relatedly, we have a general set of intermediates defined in the base module too [15:00:29] modules/profile/manifests/base/certificates.pp [15:00:54] arguably the X3/X4 intermediates from acme_chief/letsencrypt, and the newly-rooted ones to come, should all go there [15:01:02] also profile::mail::smarthost seems to unconditionally use the old LE module [15:01:38] I think they were just in the letsencrypt module before to keep all that work separate, but at this point it's just one of our standard intermediates we should probably have everywhere, and will help smooth over any issues with other clients in our infra that lack the new root somehow. [15:07:59] Krenair: running PCC as we speak against 504571 [15:08:06] thanks [15:15:37] https://puppet-compiler.wmflabs.org/compiler1002/15856/ [15:17:34] so [15:17:57] actually in some cases this is fixing stuff missing in prod? [15:18:12] oh no because those will still get it all from unified [15:18:14] ok [15:18:16] indeed [15:18:58] it's not being added as a new resource in their catalogs [15:19:14] nope, only the relationship is added [15:19:50] yeah [15:25:26] Krenair: if you're not in a rush I'll merge that tomorrow EU morning [15:27:43] just to play on the safe side of things [15:30:06] no rush [15:31:49] ack [15:32:41] Krenair: BTW, when you have the time, please rebase https://gerrit.wikimedia.org/r/c/operations/puppet/+/501461 to get the latest changes in production [15:45:00] I'm getting git fatal errors trying to do so, asking in #git [15:58:52] https://phabricator.wikimedia.org/P8415 [15:59:28] * vgutierrez trying to reproduce [16:01:30] https://www.irccloud.com/pastebin/21XXUwba/ [16:01:43] different versions of git? [16:01:52] 2.21.0 here [16:02:09] 2.17.1 [16:03:08] should I push the rebase for you Krenair? :) [16:03:19] I can leave it like that if you want to debug any further though [16:04:39] vgutierrez, if you could do the rebase that'd be great [16:04:54] all the debugging on my end could still be done even if you push a new patchset [16:05:00] ack [16:08:01] 2.19.0 git release notes seem to fix a couple of crashers in specific situations for "git pull -r" and "git rebase" [16:08:28] in general from glancing at their release notes, it doesn't look like they backport non-security fixes very far if at all [16:08:56] (those fixes never made it to a 2.18.N or 2.17.N release, for instance) [16:09:20] * "git rebase -i", when a 'merge ' insn in its todo list [16:09:20] fails, segfaulted, which has been (minimally) corrected. [16:09:48] (although maybe these fixes are for code that only exists in 2.19, I donno) [16:10:02] * "git pull --rebase -v" in a repository with a submodule barfed as [16:10:02] an intermediate process did not understand what "-v(erbose)" flag [16:10:03] meant, which has been fixed. [16:10:20] * "git pull --rebase" on a corrupt HEAD caused a segfault. In [16:10:20] general we substitute an empty tree object when running the in-core [16:10:23] equivalent of the diff-index command, and the codepath has been [16:10:25] corrected to do so as well to fix this issue. [16:10:29] ^ those were the 3 in 2.19 that stood out as possibles [16:12:51] 10netops, 10Operations, 10fundraising-tech-ops: configure switch ports for frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T221232 (10ayounsi) 05Open→03Resolved a:03ayounsi [16:13:24] Well I'm not using -v [16:13:30] And I don't think it's a segfault? [16:15:26] no idea :) [16:15:38] could dig deeper, but probably the answer is just upgrade [16:15:46] what is the fatal output? [16:16:07] oh I see, pasted earlier [16:17:41] try "git fsck"? [16:18:41] git fsck --no-dangling might reduce pointless output noise if looking for real issues [16:18:55] dunno if fsck will do much on a fresh repo but ok [16:19:07] yeah who knows [16:20:23] interestingly I can rebase on top of FETCH_HEAD just fine [16:20:56] But if I checkout f94d2f8c09 again and `git pull --rebase origin production` it complains about that bad object [16:35:07] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) [16:39:01] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) time frame 16:27 UTC, 12:27 PST: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down 12:27 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqor... [16:40:16] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) Indeed, just got a notification: "We have an outage which is suspected to be caused by a cable fault. Our NOC is investigating and activating local resources. We will provide more informati... [16:44:26] bblack, I guess we should make a task about getting rid of the old LE puppet module [16:48:35] yeah, low priority, but it's nice to clean up after ourselves [17:33:41] 10Traffic, 10Operations: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) [17:33:50] 10Traffic, 10Operations: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) p:05Triage→03Low [17:34:02] done ^ [17:35:00] 10Traffic, 10Operations: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) ` modules/archiva/manifests/proxy.pp: # regsubst is needed due to letsencrypt::cert::integrated's naming modules/profile/manifests/gerrit/server.pp: letsencrypt::cert::integra... [17:37:21] 10Traffic, 10Operations: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Paladox) I use " modules/profile/manifests/gerrit/server.pp: letsencrypt::cert::integrated { 'gerrit':" for gerrit.git.wmflabs.org and gerrit.gerrit.wmflabs.org as the acme service does not work in WMCS... [17:38:03] 10Traffic, 10Operations: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) It does work in WMCS, with some puppet cherry-picks and some credentials generated by WMCS admins to allow modification of designate DNS records from within instances. [17:52:19] Krenair: the main blocker for wmcs is the designate script review, right? [17:52:30] regarding acme-chief [17:54:29] vgutierrez, well, sort of [17:54:35] Unfortunately WMCS doesn't quite have the capabilities of AWS [17:54:53] can't just make an instance and hand it a role that lets it modify DNS [17:55:11] ohh ok [17:55:51] gotta register a user on wikitech - which right now is not self-service - get it added to a list in puppet.git, get it added to a certain role that gives it DNS modification abilities [17:57:02] and on top of that you have to be sure you can trust everyone who has access to the hosts containing these credentials, it's not within the standard project membership rights [17:57:27] From a puppet standpoint the list of blockers is https://phabricator.wikimedia.org/T182927#5087304 [17:57:56] Plus I imagine most people would not care for running it with a standby host [17:58:56] other major problem is that it relies upon puppet certs which of course is useless under an autosigning puppetmaster [18:00:07] I do wonder if we should just end the autosigning of puppet certificates within labs [18:00:40] We already have the mechanism to clean up certs from the central puppetmaster when an instance is deleted, why not the reverse? [18:03:32] within deployment-prep we have an non-autosigning project puppetmaster, we have the cherry-picks on that, and we have a user with the correct rights at the openstack keystone/designate level [18:03:55] are you following https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497430/? [18:04:09] with a separate role for labs I mean [18:05:21] vgutierrez, I do wonder if we should just add the profile and tell people to add the profile directly [18:05:26] well, add the profile to puppet.git, and tell people to apply the profile to instances directly [18:06:39] hmm according to the puppet style guide you should apply a role to an instance [18:06:56] I don't have strong opinions for labs tough [18:07:08] you do know it way better than me [18:07:39] heh, good luck enforcing the puppet style guide over how people configure labs instances [18:07:53] may just end up with more labs instances that run puppet but have everything on them unpuppetised [18:08:10] I'll add a role [18:15:14] ack.. I'll go through that list of CRs tomorrow morning [18:16:45] thanks [18:47:25] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10colewhite) p:05Triage→03High [19:00:39] 10Traffic, 10Operations, 10Patch-For-Review: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Krenair) [20:17:38] 10Traffic, 10Operations, 10Patch-For-Review, 10Puppet: Remove old letsencrypt puppet module - https://phabricator.wikimedia.org/T221268 (10Peachey88) [20:34:21] 10Traffic, 10DNS, 10Mail, 10Operations: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10Krenair) [20:42:29] 10Traffic, 10DNS, 10Mail, 10Operations: wiki-mail DKIM failing - https://phabricator.wikimedia.org/T221290 (10Krenair) [20:43:29] 10Traffic, 10DNS, 10Mail, 10Operations: Phabricator SPF record contains internal addressing for phab[12]001 - https://phabricator.wikimedia.org/T221288 (10Krenair) {T216714} may be related here [22:21:41] 10HTTPS, 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Dzahn) re: the next checkbox above " Prioritize which "junk" domains should be in the primary (works for non-SNI) S... [22:37:09] 10netops, 10Operations: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) " Field tech isolated the fault location and is en route to perform a survey of the damage. " Wed, 17 Apr 2019 23:04 : Field tech is still working at the site to run OTDR and isolate the lo... [23:19:03] vgutierrez: is it known that the icinga.wm.org cert will expire in 9 days? seems soon for normal LE renewals [23:19:16] and LE did send a mail about it [23:19:56] think it expires in 39 days mutante [23:20:30] I'm guessing LE emailed about the previous cert that didn't get renewed [23:20:36] and is no longer in use [23:21:51] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (production)$ openssl s_client -connect icinga.wikimedia.org:443 2>&1 | openssl x509 -noout -enddate [23:21:51] notAfter=May 27 15:58:05 2019 GMT [23:22:00] Krenair: indeed. May 27, 2019 [23:22:08] i was just in my inbox [23:22:18] it must be like you said, about old cert [23:23:35] if you want to be really sure it's possible the old cert is still on the box [23:25:52] with something like openssl x509 -in /etc/acme/cert/icinga.crt -noout -enddate [23:27:21] oh, other possible explanation is that's a cert from an old version of the acme-chief service named certcentral [23:27:45] might find it at /etc/acmecerts/icinga.rsa-2048.crt [23:28:38] or /etc/centralcerts/icinga.rsa-2048.crt [23:28:49] it's moved around a bit lately, should be stable from here on though hopefully [23:28:53] there is /etc/acmecerts/icinga/ but not thaat ^ [23:29:24] 'live' and 'new' link to the same target [23:29:49] that's normal unless it's part-way through the issuance process [23:30:08] yes, i am just confirming it looks like one cert.. the new one [23:30:23] openssl x509 -in /etc/acme/cert/icinga.crt -noout -enddate [23:30:23] notAfter=Feb 11 16:28:51 2019 GMT [23:30:30] this is a third one