[03:53:51] vgutierrez, re https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/446618/ - do you want to work on replacing the acme_tiny call with that in a separate commit? [03:53:56] if so I think I'll merge this today [06:41:38] Krenair: yey.. replacement will be on a separate commit [06:42:11] Krenair: but let's give volans a chance to do a final review of that code before merging it [07:10:59] vgutierrez: thx, saw the latest PS, I'll review it shortly [07:15:16] Krenair: BTW.. regarding ACME Accounts... certcentral should get from a CLI parameter the ID of the account that it's going to use and just load() it, cause ACME accounts should be persisted in the private repo as a secret and provisioned on the server where certcentral is running by puppet [08:51:44] vgutierrez: review done [09:00:13] 10Traffic, 10Operations: Discard of cold, labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) 05Open>03Resolved a:03ema No more cold VCLs, [[https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445357/ | workaround ]] working fine. Closing. [09:22:15] 10Traffic, 10Operations, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [09:22:20] 10Traffic, 10Operations: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [09:53:57] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5003.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [10:43:46] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5003.eqsin.wmnet'] ``` and were **ALL** successful. [10:49:42] mmmh the first puppet run after -auto-reimage fails with an interesting error [10:49:46] Error: /Stage[main]/Profile::Base::Certificates/Sslcert::Ca[wmf_ca_2017_2020]/File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt]: Could not evaluate: Could not retrieve file metadata for puppet:///modules/base/ca/wmf_ca_2017_2020.crt: execution expired [10:50:06] see /var/log/wmf-auto-reimage/201808060952_ema_27699_cp5003_eqsin_wmnet_cumin.out on neodymium [10:50:17] <_joe_> ema: that's a puppetmaster failure, 99% [10:50:27] protip: less -R [10:50:34] <_joe_> go check the puppetmaster logs for that file :) [10:50:47] <_joe_> bbl [10:50:54] ok it seems a consistently reproducible error, at least reimaging eqsin hosts [10:58:09] vgutierrez, hm okay, so at some point someone generates an account and then stores it in puppet-private [10:58:17] indeed [10:58:46] Krenair: BTW, before adopting acme_requests I'm working in some config refactor [10:58:50] with proper tests [10:58:59] ok [10:59:11] BTW, we need to get rid of get_certs from the CertCentral class [10:59:31] on the meanwhile to be able to import centcentral without instantiating CertCentral() I've adopted the following hack [10:59:38] app.cert_manager = CertCentral() [10:59:39] app.cert_manager.run() [10:59:39] app.run() [11:00:36] yeah good call [11:45:28] Krenair: question, I'm using deployment-certcentral02 as a config example [11:45:38] ok [11:45:39] right now in /etc/certcentral/conf.d we have a file with the following contents [11:45:56] cat conf.d/authorisedhost_testing__deployment-certcentral-testclient02.deployment-prep.eqiad.wmflabs.yaml [11:45:59] certname: testing [11:46:02] hostname: deployment-certcentral-testclient02.deployment-prep.eqiad.wmflabs [11:46:16] Krenair: what would be the syntax for specifying several hostnames? [11:46:39] just specifying a list instead of a string in hostname? [11:46:47] cause it looks like it's not currently implemented [11:46:51] so maybe I'm missing something [11:47:28] you don't specify multiple hostnames [11:47:40] puppet makes a file for each host [11:48:03] ack, got it, thx [11:48:07] basically each host exports a resource that says it wants certificate x, and the central machine collects them all [11:49:20] sticks them in /etc/certcentral/conf.d [12:41:57] vgutierrez, ew american spelling changes [13:16:45] volans, _joe_: opened T201317, I could not find anything useful on the puppetmaster apache logs for those files [13:16:46] T201317: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 [13:16:47] ema: is T201317 something related to the reimage that I should look at or is a plain puppet issue? [13:16:57] sametime :) [13:22:22] volans: it's a puppet issue consistently happening at reimage time :) [13:24:18] ema: my hunch would be it's shortly after puppet executes a command that blips the network interface. I don't recall seeing it actually do the "execution expired" thing, but I know I've seen significant pauses before in those cases on cache/lvs. [13:24:57] basically the network interface blips because of some ethtool or sysctl or interface-rps thing, and then the live TCP connection to the master lags out and puppet gets stuck in the midst of a file transfer for a long time. [13:25:12] (or to a fileserver maybe, rather than the master?) [13:26:17] if you repro running puppet with -tvd you can probably see the command it executes just before the hang. [13:36:40] ok, let's see what happens with --debug [13:39:50] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp5004.eqsin.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20... [13:52:24] ema: thoughts on starting to roll in the new eqiad stretch boxes to their pools? I was thinking 1-for-1 swap (depool 1x old + pool 1x new quickly, then wait a little for settling, repeat), re minimal distrubance [13:53:19] bblack: sounds good! [13:53:52] ok, I'll probably try the first of those in a few [14:28:48] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5009.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [14:36:12] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5004.eqsin.wmnet'] ``` and were **ALL** successful. [14:56:59] not my lucky day with reimages, cp5009 is stuck somewhere [14:57:30] oh, "Error: Could not find a certificate for cp5009.eqsin.wmnet" [14:59:55] scratch that, the first puppet run is now running with --debug [15:11:01] XioNoX: so I was going to start slowly pooling up new eqiad caches today. They're all on the new switch stacks (4 nodes each on asw2-[abcd]-eqiad). Should I avoid any of those for now, because they're one of the few on that particular row's new switch stack and still possibly debugging/rebooting things? [15:11:35] my vague recollection of details from friday makes me think avoid A [15:12:40] bblack: it seems you're right about the 'execution expired' thing, it happens right after 'ip token set ; ip -6 addr flush' [15:13:28] if that's the only such case that causes such a problem (makes sense, as it probably changes the IPv6 IP that was being used to communicate, vs just blipping an interface and keeping the IP) [15:13:40] is there maybe a puppet agent argument to use only v4? [15:13:45] kinda like "ssh -4" [15:14:12] probably not [15:14:28] we could do something else to force the v6 off before running the first agent run though, perhaps [15:14:39] (like, unconfigure the existing autoconf v6 IP) [15:15:45] bblack: asw2-a5-eqiad will be worked on, not expecting any issues for the other members though. So to be extra safe you can avoid row A, but don't have to [15:15:49] a "puppet agent -4" flag seems like a decent feature request for early runs on new hosts, but I imagine it's not a quick turnaround [15:15:59] is asw2-a-eqiad still flapping? [15:18:36] err not flapping I guess, looking at the tickets, that was the one that needs a cable replaced [15:19:04] bblack: cable + flapping https://phabricator.wikimedia.org/T201145#4478170 [15:19:06] yeah... [15:19:21] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5009.eqsin.wmnet'] ``` and were **ALL** successful. [15:25:15] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Dell finally replied back to me (3 days later) giving me a list of 4 engineers to go onsite. They keep doing that (listing more than are going.) So now I have to figure ou... [15:58:02] https://letsencrypt.org/2018/08/06/trusted-by-all-major-root-programs.html [16:02:21] 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Andrew) eth0 and eth1 are both up on labtestnet2002. On labtestnet2003 eth1 still shows as down: ``` root@labtestnet2003:~# ip addr 1: lo: 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) The cable is unplugged or the switch port is down? ``` aborrero@labtestnet2003:~ $ sudo ethtool eth1 | grep Link Link detected: no aborrer... [16:15:04] smooth so far on cp1090 in upload. comparing to cp1073, the most interesting result is that 1073's iowait is averaging ~0.2% CPU, and so far 1090 (even while initially filling the disk cache) is bouncing around between 0 and 0.002% [16:15:13] which seem to somewhat confirm the benefits of nvme :) [16:15:21] \o/ [16:16:15] hopefully nvme's raw io performance will make mailbox lag almost impossible (even if it is a crappy way to engineer around the problem) [16:18:06] vgutierrez: did you share the blogpost on hackernews? [16:18:29] I didn't [16:18:48] we were minorly concerned last week about the fact that all phab traffic through cache_misc, including phame, is uncacheable in varnish [16:19:14] so if we slashdot it, we might melt phab heh [16:19:30] but I doubt the number of clicks from HN is going to be that awful [16:19:56] I didn't share it cause I had the same concern [16:20:10] relatedly, now that I've been poking around in phame a little, I saw it can be configured with custom domainnames [16:20:12] bblack: can I ask why phame instead of out technology official blog? [16:20:37] mostly because is more likely to be followed and has RSS feeds, that I didn't find in Phame looking for 30s :) [16:20:42] e.g. we could set up trafficblog.wikimedia.org in DNS to alias cache_misc and add it to the domains backending to phab, and configured it in the blog dashboard, and it probably "just works" for nicer URIs [16:22:02] volans: I don't know really, it's possibly a candidate [16:22:41] or foundation's blog? :) [16:23:10] volans: there's a lot of thorny issues there. techblog/blog.wikimedia.org are run by comms not us, there's a different level of review and editorial standards and scheduling, vs phame being a bit lower-key and published directly by tech persons [16:23:32] true [16:23:39] volans: also, it's a story about TLS termination properties for Foundaiton sites, which techblog/blog.wikimedia.org don't share (or even come close to) since they're 3rd party hosted [16:24:11] so it would kinda be inviting negative comments about how ssllabs on techblog shows shitty/different properties than advertised in the blog post :) [16:24:18] eheheh, also true [16:24:43] I'm kinda of up in the air about related long term things in terms of practice and policy [16:25:08] maybe we shift to using phame more by default (and fixup caching and give the blogs real hostnames for prettier URLs?) [16:25:24] I understand, I was more thinking about having a single tech place for those things that might be easier to advertise and be knoen [16:25:28] *known [16:25:29] or maybe we publish there by default, but look at occasionally using it in a pipeline sort of way up to techblog [16:25:50] (if it looks good and seems like it should get broader notice, ask comms to lift it from phame and do whatever editorial cleanup and push a copy on techblog?) [16:26:32] 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Papaul) switch port is enable and link is up ``` show interfaces ge-1/0/17 descriptions Interface Admin Link Description ge-1/0/17 up... [16:26:46] could be an option, the main concern is the absence of RSS feeds (if confirmed) and the fact that you need to register on phab to subscribe [16:27:11] there are some kind of feed on phame I think, just not pretty/memorable URIs [16:27:35] atom [16:27:41] I was just looking for something to add to a feed reader [16:27:48] instead of subscribing on phab [16:27:55] directly as it requires an account [16:28:14] supposedly this is the atom URI: https://phabricator.wikimedia.org/phame/blog/feed/11/ [16:28:35] I have no idea how much the URI layout changes if we add a custom domain [16:29:52] right, that works [16:30:44] but it's ugly and long, and what should be some memorable blog name is instead "11" [16:30:46] so :P [16:31:38] lol [16:31:52] there are settings for "Full Domain", "Parent Domain", and "Parent Site" that I assume mangle URLs to use some more-memorable scheme, that we could make consistent across the various phame blogs [16:32:06] have to look into how it really works [16:34:49] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) I checked the log today and the error has not returned. [16:35:27] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) Ok I'll take a stab at another imaging today and see how it goes, thanks! [16:35:36] 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) >>! In T199821#4481716, @Papaul wrote: > switch port is enable and link is up something is now working as expected. ``` aborrero@labtestn... [16:36:19] <_joe_> phame is really a dumb name for a blog system [16:36:22] <_joe_> :P [16:36:40] it makes me think of the David Bowie song every time I see it :) [16:38:15] [ https://www.youtube.com/watch?v=J-_30HA7rec ] [16:39:41] <_joe_> we should move your blog to wordpress.com [16:40:50] only if we use the windows port of wordpress and have a 3rd, 4th, and 5th parties help us install in Azure for hosting. [16:41:27] hmm I'm all in for deploying the blog in a honeypot O:) [16:42:02] * volans heard that oracle cloud is better than Azure [16:43:36] Krenair: either we don't allow CSRs without SANs, or we fix _find_wildcard [16:44:01] I'd vote for not allowing it, it will only cause pain when someone finds a way to trigger it [16:44:09] you're right that SAN is optional in X509 [16:44:24] (and maybe in the case that the input set has only 1x subject name and no san list, just auto-create a 1-entry san list as a copy) [16:44:45] in any case, I think that _find_wildcard is better with that try/except block avoiding it to fail [16:45:59] actually.. the let's encrypt testing server (pebble) fails on the validation with a pretty weird error if you submit a CSR without SANs [16:46:17] but, Let's Encrypt production environment does what bblack just described [16:46:32] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) No flows logged since at least the last 3 days. So it looks fine to me. I removed the syslog statement to minimize noise while... [16:46:36] relatedly in the new today: https://letsencrypt.org/2018/08/06/trusted-by-all-major-root-programs.html [16:46:38] it just gets the subject name and appends it to the SAN list [16:47:12] yeah would expect LE to enforce the baseline requirements [16:47:41] which makes it required [16:48:12] LE goes the friendly way and does it for you [16:48:24] but yeah :) I'll submit a CR making it mandatory [16:48:27] so they take a CSR without a SAN and generate a cert with it? [16:48:32] yes [16:48:34] ok [16:48:44] but in the certificate you get the subject common name in the SAN as we ll [16:48:47] *well [16:49:00] bblack: I'll give some love to T196691 tomorrow EU morning [16:49:00] T196691: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 [16:49:12] it's pretty similar to dns2001 [16:51:00] 10Traffic, 10DNS, 10Operations: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) I'll handle this from here @RobH, thanks :) [16:51:08] ok thanks [16:51:17] 10Traffic, 10DNS, 10Operations: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) a:05BBlack>03Vgutierrez [16:51:31] we have a slew of those from the hw refreshes in both eqiad and codfw (new lvses, authdnses, recdnses, etc) that are languishing a bit [16:52:53] feel free to send some in my way [16:53:44] T196560 [16:53:56] stashbot is lazy :) [16:53:59] T196560: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [16:53:59] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [16:58:53] vgutierrez, so we've got to replace acme_tiny with acme_requests calls, make it do renewals, other stuff? [16:59:15] move get_certs out of certcentral [16:59:36] also we need to sync with bblack regarding dns-01 [16:59:49] without dns-01 wildcard certificates are out of the picture [17:00:10] how are you even testing dns-01 so far? [17:00:14] (or are you?) [17:00:32] so.. I've tested it manually with wmflabs.org dns zone [17:00:50] in our integration tests, we just configure the ACME integration server to mark as valid every challenge [17:01:06] ok [17:01:31] so there's two levels of dns-01 issue to sort out here, basically [17:01:54] 1) Some sane interface from gdnsd for publishing arbitrary TXT records (it is TXT contents, right?) [17:02:17] 2) How we'll have the certcentral hosts actually execute pushing the update to the list of authdns servers [17:02:50] indeed, it's a TXT record [17:03:08] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) First attempt to reboot for PXE install stops now with: ```UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot B5 is disabled because of in... [17:03:12] we talked a bit about 2 a couple of weeks ago [17:03:27] also, where is the TXT record at? I haven't looked at the dns-01 standard at all really [17:03:45] (as in, how does the requested SAN map to the TXT record lookup name?) [17:04:37] https://tools.ietf.org/html/draft-ietf-acme-acme-13 [17:04:47] so if you need to validate a certificate for wikipedia.org and *.wikipedia.org you will need to set up two TXT records in _acme-challenge.wikipedia.org [17:04:52] _acme-challenge.example.org. 300 IN TXT "gfj9Xq...Rg85nM" [17:05:15] for m.wikipedia.org _acme-challenge.m.wikipedia.org., and so on :) [17:05:20] I have a customised acme_tiny that does this stuff btw [17:06:34] ok [17:06:56] I assume both a SAN "example.org" and a wildcard SAN "*.example.org" both check _acme-challenge.example.org. ? [17:07:03] right [17:07:21] ok [17:07:22] if both are present in the CSR, you need to solve 2 challenges [17:07:33] one for example.org and onother one for *.example.org [17:07:33] right [17:07:58] and obviously in the general case, we will have some that do many SANs across many domains, some multiple for the same DNS label, for a single CSR [17:08:16] indeed [17:08:33] https://phabricator.wikimedia.org/P7428 [17:09:05] at the gdnsd level, I'd talked before about just doing some generic dynamic-txt thing using statefiles pushed to disk by whatever script/mechanism is pushing the data [17:09:07] hm that doesn't help this question much because it just takes the one from the server [17:09:49] the downside of that approach, is we need to pre-provision _acme-challenge labels into the zonefiles (e.g. via commits) for all SANs we might use with this in the near future, so that it's even looking for those TXT data statefiles to flesh them out with. [17:09:55] which doesn't sound ideal for integration [17:10:30] could also just add acme challenge support more directly, and make it independent of the zonefiles [17:10:53] so we'd have special records that tell gdnsd to look at the data files? [17:11:02] as in, some core code actually watches for incoming query hostnames that start with "_acme-challenge.", and diverts them off to custom support rather than using the real zonefile data at all [17:11:50] the other ??? in the nebulous earlier plans was statefiles vs controlsock commands [17:12:26] e.g. "gdnsdctl add-acme-challenge-data example.org 600 awijaliewjfalwiejflawiejf", would create such a TXT record for the next 10 minutes and then let it expire out [17:13:05] vs having the tooling write the current set of challenges to some file like /var/lib/gdnsd/acme-data with a list of them, and having the daemon scan that file for updates (inotify) [17:13:44] the control socket variant is simpler, but it wouldn't be stateful across daemon restarts if someone was concurrently pushing a daemon-restarting gdnsd config change while a challenge was ongoing, whereas a statefile could be [17:14:16] but IIRC we agreed that ACME will retry so would not be a big issue [17:14:59] and given the transient nature of the challenges it seemed more simple to keep it in memory only and not having persistence [17:14:59] I assume we'll be retrying the entire certificate renewal/issue process until it succeeds yeah [17:15:00] "ACME will retry" would have to mean: challenge fails, returns failure from LE -> certcentral, certcentral re-attempts the process starting with re-pushing challenges to the DNS servers again [17:15:16] yes [17:15:17] ok [17:15:42] vgutierrez: would it be the same CSR or a new one? (that I don't know) [17:16:00] could probably make it do either [17:16:30] so, yeah, I tend to lean in that direction then (gdnsdctl adding temporary data, and no zonefile involvement, just some custom core-code support for special handling of query names that match ^_acme-challenge\. ) [17:16:49] I remeber a chat about it based on LE behaviour vs peeble for the same data request generates the same CSR or a new one [17:17:09] you can probably use the same CSR, you'll just get new challenge data [17:17:31] there was some issue with one of the two, but cannot remember now, valentin knows for sure [17:18:10] maybe whether you re-use the CSR or make a new one affects one of the ratelimits in some positive or negative way [17:18:31] bblack: would the ^_acme-challenge\. magic be enabled for all domains or you still need to setup something beforehand in the zonefile to enable it? [17:19:12] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) USB method isn't working neither, followed up with JTAC, if no quick resolution, we have spare EX4300 to swap it with. ``` loader> install --format file:///jinstall-ex-4300-14.1... [17:20:55] volans: I'm thinking configuration-wise, at most a binary flag to turn the whole feature on or off. If the feature's on, "gdnsdctl add-acme-challenge ..." works for whatever domainnames you happen to specify in the commands [17:21:50] sounds good for a first implementation [17:21:53] so then we have some script running that takes remote input from certcentral and runs gdnsctl add-acme-challenge? [17:22:02] (or can avoid adding new configuration, and just don't run the gdnsdctl command if you don't want the challenges) [17:22:25] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) @Cmjohnson please swap the switch with another EX4300, and only connect console and the usb drive. Once it's ready to join the virtual chassis, I'll need you to connect the VC c... [17:22:36] I don't see much reason to have a per-zone setting [17:23:30] Krenair: so, at challenge time, presumably certcentral will need to issue an external command of some kind however it's integrated. "/usr/local/bin/publish-acme-dns-01 .... ... ... .... .." [17:23:54] and we'll need that script or whatever to handle dispersion to authdnses somehow [17:24:16] which I imagine ends up being a list of authserver hostnames from puppet and some ssh commands over to them that execute the gdnsdctl stuff [17:24:38] I don't know if we'll be shelling out but yeah [17:25:07] yeah, I guess you could integrate deeper too [17:25:15] it's just that part will necessarily be fairly wmf-specific :) [17:25:22] yeah [17:25:31] (suck up the list of authdns from a puppet-controlled configfile, and do the SSH-ing directly) [17:25:45] are there existing cases where our systems talk to each other like this using SSH, other than cumin? [17:26:16] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Here they are: 1/ Add logs statements to know if the packets are getting in/out of the fabric ``` Filter on the ingress interface: set firewall family ethernet-switching f... [17:26:17] yeah, our current "authdns-update" mechanism for pushing zonefile updates does so, and it's on the same authdns servers to boot. We could probably even re-use the same system of keys. [17:26:37] ok [17:26:46] (authdns-update gets run on any 1/N of the authdns servers, it picks up new dns zonefile changes from git, then sshes around to all the others and gets them to pull the same update) [17:30:11] we can make the command synchonrous too, it will make life easier [17:30:19] is TSIG something that has been considered? [17:30:43] (so that it's guaranteed if you wait on "gdnsdctl add-acme-challenge ..." to exit 0, any query after that point will see the new data) [17:30:51] Krenair: in what context? [17:31:20] oh you mean to push these updates [17:31:20] for pushing acme challenges into auth dns servers [17:31:23] yeah [17:31:31] no, probably not realistic for gdnsd [17:31:42] ok [17:32:19] I mean, we could do it in theory, but it wouldn't be for general use, just for this kind of case with temporary acme-challenge TXT records [17:33:08] hm [17:33:11] doing general purpose dynamic dns updates like that and persisting state, just doesn't jive well with the rest of the design, and people would start asking about it or filing bugs about why that doesn't work when they see TSIG supported in general, etc [17:33:41] so easier to special-case acme-challenge and have our own way of pushing updates in then [17:33:53] far easier, probably [17:34:14] as it is, I'm worried I won't push into a stable gdnsd release in time for all of this by EOQ [17:34:42] there's a huge raft of breaking changes for gdnsd's 3.x release already in a prototype branch, which is taking forever to finish up [17:34:49] ok [17:34:56] so [17:35:16] I think I'm just going to have to scale that back to the minimum good/easy bits and toss this in, and release a less-crazy 3.x and push the rest to a relatively-quick 4.x followup later [17:35:33] vgutierrez, volans: certcentral-wise, we need some backend system so we can configure how to push and validate challenges for either type of challenge [17:36:30] so a generic http-01 write-to-file-and-get-once-from-verification-domain type thing [17:36:59] or dns-01 by issuing SSH commands to the gdnsd servers [17:37:04] or dns-01 by talking to designate [17:37:45] 10netops, 10Operations, 10cloud-services-team, 10ops-codfw: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) 05Open>03Resolved I didn't press the right buttons: ``` aborrero@labtestnet2003:~ $ sudo ip link set dev eth1 up aborrero@labtestnet20... [17:38:02] maybe with some mechanism for verifying that all DNS or HTTP servers are serving up the challenge, rather than just a simple one-off check against any random one [17:38:09] agreed? [17:39:15] for checking an entry in all authdns server is pretty simple, we already have the code [17:39:19] for that bit [17:39:40] in practice the http-01 won't work very well for our desired redundant certcentral setup, as there'd be some tricky bit about having the caches or nginxes or whatever internally re-route /.well-known/acme to the 1/2 certcentral hosts that's actually executing the given challenge [17:40:57] is that actually tricky? [17:41:06] (or we do a more push-like model with that too, but then that requires a local fs path to serve challenge data from on all the endpoints, regardless of apache/nginx/varnish/whatever, and again + config) [17:41:41] it seems tricky, if both certcentrals are concurrently running various challenges on various randomized renewal timelines [17:41:54] but only one of the two has the challenge data for any given challenge [17:42:53] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10Cmjohnson) I swapped out the ex4300 with a current spare wmf7314. @ayounsi can you give me the details of the RMA and the shipping label. Please email. [17:43:06] so we are going to have multiple active certcentral servers at any given time? [17:44:05] that's what I last remember from , it seems the most simple and stateless model, aside from the http-01 routing challenge. [17:44:54] they can run independently and unaware of each other, and maintain their own sets of issued certs for all the configured things, assuming doubling the rate doesn't run afoul of ratelimits (shouldn't) [17:45:47] and on the puppetization of the clients (the real TLS servers), I guess we have some switch that controls whether they're pulling their keyfiles from server01 or server02, so that we can switch in case of outage [17:46:02] but either way there's a full set of keyfiles/certs there waiting [17:47:23] it avoids all the challenges of syncing the created keys/certs between the two servers, or alternatively avoids almost certainly hitting ratelimits if we had to issue them all freshly on the spot on failover. [17:48:12] or races on who's doing which next renewal if we're syncing bidirectionally them as they're generated (although I guess with that handled by rsync, you could designate one as the master who's doing all the renewals) [17:49:37] no matter which way you structure things, there will be state/race issues I think. it's just a question of which corner you want to sweep those problems into. [18:01:26] alright [18:01:30] volans, where is the code for checking all authdns servers? [18:17:44] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Replacing fpc5 didn't solve the issue... Following up... [19:06:19] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @JKatzWMF Definitely makes sense to test this before pushing it everywh... [19:54:43] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10JKatzWMF) Great! Thank you for confirming, @Imarlier and, again, I am really exc... [21:25:12] 10netops, 10Operations: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) @Papaul, can you pre-populate the following interfaces with SFP+-10G-LR? ``` xe-0/1/0 {#11399} xe-0/1/1 {#11401} xe-0/1/3 {#11389} xe-0/1/4 {#11403} xe-0/1/6 {#11397} ``` You can store a few spares... [21:28:56] 10netops, 10Operations: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) The current guess is that those errors were side effects of the other switches failing. "DDOS_PROTOCOL_VIOLATION" syslog seem to be read hearing. Still monitoring, let me me know if... [21:37:56] 10netops, 10Operations: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) Planning on doing the swap on August 14th, 11am CDT, 9am PDT, 4pm UTC. 1h. Pending no planned maintenance from redundant links. [21:44:22] 10netops, 10Operations: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10Papaul) [21:48:10] 10netops, 10Operations: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10ayounsi) [22:02:17] 10netops, 10Operations: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10ayounsi)