[00:04:58] LCF, there is also a load of lines at the bottom of that geo-maps file I linked [00:05:10] that appear to be overrides for specific networks [00:05:44] not just WM addresses, there's some ISPs in there and FB [00:06:44] (though FB has been trouble lately so they might not be there for good reasons) [00:07:02] you might be able to get added there? idk [00:08:17] oh [00:08:22] could you just add one there?> [00:09:35] it's not up to me, I have no production server access let alone the ability to mess with stuff like that [00:09:46] as I said you'll want to talk to bb.lack [00:09:48] 74.120.189.0/24 => [eqiad, codfw, ulsfo, esams, eqsin], # Wikia [00:09:54] that would be super awesome [00:10:05] k k [03:01:21] 10HTTPS, 10Traffic, 10Operations, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Shizhao) Firefox Nightly now supports ESNI: https://blog.mozilla.org/security/2018/10/18/encrypted-sni-comes-to-firefox-nightly/ [03:43:49] 10HTTPS, 10Traffic, 10Operations, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Bawolff) >>! In T205378#4613684, @Aklapper wrote: > @Shizhao: Is this a [[ https://www.mediawiki.org/wiki/How_to_report_a_bug | feature request ]]? Currently it looks like a... [07:03:18] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) a:03ayounsi [09:23:18] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468368/1/modules/certcentral/files/gdnsd-sync.py that was tricky :) [09:24:40] heh [09:25:49] what happened with the v6 issue? [09:25:54] cause it's working right now [09:28:34] vgutierrez, the IP was not configured on the VM [09:28:40] so it had an automatic ipv6 address [09:28:47] which wasn't in DNS so wasn't in the authdns ferm rules [09:29:01] faidon added the interface::ip6_mapped thing [09:29:56] cool :) [09:30:42] thx faidon <3 [09:30:48] er [09:30:49] not faidon [09:30:50] sorry [09:30:56] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468370/ - it was brandon [09:31:10] faidon appeared in channel asking about the problem at some point [09:31:48] also it was interface::add_ip6_mapped [09:36:16] IIRC you already had the commit for the cert test, right? [09:37:20] yep [09:37:28] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468315/ [09:37:39] great [09:39:01] note the wildcard in there [09:39:45] yep [09:40:49] looking good: https://puppet-compiler.wmflabs.org/compiler1002/13084/certcentral1001.eqiad.wmnet/ [09:46:42] I'm merging it [09:46:49] let's see how certcentral behaves [09:53:22] hmmm the puppet change didn't trigger a reload of the certcentral service [09:53:25] Notice: /Stage[main]/Certcentral::Server/Uwsgi::App[certcentral]/Base::Service_unit[uwsgi-certcentral]/Exec[systemd reload for uwsgi-certcentral]: Triggered 'refresh' from 1 events [09:53:29] Notice: /Stage[main]/Certcentral::Server/Uwsgi::App[certcentral]/Base::Service_unit[uwsgi-certcentral]/Service[uwsgi-certcentral]: Triggered 'refresh' from 1 events [09:53:32] only of the API [09:53:33] we need to fix that [09:53:47] I'm issuing the SIGHUP manually now [09:55:58] uploading a fix [09:56:05] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468537 [09:56:11] brb [10:01:01] nice, we are missing the proxy config in certcentral [10:01:11] so certcentral cannot reach the LE servers right now [10:01:54] and we've a problem generating the config right now [10:01:58] trailing spaces in the SNIs [10:02:24] + not quoting the wildcard SNI makes the yaml parsing to fail [10:03:34] back [10:03:36] damn [10:03:38] can we pass the http_proxy env variable when starting the service? [10:03:40] maybe we have to overwrite the package's service file with our own? [10:04:14] that's one approach, the other one is making it configurable [10:04:34] and having our application code pick it up and send it through to the right API calls? [10:05:06] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468538 [10:05:30] and yeah I don't think there's a way around the quoting requirement with asterisks [10:05:49] we can quote all of them [10:06:12] ok [10:07:34] vgutierrez, should I edit the hieradata to always have quotes around those? [10:07:50] https://puppet-compiler.wmflabs.org/compiler1002/13087/certcentral1001.eqiad.wmnet/ [10:08:06] hm [10:09:37] try this one [10:10:12] ok [10:10:18] hieradata already has the quotes [10:10:22] for the wildcard SNI [10:10:24] yes [10:10:29] but they don't end in the config.yaml file [10:10:36] ? [10:10:36] so I guess we must hardcode them in the template [10:10:45] oh [10:10:48] you mean [10:11:01] the quoting not only has to pass hiera but it needs to end up in the resulting config yaml [10:11:02] fun [10:11:05] indeed [10:11:05] one sec [10:11:21] I still think we shouldn't be generating yaml in a template like this. [10:11:25] same behaviour here --> https://puppet-compiler.wmflabs.org/compiler1002/13088/certcentral1001.eqiad.wmnet/ [10:11:34] should build the data structure in the puppet manifest and then ordered_yaml it [10:12:41] vgutierrez, is that PS2? [10:13:13] yup [10:13:14] commit 0b111f757ed3a492549bf3d0dbd35136bdc3330d [10:13:35] HEAD is now at 0b111f757e... certcentral: don't muck with whitespace after SNI entries [10:14:31] try this one [10:16:21] https://puppet-compiler.wmflabs.org/compiler1002/13089/certcentral1001.eqiad.wmnet/ [10:16:25] that looks promising [10:16:52] I'm testing 468539 [10:16:56] and if it works as expected [10:17:01] I'm merging the three of them [10:17:03] ok [10:17:19] we still need to do proxy [10:17:23] indeed [10:17:23] when you said [10:17:30] 18<vgutierrez18> that's one approach, the other one is making it configurable [10:17:40] you mean read it in as part of the certcentral config and pass it around? [10:17:43] I was suggesting adding it as a config parameter in config.yaml [10:17:43] yes [10:18:13] WTF [10:18:14] https://puppet-compiler.wmflabs.org/compiler1002/13090/certcentral1001.eqiad.wmnet/ [10:18:47] the quotes commit is messing up the previous one [10:18:47] oh [10:18:58] right [10:19:54] one sec [10:20:37] vgutierrez, btw, don't we have the acme library do the HTTP calls for us? [10:21:44] yes [10:21:53] and they're using requests internally [10:22:03] does that provide a way for us to pass in proxies? [10:22:07] I'm checking that right now :) [10:22:14] it doesn't look like it does [10:22:43] alex@alex-laptop:/usr/local/lib/python3.6/dist-packages/acme$ grep prox * -r [10:22:43] messages.py: Conveniently, all challenge fields are proxied, i.e. you can [10:22:43] messages_test.py: def test_proxy(self): [10:22:43] alex@alex-laptop:/usr/local/lib/python3.6/dist-packages/acme$ [10:24:45] vgutierrez, try the new commit in PCC [10:26:34] ack [10:26:38] right.. they don't support it right now [10:26:48] so we have to pass http_proxy? [10:26:49] we could submit a PR here: https://github.com/certbot/certbot/blob/master/acme/acme/client.py [10:26:55] https://github.com/certbot/certbot/blob/master/acme/acme/client.py#L882 [10:27:03] that should be the way to do it [10:27:10] but right now we must set the env variables [10:27:14] ok [10:27:29] so we do this by overwriting the packaged service file with our own in puppet right? [10:27:59] that or making certcentral code mess with the env variables [10:28:23] https://puppet-compiler.wmflabs.org/compiler1002/13096/certcentral1001.eqiad.wmnet/ --> nice [10:28:33] I'm merging those [10:28:39] ok [10:31:42] so... messing up with the environment within certcentral doesn't look clean [10:31:54] yeah [10:32:03] so I guess we go for including our own unit file [10:32:37] can we make the Service resource take a unit file template? [10:32:48] or do we need to change it to systemd::service? [10:34:02] hmm the refresh of cercentral [10:34:06] actually issued a restart [10:34:08] instead of a reload [10:34:09] Oct 19 10:32:48 certcentral1001 systemd[1]: Stopping Central Certificates Service... [10:34:13] Oct 19 10:32:48 certcentral1001 systemd[1]: Stopped Central Certificates Service. [10:34:16] Oct 19 10:32:48 certcentral1001 systemd[1]: Started Central Certificates Service. [10:34:32] systemctl reload though? [10:34:58] the reload behaves as expected [10:34:59] Oct 19 10:34:46 certcentral1001 systemd[1]: Reloading Central Certificates Service. [10:35:02] Oct 19 10:34:46 certcentral1001 certcentral-backend[17240]: SIGHUP received [10:35:49] but maybe it's caused cause the timeouts are making certcentral to crash [10:36:07] we need to capture those properly [10:36:17] so let's fix the proxy stuff [10:36:27] yeah working on it [10:38:27] https://gerrit.wikimedia.org/r/468544 + parent [10:39:17] hmmm we need to set https_proxy as well [10:39:36] or http_proxy behaves as a fallback? [10:39:42] (for https_proxy) [10:40:53] IIRC it's a fallback [10:41:30] can you try it out? don't happen to have such a network handy [10:41:39] http://docs.python-requests.org/en/master/user/advanced/#proxies [10:41:59] I can try with the certcentral tests :) [10:42:00] one sec [10:44:06] HTTP_VALIDATOR_PROXIES = { [10:44:06] 'http': os.getenv('HTTP_PROXY'), [10:44:06] 'https': os.getenv('HTTPS_PROXY'), [10:44:06] } [10:44:09] we have that in our code [10:44:14] we need to explicitly set both of them [10:47:07] oh.. and env. variables are case sensitive.. [10:47:10] so HTTP_PROXY [10:48:05] huh [10:48:13] I seem to recall curl taking lowercase ones [10:48:14] ok [10:48:42] dunno about curl right now, but requests doc is using them in upper case [10:49:48] ok [10:52:09] LOL [10:52:11] https://curl.haxx.se/docs/manpage.html [10:52:16] http_proxy but HTTPS_PROXY [10:52:19] gotta love the consistency [10:52:43] and [url-protocol]_PROXY [10:52:50] so I guess that HTTP_PROXY it's also valid in curl [10:53:07] https://gerrit.wikimedia.org/r/468544 [10:55:19] Krenair: we could fetch the proxy from hiera/common.yaml http_proxy variable [10:55:57] 665:http_proxy: "http://webproxy.%{::site}.wmnet:8080" [10:55:59] it's already there [10:56:11] vgutierrez, yeah but then we have to access hiera from a template? [10:56:30] yup.. not from the template but in the puppet code [10:56:37] hm ok [10:56:38] fetch it to a local variable [10:56:42] and use the local variable in the template [10:57:50] new ps [10:59:19] 10:58:35 modules/certcentral/manifests/server.pp:70 wmf-style: Found hiera call in class 'certcentral::server' for 'http_proxy' [10:59:20] vgutierrez [10:59:48] HHAHAHAHAHAHA [10:59:54] (crazy laugh) [11:00:02] fricking wmf-style :) [11:00:09] at this point I'm not even surprised [11:00:44] lint:ignore:wmf_styleguide? [11:01:05] _joe_: do you have a millisecond for us? [11:01:26] <_joe_> yes [11:01:40] <_joe_> do the hiera call in the profile, pass the parameter along [11:01:42] ack [11:01:44] thx :) [11:02:06] <_joe_> that was fast :D [11:02:11] I told you :D [11:03:06] <_joe_> there is a specific reason why we want to do hiera calls only in profiles, and it is it makes it easy to see all the variable configuration you inject [11:03:55] it'd be nice to see a wmf_styleguide run on the entire repo rather than just new things [11:04:08] <_joe_> oh you can do it [11:04:13] and fix them all [11:04:14] <_joe_> it's a tad depressing though :P [11:04:16] so we can: [11:04:26] <_joe_> Krenair: fix them all is hard [11:04:27] * actually rely on existing patterns to pass the lint check [11:04:41] <_joe_> very hard in some cases, take the mediawiki module [11:04:48] * ensure any new style rules are sane by forcing whoever adds it to fix the existing ones [11:04:51] 8 modules/profile/manifests/certcentral.pp:45 wmf-style: Found hiera call in class 'profile::certcentral' for 'http_proxy' [11:04:54] wut? [11:06:09] <_joe_> ? [11:06:26] wmf-style doesn't like your solution _joe_ [11:06:27] we did the hiera call in the profile [11:06:34] still says nope [11:07:18] <_joe_> that's a bug then [11:07:25] <_joe_> either in the plugin or in CI [11:07:28] <_joe_> lemme check myself [11:08:20] <_joe_> oh yes, it needs to be a class parameter [11:08:39] <_joe_> stupid limitations imposed by playing nice with the horizon UI :/ [11:09:27] ack [11:09:43] <_joe_> Krenair: use https://wikitech.wikimedia.org/wiki/Puppet_coding#Organization as a reference [11:10:02] Krenair: add it as a parameter instead of ignoring the linter plz [11:10:17] yeah the latest PS does that [11:10:36] thx :D [11:11:03] checking pcc right now [11:12:06] https://puppet-compiler.wmflabs.org/compiler1002/13102/certcentral1001.eqiad.wmnet/change.certcentral1001.eqiad.wmnet.err [11:12:12] <_joe_> also another tip [11:12:12] https://puppet-compiler.wmflabs.org/compiler1002/13102/certcentral1001.eqiad.wmnet/change.certcentral1001.eqiad.wmnet.err [11:12:15] arg [11:12:24] Error: Failed to compile catalog for node certcentral1001.eqiad.wmnet: Evaluation Error: Error while evaluating a Resource Statement, Systemd::Service[certcentral]: parameter 'ensure' expects a match for Wmflib::Ensure = Enum['absent', 'present'], got 'running' at /srv/jenkins-workspace/puppet-compiler/13102/change/src/modules/certcentral/manifests/server.pp:71 on node [11:12:30] certcentral1001.eqiad.wmnet [11:12:49] this will be the parent commit which changed from Service to Systemd::Service [11:13:08] how are you supposed to ensure running a Systemd::Service then [11:13:22] <_joe_> using service_parameters [11:14:03] ah so we pass ensure => present, service_params => {ensure => running} [11:14:42] <_joe_> let me check the code [11:15:18] <_joe_> no, you don't need to do it [11:15:33] <_joe_> ensure => present will automagically guarantee the service is ensure => running [11:16:08] <_joe_> modules/systemd/manifests/service.pp L47-53 [11:16:28] ok [11:16:56] vgutierrez, try this [11:18:52] looking good: https://puppet-compiler.wmflabs.org/compiler1002/13103/certcentral1001.eqiad.wmnet/ [11:20:41] thx _joe_ :D [11:22:45] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) Does this still affect us? If so, which concrete subnets are affected? [11:26:06] ok... [11:26:09] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10Krenair) I can't seem to access eqiad1.bastion.wmflabs.org right now but: ```krenair@bastion-01:... [11:26:17] now it's crashing when it tries to submit the DNS challenges [11:26:43] hmm nope [11:26:51] it fails on DNS challenge validation [11:27:46] https://phabricator.wikimedia.org/P7698 [11:28:36] vgutierrez, oh I think we changed from listing IPs of resolvers to listing hostnames of resolvers? [11:28:44] that probably broke it [11:28:54] hmmm [11:29:04] that could be it :) [11:31:27] running a test.. with localhost instead of 127.0.0.1 [11:31:46] can you PCC https://gerrit.wikimedia.org/r/468551 ? [11:32:03] seriously I should really be able to run stuff through PCC myself :/ [11:34:28] sure [11:34:33] same crash [11:34:36] in the test [11:34:59] so you reproduced that it's hostnames vs. IPs in the verification DNS server list causing the problem/ [11:35:05] indeed [11:35:22] but I don't know if your solution is the best approach [11:36:04] DNS lookup on puppet compiler time [11:36:11] VS on certcentral config parsing [11:36:36] yeah [11:36:36] or even before attempting the DNS validation [11:36:36] well [11:36:44] to avoid restart-less DNS changes [11:36:46] my laptop is low on battery and I need to go get lunch and walk home [11:36:53] sure, don't worry [11:37:01] enjoy your lunch :) [11:40:47] 10Certcentral: Allow validation_dns_servers to be a list of hostnames - https://phabricator.wikimedia.org/T207457 (10Vgutierrez) [11:40:49] 10Certcentral: Allow validation_dns_servers to be a list of hostnames - https://phabricator.wikimedia.org/T207457 (10Vgutierrez) p:05Triage>03High [12:18:38] vgutierrez, back, you working on that ^ ? [12:23:36] I've just submitted the patch [12:23:45] https://gerrit.wikimedia.org/r/468554 [12:23:56] BTW, we need to open a task to fix something else [12:24:05] right now we check 1 of the configured dns servers [12:24:07] not all of them [12:27:46] huh [12:29:38] vgutierrez, not socket.gethostbyname ? [12:30:38] 10Certcentral: Validate DNS-01 challenges against every DNS server - https://phabricator.wikimedia.org/T207461 (10Vgutierrez) [12:30:54] 10Certcentral: Validate DNS-01 challenges against every DNS server - https://phabricator.wikimedia.org/T207461 (10Vgutierrez) p:05Triage>03Normal [12:30:58] Krenair: gethostbyname only returns IPv4 addresses [12:31:03] and we are in a dual stack environment [12:31:23] from python doc [12:31:24] gethostbyname() does not support IPv6 name resolution, and getaddrinfo() should be used instead for IPv4/v6 dual stack support. [12:33:31] I've already tested that authdns servers answer to DNS queries in IPv6 addresses [12:34:04] wonder why python returns a standard tuple as the 5th item in each getaddrinfo entry [12:34:12] instead of named tuples [12:36:35] I think your 53, proto=socket.IPPROTO_UDP gets ignored, but anyway [12:36:47] yep, but trims the output a little bit [12:39:46] Krenair: are you submitting the new package release to the debian branch? [12:39:50] vgutierrez, https://gerrit.wikimedia.org/r/468565 [12:39:51] nice .) [12:40:17] I think you'll need to add me to the list of maintainers [12:40:23] or liantian is going to complain [12:42:06] hmmm [12:42:12] we forgot to bump the version in setup.py as well [12:42:53] do you need to set V+2 in this repo? [12:43:57] hmmm we need to do it in master as well [12:44:06] the setup.py version change [12:44:15] vgutierrez: use setuptools_scm ;) [12:44:48] volans|off: CRs are always welcome [12:45:00] sorry I'm off :-P [12:45:09] <3 [12:45:44] 10Certcentral, 10Patch-For-Review: Allow validation_dns_servers to be a list of hostnames - https://phabricator.wikimedia.org/T207457 (10Krenair) 05Open>03Resolved [12:49:45] BTW, cherry-picking with -x flag adds the cherry-picked commit ID to the commit message automatically [12:50:25] so where we at? [12:50:36] LCF: still around? [12:50:47] bblack: fixing a small issue regarding DNS-01 challenge validation [12:51:00] bblack: using hostnames instead IP addresses made certcentral crash [12:51:01] yeah I saw some stuff above about DNS vs IPs, couldn't really follow [12:51:18] it's already fixed [12:51:21] hi bblack I am still around, LCF will be later [12:51:30] we're releasing 0.2 now in a few minutes [12:51:34] basically one of the libraries we were using baulked at getting nameservers names [12:51:38] it needed IPs [12:51:48] er [12:51:50] resolvers names [12:52:13] TK-999: so we the wikia IP block and geoip, often part of the problem is that your effective DNS resolution pathway is different than what you think, too. [12:53:01] yeah that's possible, for the geoip part LCF has already submitted a correction request to maxmind but ofc that will take some time to be processed [12:53:02] TK-999: when your software does a DNS lookup on commons.wikimedia.org, what local resolvers/caches does it use, and what IPs those end up hitting the internet from. [12:53:28] Krenair: don't forget about the tag :) [12:53:35] I'm going to add a reflector to our DNS (I thought we had one before) right quick that will make it easier to debug [12:54:00] gbp:error: 0.2 is not a valid treeish [12:54:04] how often does wikimedia update the copy of the maxmind db? [12:54:08] vgutierrez, one sec [12:54:26] done [12:54:47] as for local resolvers one sec [12:56:35] Krenair: often! [12:56:36] if they're your own, operating on your servers from your network block, it should all be fine, but sometimes that's not the case [12:56:36] bblack: it should be basically 74.120.189.0/24 [12:56:50] maxmind currently puts that into San Francisco, correction request is pending [12:56:52] Krenair: basically we're always up-to-date [12:58:09] ok [12:58:14] Krenair: https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/468554/ this didn't get merged [12:58:18] :( [12:58:31] dammit [12:58:46] will fix tag [12:59:04] ack [12:59:12] done [13:00:07] bblack: we saw there were some IP/subnet specific overrides in the configuration so forcing eqiad for that subnet would probably instantly improve things on our side, assuming of course that adding more overrides there is doable :) [13:00:32] yeah we can do that [13:01:40] vgutierrez@boron:~/certcentral$ git log --oneline 0.2 [13:01:40] 1f68397 dns_validation: Allow hostnames as DNS validation servers [13:01:41] 4a67086 Detect when cert config changes and re-issue [13:01:49] latency reston -> eqiad is sub 1ms compared to 60ms for reston -> ulsfo [13:01:50] Krenair: so it's missing the 0.2 bump commit [13:01:55] :/ [13:02:01] ugh [13:02:34] I've changed the 0.2 tag to cf53c0d2b82e4328c24672b4c60b3339083e3016 [13:03:03] which should have that vgutierrez [13:03:37] dpkg-source: info: local changes detected, the modified files are: [13:03:37] certcentral/certcentral/acme_requests.py [13:03:37] certcentral/tests/test_certcentral.py [13:03:37] dpkg-source: error: aborting due to unexpected upstream changes, see /tmp/certcentral_0.2-1.diff.L8iTUG [13:03:40] dpkg-source: info: you can integrate the local changes with dpkg-source --commit [13:03:43] gbp:error: 'git-pbuilder -j8 -us -uc -sa' failed: it exited with 2 [13:03:45] it's failing to build the package [13:03:56] why are they unexpected? [13:04:15] we've changed the tag and the version and done the changelog [13:04:54] cause in the diff, the dns fix commit is missing [13:04:56] :/ [13:05:35] wut [13:05:45] yeah, the diff is showing those lines as the offending ones [13:06:05] the ones from 1f68397774b0ca3d5299f6efaaf48b5ab508b9aa [13:06:12] gotta go to lunch right now [13:06:18] do you have the old tag or something? [13:06:25] mope [13:06:28] *nope [13:06:30] I updated it [13:06:42] vgutierrez@boron:~/certcentral$ git pull --tags [13:06:43] Already up-to-date. [13:07:28] what does git log --oneline 0.2 say? [13:08:22] you probably shouldn't move tags around that already pushed out to other shared repos :) [13:08:27] much like amending history [13:08:59] I don't know if gerrit pushes the tags [13:09:40] I just made the change directly in gerrit's web ui [13:12:16] bblack: so yeah, I think for now an override there would help greatly, we can create a ticket for us to follow up & remove it once the geolocation fix has been propagated to geoip services [13:31:13] TK-999: it should be working now (the override in our GeoDNS) [13:31:35] bblack: yeah just checked - awesome, thank you! [13:31:45] np! [13:34:26] Krenair: cf53c0d Merge "dns_validation: Allow hostnames as DNS validation servers" [13:35:09] yeah [13:38:46] hmm [13:38:46] dpkg-source: info: building certcentral using existing ./certcentral_0.2.orig.tar.gz [13:38:50] caching... [13:39:43] nothing like some chicken tajine to be able to see the issue [13:40:10] 10Traffic, 10Operations, 10Continuous-Integration-Config: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) Here is the full context **Current** The `operations-dns-lint` job runs on **Jessie** WMCS instances, they are provisioned by puppet and eventu... [13:43:22] so [13:43:22] E: certcentral source: maintainer-address-malformed Alex Monk , Valentin Gutierrez [13:43:26] W: certcentral source: changelog-should-mention-nmu [13:43:28] W: certcentral source: source-nmu-has-incorrect-version-number 0.2-1 [13:43:33] lintian is not happy [13:44:11] changelog-should-mention-nmu --> that would be fixed as soon as the maintainer-address-malformed is fixed [13:46:14] how do you add multiple maintainers then? [13:47:07] hmm right, what we do in other packages like pybal is list one mantainer and several uploaders [13:47:37] 10Traffic, 10Operations, 10Continuous-Integration-Config: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) Shouldn't the container be able to puppetize from `authdns::lint` directly, which would provide all the pathways for updating the package/config/... [13:47:42] so keep yourself as Maintainer and add me and yourself to the Uploaders list [13:47:45] 10Traffic, 10Analytics, 10Operations: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) a:05elukey>03mforns [13:47:51] vgutierrez, and an uploader can appear in the changelog? [13:48:04] apparently yes [13:48:16] cause I'm there for pybal and lintian is happy [13:49:23] see new commit [13:49:54] dunno if you should be there as an uploader as well [13:49:55] let's try [13:50:44] that is correct, changes by the Maintainer + Uploaders are what are considered "maintainer" uploads [13:50:51] don't ask why, legacy :) [13:53:06] yey [13:53:11] now lintian is happy [13:53:27] we're back to the usual W: certcentral source: newer-standards-version 4.2.1 (current is 3.9.8) [13:53:30] and that's expected [13:53:39] I'm merging the commit [13:53:40] that means you're using an old version of lintian [13:53:49] yup [13:53:53] boron [13:54:50] what is boron running? jessie? [13:55:11] boron? [13:55:41] oh that's confusing, that name used to belong to a frack box :) [13:56:13] it got reused due to being short on element names [13:56:22] yeah maybe we should have newer-than-stretch debian packaging tools there [13:56:25] it runs stretch, but I'd argue that for package-builder lintian should be pinned to ${::lsbdistcodename}-backports [13:56:29] I ran into that with the gdnsd packaging as well [13:56:34] https://gerrit.wikimedia.org/r/#/c/operations/dns/+/380461/ [14:07:21] so now It's running... certcentral marks the challenge as validated but LE doesn't agree with us for some reason [14:09:12] uh wow [14:09:23] alex@alex-laptop:~/Development/Wikimedia/Operations-DNS (master)$ dig _acme-challenge.pinkunicorn.wikimedia.org TXT +short | wc -l [14:09:23] 63 [14:09:38] yep [14:09:46] certcentral is too aggressive [14:09:57] haha [14:11:07] uh [14:11:21] yeah you should slow that down maybe. I mean the code is beta after all :) [14:12:00] ;; MSG SIZE rcvd: 9366 [14:12:10] ^ ~9K of response size so far of TXT records [14:13:34] if I think of this in more-general terms: probably we shouldn't assume in certcentral that a challenge failure implies re-patching DNS will fix it. [14:13:42] or if we do, there should be some kind of holdoff timer on retrying it [14:13:52] how gdnsd wipes the challenges? [14:14:20] there's an explicit command "gdnsdctl acme-dns-01-flush" that we can use on the 3x authservers to wipe them all out [14:14:47] otherwise, they automatically drop off 10 minutes after they're injected by default [14:15:04] ok [14:15:11] (we could configure that smaller, it seemed a reasonable default to make sure a successful process finishes with anyone's janky integration) [14:15:34] maybe gdnsd should also have a cap of max challenges you can insert? [14:15:47] a configurable one, maybe? [14:16:04] can't imagine the use case during which 70 would make sense for anyone? [14:16:15] someone will! :) [14:16:18] heh [14:16:35] assuming no code bugs, which is a tricky asterisk to parse, it doesn't actually fail or anything [14:16:58] they'll just build up in daemon datastructures to whatever maximum you can place in the sliding 10-minute window of history [14:17:34] and if the response (for one given challenge hostname) is >16KB, it will simply not add all of the records to the response at that point, because gdnsd refuses to output responses bigger than 16KB, even over TCP [14:17:59] normal UDP rules (512 or edns0 size up to server-configured limit) truncate over to TCP in general [14:18:42] so if you configure 10K challenge responses, the first N that fit in a 16KB TCP response will get sent, the rest won't (until some older ones expire and new ones shift into the window of 16K visibility) [14:18:50] could we wipe those plz? :) [14:18:55] sure [14:20:22] done [14:20:39] FTR: bblack@neodymium:~$ sudo cumin 'C:role::authdns::server' 'gdnsdctl acme-dns-01-flush' [14:20:58] <3 thx [14:21:17] (note eeden is still in that set in puppetdb because it didn't get reimaged yet, and fails, but that's ok) [14:24:50] so the 16KB sliding output window implies ~285 challenges or so, depending on what other things fill this last bits of space (length of query name, edns options, etc) [14:25:15] it might not be unreasonable to just hard-cap challenges-per-hostname around that value and reject further updates until some expire. [14:25:54] at least it avoids someone's runaway script doing really crazy things and configuring, say, 100M of them in memory [14:26:04] which is a bunch of pointless churn for the daemon's main thread and memory usage [14:26:45] why would you cap it at 285 and not, say, 5 or 10 in that case? [14:27:03] if you're instituting a cap in the first place, I mean [14:27:05] because 5 or 10 would almost certainly be unreasonable for some real use-cases [14:27:13] and/or retries [14:27:21] s/C:role::authdns::server/O:authdns::server/ ;) [14:27:28] what's O? [14:27:33] like what? I guess I'm missing background on how ACME DNS challenges work [14:27:42] shortcut for Class = role:: [14:27:46] oh cool :) [14:27:48] rOle I think, because R was reserved [14:28:10] https://wikitech.wikimedia.org/wiki/Cumin#PuppetDB_host_selection :D [14:28:26] yep that's the reason R is for general resources, sorry for the interruption :) [14:28:40] paravoid: so in general, you're going to have 1x challenge-response needed, for every SAN of a cert that hits that name, including any wildcard as the name under the wildcard. [14:28:48] and every cert is separate [14:29:31] so, for example, if you're issueing ECDSA+RSA matching certs, and the CN/SAN set contains "*.example.com, example.com", that's 4x challenges right there just for one attempt with one provider. [14:29:42] from one DC [14:30:05] all under _acme-challenge.example.com, right? [14:30:09] right, so then we have redundant certcentrals issuing independently and hitting the same DNS servers, so say 8x challenges at a time for that scenario [14:30:12] paravoid: yes [14:30:47] and then, in a future world where an alternate-LE exists and you do redundant certs for getting around OCSP failure (like we do with unified and GS+digicert), then that's 16x. [14:31:15] well [14:31:17] and let's say you have operational needs for two different certs, one of which just has that pair of SANs, and another which has those two plus 4 others on other names. So now you're up to 32x you could need at once [14:31:20] would you do all of them at the same time? [14:31:39] I suppose it's possible, but a bit of a corner case extreme, right? [14:31:44] you would probably have systems that auto-renew on swizzled timestamps related to next expiries [14:31:54] but do you want your system to fail because [14:32:28] why would we want to impose arbitrary limits that prevent our systems doing all of them at the same time? [14:33:08] I'm just saying: over all the possible operational scenarios for + + [14:33:27] + the possibility that you may be starting fresh on all your certs (brand new certcentral instances?) and issue all of them all at once [14:33:43] the limit where the solution should just fall over shouldn't be very low [14:33:50] you probably should splay that anyway, shouldn't you? [14:34:09] well yes, but we're talking about limits at the gdnsd level, not wmf or certcentral [14:34:32] some other organizations scripts may not do so, and they might still reasonably expect it not to unecessarily, artificially fail [14:36:00] 10Traffic, 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) None of the containers get provisioned via puppet. For CI puppet was used mostly to provide a list of packages. The variou... [14:37:09] but it probably is reasonable to reject the gdnsdctl command if you're adding more in a TTL-window than can be output in our 16KB output limits anyways [14:37:24] because that's going to partially fail anyways, in practice, as not all the records will be output [14:38:19] and ~250 or so seems like a high enough limit that most would never get there with reasonable operational constraints anyways, it almost has to be a runaway script [14:40:34] other thing I wanted to ask you was... is there anything special on the gdnsd side for ACME? [14:40:51] or is this just ephemeral records [14:41:11] I think it depends on what you mean by that... [14:41:11] I guess it chooses _acme-challenge as the label, but beyond that? [14:41:50] other stuff use DNS for challenges, e.g. we have Google Search Console records in ops/dns I think, and someone filed a task about adding something for GitHub [14:41:51] it also constrains the data to only match what acme-dns-01 specifies. e.g. you can't use this command to just define arbitrary TXT data. it actually has to be TXT data that's 43 bytes long and from the set of legal base64url characters, etc [14:42:00] so it's acme-specific in that sense [14:42:03] not sure if these are meant to be ephemeral or permanently set [14:43:10] but other than constraining the data to the use-case, it's not doing anything else special: it is just ephemeral TXT data for req->resp purposes, and gdnsd has no idea if the challenge data is actually legitimate beyond the length and content constraints above. [14:44:11] they sound ephemeral-ish, but maybe not quite as automated? [14:44:23] we could expand the use-case, it just makes some things a little more complicated to validate [14:44:32] and we do need some kinds of constraints [14:44:55] can't just store the DNS records in a database and provide an API for changing them? :) [14:44:59] * Krenair hides [14:44:59] haha [14:45:44] sure, so long as you don't mind dropping the next 50K requests that come in while the thread is stalled on a SQL query :) [14:47:13] HAHAHAHHA [14:47:16] "detail": "CAA record for *.pinkunicorn.wikimedia.org prevents issuance", [14:47:23] * vgutierrez throws keyboard [14:47:38] * vgutierrez goes after the keyboard [14:47:47] so.. the CAA record is messing with us :) [14:47:51] why don't you just test a non-wildcard first? pinkunicorn.wikimedia.org itself doesn't need CAA updates, we already authorize LE [14:48:08] somebody got greedy with the tests :) [14:48:14] but I can test that of course :) [14:48:40] of course, it would be easier to see and mess with the CAA records if they were in CAA record format in our repo instead of hex bytes [14:48:49] nah, don't worry [14:49:18] but that's stalled on authdns CI being able to run new versions of gdnsd so it doesn't fail https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 [14:49:49] the wildcard was my idea [14:52:38] https://phabricator.wikimedia.org/P7700 [14:52:39] <3 [14:53:40] is that with it configured only for the non-wildcard version? [14:53:52] indeed [14:53:54] https://phabricator.wikimedia.org/P7700#45106 [14:54:02] that's the certificate that we got [14:54:47] ah the reason it fails the CAA check is we're using the staging API? [14:54:55] nope [14:55:05] our CAA record doesn't allow LE to issue wildcard certificates [14:55:11] (right now) [14:55:15] ... why? [14:55:39] cause we don't use LE to issue wildcard certificates yet [14:55:46] ok [14:55:46] so the safest thing to do, is forbid them [14:56:37] I'm gonna start the certcentral daemon [14:56:45] it should detect the rsa-2048 certificate as valid [14:56:52] and ask the ec-prim256v1 one [14:57:48] Oct 19 14:57:35 certcentral1001 certcentral-backend[25902]: Pushing the new certificate for pinkunicorn / ec-prime256v1 [14:57:52] it worked like a charm [14:58:21] ok [14:58:24] what else can we do today [14:58:24] I'd pop a beer open right now.. it's pretty sad I'm on a muslim country [14:58:31] lol [14:58:34] :D [14:58:59] set up an LE prod account vgutierrez ? [14:59:05] err nope [14:59:14] I'd like to fix some issues in the certcentral basecode [14:59:16] ok [14:59:26] plus it's friday, no sense triggering ratelimiters or whatever [14:59:31] indeed [14:59:58] so how certcentral should react if a certificate gets rejected by admin reasons? [15:00:10] rejected by what now? [15:00:29] so... LE was rejecting the certificate due to the CAA check [15:00:53] certcentral was seeing that as a validation error and it was respawning the whole process [15:01:12] so we were flooding the auth servers with new DNS challenges [15:01:24] would be good to get notified somehow [15:01:36] and hitting the LE API hard [15:01:47] so we should be able to detect that and put the certificate out the issuance loop [15:01:52] or at least throttle it [15:02:53] nice, so I'll create a test with this and I'll work on a patch [15:03:31] https://gerrit.wikimedia.org/r/468595 [15:05:38] 10Certcentral, 10Traffic, 10Operations: Create production LE account - https://phabricator.wikimedia.org/T207476 (10Krenair) p:05Triage>03Normal [15:08:33] 10Certcentral: Avoid infinite attempts on issuing a certificate on permanent LE side errors - https://phabricator.wikimedia.org/T207478 (10Vgutierrez) [15:15:13] Krenair: can you change the parent of https://gerrit.wikimedia.org/r/468595 ? [15:18:09] done [15:19:10] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10mforns) [15:20:58] sigh.. I can't merge it [15:21:07] hm [15:21:34] try now? [15:28:01] yup [15:28:05] meanwhile in codfw [15:28:06] Oct 19 15:27:07 certcentral2001 certcentral-backend[18341]: Unexpected return code spawning DNS zone updater: 1 [15:32:42] vgutierrez: you probably need to arm the keyholder [15:32:50] it's a manual process as root after any reboot of the instance [15:32:54] it's armed [15:32:59] I did it yesterday [15:33:12] oh, well next thing someone else said was "restart the keyholder-proxy service" I think [15:33:53] yup [15:33:55] volans :) [15:34:06] if that doesn't work, I donno, we're back to debugging [15:34:12] lol [15:34:16] I'm around if needed [15:34:32] I can try some of the debugging I did yesterday on the other [15:34:59] Oct 19 15:34:50 certcentral2001 certcentral-backend[20885]: acme.messages.Error: urn:ietf:params:acme:error:malformed :: The request message was malformed :: Order's status ("valid") is not acceptable for finalization [15:35:05] really? [15:35:18] well [15:35:24] is the zone updater still returning an error? [15:35:33] or you're past that now with proxy restart? [15:35:54] yep, that worked [15:35:59] now we have another issue :) [15:36:12] but that will wait till Monday [15:36:20] I'm listening to _joe_ now [15:36:29] ok [15:36:43] don't leave it spamming gdnsd requests if it is, please :) [15:37:08] yup, I've stopped it [15:37:21] thanks! [15:37:25] and puppet is disabled there [15:37:27] so no worries [15:37:34] not that I don't think it can handle it, but it is the weekend and it is an untested scenario heh [15:38:26] sure, there is no need to complicate the weekend [15:40:34] this test has been pretty useful [15:40:45] including testing the wildcard issue [15:40:49] yeah [15:40:52] that discovered some issues generating the config [15:41:08] and the smashing of the API/gdnsd [15:41:13] so it's been useful [16:24:05] 10netops, 10Operations, 10ops-eqiad: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10RobH) So I just gave @papaul full rights to access CH2, identical to his rights in DA1 (so he should have no issues with access.) [16:29:42] thanks bblack for that dns fix [16:32:29] np! [16:38:22] 10netops, 10Operations: cr2-esams - BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 (10ayounsi) 05Open>03Resolved a:03ayounsi Thanks for following the doc! That BGP peer has been deactivated as it never replied to our notification. [16:58:08] 10netops, 10Operations, 10fundraising-tech-ops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10ayounsi) a:03ayounsi From IRC, the Juniper seems to include more than what's mentioned in the description, at least: saiph renaming and betelgeuse removal I can... [17:05:51] 10Traffic, 10Operations: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) Opened https://github.com/digitalocean/netbox/issues/2531 [17:44:09] 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) This is also the reason we have to have the following route on cr1/2-eqiad `static rout...