[01:47:11] 10Traffic, 10Operations, 10Performance-Team (Radar): Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10BBlack) Yeah I think @Bawolff's explanation seems plausible. If they's a DNS hijacking "transparent" proxy which returns the... [05:50:05] 10netops, 10Operations, 10Patch-For-Review: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 (10ayounsi) 05Open>03Resolved the re-numbering went as expected, BGP sessions are back up. The failover tests were not done, as the exact links needs to be properly identified on t... [07:47:11] 10Traffic, 10Operations: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) All the power connections have been imported, note that some data can't be imported such as cable length and cable ID. See https://netbox.wikimedia.org/dcim/power-connections/?site=eqsin [08:10:54] 10Traffic, 10Operations: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899 (10MoritzMuehlenhoff) Are very ready to deprecate this now? We have disk health checks in place via Icinga for a while now which would warn us about faulty disks. [08:19:05] morning [08:19:17] morning vgutierrez [08:19:19] Krenair: I've just spotted a warning on the package building process that's going to bite us in the ass [08:19:22] I: dh_python3 pydist:220: Cannot find package that provides josepy. Please add package that provides it to Build-Depends or add "josepy python3-josepy" line to debian/py3dist-overrides or add proper dependency to Depends by hand and ignore this info. [08:19:48] hm [08:20:42] it's pretty weird though [08:20:54] josepy it's a dependency of python3-acme [08:21:03] yeah... [08:21:11] and it's provided as a debian package so.. [08:21:58] 10Traffic, 10Operations: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) About the other links, the only less straightforward interfaces are the server's uplinks as their name can't be derived from the device/port table. I'll keep the task open until at least... [08:22:17] vgutierrez, interesting thing [08:22:27] I just looked at the package I built for testing in deployment-prep [08:22:28] and it's actually installed on certcentral1001 [08:22:31] ii python3-josepy 1.0.1-1~bpo9+1 all JOSE implementation for Python 3.x [08:22:33] it has a dependency on python3-josepy [08:22:42] like directly [08:23:21] really? [08:23:23] Depends: init-system-helpers (>= 1.18~), python3-acme, python3-cryptography, python3-dnspython, python3-flask, python3-openssl, python3-requests, python3-yaml, python3:any (>= 3.3.2-2~), adduser [08:23:42] yep [08:23:49] Depends: python3-acme, python3-cryptography (>= 1.7.1), python3-dnspython, python3-flask (<< 1.0.0), python3-flask (>= 0.12.1), python3-josepy, python3-openssl, python3-requests, python3-yaml, python3:any (>= 3.3.2-2~), adduser [08:24:04] hmm boron package it's slightly different [08:24:19] due to that warning I guess [08:24:55] didn't we change one of the versions of something used to build it all at one point? [08:25:23] debhelper? [08:31:24] also [08:31:42] I built it on ubuntu bionic if that changes anything [08:47:18] 10Traffic, 10DNS, 10GitHub-Mirrors, 10Operations, 10Release-Engineering-Team (Kanban): Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10hashar) [08:47:55] 10Traffic, 10DNS, 10GitHub-Mirrors, 10Operations, 10Release-Engineering-Team (Kanban): Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10hashar) [10:07:47] 10netops, 10Cloud-Services, 10Operations: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10aborrero) Agreed. [10:19:32] vgutierrez, so [10:19:43] tasks for the open commits [10:19:55] investigate weird packaging warning [10:20:18] get brandon to +1 the authdns commit maybe? [10:20:27] yup [10:26:00] 10Certcentral: Investigate weird packaging warning - https://phabricator.wikimedia.org/T207371 (10Krenair) [10:27:01] 10Certcentral: Add simple script for account creation - https://phabricator.wikimedia.org/T207372 (10Krenair) [10:27:25] 10Certcentral, 10Patch-For-Review: Add simple script for account creation - https://phabricator.wikimedia.org/T207372 (10Krenair) a:03Krenair [10:28:44] 10Certcentral: Remove maximum version dependencies - https://phabricator.wikimedia.org/T207373 (10Krenair) [10:28:50] 10Certcentral: Remove maximum version dependencies - https://phabricator.wikimedia.org/T207373 (10Krenair) a:03Krenair [10:31:12] 10Certcentral: Check for expired/outdated certs in the main loop - https://phabricator.wikimedia.org/T207374 (10Krenair) [10:32:18] vgutierrez, uh https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/457933/8/certcentral/cli.py is a certcentral commit [10:32:35] it's really geared more towards the idea that we would make certcentral available for others to use [10:33:21] it just uses the existing acme_requests.ACMEAccount.save function which is incompatible with our puppet setup [10:33:41] maybe we can provide both appraoches? [10:33:45] *approaches [10:34:33] dump the account details in stdout (that would be suitable for our puppetization), or persist in in disk with the .save() method [10:35:17] so provide a param to call the save function? [10:35:20] that wikimedia doesn't use [10:35:23] indeed [10:35:35] or the other way around [10:35:41] I don't care about what's the default [10:36:01] but I think it could be useful for both words (WMF and outside) [10:36:06] *worlds [12:25:51] vgutierrez, so I fixed that, probably [12:26:11] when we get the authdns commit in we should decide which domain to test on [12:26:31] I don't know if there's any point using a test domain for it, should be safe to try to get a cert for a domain we'll actually use right? [12:56:06] Krenair: I'd do the test with pinkunicorn.wm.o [12:56:50] eh [12:56:51] ok [12:57:42] it won't hurt/break anything and it's in use [12:57:58] On an entirely unrelated note, I hate git submodules. [12:58:12] Mainly when other people are changing ones I don't care about [13:01:58] vgutierrez, tempted to also throw in *.pinkunicorn [13:05:25] vgutierrez, yay or nay? [13:06:04] agreed re: submodules, I've been hating them slightly less now that git grep can be configured to recurse in submodules [13:07:24] godog, oooh how? [13:08:11] git config submodule.recurse true # git >= 2.14 [13:08:23] ok, the authdns+CC commit! [13:09:26] godog, I have 2.17 [13:09:35] godog, also <3 [13:11:30] heheh you're welcome Krenair [13:11:54] 1) Can we get rid of the hardcoded IPs in hieradata/common.yaml and use the same pattern for ferm+ssh that the authdns server role uses for itself for authdns-update? [13:12:21] that would be ideal wouldn't it [13:13:08] [13:13:42] basically, replace "profile::authdns::certcentral_target::certcentral_hosts" with "certcentral_servers" that's just an array of the two hostnames. [13:14:17] does it need to be global? [13:14:21] i.e. non-namespaced? [13:14:33] good question! [13:14:45] it'll only get used in that class [13:14:47] but if it's not, it doesn't need to be common.yaml either [13:14:53] so pick a path :) [13:14:56] true [13:15:40] so yeah currently it's even used like a proper profile hieradata param: [13:15:43] $certcentral_hosts=hiera('profile::authdns::certcentral_target::certcentral_hosts') [13:16:46] so we should be able to place it there I think? in hieradata/common/profile/authdns/certcentral_target.yaml ? [13:17:32] so yeah I guess try moving the data there, and making it an array of the two hostnames [13:18:18] and then the srange line, in the other similar case that's using hostnames instead of IPs, looks like: [13:18:23] srange => "(@resolve((${authdns_ns_ferm})) @resolve((${authdns_ns_ferm}), AAAA))", [13:18:32] where authdns_ns_ferm is the result of a join of the array of hostnames [13:18:58] hieradata/role/common/authdns/server.yaml ? [13:19:34] yeah, that was my example, which uses a different hostname list to authorize ssh between authdns servers themselves [13:22:13] just trying to figure out this security::access::config thing [13:22:31] on to point 2, which is the whole thing I can never remember all the arguments or context on, about the 3 different lists of nameservers! [13:22:44] oh, rewind! [13:23:02] what is security::access::config again? [13:23:12] keyholder? [13:23:15] no [13:23:23] the sudo rules? [13:23:26] the thing that allows us to let certcentral into a labs instance without it being in the relevant ldap group [13:23:45] actually [13:23:51] I was just using gdnsd inside labs to test this [13:23:55] make it labs only? :) [13:24:05] doesn't really solve the problem [13:24:06] but [13:24:15] going forward I expect to have deployment-prep talk to designate instead of my testing gdnsd instance [13:24:18] so I can just remove that resource [13:24:35] ok great [13:24:43] also I just found that we seem to be able to use domains in there (or at least toollabs does so for clushuser) instead of IPs [13:25:01] anyways, so on to point 2 about the lists of nameservers [13:25:27] so there's a canonical list of the "live nameservers" already in global hieradata, which authdns itself uses to find its own set of servers to update data on, etc [13:26:00] which is hiera('authdns_servers') [13:26:08] we are aware [13:26:19] and then there's the 2x lists in the current patch [13:26:27] one of which matches that and one of which is the ns[012] hostnames [13:26:47] vgutierrez, be nice :) [13:27:34] vgutierrez: sure go ahead, explain it to me :) [13:28:03] we did look into referencing the existing list [13:28:10] unfortunately it's a list rather than a string [13:28:24] if it were a string we could have hiera do something like %{hiera('authdns_servers')} [13:28:27] but there's this annoying line in the docs [13:28:41] that valentin found - https://puppet.com/docs/hiera/3.2/variables.html#the-hiera-lookup-function [13:28:55] 'The result of the lookup must be a string; any other result will cause an error.' [13:29:00] thx Krenair [13:29:21] so [13:29:36] how is that possibly true? [13:30:05] that limitation only applies to interpolation in hiera data files [13:30:08] not to puppet code [13:30:10] we have structures/arrays all over our hieradata that are fetched with hiera() [13:30:22] I had that reaction at first [13:30:32] so I guess we have to pivot through puppet code to get that working [13:30:41] without actually duplicating the list [13:30:42] it's when you go some_key: "%{hiera('authdns_servers')}" [13:30:43] oh, you mean you wanted to put the hiera call inside your new hieradata? [13:30:50] indeed [13:31:04] it seemed the simplest way to me [13:31:14] but again I was wrong [13:31:25] so since 'dns-01' and its keys are explicit anyways, you can just kill both of those nameserver lists in the patch's hieradata [13:31:31] and move th ehiera call right into the config template too [13:31:52] _joe_ discouraged that in one CR [13:31:54] so then you have to have a dns-01 config and you can't use our puppet module without it [13:32:05] that's not true [13:32:21] it still relies on the top-level bit being defined in hieradata or it doesn't configure dns-01 [13:32:51] oh it looks like we do have checks in there for challenges being empty etc. [13:32:56] and missing dns-01 [13:33:00] ok [13:33:21] the other layer of this onion, aside from "at least one of these lists shouldn't duplicate the authdns_servers data" [13:33:33] is whether the two lists should be identical for push/verify [13:34:13] I honestly can never remember what I said the time before or what was said the time before every time this topic comes up (I think we're on like round 4 of this now, at least) [13:34:29] but it seems to me at present like they should be the same list from the same source [13:34:43] so we'd do [13:34:53] yup, no problem with that [13:35:04] because ns[012] are virtual ideas that can be re-routed at any time, and may not in any case route the same for CC-verify that they do for the ACME provider. [13:35:08] - sync_dns_servers: <% @challenges['dns-01']['sync_dns_servers'].each do |dns_server| %> [13:35:08] + sync_dns_servers: <% scope.function_hiera(['authdns_servers']).each do |dns_server| %> [13:35:18] or was that the old way of doing function calls in templates, I forget [13:35:32] well I don't know about the syntax detail there without looking myself [13:36:02] but yes, that's the idea, for both lists, and then kill just those sub-stanzas in the patch's hieradata that we don't need anymore (but leave top-level dns-01 definition and its configured script path) [13:36:29] from the previous CR... https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441991/64/modules/profile/manifests/certcentral.pp@18 [13:36:30] well, the only other thing is, I think the linter will still ding you for directly referencing a global hieradata like that, right? [13:36:36] "Please don't suggest hiera calls from templates. [13:36:39] we just ignore that? [13:36:52] I think this is why the pattern that we ran into in the earlier commit [13:36:59] to do a real.pp matching hiera [13:37:18] realm.pp also has: [13:37:22] $certcentral_host = hiera('certcentral_host') [13:37:23] vgutierrez, I think that was because I was using it to evade one of the puppet style checks [13:37:31] which pulls global hieradata up to a global variable [13:37:46] it seems like authdns_servers fits this same pattern [13:38:05] so I'd promote it to a global in realm.pp, and then just use it as a regular global var in the templates [13:38:13] perfect [13:38:52] I don't know why that's better, but I think it meets the rules and has been qausi-ok'd as a solution anyways [13:38:53] 10Traffic, 10Operations: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10faidon) Awesome, thanks! No field for Cable IDs or labels is a bit disappointing :( It doesn't look like we can do it with a custom field either, but I'm not 100% sure. We should file an upstream... [13:39:06] s/qausi/quasi/ heh [13:39:39] bblack, so we drop sync_dns_servers from hiera and pull that into the template by accessing the global and not calling hiera? [13:40:56] yes, and same thing for the other list for verify [13:41:29] I think it makes sense for CC-the-software to have two lists, because they may be needed elsewhere, but I think for us we just want identical syncd lists here [13:42:52] the ns[012] routing is fairly static, but they can route differently depending on the source network, and later on at some point ns[012] would be replaced by anycast which makes the situation even vague-er [13:43:33] basically you can't actually know you're verifying against the same gdnsd instances that LE hits. But you can know that you've verified all the ones you claimed to push to, which if correctly configured should affect all public queries [13:44:57] I have a meeting soon, but let me drop in the Bomb while I'm here: [13:45:26] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459809/19..20 like this? [13:45:56] almost [13:46:03] authdns_servers twice [13:46:11] we don't need the external_fdqns right now [13:46:16] certcentral ends up being a poor name choice, because it's already the name of a commercial trademarked software that overlaps in functionality confusingly, which means user confusion and/or legal woes [13:46:20] https://www.digicert.com/mpki/ [13:46:24] vgutierrez, what? [13:46:43] so we're going to have to rename things, but I think we can stall doing it a little to get through the current stuff and make it work, and then deal with that next week or whatever [13:46:45] Krenair: set both sync_dns_servers and validation_dns_servers to ::authdns_servers [13:46:54] 10netops, 10Cloud-Services, 10Operations: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10faidon) 05Open>03Resolved a:03faidon Perfect! As far as I can see, there a few pending tasks, but are or should probably be covered in other tasks. Specificall... [13:46:55] bblack, FFS [13:47:03] yeah, I know :P [13:47:18] bblack: give us a name and it will be done :) [13:47:34] but let's ignore that a few more days and get the puppet shit working [13:47:38] vgutierrez, :| [13:47:40] just putting it out there [13:47:50] I might have not given it the best name but jeez [13:47:53] Krenair: I'm not gonna argue about naming honestly [13:48:06] I like the name you chose [13:48:10] but I see bblack's point [13:48:16] sure and I'm willing to change [13:48:49] it's a PITA, but it's not the end of the world and we know it (REM) [13:49:41] yeah [13:49:51] I'm just not thrilled about you saying bblack: give us a name and it will be done :) [13:49:54] esp w/ digicert being one of our cert vendors, it's hard for the foundation to argue innocent ignorance later heh [13:49:57] did you wanna talk about that first? [13:50:14] talk about it later, worry about the puppet stuff now. I just wanted you to be aware :) [13:50:28] Krenair: oh sorry, I didn't want to bypass you by saying that [13:50:36] it's just I don't care about the name [13:50:45] it's fine by me if you get to pick the new one too [13:51:34] I'm not saying I won't accept one from bblack, it's more the principal [13:51:51] anyway [13:52:03] Krenair: set both sync_dns_servers and validation_dns_servers to ::authdns_servers [13:52:10] we can consensus something anyways. I'm pretty bad at naming as well, and tend to name things functionally, which is exactly how you end up with such a name. [13:52:37] f.aidon was suggesting in another chat that this is a good reason to pick abstract names for projects with no semantic meaning :) [13:52:41] so [13:52:51] was I? [13:52:54] I don't remember that! [13:52:57] but sure, that sounds good :) [13:53:07] yesterday, the discussion of LE's original naming, which was right after bringing this up [13:53:17] we'd be changing the list of verification servers to use the same IPs as are used for SSH [13:53:17] project chocolate etc [13:53:21] ah [13:53:25] does gdnsd listen on those? [13:53:28] it does [13:53:49] sorry, I've been feverish for a couple of days now, can't say that I'm very focused on what I'm saying :P [13:53:53] it does because that's how we icinga-check them independently of ns[012] routing, too [13:54:08] alright [13:56:30] bblack: another CR for you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/468320 [13:56:47] about allowing the same cert to be used by multiple proxies. [13:56:55] bblack, vgutierrez: updated the authdns-CC commit [13:56:58] running pcc against PS21 [13:56:59] I'm not sure how much sense it makes for LE certs [13:57:12] bblack, is it okay if I open a task about the naming? [13:57:47] I guess, scoped to doing the work of the actual rename? [13:58:35] what about deciding the name? [13:59:03] that too, I'm just saying debating naming isn't very tasklike, we probably just hash that out in IRC and then the legwork is more tasky [13:59:10] ok [13:59:14] but whatever, it can be a bullet point there too that it has to happen :) [13:59:35] looking good: https://puppet-compiler.wmflabs.org/compiler1002/13052/certcentral1001.eqiad.wmnet/ [14:00:02] btw [14:00:11] will we have to rename the VMs? [14:00:37] I think it makes sense to respawn them with proper names [14:00:52] and it's a pretty fast thing [14:01:17] BTW, I'm not pretty fond of /etc/certcentral to drop an executable file [14:01:40] /usr/local/bin feels way better [14:02:34] 10Certcentral: Rename this project - https://phabricator.wikimedia.org/T207389 (10Krenair) [14:02:46] true [14:03:56] good point [14:04:08] PS22 [14:05:33] 10Certcentral: Rename this project - https://phabricator.wikimedia.org/T207389 (10Vgutierrez) p:05Triage>03Normal [14:07:11] right [14:07:29] I'm off to campus for a bit, will be back later [14:07:36] ok [14:08:09] we will be in meetings since 16 till 18 WEST [14:08:24] bblack even more :) [14:34:44] 10Certcentral: Rename the Certcentral project - https://phabricator.wikimedia.org/T207389 (10Aklapper) [14:39:18] vgutierrez: so we should be good to deploy the patch? looking again [14:41:52] vgutierrez: I have one final thought: we don't need gdnsdctl execution as root technically. We could limit the scope of possible future sec problems here a bit by having sudo use the "gdnsd" user the dns daemon runs as to execute gdnsdctl. [14:42:16] vgutierrez: assuming that can be made to work with sudo, given 'gdnsd' is likely a nologin user [14:43:07] nologin meaning "has a shell of /bin/false" [14:43:25] I can try it manually from my user and see how sudo works out [14:49:01] yes, it works [14:49:08] I'll amend right quick [14:51:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/459809/22..23/modules/profile/manifests/authdns/certcentral_target.pp [14:51:32] I'm kind of around for a it [14:51:34] bit [14:52:09] bblack, uh you'll need to change the command it runs too then [14:52:17] so it does sudo -u gdnsd instead of just sudo [14:52:33] oh right [14:54:06] anyways, I did manually test. I temporarily pulled ops out of the nopasswd root stuff (so I don't override with blanket access), and stuff that rule in for user bblack, and was able to do: [14:54:10] bblack@authdns1001:~$ sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 example.com 0123456789012345678901234567890123456789012 [14:54:13] info: ACME DNS-01 challenges accepted [14:54:49] yeah [14:55:23] at least this way when someone finds a buffer overflow in gdnsdctl it will only affect the dns daemon and not take root with it :) [14:55:48] although [14:55:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/459809/22..24/ [14:56:09] `dig _acme-challenge.example.com TXT @authdns1001.wikimedia.org` doesn't return anything [14:56:16] oh wait refused [14:56:18] it's not supposed to, it's ok [14:56:37] if, before the default 10 minute timer on that expired, we defined an /etc/gdnsd/zones/example.com zone and hit reload-zones though [14:56:38] I assume it refuses for anything for a domain it doesn't have a zone file on disk for? [14:56:41] the challenge would suddenly appear too [14:56:45] yes [14:57:10] 'only' the dns daemon :D [14:57:23] well there's little way to effectively avoid that :) [14:57:27] yeah [14:58:02] so I think we're done [14:59:40] I think so too [14:59:44] meeting time, but +1'd :) [15:00:07] nice :D [15:01:25] I should edit the response message on the CLI there to just say [15:01:28] Challenge accepted! [15:01:55] :D [15:02:12] late as usual, making one more coffee before I join [16:01:23] back soon [16:28:48] back [16:28:50] vgutierrez, we doing it tomorrow? [16:29:46] good Q [16:29:58] Krenair: friday morning? [16:30:09] I could push it now too, but I'd really like him to be here to look at all related things as they happen, so it depends whether he's done for the day now basically [16:30:13] timezones are awful! [16:30:31] yeah they're awful :) [16:30:45] I'm still here for another 30minutes though [16:31:07] want to give it a stab and see if it seems to basically apply without huge problems? [16:31:19] we can always save testing a cert for later, and I can handle rollback cleanup if necc [16:31:45] sure, let's merge this today, and check tomorrow morning cert issuance [16:31:50] ok [16:32:00] playing with authdns servers on Fridays is not the best idea [16:32:01] :) [16:32:01] so [16:32:11] since the puppetization hits the authdns, I'm going to puppet-disable those ahead [16:32:29] let's see if it rolls on cc1001 and such successfully first, and if it does I'll release those one at a time and see if they work out ok too [16:33:27] ok [16:33:29] they're all disabled now [16:33:43] I'll do the +2 then [16:33:49] go for it! [16:36:45] and it's merged [16:36:47] running puppet in cc1001 [16:36:52] between the CC patch and the authdns patch we have 92 total PS count [16:36:56] what could possibly go wrong? :) [16:40:35] puppet seems happy [16:40:43] but I've spotted a small bug in the config.yaml [16:40:43] accounts: [16:40:43] - id: 6e01c693ed6e9d9a6b5930923ecef104 [16:40:43] directory: https://acme-staging-v02.api.letsencrypt.org/directory [16:40:46] certificates: {} [16:40:52] that naughty tab in certificates [16:42:28] I don't see a tab? [16:42:35] yeah I've got spaces here [16:42:41] but it shouldn't be indented [16:42:44] sorry, the spaces [16:42:45] yeah [16:42:57] one sec [16:42:58] easy fix anyways [16:43:01] I'm seeing the .erb and it doesn't makes sense... [16:43:06] trying one of the authdns to see how it flies there [16:43:28] probably the <%- / -%> related issues and how they can mess with whitespace/newlines? [16:43:42] that's what I'm thinking [16:44:11] oh I see [16:44:26] after the directory line [16:44:34] there's spaces before the next if block [16:44:36] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, secret(): invalid secret ssh/authdns-certcentral.pub at /etc/puppet/modules/profile/manifests/authdns/certcentral_target.pp:12:20 on node authdns1001.wikimedia.org [16:44:44] so missing secret, and also why is the public key in secret()? [16:44:46] then it does nothing more before adding certificates: [16:45:03] I didn't catch that before [16:45:05] bblack, originally it was hieradata but apparently all the public keys are in secret? [16:45:11] WTF [16:45:11] so we moved it for consistency [16:45:19] authdns-certcentral.pub [16:45:26] I guess ok if it seems to be standard [16:45:27] in pcc that got changed to authdns_certcentral.pub [16:46:02] is in the private repo as authdns_certcentral and authdns_certcentral.pub [16:46:04] for manual-deploy TLS we tend to only put privkeys in secret and pubkeys in the main puppet repo, but if ssh keys seem to go all-secret I'm ok with following whatever the pattern is [16:46:40] so does the repo need fixing or the puppet code? [16:46:49] I mean I guess puppet code is better at this point [16:47:45] puppet code [16:47:52] damn [16:47:54] secret('ssh/authdns-certcentral.pub') [16:48:00] we missed the whole change to keyholder there [16:48:12] I'm making the CR now [16:48:15] ok [16:48:40] I think it would've been easier to build the hash in the manifest and then ordered_yaml it into a file, rather than this template [16:49:07] but I've uploaded https://gerrit.wikimedia.org/r/468358 which might hopefully fix our config indentation problem [16:49:29] by changing <% end -%> to <% end %> when it is indented [16:49:52] https://gerrit.wikimedia.org/r/#/c/468359/ [16:50:02] alternatively we could unindent all blocks but then we'd have a mess [16:50:47] vgutierrez: yeah shove the patch through and we'll see if it unbreaks the things [16:50:58] assuming jerkins is happy [16:50:58] ack [16:51:10] it is [16:51:23] [ 2018-10-18T16:51:08 ] INFO: Compilation failed for hostname authdns1001.wikimedia.org in environment prod. [16:51:28] pcc nope [16:53:07] ? [16:53:09] I guess that means the labs/private secret paths don't match the prod secret paths? [16:53:23] that's prod.. sigh [16:53:26] https://puppet-compiler.wmflabs.org/compiler1002/13066/authdns1001.wikimedia.org/ [16:53:26] https://puppet-compiler.wmflabs.org/compiler1002/13066/ <- I see no-op? [16:53:39] yep [16:54:17] I'm confused, new the patch is good or the patch fails? [16:54:45] prod fails.. so I guess pcc fails to see the changes? [16:54:51] s/see/detect/ [16:55:04] how did you detect the fail on the unmerged patch? [16:55:19] I see jenkins +1 and pcc no-op [16:55:24] yeah [16:55:30] pcc output says prod is failing [16:55:31] not the change :) [16:55:35] I missread the output [16:55:39] ok, expected! [16:55:40] let's try [16:55:41] indeed [16:55:43] merging it [16:56:42] done [16:57:09] I get a non-failing no-op run, trying a real run [16:58:11] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10fdans) [16:58:33] Krenair: they do match, but we didn't run pcc against authdns before [16:58:33] success, so hitting the other two [16:58:37] bblack: nice [16:59:05] one thing we could try (in a couple mins when these agent runs finish) [16:59:26] is manually executing the challenge script from cc1001, without CC driving it, provisioning some fake challenge for a real domain [16:59:44] like foo.wikimedia.org 0123456789012345678901234567890123456789012 [16:59:49] ah [16:59:57] and see that end of the integration run ok [16:59:59] yeah [17:00:06] [done with puppetization on authdns] [17:00:07] we were thinking pinkunicorn [17:00:18] sure, but not a real cert yet, just test the script [17:00:35] and for tomorrow's full test of CC I've got a commit that adds config for pinkunicorn and *.pinkunicorn: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468315/ [17:01:32] I can stab at it right quick [17:01:54] Krenair: https://puppet-compiler.wmflabs.org/compiler1002/13069/certcentral1001.eqiad.wmnet/ [17:01:58] Krenair: your change looks good [17:02:15] is --remote-servers comma separated or one per server or? [17:02:27] space separated iirc [17:02:29] it adds an empty line but is not pretty bad :) [17:03:01] how does it even parse that? [17:03:11] parser.add_argument('params', metavar='PARAM', nargs='+', help='Parameter to pass to the remote command.') [17:03:14] parser.add_argument('--remote-servers', metavar='REMOTE_SERVER', nargs='+', required=True, help='Remote servers to send command to.') [17:03:27] that reads to me like: [17:03:47] --remote-servers foo bar baz challengedomain challengedata <- where does one list end and another begin? [17:03:52] maybe it only works the other way around [17:04:00] we always pass --remote-servers in afterwards [17:04:03] ok [17:04:17] 10Traffic, 10DNS, 10GitHub-Mirrors, 10Operations, and 2 others: Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10Reedy) [17:04:44] no wait [17:04:51] huh [17:05:00] maybe that was my original implementation [17:05:28] it does seem to work that way, but something doesn't quite go right with all the ferm/ssh/sudo stuff I think [17:05:31] it currently seems to do --remote-servers a b c -- challenge.domain challenge_validation [17:05:45] errr I didn't change that [17:05:47] :) [17:05:48] root@certcentral1001:~# sudo -u certcentral /usr/local/bin/certcentral-gdnsd-sync.py pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 --remote-servers authdns1001.wikimedia.org authdns2001.wikimedia.org multatuli.wikimedia.org [17:05:52] authdns1001.wikimedia.org ['pinkunicorn.wikimedia.org', '0123456789012345678901234567890123456789012'] [17:05:59] but hangs there, presumably waiting for ssh to go through [17:06:02] checking... [17:06:10] we need to arm keyholder there [17:06:12] keyholder? [17:06:15] nope? [17:06:19] vgutierrez, I think I changed it at some point [17:06:41] bblack: careful.. keyholder should only allow certcentral and not root to use that ssh agent [17:06:58] yeah but I did sudo the command down to CC? [17:07:04] oh ok, sorry [17:07:06] without SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh.... [17:07:12] lol [17:07:16] you're not using heyholder though ; [17:07:17] volans: that's set by the script [17:07:17] ;) [17:07:22] hardcoded? [17:07:27] yes [17:07:31] ah... [17:07:33] ewww [17:07:36] after my output above, eventually it said: [17:07:38] Could not create directory '/nonexistent/.ssh'. [17:07:39] Password: [17:07:50] maybe it's my root -> sudo part that confuses it though [17:07:51] maybe it was asking for confrimation of fingerptint [17:07:54] but I don't see how else to easily test it [17:07:55] for the remote host [17:08:00] oh, did we arm keyholder yet? [17:08:54] should I do that? [17:09:06] trying [17:09:08] on my way [17:09:09] oops [17:09:11] :) [17:09:11] * vgutierrez holds [17:09:18] of course, I don't know the passphrase [17:09:23] bblack: pwstore [17:09:23] meant to be in ops pw repo? [17:09:26] ok [17:09:26] it is [17:09:31] as all keyholder ones [17:09:42] checking [17:09:42] volans: you trust me too much [17:09:49] vgutierrez: no I checked [17:09:51] hahahahaha [17:09:53] it's always nice to have a clueless manager run through this anyways, it's good validation [17:09:53] <3 [17:10:00] vgutierrez: you didn't update the https://wikitech.wikimedia.org/wiki/Keyholder page though ;) [17:10:06] lol [17:10:20] volans: I was waiting to see if that actually worked [17:10:30] <3 [17:11:11] so keyholder arm worked, but my attempt at root sudo -> certcentral failed. it might still just be that that's not a possible way to test this [17:11:12] * volans about to enter a meeting, will soon loose track [17:11:20] bblack: restart the proxy [17:11:23] keyholder-proxy [17:11:30] hrm ok [17:11:32] is a known bug for first arm/installation [17:11:42] so usually I do that before even starting to debug [17:11:49] just to avoid to loose time on nothing [17:12:03] still hanging [17:12:05] and the best way to test is [17:12:05] I see in ps: [17:12:06] keyhold+ 13288 1 0 17:11 ? 00:00:00 python3 /usr/local/bin/ssh-agent-proxy --bind /run/keyholder/proxy.sock --connect /run/keyholder/agent.sock [17:12:10] certcen+ 13299 13298 0 17:11 pts/2 00:00:00 /usr/bin/ssh -l certcentral authdns1001.wikimedia.org /usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 [17:12:36] sudo -u certcentral SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh $USER@$HOST [17:12:40] ^ [17:12:45] volans beat me to it [17:12:48] that way you test that keyholder works fine with the permissions [17:12:49] and all [17:12:57] if that works, then is the script :D [17:13:33] yeah that doesn't work either [17:13:46] what does it do? just sit there? [17:13:47] root 13719 8603 0 17:13 pts/2 00:00:00 sudo -u certcentral SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh certcentral@authdns1001.wikimedia.org [17:13:49] what if you add -vvv ? [17:13:50] certcen+ 13720 13719 0 17:13 pts/2 00:00:00 ssh certcentral@authdns1001.wikimedia.org [17:13:53] yeah just sits there [17:13:59] ferm? [17:14:07] auth keys? [17:14:10] yeah hangs at [17:14:11] debug1: Connecting to authdns1001.wikimedia.org [2620:0:861:2:208:80:154:134] port 22. [17:14:17] but I looked at the ferm rules, hmmmm [17:14:19] checking again [17:14:21] smells fw [17:14:31] what about the other network access controls around? [17:14:44] root@authdns1001:~# iptables -vnL|grep 10.64.32.26 0 0 ACCEPT tcp -- * * 10.64.32.26 0.0.0.0/0 tcp dpt:22 [17:14:46] some of the routers do restrict some traffic right? [17:14:47] root@authdns1001:~# ip6tables -vnL|grep 32:26 0 0 ACCEPT tcp * * 2620:0:861:103:10:64:32:26 ::/0 tcp dpt:22 [17:15:05] the ferm->iptables is there, but no accept is being logged in iptables stats either [17:15:06] bblack: any block at router level? [17:15:10] something's not getting somewhere for sure heh [17:15:14] possibly [17:15:46] root@certcentral1001:/var/log# nc -zv authdns1001.wikimedia.org 22 [17:15:47] authdns1001.wikimedia.org [208.80.154.134] 22 (ssh) open [17:16:08] it doesn't look like FW issues [17:16:13] I see the counts in iptables from yours, hmmm [17:16:21] maybe it's an FW issue only for v6? [17:16:35] yup [17:16:48] in the other window, the test ssh eventually succeeded, after it fell back to ipv4 [17:17:05] * volans meeting ttyl [17:17:05] so we missed ipv6 in the ferm puppetization [17:17:11] no, it's there [17:17:16] uh... [17:17:20] I mean, there must be a router rule that's only blocking this port 22 for v6, but not v4 [17:18:14] yeah no the ip6tables rule above checks out [17:18:24] it should be fine unless another firewall is blocking stuff# [17:18:32] hmmm [17:18:47] that aside, even if I edit the integration script to use ssh -4, there's still a password prompt [17:19:08] so two issues then [17:19:46] but this does work and gives me a certcentral-uid prompt: [17:19:46] do we know if sshd is listening on port 22 on that ipv6 address? [17:19:50] root@certcentral1001:~# sudo -u certcentral SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -4 certcentral@authdns1001.wikimedia.org [17:20:03] certcentral@authdns1001:/$ [17:20:10] so I can ignore v6 for now and focus on that [17:20:27] thought you just said you got a password prompt? [17:20:43] or was that with the whole command? [17:21:30] this part works too: [17:21:32] certcentral@authdns1001:/$ /usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 [17:21:36] info: ACME DNS-01 challenges accepted [17:21:45] so ding it manually with the sequence of two separate commands I just pasted, works [17:22:00] alex@alex-laptop:~$ dig _acme-challenge.pinkunicorn.wikimedia.org TXT @authdns1001.wikimedia.org +short [17:22:00] "0123456789012345678901234567890123456789012" [17:22:45] cool [17:22:58] just need to sort out why I can't test the script, at least from a sudo from root [17:25:32] so you have to add -4 to the ssh command for now, ok [17:25:38] it's still quite possible this all works from the actual CC user [17:25:52] and that it's just the extra bits about sudoing down from root that are breaking it [17:25:57] well [17:26:03] sudo -u certcentral -i [17:26:19] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -4 certcentral@authdns1001.wikimedia.org /usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789013 [17:26:20] has a /bin/false shell, can't :) [17:26:26] ugh [17:26:38] I can fix tha twith another argument though [17:27:02] -s /bin/bash ? [17:27:05] sudo -u certcentral -s [17:27:38] not sure how it figures out what shell to use but I don't need to provide /bin/bash to make it work [17:28:07] but then it doesn't get the full environmental switch like a login either [17:28:15] e.g. $HOME is still /root/ [17:28:35] gotta go now [17:28:38] I'll check the log [17:29:14] yeah don't worry, I can stab at this for a while on my own [17:29:16] thanks for the help [17:29:19] root@certcentral1001:~# sudo -u certcentral -H -s /bin/bash [17:29:27] certcentral@certcentral1001:/root$ SSH_AGENT_SOCK=/run/keyholder/proxy.sock /usr/bin/ssh -4 -l certcentral authdns1001.wikimedia.org /usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 [17:29:31] Could not create directory '/nonexistent/.ssh'. [17:29:33] Password: [17:30:16] hmmm [17:31:03] the password: comes from the sudo to gdnsd right? [17:31:09] did we forget a nopasswd thing? [17:31:22] hm nope [17:31:26] + 'ALL = (gdnsd) NOPASSWD: /usr/bin/gdnsdctl -- acme-dns-01 *', [17:31:39] so the password prompt is from that bit yeah [17:31:41] not from ssh [17:32:15] the sudoers file says: [17:32:16] certcentral ALL = (gdnsd) NOPASSWD: /usr/bin/gdnsdctl -- acme-dns-01 * [17:32:25] but it's fine if you ssh first and then run the command? [17:32:28] yeah [17:32:31] instead of trying to do all in one? [17:32:34] so something environmental [17:32:37] I'm suspicious about the -- [17:32:45] let me repro again [17:34:43] wonder what happens if you try to put the entire command after the ssh hostname in quotes [17:34:44] i.e. [17:34:53] SSH_AGENT_SOCK=/run/keyholder/proxy.sock /usr/bin/ssh -4 -l certcentral authdns1001.wikimedia.org "/usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012" [17:35:47] so this succeeds: [17:35:50] root@certcentral1001:~# sudo -u certcentral SSH_AUTH_SOCK=/run/keyholder/proxy.sock /usr/bin/ssh -4 -l certcentral authdns1001.wikimedia.org /usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 [17:36:08] and from the /usr/bin/ssh onwards, that's exactly what my (modified for -4) script is doing too [17:36:21] executing: [17:36:22] sudo -u certcentral /usr/local/bin/certcentral-gdnsd-sync.py pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 --remote-servers authdns1001.wikimedia.org authdns2001.wikimedia.org multatuli.wikimedia.org [17:36:33] hangs on a sudo password prompt, and the running ssh command is: [17:36:39] certcen+ 19080 19079 0 17:34 pts/2 00:00:00 /usr/bin/ssh -4 -l certcentral authdns1001.wikimedia.org /usr/bin/sudo -u gdnsd /usr/bin/gdnsdctl -- acme-dns-01 pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 [17:36:57] which seems identical to what I succeeded with manually, and the script does set the socket env var... [17:37:13] but yeah maybe this is a matter of arguments and the subprocess() call? [17:37:30] can you check the sudo logs on the authdns host? [17:37:35] yeah good idea [17:37:42] wondering if that tells us what command it's actually trying to run [17:37:48] if the -- cuts stuff off, it won't match the sudo rule [17:38:50] why the -4 btw? [17:39:11] Oct 18 17:38:29 authdns1001 sshd[110622]: Connection from 10.64.32.26 port 48650 on 208.80.154.134 port 22 [17:39:14] Oct 18 17:38:29 authdns1001 sshd[110622]: Postponed keyboard-interactive for certcentral from 10.64.32.26 port 48650 ssh2 [preauth] [17:39:26] paravoid, we're not sure yet but we think there's a firewall rule somewhere blocking that [17:39:29] yeah it doesn't really log the attempted command, but it does log that it's trying keyboard [17:40:00] yeah I suspect we have some inconsistency at the router level on blocking port 22 over v4/v6 on the authdns servers' public IPs or something, but bypassing with -4 for now to look at other issues [17:41:04] it has to be the quoting / arguments / -- stuff [17:41:05] somehow [17:41:11] trying a few edits to how subprocess() args look... [17:48:29] it's not sudo that's blocking it, it is an ssh-level issue [17:48:39] that password prompt is from ssh, not from sudo [17:49:55] that's weird [17:49:58] so [17:50:24] can you make it ssh -vvv and look at the output? [17:50:34] I wonder if it's not talking to the agent for some reason [17:50:44] that might explain why it resorts to password auth [17:51:07] come the thought of it are wikimedia servers sshd supposed to allow password auth? [17:52:04] it's not really an indication that they allow password auth, just that they allow interactive prompting that will ultimately fail, I guess [18:00:13] yeah somehow it's not even the quoting/-- stuff either [18:00:22] it's just failing to get AUTH_SOCK stuff working right, still debugging [18:02:24] success! [18:02:34] holy hell, what a deep hole to find such a simple bug [18:03:08] bblack, what was it? [18:04:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/468368 [18:04:41] bah [18:04:43] ^ my eyes glazed past the difference in that variable name between the script and what I was trying on the commandline like 400 times during that [18:04:43] that's my fault [18:04:48] sorry [18:04:52] it's ok :) [18:07:01] so now just the v6 blockage remains [18:12:05] oh [18:12:12] the v6 blockage is not a router/fw issue [18:12:24] it's that cc1001 doesn't have a static-mapped ipv6 in its live config, even though it does in DNS [18:12:43] so [18:12:43] it's source address is an ephemeral auto-address, not the one we expect in ip6tables [18:12:54] we can fix that [18:13:02] the auto-address is routable? [18:13:41] yes, we commonly still have those in use, if the server's v6 ip isn't important for e.g. service destinations or ferm rules [18:14:03] we have to explicitly add a puppet statement to get a static-mapped v6 for the host or role (which is present for a lot of things, but not all) [18:14:20] ah yes the add_ip6_mapped thing [18:17:39] ssh spam output is slightly annoying, but I think we have full puppetized success now [18:17:59] ssh spam output? [18:21:18] https://phabricator.wikimedia.org/P7693 [18:21:36] ^ the 'could not create' 'could not mkdir' crap, because the minimal user certcentral doesn't have a useful homedir [18:21:40] it's ignorable [18:22:38] oh [18:22:47] certcentral does this when calling that gdnsd-sync script [18:22:53] stdout=subprocess.DEVNULL, [18:22:53] stderr=subprocess.DEVNULL, [18:22:58] so no problem [18:24:29] right [18:24:38] and yes, the arguments work the other way around too like CC does it [18:24:49] sudo -u certcentral /usr/local/bin/certcentral-gdnsd-sync.py --remote-servers authdns1001.wikimedia.org authdns2001.wikimedia.org multatuli.wikimedia.org -- pinkunicorn.wikimedia.org 0123456789012345678901234567890123456789012 [18:24:53] ^ works too [18:25:00] so we should be good on this level now [18:25:09] will leave the cert testing for tomorrow morning! [18:25:13] yeah [18:25:25] turns out I have a busy schedule tomorrow morning, I may or may not be around [18:25:37] np! [20:44:50] 10netops, 10Operations, 10fundraising-tech-ops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10cwdent) [21:45:36] would like to run puppet on eeden to let it change NRPE config.. but see the comment "reimaging shortly".. [21:45:52] i guess i can just fix that config file manually [21:57:08] same for lvs1010, lvs1011. maybe i can help with decom? [22:37:25] 10netops, 10Operations: BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 (10Dzahn) [22:37:48] 10netops, 10Operations: cr2-esams - BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 (10Dzahn) [22:48:11] mutante: yeah, sorry, that's me falling behind on tons of things [22:48:44] eeden is meant to be reimaged to just role(test), it's already that way in site.pp. [22:49:25] any remaining servers from the lvs1007-12 range are meant to be decommed [22:54:37] I think x.ionox confirmed before that none of the remaining lvs1007-12 have any router-side config anyways. that's the main reason they're disabled, was fear of them speaking bgp to routers and the routers listening. [22:57:04] bblack: i started editing the nagios NRPE config because that was easy.. but then i realized i also have to edit ferm constatnts.. and messing with that seemed a bit bad. the only reason i am doing this is because i am looking at icinga1001 and trying to eliminate all remaining issues that are not the same on einsteinium [22:57:47] ah [22:58:00] hyptothetically, the lvses don't have ferm anyways, unless they've already been moved to spare or something [22:58:01] i can talk to x.ionox and have it double confirmed and then reinstall them [22:58:09] oh, they have [22:58:15] eeden is definitely safe [22:58:19] ok [23:32:35] Hi, I'm trying to understand better how MWF distribute traffic across different geo-regions, in my case I see traffic from Reston heading to SFO but AFAIK there is closer location also in Virginia [23:33:03] anyone could help me with that [23:33:57] basically we're using commons as a foreign image repository for some wikis & it seems that commons.wikimedia.org is resolved to text-lb.ulsfo.wikimedia.org which is suboptimal latency wise [23:34:32] we were wondering if we could do something to get served from eqiad instead [23:45:16] interesting [23:46:04] if I'm understanding this file right it looks like you should be getting eqiad [23:46:39] https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/geo-maps$167 [23:46:40] we use HE over there [23:46:52] 4. AS6939 209.51.171.33 0.0% 20 97.7 79.0 0.5 106.8 27.5 [23:46:55] 5. AS6939 184.105.222.41 0.0% 20 59.5 59.5 59.4 59.8 0.0 [23:46:58] 6. AS??? 198.32.176.214 0.0% 20 60.8 63.9 60.5 88.6 6.7 [23:47:01] 7. AS14907 198.35.26.96 0.0% 20 60.6 60.6 60.3 62.1 0.4 [23:49:07] oh so it's based on DNS [23:49:15] yeah [23:49:50] if I understand correctly, which DC you end up getting directed to is based on your resolver's IP when talking to DNS [23:50:42] I see [23:51:31] that makes sense we never advertise that /24 class as a different location [23:52:21] might be your own IP if your resolver does edns client subnet [23:54:05] really the person you want to talk to is bbl.ack [23:55:01] I think the dns part is enough for now, at least I know it's not based on bgp [23:55:48] and based on the fact it's maxmind db I can work on it [23:55:52] thanks! [23:55:56] np [23:56:10] btw [23:56:37] that IP you're connecting to IRC from yourself [23:56:46] geo locates to California, which will obviously get ulsfo [23:56:52] according to maxmind [23:57:41] oh yeah that's a different network [23:57:47] ok [23:57:48] https://www.maxmind.com/en/geoip-demo [23:57:53] I can reproduce the issue here [23:57:54] yeah that's what I was looking at