[05:55:30] 10Traffic, 10Operations, 10conftool, 10Patch-For-Review, and 2 others: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972 (10Joe) 05Open→03Resolved [05:55:33] 10Traffic, 10Operations, 10discovery-system, 10services-tooling: Create a tool to sync static configuration from a repository to the consistent k/v store - https://phabricator.wikimedia.org/T97978 (10Joe) [06:51:28] 10netops, 10Operations, 10Patch-For-Review: netflow2001 kafkatee-webrequest restart loop - https://phabricator.wikimedia.org/T249176 (10ayounsi) 05Open→03Resolved a:03ayounsi Removed! [07:11:19] 10Traffic, 10Analytics, 10Operations: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) ` scala> spark.sql("select count(*) from wmf.netflow where year=2020 and month=05 and day=31 and hour=18").show(); 20/06/04 07:09:37 WARN Utils: Truncated the string repre... [07:29:56] 10netops, 10Operations: Homer: manage transit BGP sessions - https://phabricator.wikimedia.org/T250136 (10ayounsi) The changes/cleanup done with that CR: `name=everywhere, lang=diff [edit protocols bgp group Transit4 family inet] + unicast; - any; ` `any` includes unicast + multicast and we don't... [07:53:09] 10netops, 10Operations: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10faidon) This is now set up on SFMIX's end and up: > On your side please plumb 206.197.187.82/24 and 2001:504:30::ba01:4907:1/64. Usual sane BGP peering rules apply - no broadcast traffic (DHCP, CDP, etc),... [08:52:28] 10Traffic, 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) [08:52:46] 10netops, 10Operations: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10ayounsi) 05Open→03Resolved a:03ayounsi Everything is done, and we're peering with the RS. Next is to send peering requests. [08:53:13] 10Traffic, 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) [08:54:05] 10Traffic, 10Operations: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) p:05Triage→03High [08:54:40] 10Domains, 10Traffic, 10DNS, 10Operations: Create diff.wikimedia.org subdomain - https://phabricator.wikimedia.org/T253807 (10Dzahn) p:05Triage→03High [09:14:22] 10netops, 10Analytics, 10Operations: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Dzahn) p:05Triage→03Medium [09:19:20] 10Traffic, 10DC-Ops, 10Operations: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10Dzahn) p:05Triage→03Medium [09:20:55] 10Traffic, 10Operations, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Dzahn) p:05Triage→03Medium [09:21:52] 10Traffic, 10Operations, 10Pybal: PyBal ProxyFetch failure when talking to Envoy in SNI-only mode - https://phabricator.wikimedia.org/T253527 (10Dzahn) p:05Triage→03High [09:24:16] 10Traffic, 10Analytics, 10Operations, 10Readers-Web-Backlog (Tracking): Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Dzahn) p:05Triage→03Medium [09:40:44] mutante: why the SNI-only it's a high priority task? is it blocking something? [09:41:58] 10Traffic, 10Operations, 10Pybal: PyBal ProxyFetch failure when talking to Envoy in SNI-only mode - https://phabricator.wikimedia.org/T253527 (10Dzahn) p:05High→03Medium [09:43:05] vgutierrez: because "ProxyFetch gets a ConnectionLost " sounded kind of important. but really.. because i don't know and probably it makes no sense that clinic duty is supposed to triage them [09:43:15] fixed [13:15:08] 10Domains, 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: Create diff.wikimedia.org subdomain - https://phabricator.wikimedia.org/T253807 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `willikins:dns vgutierrez$ dig diff.wikimedia.org. ; <<>> DiG 9.10.6 <<>> diff.wikimedia.org. ;; global opt... [13:42:25] XioNoX: cr1-eqsin soooo slloowwwwwwwww [13:44:29] cdanis: https://phabricator.wikimedia.org/T253246 :) [13:44:52] I know :) [13:45:08] XioNoX: do you have a oneliner for concisely checking if a bgp session is up? [13:45:23] cdanis: which one? [13:45:38] any given one, I just did the peerings for DigitalOcean in SFMIX/EQSIN [13:45:48] show bgp summary | match [13:46:21] oh okay, and 'Establ' is there, nice [14:37:34] vgutierrez: are you up/working? [14:38:43] (or, alternatively, anyone who uses the 'traffic' cloud-vps project) [14:38:49] yep, I am [14:38:57] hello! [14:39:23] A weird thing happened this morning, where all local secrets on cloud project puppetmasters were wiped. [14:39:35] uh... [14:39:38] I know that you had local secrets for acme-chief, and I'm going to try to restore those right now. [14:39:47] ack [14:39:53] Do you know if there are other things that were likely lost? [14:40:25] AFAIK everything acme-chief related [14:40:43] ok. I'll see what I can rescue and then will ping you again to test [14:40:48] thx [14:43:06] mercifully, acme-chief account info seems to be stored in an unpuppetized dir on the acme-chief hosts [14:43:34] we need the cloud DNS API key secret as well [14:43:40] to be able to inject dns-01 challenges [14:44:36] yep, I'm just going to generate a new one of those. Do you have that password stored anywhere outside of that puppet repo? [14:45:47] hmmm nope AFAIK [14:46:02] but yeah, a new one should be enough :) [14:49:16] 10Traffic, 10Analytics, 10Operations: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) The hole is now gone, but we discovered a major problem in T254383 :( [14:49:35] 10Traffic, 10Analytics, 10Operations: missing wmf_netflow data, 18:30-19:00 May 31 - https://phabricator.wikimedia.org/T254161 (10elukey) 05Open→03Resolved [15:17:29] vgutierrez: I think this are back the way they were now but don't feel totally qualified to judge. Is it possible for you to do an end-to-end test and make sure that the designate + LE auth is working? [15:17:39] * vgutierrez testing [15:18:29] The last Puppet run was at Thu Jun 4 10:18:24 UTC 2020 (299 minutes ago). Puppet is disabled. /labs/private recovery [15:18:36] I guess that's expected [15:18:45] yes, go ahead and re-enable [15:18:52] Jun 04 15:00:05 traffic-acmechief01 acme-chief-backend[22514]: Aborting new certificate. Prevalidation failed for CN wikipedia.com.traffic.wmflabs.org for non-canonical-redirect-1 / rsa-2048 [15:19:00] let me restart acme-chief just in case :) [15:19:02] we disabled it everywhere before doing our forensics [15:19:03] thx [15:20:06] hmm we're having some issues on acme-chief but I don't know if they're strictly related to this incident [15:20:17] let me debug it and I'll come back to you [15:20:58] andrewbogott: wmflabs.org is still the valid cloud TLD, right? :) [15:20:58] thanks — let me know if I messed up one of the secrets because I'm hoping to use the same process elsewhere [15:21:07] it should be! [15:22:03] maybe just too many labels [15:22:36] not to mention that doesn't resolve publicly [15:23:03] I salvaged the private key from /etc/acme-chief/accounts on the acme-chief host where it seems to live on apart from puppet [15:23:10] com.traffic.wmflabs.org is NXDOMAIN (and thus everything underneath is too) [15:23:45] hmm weird [15:23:54] that seems probably unrelated to the things we just did :) [15:24:33] bblack: I get a NXDOMAIN for wikimedia.traffic.wmflabs.org as well, and acme-chief issued the "unified" labs cert on May 1st [15:26:01] If you decide that the cloud's DNS infrastructure is also broken let me know! That would surprise me but it wouldn't surprise me /that/ much considering how today is going [15:33:21] vgutierrez: LE would reject an NXDOMAIN on the challenge hostname [15:33:46] yep, but is not even reaching that point [15:33:59] the prevalidation checks the CAA and the NS records [15:34:14] wmflabs.org doesn't have a CAA record so that can't be it [15:34:32] the CAA and NS for... wmflabs.org? [15:35:24] it tries to find the most accurate SOA [15:35:55] but it looks like our acme-chief config is just deprecated [15:36:09] it's expecting to find cloud-ns0.wikimedia.org. and cloud-ns1.wikimedia.org. on the NS records [15:36:15] not ns1.openstack.eqiad1.wikimediacloud.org. and ns0.openstack.eqiad1.wikimediacloud.org. [15:36:25] well, there's also some strangeness with the NS config in this, but I don't know if it's really causative here [15:36:53] hmmm [15:37:01] https://www.irccloud.com/pastebin/tlAhHP0P/ [15:37:23] andrewbogott: ^^ could be that our beloved DNS zone traffic.wmflabs.org is a little bit deprecated regarding NS records? :) [15:37:26] digging NS for wmflabs.org gives 4x nameservers: cloud-ns[01].wikimedia.org + ns[01].openstack.eqiad1.wikimediacloud.org., with 86400 TTLs [15:37:51] but digging NS for traffic.wmflabs.org *also* gives the same NS records, for "traffic.wmflabs.org NS ...", with 120s TTLs [15:37:59] bblack: weird.. from the labs instance I only get the wikimediacloud.org ones [15:38:07] root@traffic-acmechief01:/var/lib/acme-chief/certs/unified/live# host -t ns wmflabs.org [15:38:07] wmflabs.org name server ns1.openstack.eqiad1.wikimediacloud.org. [15:38:07] wmflabs.org name server ns0.openstack.eqiad1.wikimediacloud.org. [15:38:09] which makes it look like a delegation-to-self [15:38:28] (which doesn't work in the general case) [15:38:34] bblack: yeah.. we got an API account on designate with rights to handle traffic.wmflabs.org but not wmflabs.org [15:39:06] the names like ns0.openstack.eqiad1.wikimediacloud.org are current [15:39:07] it seems to be set up like a delegated subzone, but deletegated to the same servers as the parent delegating zone, is what I mean [15:39:29] which always causes at least some DNS responses to be paradoxical/inconsistent/broken, from some perspective. You have to delegate to different servers if you delegate at all. [15:39:43] hmmm do I have permissions to fix that andrewbogott? [15:40:46] the upstream .org registrar only has the ns[01].openstack pair of NS records for wmflabs.org [15:40:47] for traffic.wmflabs.org? Sure. [15:41:01] but ns[01].openstack themselves are the ones returning all 4 records (cloud-ns[01] as well) [15:41:22] the cloud-ns1.wikimedia.org names are deprecated. It's quite possible they didn't get cleaned up properly [15:42:12] https://phabricator.wikimedia.org/P11401 [15:43:15] yeah.. wmflabs.org zone management it's out of my scope for sure :) [15:43:41] but that aside: the existence of an NS record at any level automatically implies a "zone cut", meaning a delegation happened. [15:44:01] so having NS records at all for traffic.wmflabs.org means (in the dns protocol sense) that it was delegated from one set of servers to another. [15:44:45] but across such a delegation boundary, the RFCs require the parent and child dns servers return different responses for the same query, so if they're the same server, it can't be correct. [15:45:54] andrewbogott: so on horizon I can't find the option to edit the NS records for traffic.wmflabs.org [15:46:05] ok, I'm looking too [15:46:08] thx [15:49:13] (but contiuing that thread: for some clients, and some sequences of queries, no issue may become apparent in such a scenario. so it can be subtle, and it can "work", if the server never does actual delegation-style responses and just emits the extra NS records when queried explicitly. But still, some software somewhere is bound to get confused by the situation when it makes wrong assumptions [15:49:19] based on the standard) [15:51:16] yeah, the cli is telling me Updating a root zone NS record is not allowed [15:51:54] I can definitely change it in the database but that scares me a bit :) [15:52:21] it might be interesting to run some kind of select query on the database to find other such cases [15:52:33] (any NS records for any zones that still reference cloud-ns[01]) [15:53:13] IIRC, sometime a few year sback I once edited the database directly to fix some issue, and editing it via SQL worked fine, so long as you don't mess up the data :) [15:57:10] I'm sure there are several, I think this is a designate bug where it doesn't purge before updating [15:57:42] I removed those records from the designate db; let's see if things clear up as caches run down [16:01:46] at least now on Horizon I see just 2 NS servers as expected :) [16:02:21] yeah — I'm not clear on how that's cached or for how long [16:03:50] NS records? quite a bit I'm afraid [16:08:18] yeah who knows [16:08:26] the dig output still shows all four publicly, fwiw [16:18:29] 10netops, 10Analytics, 10Operations: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Milimetric) p:05Medium→03High [16:19:17] so, to confirm my understanding: the acme-chief process has probably been broken for weeks and was not broken additionally by the puppet nonsense? [16:19:41] indeed [16:20:55] great, I won't wait for confirmation of a fix then. LMK if there are other things I can do to unscramble things after DNS catches up [18:14:49] 10netops, 10Operations, 10netbox, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10ayounsi)