[08:32:47] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Volans) @jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the `networking` fact because it needs all of them and parses that one, so no...
[09:13:08] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6564734, @Gilles wrote: > cp3054 seems to be consistently a little faster for miss and pass, and overall a little slower...
[09:30:51] <XioNoX>	 https://fly.io/blog/bpf-xdp-packet-filters-and-udp/
[09:35:21] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) The telemetry of both ATS-TLS and Varnish is also blind to app-level, OS-level or hardware-level delays and buffering internal to our...
[09:45:37] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6570924, @Gilles wrote: > Before we dig into that, the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/635276/ |...
[09:48:03] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10akosiaris) >>! In T265904#6570744, @Volans wrote: > @jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the `networking` fact because it...
[10:22:17] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) >>! In T265904#6570744, @Volans wrote: > @jbond thanks for looking into this, unfortunately the data used in the Netbox import comes from the `networking` fact because it need...
[10:32:03] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10Volans) @jbond yeah I agree we can implement this assuming that all v6 addresses are mapped, with a sane fallback into the `networking[ip6]` one. @crusnov could you implement that lo...
[11:13:26] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I think that a major caveat with that 6-hour window counter-example is that it's looking at hit rate over everything and RUM data is...
[11:54:31] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10jbond) > Ill take a quick look today to see how to override the structured fact I had a quick look and i can't work out how to override the networking fact.  I seem to be hitting the...
[13:19:16] <ema>	 https://github.com/varnishcache/varnish-cache/commit/a79e90ffe12e4656f125344d0dcfaf190a7ede26
[13:19:19] <ema>	 https://github.com/varnishcache/varnish-cache/commit/5ee0e2573a77c1508faf0dccf2d6432c7d3aa561
[13:19:24] <ema>	 https://github.com/varnishcache/varnish-cache/commit/718918663cacd1159efa550da1fce8aed669feb3
[13:19:28] <ema>	 https://github.com/varnishcache/varnish-cache/commit/0f0464411a7db6395fefec185e2a24df42ddb2b3
[13:19:42] <ema>	 these were merged in 6.0.2 ^
[13:19:56] <ema>	 I wonder how many things could have gone wrong
[13:29:09] <bblack>	 one of the best parts is:
[13:29:12] <bblack>	 Inspired by: #2856
[13:29:42] <bblack>	 not "Fixes: #2856", that would be crazy for a point release :P
[13:31:34] * ema shakes his head
[13:32:33] <ema>	 all that was merged in 6.0.3 actually -> https://github.com/varnishcache/varnish-cache/blob/6.0/doc/changes.rst#varnish-cache-603-2019-02-19
[13:32:53] <bblack>	 anywas, I kinda buy the "perf impact is from the missing hits" argument, and if I try to loosely stitch that together with these grace/keep -level changes, it paints a possible portait of the missing hits being hits that used to happen in grace/keep-like windows but no longer do.
[13:33:07] <ema>	 or at least that's my guess at what "Many cache lookup optimizations." might mean in the changelog, there's no link to the commits
[13:33:17] <bblack>	 which you'd think would be a minor edge-case, but for hot shorter-lived objects (think resourceloader), they might be very significant
[13:36:18] <bblack>	 so the 1/N commit at least seems to claim to be a nop refactor, but I haven't tried to vet that fully yet
[13:36:51] <bblack>	 2/N calls itself a "refactor" as well, possibly intended as a functional nop
[13:37:20] <bblack>	 4/4 sounds like it might be intended to be more source-level than functional as well
[13:37:28] <bblack>	 but 3/N sounds like there was an intentional behavior change in it
[13:37:46] <bblack>	 just going by the commit messages
[13:38:13] <bblack>	 I'm going to spend a little time staring at these and see if I can find a clear case where the functional behavior might have changed
[13:39:12] <ema>	 I think there could possibly be some logic bug in there somewhere
[13:39:28] <ema>	 also something we haven't explained at all is the difference in n_objecthead
[13:39:46] <bblack>	 well, I think one of the links from yesterday discussed that
[13:40:06] <ema>	 yeah but it was supposed to be about hfm I think 
[13:40:07] <bblack>	 about how hf[pm] cases were creating excess pointless objheads, and something was done about it?
[13:41:26] <bblack>	 I thought hfp was implied as well.  In any case the two things are very related in the code (and you'll recall during the vacillations of versions 3->4->5, first HFP was simply changed to be HFM instead on the grounds that HFM behavior was universally better than HFP behavior, then later HFP was brought back to give us two alternatives)
[13:44:09] <ema>	 here it is https://github.com/varnishcache/varnish-cache/issues/2754
[13:45:02] <bblack>	 https://varnish-cache.org/docs/6.0/whats-new/changes-5.0.html#hit-for-pass-is-now-actually-hit-for-miss
[13:45:19] <bblack>	 then:
[13:45:21] <bblack>	 https://varnish-cache.org/docs/6.0/whats-new/changes-5.1.html#hit-for-pass-has-returned
[13:45:23] <ema>	 yeah I remember that dance
[13:46:13] <bblack>	 hmm yeah the bug does seem to explicitly call out hfm and not hfp
[13:47:09] <ema>	 there's this though about hfp: https://github.com/varnishcache/varnish-cache/pull/2760
[13:48:15] <ema>	 part of 6.0.2 according to the changelog: https://github.com/varnishcache/varnish-cache/blob/6.0/doc/changes.rst#varnish-cache-602-2018-11-07
[13:48:24] <ema>	 > Fix and test objhead refcount for hit-for-pass (2654, 2754, 2760)
[13:49:13] <ema>	 well the message says hfp and the first two issues are about hfm, but let's just assume they said "for hfp/hfm"
[13:49:51] <bblack>	 re-reading the hfp drama above, I'm reminded of this about HFM that I'd forgotten, too:
[13:49:55] <bblack>	 "A consequence of this is that fetches for uncacheable responses cannot be conditional in the default case. That is, the backend request may not include the headers If-Modified-Since or If-None-Match, which might cause the backend to return status "304 Not Modified" with no response body. Since the response to a cache miss might be cached, there has to be a body to cache, and this is true of 
[13:50:01] <bblack>	 hit-for-miss as well. If either of those two headers were present in the client request, they are removed from the backend request for a miss or hit-for-miss."
[13:50:24] <ema>	 yeah that's one of the reasons for bringing back hfp
[13:50:56] <bblack>	 so if we chose, as in my proposal, to use HFM to handle large_objects_cutoff, one of the unintended consequences would be the conversion of IMS/INM 304 cases into full fetches, for the fe->be leg of things and beyond.
[13:52:23] <bblack>	 (which is maybe-ok if you were using HFM to hopefully eventually bring a hit back into cache, but in this case we're intentionally persisting the HFM case as an HFP alternative, and it would probably be a bad thing to strip away conditionalness)
[13:54:27] <bblack>	 it's even maybe-ok for the "exp" code (that we're not presently using) to use HFM, because it will only be causing this conditional-stripping-without-really-caching issue for relatively-rare requests.  If something becomes hot enough, "exp" will eventually let it back into cache.
[13:54:31] <ema>	 I think the request would become a full fetch only for the fe->be leg
[13:54:52] <bblack>	 yes, but also for the client
[13:55:22] <ema>	 ah, that's what you meant with "beyond"
[13:55:31] <ema>	 I thought you were thinking of the ats-be<->origin part
[13:55:50] <bblack>	 the client may hold a cached copy of the resource, and issue a conditional request to the fe for the oversized object, the fe passes off to the be which has a matching object and could've said "304", but now it will return the whole object with 200 to the client
[13:55:58] <bblack>	 yeah sorry, that was bad word choice :)
[13:58:10] <bblack>	 so, this is good news, because it's some progress on the constraints of the solution space for the large_objects_cutoff ticket
[13:58:17] <ema>	 :)
[13:58:23] <bblack>	 HFM isn't a viable solution, and we have to find a way to make it work with HFP
[13:59:19] <bblack>	 one question I thought I knew the answer to, but maybe should double-check: can we set any kind of metadata (of our own choosing) on hfp objects?
[13:59:51] <bblack>	 as in: obj.Something=1 just before return(pass(120)), so that later hits on the hfp object, in vcl_pass, can check obj.Something to decide about pass_random?
[13:59:57] <ema>	 I don't think we can, but it would be great
[14:00:21] <bblack>	 it's the easiest way to get everything we want, but yeah I've been assuming it's not possible
[14:00:30] <bblack>	 maybe I should stare harder
[14:01:14] <bblack>	 but barring that angle of attack working out, we're basically left with 3 possible paths, none of them completely-ideal:
[14:01:59] <bblack>	 1) Leave this situation as-is, causing all objects >= large_objects_cutoff being randomly spread over all be caches, which hurts hitrate, initial burst-through on new stuff, and probably has a huge impact on the effective total be storage size.
[14:02:21] <bblack>	 2) Get rid of large_objects_cutoff and hope the FE can cope sanely with the few that exist, in the text case
[14:03:03] <bblack>	 3) Get rid of pass_random and possibly suffer the consequences (seen many times in the past!) of an uncacheable URI getting tons of traffic and lasering a BE instance and causing harm there.
[14:04:02] <bblack>	 upload doesn't really have a problem as presently configured, because it has chosen path #3 and basically-everything there is cacheable.
[14:04:55] <bblack>	 well, there's also a minor improvement you can make to path #2
[14:05:27] <bblack>	 2.1) Keep large_objects_cutoff, but set it to the same as the backend-most caches' upper limit (we have one on upload, do we have one for text?)
[14:06:21] <bblack>	 hmm yeah we do, because they're both the same config in ats-be land
[14:06:22] <bblack>	         -- Do not cache files bigger than 1GB
[14:06:43] <bblack>	 so even in path#2, it would make sense to keep doing a pass_random pass on objects above that threshold in the fe
[14:11:41] <bblack>	 really, ideally if we had the ability to set hfp metadata, we'd use it to randomize the >=1GB case for upload (while still keeping its smaller non-random pass cutoff, too)
[14:12:05] <bblack>	 because "basically-everything" of course excludes those >=1GB objects...
[14:17:45] <bblack>	 ah well:
[14:17:47] <bblack>	 https://github.com/varnishcache/varnish-cache/blob/varnish-6.0.6/bin/varnishd/cache/cache_req_fsm.c#L531
[14:18:30] <bblack>	 ^ when lookup hits a hitpass object, it asserts that oc and busy (the two possible output objects of the lookup) are NULL, and then sets some flags on req and returns.
[14:18:52] <bblack>	 so even if you could store metadata to an hfp object, there's clearly no object available at the time of hitting the hfp object
[14:19:08] <bblack>	 it exists only to signal the passing action, not to pass data around
[14:20:38] <bblack>	 we could patch such a mechanism into the source, but ewwww
[14:21:50] <ema>	 it seems like a useful thing to have, but yes, ewwww :)
[14:24:07] <bblack>	 maybe if it were a simple patch, and genuinely upstreamable in a generically-useful way to others
[14:24:26] <ema>	 the cases of uncacheable URIs focusing on one backend and causing trouble, do we know if those could have been spotted at request time?  
[14:24:34] <bblack>	 I don't think we could go so far as to have hitpass actually return the hfp object to VCL somehow
[14:24:47] <bblack>	 but we could add some kind of req flag, much like the current req.is_hitpass
[14:25:00] <bblack>	 (which is derived from some flag set on the hfp object itself)
[14:25:17] <bblack>	 for our purposes, a binary flag to control pass_random is enough, but that's hardly generic enough to justify upstreaming
[14:25:38] <bblack>	 I don't see a good design path here that avoids us having a one-off varnish-wmf patch
[14:26:24] <bblack>	 ema: I don't recall the specifics of past cases, but in general I don't think they're request-time predictable
[14:27:44] <bblack>	 we don't have any reliable corpus of "these URI patterns return uncacheability headers to anonymous requests" that keeps up with the times and various APIs, etc.  We rely a lot on HFP from response headers for them.
[14:28:36] <ema>	 yeah
[14:29:45] <bblack>	 maybe a single string field would be generic enought to upstream and still simple to implement
[14:30:03] <bblack>	 and a syntax patch for an optional hfp reason string?
[14:31:33] <bblack>	 return(pass(120, "because-objsize-too-big")); ---> sub vcl_pass { if (req.is_hitpass) { if (req.hitpass_reason != "because-objsize-too-big") { pass_random } } }
[14:32:01] <bblack>	 something along those lines might be upstreamable
[14:32:55] <ema>	 I was hoping we could do something simpler than hfm, and now we're gonna have to add state to hfp :D
[14:33:32] <ema>	 jokes aside, I agree, let's do it
[14:33:42] <bblack>	 related question:
[14:33:55] <bblack>	 do we use vcl_pass on the actual request that triggered hfp creation when there was none before it?
[14:34:03] <bblack>	 I think not
[14:34:12] <bblack>	 need that state diagram link again heh
[14:34:20] <ema>	 https://varnish-cache.org/docs/6.0/reference/states.html#reference-states
[14:35:12] <ema>	 it seems to me that the red arrow from recv does go to pass
[14:35:24] <bblack>	 I guess it doesn't matter too much, it should be rare, and hfp should self-refresh during grace in hot cases, I think
[14:36:42] <bblack>	 what I mean was: when we first do a recv->miss, and then a fetch, and then in vcl_backend_response we choose to do return(pass(120)) to create an HFP, that request doesn't go through vcl_pass
[14:36:52] <bblack>	 and probably counts statistically as a miss, not a pass
[14:37:00] <ema>	 ah!
[14:37:49] <bblack>	 oh, we do set the X-CDIS string from backend_response in those cases, just to fix our stats
[14:37:50] <ema>	 ok no, I think from _backend_fetch there's no way to jump back to _pass
[14:39:09] <bblack>	 really I guess if the reason string is optional and req.hitpass_reason ends up undefined in the default case, varnish still lets you do compares IIRC
[14:39:32] <bblack>	 return(pass(120, "because-objsize-too-big")); ---> sub vcl_pass { if (req.hitpass_reason != "because-objsize-too-big") { pass_random } }
[14:39:37] <bblack>	 so we don't need to wrap conditionals
[14:39:53] <bblack>	 and the cost to non-users, is the size of a string pointer to an undefined varnish string in every hfp object
[14:52:12] <bblack>	 ema: in the meantime, we could do a 'temporary' patch attached to that ticket which raises cache_text large_objects_cutoff to 1G and see how that flies?  It's not ideal, but it's very likely better than the current state of affairs in the net.
[14:55:25] <bblack>	 (thus is would match the ats-be cacheability cutoff, and we'd be back to only doing pass_random on things that pass both layers, at the cost that we could be unaware that there are a ton of cacheable, very large sub-GB objects in the text set that now disrupt the fe memory pool)
[14:55:45] <bblack>	 maybe turnilo could tell us if there's any popular such cases, actually
[15:02:27] <cdanis>	 interestingly there's a bunch of /w/load.php traffic that is over 256KiB
[15:02:48] <cdanis>	 bblack: https://w.wiki/i5e
[15:08:14] <cdanis>	 even expanding to the past week, and narrowing to GETs, nothing jumps out at me as something that "shouldn't" be cached in the frontend
[15:36:51] <bblack>	 cdanis: searching over 30d for much larger objects (I tried >100MB) turns up a lot of curious rabbitholes :)
[15:38:02] <bblack>	 in the 10M->100M range, we also have mediawiki tarballs coming through cache_text, too
[15:38:11] <bblack>	 and dumps
[15:40:26] <bblack>	 interesting that the releases tarballs don't seem to cache
[15:40:32] <bblack>	 (even in the BE, they seem to constantly miss)
[15:40:43] <bblack>	 but they're probably not high volume, either
[15:49:43] <cdanis>	 yeah, something else I was thinking about was that you lose insight by picking too long a range at once -- for this kind of thing you care about the hour-to-hour / minute-to-minute bursts of popular content
[15:49:46] <cdanis>	 but eh
[15:50:07] <cdanis>	 it doesn't seem obviously harmful to just try it and see what happens
[15:50:21] <bblack>	 yeah
[15:50:36] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/635852/
[15:50:37] <cdanis>	 I also think you could pick something between the two extrema (maybe a couple MiB?) to catch the very-obviously-helpful-to-cache cases like the load.php stuff and [[Donald Trump]] etc
[15:50:49] <cdanis>	 but I don't feel strongly about that either :)
[15:51:15] <bblack>	 I think we hopefully prove that we don't need a tunable here for text, just protection against extremes
[15:51:28] <cdanis>	 that would be a very good outcome :)
[15:59:46] <bblack>	 ema: I'm gonna merge the 1G cutoff idea, just so you see this later/tomorrow.  it seems to hit a good intersection of bold and potentially beneficial right now :)
[16:02:57] <bblack>	 it will take a while for the effect of this patch to smear itself into place anyways, as it won't destroy all the existing HFP objects
[16:03:07] <bblack>	 so we'll have some time to see if it's getting bad, slowly, I think
[16:03:59] <bblack>	 well, we might see some fairly-rapid improvement just from RL though, since RL objects have a shorter TTL to begin with, the oversized ones' HFPs are also short
[16:04:06] <bblack>	 but most of the HFPs will have fairly long TTLs
[16:21:07] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) I've compared 1+ million miss requests on cp3052 and cp3054 and looking at the most frequent miss URLs, there's no distinguishable pa...
[16:21:15] <wikibugs>	 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) Hi @Tsevener -- wanted to check in about something.  Is the version...
[16:21:31] <wikibugs>	 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10CDanis) p:05High→03Medium
[17:10:32] <volans>	 bblack: so, the eqiad DNS patches are ready, given their size compared to the PoPs one I have a 99.3% confidence compared to the 100% of the others and it's starting to be a bit late, but it's now or next Thu. at the least because of switchover. Any toughts? :) cc chaomodus 
[17:19:19] <bblack>	 volans: what about the latest "this must stay" comments?
[17:19:45] <volans>	 Already addressed, I commented them as resolved
[17:20:04] <volans>	 to avoid the double spam
[17:20:16] <bblack>	 ok
[17:20:43] <bblack>	 I'd say it's safer now than post-switchover, really
[17:20:43] <volans>	 double checking them
[17:21:42] <volans>	 yep confirmed all addressed
[17:22:04] <bblack>	 either way there will be a small risk, but at least it's not so much to the wikis with eqiad depooled
[17:22:21] <volans>	 sure
[17:23:36] <volans>	 chaomodus: you around?
[17:23:44] <chaomodus>	 yep
[17:23:56] <volans>	 I can start spamming IRC alerting people :)
[17:24:27] <chaomodus>	 haha want me to do the merging then?
[17:24:42] <volans>	 wait a sec :)
[17:24:51] <chaomodus>	 yes after warnings of course
[17:28:01] <volans>	 waiting for an ack from DCops to avoid issues
[17:28:17] <volans>	 preparing patches for the cookbooks/netbox scripts to mark eqiad as migrated
[17:31:56] <volans>	 ok patches ready, IRC spammed
[17:32:20] <volans>	 I guess we need just to decide if merging both at the same time or doing private, 1H, public, or something in the middle
[17:32:26] <volans>	 I vote for something in the middle
[17:32:36] <volans>	 like give ~20 minutes after the first
[17:33:20] <chaomodus>	 that sounds reasonable
[17:33:31] <chaomodus>	 is that enough delay to get any obvious errors?
[17:35:20] <volans>	 the TTL of all records is 1H but statistically speaking any large issue should arise way earlier as I expect records to expire their TTL randomly, is that a fair assumption bblack ?
[17:35:58] <volans>	 I will also manually clear the cache of a couple of records in the recursors to double verify their resolution
[17:36:23] <chaomodus>	 roger
[17:36:25] <bblack>	 it's hard to predict really
[17:37:02] <bblack>	 there will be a splay of natural expiries for sure, but which of those would stumble on (a) a diff in the data that was unintended + (b) one that causes notable fallout... who knows when during the ~1H that happens, if at all
[17:38:01] <volans>	 yeah a specific record might be resolved only once a month for what we know 
[17:38:43] <bblack>	 the case we'd really care about is an important record that's spam-resolved every few seconds, but happened to pick up a fresh 1H TTL just before deploy, and then is broken by the change
[17:39:43] <bblack>	 if you wanted to be sure you've seen fallout rapidly, you could also just clear eqiad caches
[17:40:08] <bblack>	 in the current config, they refill from localhost, so it's not like it causes some big latency hit or whatever.
[17:40:38] <volans>	 you mean recdns?
[17:40:59] <bblack>	 yeah, use rec_control wildcards and go after our local domains (to avoid wiping external cache entries)
[17:41:30] <bblack>	 it's one way you could accelerate any fallout, while probably not causing other fallout :)
[17:41:38] <volans>	 the https://wikitech.wikimedia.org/wiki/DNS#How_to_Remove_a_record_from_the_DNS_resolver_caches says only about specific hosts :)
[17:41:50] <bblack>	 or you can just decide you've already done a ton of due diligence and trust it's all fine until proven otherwise :)
[17:42:48] <bblack>	 https://doc.powerdns.com/recursor/manpages/rec_control.1.html
[17:42:50] <volans>	 ah with something like eqiad.wmnet$ ?
[17:42:51] <bblack>	 ^ near the bottom
[17:43:01] <volans>	 yeah was looking there
[17:43:03] <bblack>	 yeah
[17:43:48] <volans>	 sure we can do that too
[17:44:27] <volans>	 if the whole bash/cumin doesn't eat the $ :D
[17:45:42] <chaomodus>	 are we ready?
[17:45:45] <volans>	 chaomodus: let's proceed
[17:45:50] <volans>	 private only first
[17:45:59] <chaomodus>	 okay doing priv
[17:45:59] <volans>	 and SAL first
[17:46:40] <volans>	 bblack: this looks ok to you? sudo cumin 'A:dns-rec and A:eqiad' 'rec_control wipe-cache eqiad.wmnet$'
[17:47:07] <volans>	 will include mgmt too that we don't need but can't avoid it AFAICT
[17:47:17] <bblack>	 volans: yeah, assuming as you said that $ doesn't fail on some intermediate quoting step, but I don't think it would.
[17:47:24] <volans>	 it shouldn't
[17:47:42] <bblack>	 you could argue for removing the A:eqiad conditional
[17:47:55] <bblack>	 after all, the broken case might be e.g. an esams cache resolving a .eqiad.wmnet hostname, or whatever
[17:48:33] <volans>	 I was planning first that one and then if nothing screams in 5 minutes also the others but yeah an do all at once :D
[17:48:45] <volans>	 they have their own gdnsd to query right?
[17:48:56] <volans>	 *local gdnsd
[17:49:19] <bblack>	 right, all recdns have a gdnsd they query over localhost
[17:49:28] <chaomodus>	 priv done
[17:49:52] <volans>	 ok clearing cache
[17:50:47] <bblack>	 [s/localhost/loopback/ , to be pedantic about it]
[17:50:48] <volans>	 {done}
[17:50:52] <volans>	 ack
[17:51:29] <volans>	 there is no easy wildcard for reverse records if I read it correctly
[17:51:46] <bblack>	 you can scope them on 8-bit boundaries in the case of ipv4
[17:51:48] <bblack>	 if you wanted to
[17:52:02] <bblack>	 2.2.10.in-addr.arpa$ and similar
[17:52:22] <bblack>	 but it's a PITA to isolate out the eqiad parts that way
[17:52:37] <bblack>	 and almost nothing functionally relies on PTR records in any important way, AFAIK
[17:52:43] <volans>	 yeah agree
[17:54:21] <volans>	 so far I don't see anything screaming in icinga, manual checks looks good
[17:57:22] <bblack>	 no news is good news around here :)
[17:57:33] <volans>	 ehehe :)
[17:58:37] <volans>	 chaomodus: let's wait few more minutes and I guess we can merge the public one too
[17:59:04] <volans>	 for that one I'll invalidate wikimedia.org$
[17:59:16] <chaomodus>	 okay
[17:59:43] <volans>	 bblack: sure it shouldn't harm to clear all wikimedia.org cache?
[17:59:55] <volans>	 sounds potentially harmfl :)
[18:00:00] <volans>	 *harmful
[18:00:25] <bblack>	 breathing can be harmful too, but we keep doing it!
[18:00:36] <volans>	 lol
[18:00:38] <bblack>	 but more-seriously, the biggest risk is probably the potential to trigger some powerdns bug
[18:00:58] <bblack>	 in which case we'll restart the cache daemons if systemd doesn't do it for us
[18:00:59] <volans>	 I trust gdnsd will DTRT, for experience
[18:01:28] <bblack>	 gdnsd won't even know what happened with rec_control, other than it gets a few extra queries briefly
[18:02:01] <volans>	 ok, tell the truth, you want us to do that so that we test this large wipe-cache for you and you can blame the netbox migration if it gow awry :D
[18:02:39] <bblack>	 yes!
[18:02:47] <bblack>	 bonus points if you document it too
[18:03:05] <bblack>	 eqiad.wmnet probably had more hot cache entries than wikimedia.org anyways though
[18:03:19] <volans>	 yeah some reported 15k
[18:04:19] <volans>	 https://grafana.wikimedia.org/d/000000399/dns-recursors?orgId=1&from=now-30m&to=now
[18:04:54] <bblack>	 https://grafana.wikimedia.org/d/Jj8MztfZz/authoritative-dns?viewPanel=2&orgId=1&from=now-1h&to=now
[18:05:07] <bblack>	 ^ eqiad authdns not caring.  there's some kinda little bump there, but it's within the usual noise
[18:06:06] <volans>	 chaomodus: let's go ahead with the public one
[18:06:08] <volans>	 nice
[18:06:32] <chaomodus>	 okay
[18:10:23] <volans>	 authdns-update completed?
[18:11:00] <volans>	 chaomodus ^^^
[18:11:18] <chaomodus>	 Ah yes, sorry it is complete
[18:11:43] <volans>	 bblack: OCSP for  wikiworkshop.org  warnings are unrelated right? just popup on icinga
[18:12:36] <volans>	 wikiped cache, less records indeed, more around 2k per host
[18:12:41] <bblack>	 should be unrelated, looking
[18:18:45] <bblack>	 volans: was the ocsp alert you saw a crit or a warning?
[18:18:53] <volans>	 warnings
[18:19:16] <bblack>	 yeah ok, I think it's all fine
[18:19:49] <bblack>	 we may need to revisit the overlap of acmechief LE renewal timelines and the icinga check thresholds
[18:21:12] <volans>	 ack, race between renew and check
[18:23:16] <bblack>	 yeah, a fairly tight one.  it looks like it lasts for roughly one full cycle of puppet agent runs
[18:23:26] <bblack>	 the first agents to update don't spend so long in the warning state
[18:24:15] <volans>	 got it
[18:31:25] <volans>	 bblack: as a reward for your help :) https://wikitech.wikimedia.org/wiki/DNS#How_to_remove_an_entire_domain_from_the_DNS_resolver_caches
[18:32:10] <volans>	 (just fixed a small formatting issue)
[18:33:04] <volans>	 icinga still happy
[18:34:19] <volans>	 chaomodus: let's keep a close look at icinga for another ~40m to make sure we go over the 1H TTL anyway as we didn't wipe-cache everything
[18:34:31] <volans>	 but all looks good so far
[18:35:22] <chaomodus>	 okay
[18:37:07] <volans>	 I'm preparing something for dinner, but I have the laptop nearby
[19:02:47] <chaomodus>	 i'm going to lunch but i  can be pinged, alerts look okay i think
[19:48:55] <wikibugs>	 10Traffic, 10Operations, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup)
[20:10:22] <chaomodus>	 nothing exploded apparently?
[20:10:34] <volans>	 lol, I was writing kinda the same thing right now
[20:10:54] <volans>	 all seems good so far, I think I can call it a day
[20:11:09] <volans>	 but call / fix / rever it anything explodes
[20:13:04] <chaomodus>	 oke doke
[20:13:08] <chaomodus>	 thanks :)
[20:15:35] <volans>	 thank you :)
[22:07:09] <wikibugs>	 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis darn - yes, that's the right version, and we limited it to...
[22:11:24] <wikibugs>	 10Traffic, 10Operations, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10Tsevener) @CDanis  Sorry I just noticed you do mention in the description th...