[07:15:30] <wikibugs>	 10netops, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi)
[08:33:06] <ema>	 https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=13&orgId=1&var-cluster=upload&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1618225705030&to=1618389134340
[08:33:32] <ema>	 it looks like I've murdered stashbot and wikibugs lol
[08:33:53] <ema>	 anyways: +3% frontend hitrate in upload@eqsin with the exp policy
[08:36:41] <ema>	 I think something we should look into a bit before extending the policy to the rest of upload is nuke_limit: the default is 50, but we bumped it to 1K in text@everywhere and upload@eqsin
[08:38:17] <ema>	 the open questions I have are: how many objects are we usually currently nuking in practice? what's the drawback of having a Very Large nuke_limit?
[13:30:47] <ema>	 so, two surprising findings: (1) we do reach n_lru_limited fairly often (in bursts) on upload@esams (2) that does not result in 50x errors as I thought
[13:31:24] <ema>	 see for example https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=97&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from=1618340012231&to=1618355928021
[13:31:43] <ema>	 and no relevant 50x spike: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=71&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from=1618340012231&to=1618355928021
[13:33:00] <ema>	 if anyone was in my brain when I thought that lru_limited implies 50x please speak up :)
[13:34:44] <cdanis>	 hm
[13:34:52] <cdanis>	 is it just lru_limited plus pass traffic together that is the issue?
[13:35:02] <cdanis>	 (sorry, not really awake yet)
[13:41:05] <ema>	 cdanis: almost! it's lru_limited plus streaming
[13:41:10] <cdanis>	 aha
[13:41:21] <cdanis>	 which presumably we enable for pass in upload
[13:42:35] <ema>	 it's the default, we have it on for everything 
[13:43:29] <ema>	 so yeah we don't see 50x errors anywhere, but clients do indeed get only a partial transfer
[13:44:40] <ema>	 "transfer closed with outstanding read data remaining" is the specific varnish error
[13:46:09] <cdanis>	 lovely
[13:52:16] <ema>	 s/varnish error/curl error/
[13:56:21] <cdanis>	 ema: I wonder if Chromium calls that http.response.invalid.content_length_mismatch https://logstash.wikimedia.org/goto/5880494fe6a6fcf51c1811810d683cf2
[13:58:37] <vgutierrez>	 hmm Chromium should trigger that if CL header value is higher than what the UA gets on the socket, so it could be the same
[14:14:17] <akosiaris>	 that's the same issue as https://phabricator.wikimedia.org/T266373, isn't it ?
[14:14:45] <cdanis>	 indeed
[14:55:04] <ema>	 nice akosiaris, yeah
[14:56:57] <akosiaris>	 ;-)
[15:32:55] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) Apparently we do [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=97&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from...
[15:40:48] <wikibugs>	 10netops, 10SRE: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10jbond) proposal seems fine to me however it would put it theses routes above PEER_INTERNAL which is probably fine but feels wrong  ~~That said Im also curious why PEERING_ROUTE and PEERING_ROUTE_PRIMARY hav...
[15:46:50] <moritzm>	 what's the current state of the art for rebooting DNS recursors? https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) mentions to simply pool/depool, but when I did that with dns4001 that triggered BGP alerts probably caused by the anycast work
[15:47:05] <moritzm>	 are these okay to ignore or is there something else that needs to be done?
[15:50:07] <bblack>	 There's not a great way to suppress those BGP alerts, I think for now our best option is just ACK them (or let them just self-resolve if it's a quick reboot)
[15:50:41] <bblack>	 in general systemd should take down the service in an appropriate way and the BGP peer loss is an indication that it was effectively depooled from service at the router
[15:51:09] <moritzm>	 ack, ok, I'll simply sit them out, then. the reboots only take 2-3 minutes
[15:51:27] <bblack>	 there's some other poorly-documented stuff to be aware of though:
[15:51:49] <bblack>	 authdns* are not self-depooling and need some manual hand-holding
[15:52:01] <bblack>	 and authdns is esams is actually on the dns300X machines
[15:52:04] <bblack>	 moritzm: ^
[15:52:09] <bblack>	 s/is/in/
[15:52:26] <jbond42>	 yes i just checked and the timeres are low enough that the recursours should stop reciving traffic about ~1s after brid is stopped.  it may be worth adding systemctl stop bird;  sleep1; to the restart instructions but probbly overkill
[15:53:51] <moritzm>	 bblack: ack, thanks. I was aware of authdns* being special (Arzhel redirects routers for those), but the dns3* situation was in fact new to me
[15:54:04] <bblack>	 I think it's badly documented on our side in general
[15:54:22] <moritzm>	 I'll deal with them tomorrow and will update the Service restarts page accordingly when done
[15:54:26] <bblack>	 but basically we don't have a separate authdns box in esams, and the router-level stuff is sending "ns2.wikimedia.org" to one of the dns300X's
[15:54:41] <bblack>	 acutally it might be hashing it to both of them
[15:55:28] <moritzm>	 ack, as long we have ns2 drained, I'll simply do both before it gets live again
[16:03:35] <bblack>	 it's probably better, in the esams case, to move ns2 between the dns300[12] hosts as we go
[16:03:52] <bblack>	 the only other way to "drain" it would be to forward it over transport back to eqiad, and that adds a ton of latency.
[16:04:08] <moritzm>	 ok
[16:05:35] <bblack>	 looking at the dashboard now, yes, I think it's hashing it over both
[16:08:16] <XioNoX>	 bblack: we could have ns2 advertise its VIP to the routers as well
[16:09:58] <bblack>	 yeah
[16:10:08] <bblack>	 that's all kinda tied up in the anycast related things too :)
[16:10:36] <bblack>	 in the moment for a reboot happening now, I don't think we want to make design changes though
[16:10:58] <bblack>	 but we can just turn off one as a destination on the router, do that reboot, and then switch to the other, etc
[16:11:57] <bblack>	 last I looked at the advertising thing, the tricky part is the multi-layer dependency issues for bird and the routes
[16:12:19] <bblack>	 (e.g. we don't want to withdraw an authdns route just because pdns-recursor needs a restart)
[16:13:31] <bblack>	 but: even if authdns routes did withdraw with recdns ones on pdns-recursor, that would maybe be better than what we have today (manual routes)
[16:13:36] <XioNoX>	 right, I remember now
[16:13:48] <bblack>	 since pdns-rec restarts are still relatively-rare
[16:14:06] <XioNoX>	 and shouldn't happen on both nodes at the same time
[16:15:25] <bblack>	 but right now, if we push a config change that requires a pdns-recursor restart, I think there's already such a danger, sadly
[16:15:54] <bblack>	 (because we're relying on agent timing, which can collide easily.  we've observed those cases before of bad agent run cron overlaps)
[16:19:50] <volans>	 it seems you need a cookbook to orchestrate the reboots ;) 
[16:22:55] <bblack>	 well, I'd argue we need better system designs to avoid the necessity of the cookbook :)
[16:24:13] <bblack>	 the cron agent run collision pattern is really enlightening as to the root of it: that nothing is coordinating the higher-level view of cluster changes that require service outages.
[16:24:47] <volans>	 eheh :)
[16:25:21] <bblack>	 this is sort of like what pybal is trying to paper over with its depool threshold
[16:25:34] <bblack>	 but there's no control or feedback part in that mechanism
[16:26:06] <bblack>	 (to push back on the depooler and ask them to wait until something else is repooled)
[16:27:11] <bblack>	 + the notional difference between health/outage depools vs intentional maintenance that can be planned, delayed, and/or canceled.
[16:36:26] <bblack>	 (+stacking of reasons for depooling, to complete the set of inputs: a host can be depooled for multiple health and/or planned reasons, and canceling one of those shouldn't cancel them all)
[18:49:28] <wikibugs>	 10Traffic, 10RESTBase, 10SRE, 10Page-Previews (Tracking), and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534 (10Jdlrobson)