[07:15:30] 10netops, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10ayounsi) [08:33:06] https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=13&orgId=1&var-cluster=upload&var-site=eqsin&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=1618225705030&to=1618389134340 [08:33:32] it looks like I've murdered stashbot and wikibugs lol [08:33:53] anyways: +3% frontend hitrate in upload@eqsin with the exp policy [08:36:41] I think something we should look into a bit before extending the policy to the rest of upload is nuke_limit: the default is 50, but we bumped it to 1K in text@everywhere and upload@eqsin [08:38:17] the open questions I have are: how many objects are we usually currently nuking in practice? what's the drawback of having a Very Large nuke_limit? [13:30:47] so, two surprising findings: (1) we do reach n_lru_limited fairly often (in bursts) on upload@esams (2) that does not result in 50x errors as I thought [13:31:24] see for example https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=97&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from=1618340012231&to=1618355928021 [13:31:43] and no relevant 50x spike: https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=71&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from=1618340012231&to=1618355928021 [13:33:00] if anyone was in my brain when I thought that lru_limited implies 50x please speak up :) [13:34:44] hm [13:34:52] is it just lru_limited plus pass traffic together that is the issue? [13:35:02] (sorry, not really awake yet) [13:41:05] cdanis: almost! it's lru_limited plus streaming [13:41:10] aha [13:41:21] which presumably we enable for pass in upload [13:42:35] it's the default, we have it on for everything [13:43:29] so yeah we don't see 50x errors anywhere, but clients do indeed get only a partial transfer [13:44:40] "transfer closed with outstanding read data remaining" is the specific varnish error [13:46:09] lovely [13:52:16] s/varnish error/curl error/ [13:56:21] ema: I wonder if Chromium calls that http.response.invalid.content_length_mismatch https://logstash.wikimedia.org/goto/5880494fe6a6fcf51c1811810d683cf2 [13:58:37] hmm Chromium should trigger that if CL header value is higher than what the UA gets on the socket, so it could be the same [14:14:17] that's the same issue as https://phabricator.wikimedia.org/T266373, isn't it ? [14:14:45] indeed [14:55:04] nice akosiaris, yeah [14:56:57] ;-) [15:32:55] 10Traffic, 10SRE, 10Patch-For-Review: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 (10ema) Apparently we do [[https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=97&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3065&from... [15:40:48] 10netops, 10SRE: BGP: prioritize directly connected peers - https://phabricator.wikimedia.org/T280054 (10jbond) proposal seems fine to me however it would put it theses routes above PEER_INTERNAL which is probably fine but feels wrong ~~That said Im also curious why PEERING_ROUTE and PEERING_ROUTE_PRIMARY hav... [15:46:50] what's the current state of the art for rebooting DNS recursors? https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) mentions to simply pool/depool, but when I did that with dns4001 that triggered BGP alerts probably caused by the anycast work [15:47:05] are these okay to ignore or is there something else that needs to be done? [15:50:07] There's not a great way to suppress those BGP alerts, I think for now our best option is just ACK them (or let them just self-resolve if it's a quick reboot) [15:50:41] in general systemd should take down the service in an appropriate way and the BGP peer loss is an indication that it was effectively depooled from service at the router [15:51:09] ack, ok, I'll simply sit them out, then. the reboots only take 2-3 minutes [15:51:27] there's some other poorly-documented stuff to be aware of though: [15:51:49] authdns* are not self-depooling and need some manual hand-holding [15:52:01] and authdns is esams is actually on the dns300X machines [15:52:04] moritzm: ^ [15:52:09] s/is/in/ [15:52:26] yes i just checked and the timeres are low enough that the recursours should stop reciving traffic about ~1s after brid is stopped. it may be worth adding systemctl stop bird; sleep1; to the restart instructions but probbly overkill [15:53:51] bblack: ack, thanks. I was aware of authdns* being special (Arzhel redirects routers for those), but the dns3* situation was in fact new to me [15:54:04] I think it's badly documented on our side in general [15:54:22] I'll deal with them tomorrow and will update the Service restarts page accordingly when done [15:54:26] but basically we don't have a separate authdns box in esams, and the router-level stuff is sending "ns2.wikimedia.org" to one of the dns300X's [15:54:41] acutally it might be hashing it to both of them [15:55:28] ack, as long we have ns2 drained, I'll simply do both before it gets live again [16:03:35] it's probably better, in the esams case, to move ns2 between the dns300[12] hosts as we go [16:03:52] the only other way to "drain" it would be to forward it over transport back to eqiad, and that adds a ton of latency. [16:04:08] ok [16:05:35] looking at the dashboard now, yes, I think it's hashing it over both [16:08:16] bblack: we could have ns2 advertise its VIP to the routers as well [16:09:58] yeah [16:10:08] that's all kinda tied up in the anycast related things too :) [16:10:36] in the moment for a reboot happening now, I don't think we want to make design changes though [16:10:58] but we can just turn off one as a destination on the router, do that reboot, and then switch to the other, etc [16:11:57] last I looked at the advertising thing, the tricky part is the multi-layer dependency issues for bird and the routes [16:12:19] (e.g. we don't want to withdraw an authdns route just because pdns-recursor needs a restart) [16:13:31] but: even if authdns routes did withdraw with recdns ones on pdns-recursor, that would maybe be better than what we have today (manual routes) [16:13:36] right, I remember now [16:13:48] since pdns-rec restarts are still relatively-rare [16:14:06] and shouldn't happen on both nodes at the same time [16:15:25] but right now, if we push a config change that requires a pdns-recursor restart, I think there's already such a danger, sadly [16:15:54] (because we're relying on agent timing, which can collide easily. we've observed those cases before of bad agent run cron overlaps) [16:19:50] it seems you need a cookbook to orchestrate the reboots ;) [16:22:55] well, I'd argue we need better system designs to avoid the necessity of the cookbook :) [16:24:13] the cron agent run collision pattern is really enlightening as to the root of it: that nothing is coordinating the higher-level view of cluster changes that require service outages. [16:24:47] eheh :) [16:25:21] this is sort of like what pybal is trying to paper over with its depool threshold [16:25:34] but there's no control or feedback part in that mechanism [16:26:06] (to push back on the depooler and ask them to wait until something else is repooled) [16:27:11] + the notional difference between health/outage depools vs intentional maintenance that can be planned, delayed, and/or canceled. [16:36:26] (+stacking of reasons for depooling, to complete the set of inputs: a host can be depooled for multiple health and/or planned reasons, and canceling one of those shouldn't cancel them all) [18:49:28] 10Traffic, 10RESTBase, 10SRE, 10Page-Previews (Tracking), and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534 (10Jdlrobson)