[05:46:33] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Majavah) [06:11:49] cdanis: storage wiping can only be done offline. See https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Cheatsheet [06:22:46] 10Traffic, 10Operations, 10Repository-Admins: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10ema) >>! In T249606#6036981, @Dzahn wrote: > Please see https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests Done, thanks @Dzahn. [08:33:42] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:34:11] 10Traffic, 10Operations, 10Repository-Admins: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:31:16] 10Traffic, 10OpenRefine, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10Anomie) There was no change to MediaWiki with respect to output of Set-Cookie headers. For... [13:34:14] purged is now running on cp3050 and it seems to be working alright. Some initial observations: [13:34:38] purge "hitrate" is awful, as we know [13:34:55] purged_http_requests_total{layer="backend",status="200"} 10160 [13:34:55] purged_http_requests_total{layer="backend",status="404"} 1.078386e+06 [13:36:19] I was semioptimistically thinking that concurrency=4 would have been enough to avoid backlog given that this was the case in my tests on cp2009. In esams it's not enough, I went all in with -concurrency 48 and backlog is 0 so far [13:36:32] well at times it grows a little, but it quickly goes back to 0 [13:37:15] see curl -s 127.0.0.1:2112/metrics | grep ^purged [13:41:08] varnish-fe and ats-be cpu usage has increased significantly now that they get all the purges [13:45:16] 10Traffic, 10OpenRefine, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10Pintoch) Thank you very much for the investigation. We have patched Wikidata-Toolkit, rel... [13:49:18] 10Traffic, 10OpenRefine, 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10CDanis) This seems likely related to us switching from using nginx as our... [13:53:35] 10Traffic, 10Operations, 10Patch-For-Review: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 (10ema) Deployed on cp3050. Compared to vhtcpd, CPU usage is ~ 3x - 4x which I'd say is good enough for now. {F31743497} Here's the 10s CPU profile with production traffic which might suggest... [14:14:02] ema: wow, <1% hitrate is even worse than I expected [14:14:52] cdanis: it might be a bit higher on frontends [14:15:10] we can't really find out though, varnish says 204 regardless of whether the object was actually purged or not [14:15:23] ahh [14:16:25] s/varnish/our VCL/ - there might be a way to distinguish and return 404 [15:14:14] I think there is [15:14:28] it's hard to say whether it would be better or worse [15:14:51] the backends have the disadvantage of chashing (at best, 1/8 of the possible URLs would ever actually exist on a given node for a hit) [15:15:04] but the frontends have the disadvantage of smaller total storage, too [15:15:32] but mostly, it's that we issue lots and lots of purges for things that aren't even remotely hot in various cache layers/locations :) [15:22:07] bblack: hi! [15:22:13] we have an interesting problem :) [15:23:00] ? [15:23:37] transient memory on various upload nodes is giving trouble, see eg: [15:23:39] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=now-2d&to=now [15:24:09] and [15:24:10] https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?panelId=8&fullscreen&orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3051&from=now-7d&to=now [15:24:24] 10Traffic, 10OpenRefine, 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10Pintoch) Sure, it's your call. I am sure you have more important things to... [15:24:51] for some reason varnish-fe does not seem to be able to free it? [15:25:12] any theories on the start transition from normalcy to badness there? [15:25:15] s/start/stark/ [15:26:35] a couple of days ago I upgraded to a patched version of varnish including my patch for https://github.com/varnishcache/varnish-cache/issues/3273 [15:27:01] ah [15:27:14] but that doesn't line up with that transition point either, which is quite recent [15:27:30] yeah, but it took ~ 1 day for the regular malloc storage to fill up [15:27:48] it was ~24h ago [15:29:12] ok yeah scratch the comment about malloc storage taking 1d to fill up [15:29:15] perhaps, it's possible that the new error-handling block leaks some resource when cleaning up there? [15:29:42] it's possible, yeah [15:30:48] mmh so the cleanup for a regular "Could not get storage" does: [15:30:51] bo->htc->doclose = SC_RX_BODY; [15:30:54] VFP_Close(bo->vfc); [15:30:56] VDI_Finish(bo->wrk, bo); [15:31:21] my patch calls VFP_Close and VDI_Finish, but does not set doclose [15:31:52] hmmm [15:32:11] right [15:32:19] what's bo->htc again? I know bo is busyobj [15:32:55] http_conn apparently [15:33:02] yeah [15:34:09] looking at the code that actually reads htc->doclose [15:34:40] I think setting bo->htc->doclose = causes the TCP connection to be closed, otherwise it would be recycled for reuse. [15:35:12] the TCP conn to the backend, I believe? [15:35:16] yeah [15:35:20] yeah [15:35:27] it seems wise to close it after something like that happens [15:36:19] ah look [15:36:19] https://github.com/varnishcache/varnish-cache/pull/3278 [15:39:01] so the merged one, replaces your: [15:39:01] + VFP_Close(bo->vfc); [15:39:02] + VDI_Finish(bo->wrk, bo); [15:39:04] with: [15:39:10] vbf_cleanup(bo); [15:39:14] I think? [15:39:29] no, vbf_cleanup() is a recent function, we don't have it in 5.x [15:39:34] right, that's master [15:39:58] my backport of the patch to 5.x might have something wrong, but I can't find it now [15:40:20] (other than not closing the backend conn, which according to nigoroll is fine) [15:40:31] see https://github.com/varnishcache/varnish-cache/pull/3274#issuecomment-608563773 [15:41:19] also what's not being freed is transient, which shouldn't be affected by not closing the backend conn I think? [15:41:40] lots of negations in that sentence :) [15:42:29] transient might also include various temporary buffers related to the conn? I don't recall. [15:42:43] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?panelId=5&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=frontend&from=now-7d&to=now [15:42:54] also, it's possible the real leak isn't in transient, and that transient explosion is just a fallout from something else going wrong. [15:43:18] we do definitely have a sharp increase in fe<->be connection rate [15:45:01] right [15:45:22] it probably doesn't hurt to set doclose :) [15:45:22] which maybe implies that even though we're not setting doclose, something about that error cleanup is causing those connections to die regardless [15:45:50] hmm [15:45:55] does the VSL give any hints? [15:45:57] ah, wait [15:46:12] what's failing frequently is getting normal storage [15:46:41] well because mbox lag has ramped off and we have lots of nuke/kill traffic [15:47:03] e.g. in a random bit of varnishlog output on 3051 fe: [15:47:05] - Fetch_Body 3 length stream [15:47:05] - ExpKill LRU_Cand p=0x7fce89c28120 f=0x0 r=1 [15:47:05] - ExpKill LRU x=160106851 [15:47:05] - ExpKill LRU_Cand p=0x7fb2ee86c1a0 f=0x0 r=1 [15:47:07] - ExpKill LRU x=954388921 [15:47:10] - BackendReuse 727 vcl-71d41936-29ec-40bd-b31f-ab5bc1646fd5.be_cp3059_esams_wmnet [15:49:43] the few LRU_Fail I see are failures to allocate transient, but I'm just looking at random items out of a log [15:50:58] it is also possible that the issue is not in the cleanup code, just a situation we never got into because (1) in the past we did not limit transient on upload and (2) since we do limit it, varnish would have crashed eventually due to https://github.com/varnishcache/varnish-cache/issues/3273 [15:52:08] we've been limiting it on upload a while right? [15:52:51] less than a month? [15:53:26] https://github.com/wikimedia/puppet/commit/1a8df5d49172f60f475cf90a94af64be49a30830 [15:53:36] 23 days :) [15:53:52] ah [15:54:17] it's quite likely that upload needs a lot more [15:54:19] we started doing that to deal with OOM shootings like T185968 [15:54:19] T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 [15:54:42] bblack: agreed, yeah [15:55:03] so maybe leave the patchwork alone for now and just try raising it to [15:55:16] I guess looking at historical graph? [15:55:57] yeah the esams upload nodes are much spikier on transition history [15:56:02] the spikes are veyr narrow, but go quite high [15:56:08] in the past 90 days cp3051 never went above 8G apparently [15:57:11] but 3053 had a spike to 23G back on Mar 12 [15:57:19] some others have some short tall spikes like that too [15:58:38] 3053 had a spike of 93G as well, but that would seem a bit eccessive :) [16:00:31] note that by default transient is unlimited, so I wouldn't be surprised to find that dealing with limited transient is something varnish has troubles with under special circumstances [16:02:05] bblack: I've gotta go soon, shall we start by doubling the limit and setting it to 10G for now? [16:02:49] yeah sounds good [16:03:32] when I was looking at varnishlog for some of the failing transient allocation cases, they were puzzling to me [16:04:02] e.g. one of them was for the common case of a regular cacheable thumbnail [16:04:15] it was a backend hit, coming through the frontend, with a CL size of ~12MB [16:04:36] but v-fe chose to do a pass, which caused transient to be used, which then failed at allocation [16:04:53] I know we had some size-based passsing and the statistical one too [16:04:55] hmmm [16:05:04] yeah, pass is affected [16:05:11] also the hfp rate is now lower than usual [16:07:12] so our current admission_policy is 'none' [16:07:43] so we're not doing the n-hit-wonder code, and we're not doing the exp code [16:08:09] which means the remaining related chunk of code that's active on upload is the one that does: [16:08:13] // hit-for-pass objects >= 256K size [16:08:21] so that's why that 12MB image is a pass [16:08:44] mmh [16:08:55] there's probably a lot of stuff > 256K [16:09:37] well yeah [16:09:39] bblack: shall we? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/587551/ [16:09:41] that was intentional though [16:10:00] +1 [16:10:17] so the 256K cutoff was what we started doing a very long time ago I think, before the nhw/exp experiments [16:10:32] I just don't recall exactly when or why we went back to 'none' instead of nhw/exp [16:10:56] I guess nhw might be out because we're not getting hitcount reports from ats-be like we were v-be? [16:11:35] but the triggering condition could just be that some >256K objects got very popular and lots of parallel pass traffic through the fe ties up transient, too [16:11:51] whereas exp might've caught it, maybe [16:12:06] (or nhw, I guess) [16:12:42] we stopped the exp experiments (ha) because of lack of time and resources I think [16:13:02] does ats-be report hit/42 or whatever? [16:13:08] nope [16:13:16] we cannot do nhw but we can do exp [16:13:39] right, and the time and resources part would've been to keep updating the statistic to derive the magic exponent from [16:15:18] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#13 [16:15:24] bblack: ok to proceed with upload frontend restarts ? [16:15:45] ema: ack, +1 [16:16:00] so our last stats run from dsb was late 2016 [16:16:06] ~3.5 years old [16:16:23] but still, it's probably not a horribly unreasonable thing to assume that those numbers are still in the ballpark of useful [16:18:56] yeah [16:19:05] (and we could monitor whether they're useful in two senses: (a) whether fe hitrate improves over time by switching from <256K to "exp", and (b) whether this decreases transient storage utilization/spikes. It's probably possible we can get either or both or neither. [16:19:09] ) [16:19:37] :) [16:19:45] the raw hitrate could stay ~same or even get slightly worse, but some of the new hits are on important objects for avoiding transient usage spikes, for instance. [16:20:29] (e.g. a handful of hot objects that are >256K and causing transient issues, but take up the room that 100 other hittable objects could've used) [16:21:16] anyways, will see how the 10G bump plays out first [16:24:00] if we retry "exp", we should probably resurrect this block too (10M limit, so that exp doesn't try to fe-cache some huge object relative to storage): https://phabricator.wikimedia.org/T144187#2775914 [16:24:05] but fe-only, as noted there [16:49:42] bblack: which block? [16:50:36] oh bad link heh [16:51:04] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/393227/2/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb [16:51:09] ^ that from an old abandoned patch of yours [16:51:19] (which needs some work to be fe-specific) [16:51:29] well it can only be fe, right? [16:51:45] oh, yeah :) [16:51:51] :) [16:52:15] I guess we never wrote a VCL plugin for ATS :) [16:52:25] * bblack hides [16:54:17] gotta go, see you tomorrow! restarts are carrying on happily in screen on cu.min1001 [16:54:37] . to avoid summoning r.i.c.c.a.r.d.o [16:56:39] cya