[10:59:21] 10Varnish: latest varnishkafka fails to build on Debian - https://phabricator.wikimedia.org/T186250#3938638 (10Liuxinyu970226) @Jrdnch thank you for reporting, but as you can see, the #varnish project is archived now, wondering if you could pick another project? Or otherwise we can just decline or invalid this t... [13:41:10] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Prtksxna) >>! In T185282#3914281, @Dzahn wrote: > @Volker_E Gotcha! So you are requesting web space to clone to from Gi... [14:23:41] no mbox lag ramps in ulsfo yet, so far so good :) But we're probably being helped by the fact that everything restarted Friday, so none of them have had >half-week uptimes yet. [14:24:36] what's interesting, though, is they seem to slowly be building up a baseline amount of mbox lag in the single digits [14:24:54] https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&from=1517712730805&to=1517840574533&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&var-server=All [14:25:40] that seems like a possible indicator of a minor bug that will effectively leak mbox slots over time. but probably acceptable on a weekly (or even monthly, if the tiny rate holds) restart schedule. [14:38:40] yup, so far so good in ulsfo [14:39:24] cp3036 (v5, upload-esams) was lagging this morning though, so I don't think v5 fixed the issue altogether [14:39:57] it does not seem to have made it worse, which is something :) [14:40:05] * volans mind cannot help but making a mental parallel between mbox lag and swap usage everytime :-P [14:46:03] mbox lag is tricky to think about, because we've abstracted out a metric that represents an indicator of one or more issues that aren't directly related, etc... [14:46:40] it's not necessarily bad that there's mailbox lag, but we've found that when mbox lag metrics look pathological, almost certainly other bad things are happening that we care about :) [14:46:51] yeah! [14:48:00] looking at it from the varnish source code perspective, it can be roughly paraphrased like this (from memory, possibly with minor factual incorrectness!): [14:49:12] (1) The reclaimation of cache space from naturally-expired objects, and the reconciliation of various metadata from immediately-evicted-for-space-reasons objects, happens in a single thread for all of varnishd. [14:50:12] (2) The rest of varnishd uses hundreds of threads (in a way that I hate), which is a thread-per-connection sort of model instead of thread-per-cpu core, which tends to cause lock-scaling problems if you're not careful. [14:51:06] (3) All of these other connection-handling threads have to send notifications to the singular expiry thread when they need to evict space, so it can update metadata. They do this through a serial message-passing queue with a pthread lock protecting updates on both sides. [14:51:38] (4) Varnish logs two metrics about this queue: how many message are enqueued by all the connection threads, and how many the singular expiry thread has received and acted on. [14:52:43] (5) So we log "mailbox lag" as (sent-received), indicating how many backlogged messages are in that queue, either because the expiry thread is overworked and can't keep up, or because lock contention to write to the queue is making the whole scheme perform horribly (or perhaps because eviction rate has crazy spikes that no reasonable scheme like this can keep up with) [14:54:14] the main problem we were tracking in the "original" case that lead to us paying attention to mbox lag [14:55:01] was that the backend storage engine "file" we're using for disk caches isn't very scalable/efficient. In particular, it pays little attention to fragmentation-related issues and doesn't handle them well. [14:55:44] e.g. cache is full of 1K-sized objects, new object arrives of size 512MB -> must evict 512K separate objects on the spot (each one a message to the expiry thread...) [14:56:08] (that's an oversimplification, but captures the gist of it) [14:56:12] got it [14:59:24] so a steadily rising or suddenly-spiking mailbox lag almost always means some kind of pathological perf issue somewhere in varnishd, but it's not always easy to figure out why. [15:00:51] and I guess that lock contention is the worse of the two possible causes [15:01:25] well really there's 3 basic causes (although knowing which of the 3 is still only a partial answer about root causes): [15:01:49] 1) The expiry thread is overworked and simply can't drain the queue fast enough (e.g. cpu or i/o -limited singular expiry thread) [15:02:24] 2) Lock contention on the queue itself, as in the workload is reasonable on both sides but artficially constrained by pthread scaling or priority-inversion-like issues. [15:02:52] 3) Unreasonable/overwhelming amounts of evictions/expiries. [15:03:47] I guess 3 and 1 seem like re-wordings of the same thing. I guess I think of (1) as being more like "the connection threads are putting things into the queue ok without stalling themselves, but the expiry thread isn't draining fast enough" [15:04:07] 3 is more 1+2 :D [15:04:15] and (3) more like "the connection threads are stalling out on other operations because they're inserting so much stuff into the queue" [15:05:08] I guess a lot of it's perspective. But probably the most important perspective is whether connection threads are successful in answer user requests with reasonable latency. [15:05:24] there's also an interesting increase in read operations disk-wise, see eg: earlier today on cp3036 [15:05:27] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?var-server=cp3036&var-datasource=esams%20prometheus%2Fops&orgId=1&from=1517819630017&to=1517838059986 [15:05:47] ~25% iowait [15:06:16] yeah, I think once the mbox grows large, the process of the insertion and deletion of events may not scale well, either, even in the single-threaded sense. [15:07:33] a guess at how all the above can correlate with heavy read iowait would be that it's scanning existing cache storage to find evictable objects, or loading up their metadata to process the eviction event, etc [15:09:22] aside from the large design issues about thread-per-connection and about the singular expiry thread as a contention point, one thing that really seems to be missing here is some analog to the Linux kernel's /proc/sys/vm/min_free_kbytes [15:10:19] i.e. the expiry thread should pre-expire objects in an attempt to maintain some configurable X amount of free space for new insertions, so that it doesn't become the case that once the cache is full almost every insertion has to go find objects to evict while handling a request. [15:15:33] anyways, the malloc backend seems to perform well. Hopefully we won't use "file" at some point in the medium-term future :) [16:17:07] 10Traffic, 10Operations, 10Accessibility, 10Browser-Support-Internet-Explorer: Wikipedia no longer accessible to those using some braille devices - https://phabricator.wikimedia.org/T185582#3946497 (10TheDJ) 05Open>03Invalid I'm hereby closing this ticket as invalid (works as intended), as the communic... [17:13:02] ema: I assume cp3036 self-resolved? I wonder if it caused lag or failed fetches, though. Also, it seems to have also done the same thing 24h earlier, with an eerily similar shape on the graph even. [17:13:30] https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&from=now-2d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All [17:21:13] bblack: yup, it did fail a few fetches but it already self-resolved by the time I've started looking [17:22:19] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3946656 (10RobH) a:05BBlack>03Cmjohnson Ok, this is ready to have the 4 new LVS systems racked, one per row. [17:22:40] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3946659 (10RobH) [17:25:13] interesting how the objhdr lock operations graphs also look very similar (yesterday vs today) [17:26:15] maybe (not now, but maybe later) we could/should retry the pthread-level attempted fixup, with v5? [17:28:02] exp_thread_rt + exp_lck_inherit stuff I mean (I think we still have the patch in v4, I doubt we brough it to v5, not sure?) [17:29:06] yes we have forward ported it to v5! [17:30:21] oh, cool :) [17:31:13] I only vaguely recall when we tried it before with v4. I know it didn't end up fixing anything in practice. I don't think it did anything bad, either? But maybe with other improvements/changes having happened, it's capable of making a difference in the remaining parts of the problem. [17:36:55] yes that's what I remember too: no big improvement, nothing bad either. I was now going through T145661 again to see if we did mention anything re: RT patch (man how much work we've done on this!) [17:36:56] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [17:38:18] there's no specific reason mentioned on the ticket for why we stopped the RT experiment though [17:39:00] I think it was just a general case of "if it's not fixing anything, let's not pointlessly customize the default behavior in ways that may eventually bite us" [17:39:57] especially with both booleans turned on, there's a real possibility it causes an even worse performance meltdown in a new and novel way, in exchange for the old one :) [18:00:51] so cp3036 is bound to be weekly-restarted soon, perhaps it would be a good candidate to give another shot to the RT experiment? [18:33:40] 10Traffic, 10netops, 10Operations: Anycast recdns - https://phabricator.wikimedia.org/T186550#3947034 (10BBlack) p:05Triage>03Normal [18:36:10] ema: yeah couldn't hurt, so long as we keep track of it. [18:45:40] 10Traffic, 10Operations, 10Accessibility, 10Browser-Support-Internet-Explorer: Wikipedia no longer accessible to those using some braille devices - https://phabricator.wikimedia.org/T185582#3947075 (10Volker_E) Just for completion, I haven't heard back from the company for the last two weeks on the email r... [19:05:54] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3947117 (10BBlack) Meta-update since this is quite stalled out now. I'll try to line up all the explanatory bits here that are affecting proces... [19:44:20] 10Traffic, 10Operations, 10Accessibility, 10Browser-Support-Internet-Explorer: Wikipedia no longer accessible to those using some braille devices - https://phabricator.wikimedia.org/T185582#3947244 (10Cameron11598) @TheDJ No problem! I think accessibility is important so I tend to grab these tickets when... [20:45:12] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3947432 (10Ottomata) Thanks @bblack, it's at least good to know that we'll need to do the IPSec thing or this will block us for a long while. I...