[07:34:41] 10netops, 10Operations: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) 05Open→03Resolved a:03elukey [09:47:47] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) @CDanis I had a chat with my team as promised, and we are +1 to add the new field to varnishkafka but -1 to add it all the way to the webreq... [09:48:04] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10elukey) a:05Ottomata→03elukey [12:59:10] 10Traffic, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: include the 'Server:' response header in varnishkafka - https://phabricator.wikimedia.org/T224236 (10CDanis) SGTM @elukey, thanks! [13:19:25] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3049.esams.wmnet'] ` The log can be found i... [14:07:20] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3049.esams.wmnet'] ` and were **ALL** successful. [14:53:56] each time I depool out a node for the ATS conversion, one or more of the others suffers on mailbox lag at least temporarily, due to the influx from re-chashing or whatever [14:54:11] cp3039 is looking pretty bad now at ~1M lag [14:54:57] eh, I was chatting with ema about https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3039&service=Check+Varnish+expiry+mailbox+lag [14:54:59] will wait a few and see if it recovers with the repool, but if not might just convert it today too and see how that goes [14:55:29] https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All [14:55:42] note the ramp up there starts right when I too 3049 offline for reimage [14:56:12] I think 3043 (which was less-affected) got saved by a cronjob restart it was already due for [14:58:33] it's interesting that the fallout isn't what you expect from chash re-distribution in general [14:58:46] I tend to suspect it's really because of rare large-object cache hits [14:59:51] as in a scenario where 3039 (which I depooled for reimage) had a hittable object with some traffic, which was ultra-huge. and chash sent that particular rare file to 3039, and it got lost in mailbox lag initially due to bringing that object in and crazy evictions, then the problem runs off into the sunset on its own from there. [15:00:14] but that's just a rough guess from high-level patterns, who knows [15:02:25] you'd think if it were a broader effect with several large-ish files, they'd land on many different caches and the problem would be more diffused and maybe not even a problem. [15:04:19] anyways, I think I'll just do a quickie restart on 3039 for now. I mean it's tempting to just reimage, but it's too-soon after 3049, which is still filling its caches and doing lots of misses [15:08:10] but if you're varnish-backend-restarting 3039, you're still clearing its caches ;) [15:09:08] now that I was going throught things, I am wondering, what is the mailbox lag ? [15:10:29] you should read bb.lack's essay in #_sec from two days ago for the full story, but tldr is "an indicator that a varnish backend is possibly unhealthy or handling weird traffic" [15:14:14] ok ok I will go through the logs [15:14:19] tx! [15:16:30] lol [15:17:03] yeah still clearing its caches, but with a very fast repool into service and so minimizes the disturbance of all the chashed traffic [15:17:10] vs a reimage taking it out of the pool for an hour+ [15:17:48] "mailbox lag" has many layers of understanding [15:18:31] but to dig into what it really really means: [15:18:59] Varnish has a singular maintenance thread which is responsible for evicting expired cache objects (vs the many parallel threads servicing user requests) [15:19:23] the many user-facing threads of course also drag in new storage on cache misses, and possibly need to do immediate evictions to make space [15:20:00] the maintenance thread that does the expiry evictions has to keep a data structure that sorts all the objects by their natural expiry time so that it can keep itself in a loop of sleeping until the next natural eviction and then handling that. [15:20:26] but all these user-facing threads are also constantly modifying cache contents with insertions and evictions, and so they need to tell the maintenance thread about the changes they make, so it can update the data it's tracking [15:20:39] so there's an IPC mechanism called the mailbox [15:21:05] as the user-facing threads do things to cache storage, they write messages into this mailbox for the expiry thread to consume and modify the data structure it uses to direct its own expiry operations [15:21:21] and the expiry thread consumes messages from the mailbox as part of its complex and lock-infested loop of operations [15:21:50] "mailbox lag" is our statistic for the backlog of messages in that mailbox, which under normal/sane conditions should remain small because the expiry thread can totally keep up with its job... [15:22:29] when we have fragemntation-driven eviction problems from the real storage design woes, mailbox lag tends to run up over time and eventually reach bad levels as things go haywire [15:23:07] but also in general if varnish is just overloaded in any of a number of other ways, you tend to see the same effect (that it falls behind on the expiry mailbox because of random lock contention and/or cpu exhaustion or whatever) [15:24:02] so hence it's a general problem indicator too, sort of like watching any other general-problem graph. Saying "omg the mailbox is super-lagged" is not that different from things like "omg the CPUs are saturated" or whatever. It means something is wrong, but isn't super helpful on its own telling you what exactly is wrong. [15:24:28] but in theory, some backlog in the mailbox is not itself problematic, and is intended as part of the design. [15:26:15] we've tried to set some reasonable limits, which are in the current icinga alerting on mailbox lag [15:26:47] but there's no good cutoff values really. no matter what number you pick, sometimes there will be problems before mbox lag reaches icinga critical levels, and sometimes icinga can reach critical level on mbox lag and nothing's actually really wrong yet. [15:27:46] can an article that is being heavily edited, like the dead celebrities we discussed yesterday [15:27:51] it just occurred to me -- we *don't* have any alerting for a particular varnish maxing out on its connections-to-a-backend limit, right? perhaps we should [15:27:53] cause mailbox lag? [15:29:34] cdanis: yes, that might be wise. It's a little tricky to implement, too, though. [15:29:50] the limits are per-backend-service, per-cache-node [15:30:10] so e.g. cp1075 might have a 1K limit for connections to appservers.svc, and a separate 10K limit for connections to restbase. [15:30:42] you'd have to pull those limits out of the varnish config, and compare them to the per node->service connection data that's coming into... I guess prometheus now? [15:31:00] but yeah, an alert on any node maxing out its connection pool to any service would be helpful [15:31:35] I think you would implement it by also exporting into prometheus the service limits, named the same way as they are in the service names coming from varnish's exports [15:31:36] (and really, we set the limits much higher than the normal parallelism load to allow for small spikes and stuff. Probably even crossing ~50% of the allowed limit should at least warn if not crit) [15:31:54] then you can have a simple subtraction in the alerting rule [15:32:51] I'll file a ticket, this is worth doing -- in my limited experience, lots of real problems at the traffic and/or appserver layer correlate with connections-maxed-out very well [15:34:10] thank you brandon :D [15:43:29] 10Traffic, 10Operations: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10CDanis) [15:45:35] 10Traffic, 10Operations: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends - https://phabricator.wikimedia.org/T224738 (10CDanis) p:05Triage→03Normal [18:25:54] 10Domains, 10Traffic, 10Operations, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10reosarevok) We're waiting until we have control over the domain so that we can release our updated website. Is there any particular reason that makes just tran... [20:06:12] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3034.esams.wmnet'] ` The log can be found i... [20:36:07] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) >>! In T221507#5219523, @faidon wrote: > - On the device types errors, I can't help but think that we're looking at th... [20:57:01] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3034.esams.wmnet'] ` and were **ALL** successful.