[03:57:28] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3791253 (10Imarlier) What's the log format of the message field? I'm guessing the first three fields are start_timestamp, total_time, be_time? [10:58:42] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3791910 (10Gilles) I've found the explanation [[ https://feryn.eu/blog/varnishlog-measure-varnish-cache-performance/#The_Timestamp_tag | here ]]. What's log... [11:09:31] mark: hey :) [11:10:02] it's not fully clear to me why we've got 3 class attributes in BGPFailover, of which one seems to be used as instance attribute instead? [11:10:34] peerings, that is [11:11:33] also we do instantiate BGPFailover in main.py, but the instance is not used [11:12:17] bgpannouncement = BGPFailover(configdict) [11:12:37] pybal/main.py:127: local variable 'bgpannouncement' is assigned to but never used [11:14:31] in ipvs.py instead we use the class methods only [11:16:56] perhaps there's a reasoning I don't understand behind the choice of defining some methods and attributes as class-bound and others as instance-bound in the BGPFailover case, if so let's document it :) [12:04:42] 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3792142 (10Gilles) Slight correction: the fields start at 1, so Timestamp:Resp[2] is correct for total time spent in Varnish. We could log more, though. [12:20:03] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3792183 (10Gilles) Going back to the context of this task, the request that took more than a minute in the HAR should have been seen by... [12:30:11] bblack: looking at https://wikitech.wikimedia.org/wiki/LVS_and_Varnish is "For :443 to the nginxes, it uses a hash of the client IP to help with SSL session persistence" still accurate? [12:31:13] that doesn't seem to be my experience, in a given pageload I hit all sorts of different varnish frontends [13:49:18] gilles: yes that's still accurate [13:49:28] we use the sh scheduler http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.ipvsadm.html#SH-scheduler [13:50:58] but what is it connecting consistently to? the rest of the doc states that nginx is running locally on the varnish frontends [13:51:01] what isn't accurate any longer is the mention of random backend choice on :80, we do use sh for 80 as well since 28ac9d01b907fd467fe5bb8f0d3ef476283a0aec [13:52:38] but I'm not seeing any of that from a client perspective. in the same http/2 connection the requests are handled by random frontends [13:53:06] gilles: do you have an example? Perhaps you're reading X-Cache left-to-right instead of right-to-left? [13:53:24] https://wikitech.wikimedia.org/wiki/Varnish#X-Cache [13:53:25] aaaah, gotcha [13:53:42] never realized it was RTL [13:54:21] speaking of that, in your slow log, is it from the edge-most servers' perspective? [13:55:11] there's two different "programs", in logstash terminology: varnish-fe-slowreqs and varnish-be-slowreqs (fe=frontend, in-memory, and be=backend, on disk) [13:55:21] fe is always the edge-most servers' perspective [13:55:26] thanks [13:55:47] be could be any of the backends used to route traffic through the cdn [13:56:53] it would be awesome to add some structured data to those log entries, as you were suggesting [13:57:03] I'm not sure how to do that though [13:58:42] at the moment we're just sending the 'message' via rsyslog, and the only fields being populated are program, message, timestamp and the like [14:01:50] yeah you'd need another adapter to pipe the output from the varnish command to logstash [14:01:57] with structured fields [14:04:53] a cool so the keyword here is 'adapter' [14:05:16] I'll unleash my google fury then :) [14:35:19] gilles: thanks for https://gerrit.wikimedia.org/r/#/c/393751 ! [14:35:31] do we really want Timestamp[3] though? [14:35:43] that's "Time since last timestamp" [14:36:02] (I think) [14:36:15] it's the time spent in the current stage [14:36:25] VSL:Timestamp:Fetch[3] is the amount of time spent fetching [14:36:30] etc [14:37:11] it's described here https://feryn.eu/blog/varnishlog-measure-varnish-cache-performance/#The_Timestamp_tag [14:39:13] [2] is the sum of the current stage and all previous stages. useful when looking at the last one, which gives you the grand total of time spent in Varnish (that's Resp[2] which you correctly filter on) [14:39:54] if you want the breakdown you need to look at [3], or I guess you could substract timestamps and get the same number, but I think the delta are more legible than a series of timestamps [14:40:31] you're very right! [14:40:39] we could output everything but it'd be redundant and lengthy, imho [14:41:35] let's ship it! [14:43:11] I've tried the varnishncsa command by hand and it looks good, change merged [14:43:29] yeah I have access to varnish boxes, I tested it [14:43:54] running puppet on cp3030 [14:45:23] varnish-{frontend,backend}-slowreqs.service restarted fine, thanks again! [16:55:44] 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793143 (10Fjalapeno) [17:23:54] 10netops, 10Operations: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3793312 (10Cmjohnson) a:03RobH This did not get setup before shipping to Singapore...assigning to @RobH [18:04:55] 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793547 (10JoeWalsh) Confirmed that we can remove the `*.m.wikipedia.org` from the config and the OS will stop re... [18:22:20] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3793642 (10Cmjohnson) p:05High>03Low [18:46:40] 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793768 (10Fjalapeno) @Mholloway spoke with @JoeWalsh about the timing and targeting first week of January seems... [18:56:08] hi, i intend to purge all the Ganglia things from prod, so that includes cache servers. I have 3 patches: cache::canary (cp1008) (https://gerrit.wikimedia.org/r/393674) cache::misc (https://gerrit.wikimedia.org/r/#/c/393675/) cache::text/cache::upload (https://gerrit.wikimedia.org/r/#/c/393676/) this is because i went by $cluster in Ganglia web view and abundance of caution. when i did for [18:56:14] appservers it caused some icinga puppet alse positives but that's just when it races. what it actually does is https://gerrit.wikimedia.org/r/#/c/393707/2/modules/ganglia/manifests/monitor/decommission.pp [19:30:13] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH) [19:31:22] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH) [19:33:07] 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3794094 (10RobH) [20:09:01] 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3794209 (10Mholloway) Will do. Thanks, @JoeWalsh & @Fjalapeno. [21:42:04] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3650187 (10ema) The approach we have in mind can roughly be summed up with `varnishncsa | mtail`. We can group the six scrip... [21:52:17] gilles: ema: interesting stuff on the slow-query logs. I'm not yet sure what to make of it. The new patch is definitely moving in the right direction... [21:53:01] I suspect the high Resp[3] values commonly logged now, though, probably represent that a bunch of excess time is spent transmitting response bytes to the client. [21:53:20] which might just be "normal" in the our setup in certain cases where the client dissappears [21:53:54] and thus the overall long Resp[2]>60 we're filtering on, if it includes that, is also relatively-useless for what we're looking for. [21:55:30] (well, there are many things that qualify as "what we're looking for" with these logs, but if it's just "hey the client TCP connection died a slow death and it took forever for various bits of the stack to time out and give up", there's not much to care about there) [21:56:37] assuming the above, perhaps we want to filter on the deltas for the other steps to catch actual backend slowness [21:58:19] a filter like: "Fetch[3]>60 || Process[3]>60" would isolate just requests where fetching the backend response or processing it took excessive time [21:58:36] (which should be much rarer and more-important than a long send-time to a client) [22:01:11] or maybe, "Process[2]>60" - looking at the total time accumulated up through the Processing stage, but not the Resp part. That's a cleaner "something unrelated to sending a repsonse to the client took longer than 60s" [22:27:39] 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794607 (10RobH)