[03:57:28] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3791253 (10Imarlier) What's the log format of the message field?  I'm guessing the first three fields are start_timestamp, total_time, be_time?
[10:58:42] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3791910 (10Gilles) I've found the explanation [[ https://feryn.eu/blog/varnishlog-measure-varnish-cache-performance/#The_Timestamp_tag | here ]].  What's log...
[11:09:31] <ema>	 mark: hey :)
[11:10:02] <ema>	 it's not fully clear to me why we've got 3 class attributes in BGPFailover, of which one seems to be used as instance attribute instead?
[11:10:34] <ema>	 peerings, that is
[11:11:33] <ema>	 also we do instantiate BGPFailover in main.py, but the instance is not used
[11:12:17] <ema>	 bgpannouncement = BGPFailover(configdict)
[11:12:37] <ema>	 pybal/main.py:127: local variable 'bgpannouncement' is assigned to but never used
[11:14:31] <ema>	 in ipvs.py instead we use the class methods only
[11:16:56] <ema>	 perhaps there's a reasoning I don't understand behind the choice of defining some methods and attributes as class-bound and others as instance-bound in the BGPFailover case, if so let's document it :)
[12:04:42] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3792142 (10Gilles) Slight correction: the fields start at 1, so Timestamp:Resp[2] is correct for total time spent in Varnish. We could log more, though.
[12:20:03] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3792183 (10Gilles) Going back to the context of this task, the request that took more than a minute in the HAR should have been seen by...
[12:30:11] <gilles>	 bblack: looking at https://wikitech.wikimedia.org/wiki/LVS_and_Varnish is "For :443 to the nginxes, it uses a hash of the client IP to help with SSL session persistence" still accurate?
[12:31:13] <gilles>	 that doesn't seem to be my experience, in a given pageload I hit all sorts of different varnish frontends
[13:49:18] <ema>	 gilles: yes that's still accurate
[13:49:28] <ema>	 we use the sh scheduler http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.ipvsadm.html#SH-scheduler
[13:50:58] <gilles>	 but what is it connecting consistently to? the rest of the doc states that nginx is running locally on the varnish frontends
[13:51:01] <ema>	 what isn't accurate any longer is the mention of random backend choice on :80, we do use sh for 80 as well since 28ac9d01b907fd467fe5bb8f0d3ef476283a0aec
[13:52:38] <gilles>	 but I'm not seeing any of that from a client perspective. in the same http/2 connection the requests are handled by random frontends
[13:53:06] <ema>	 gilles: do you have an example? Perhaps you're reading X-Cache left-to-right instead of right-to-left?
[13:53:24] <ema>	 https://wikitech.wikimedia.org/wiki/Varnish#X-Cache
[13:53:25] <gilles>	 aaaah, gotcha
[13:53:42] <gilles>	 never realized it was RTL
[13:54:21] <gilles>	 speaking of that, in your slow log, is it from the edge-most servers' perspective?
[13:55:11] <ema>	 there's two different "programs", in logstash terminology: varnish-fe-slowreqs and varnish-be-slowreqs (fe=frontend, in-memory, and be=backend, on disk)
[13:55:21] <ema>	 fe is always the edge-most servers' perspective
[13:55:26] <gilles>	 thanks
[13:55:47] <ema>	 be could be any of the backends used to route traffic through the cdn
[13:56:53] <ema>	 it would be awesome to add some structured data to those log entries, as you were suggesting
[13:57:03] <ema>	 I'm not sure how to do that though
[13:58:42] <ema>	 at the moment we're just sending the 'message' via rsyslog, and the only fields being populated are program, message, timestamp and the like
[14:01:50] <gilles>	 yeah you'd need another adapter to pipe the output from the varnish command to logstash
[14:01:57] <gilles>	 with structured fields
[14:04:53] <ema>	 a cool so the keyword here is 'adapter'
[14:05:16] <ema>	 I'll unleash my google fury then :)
[14:35:19] <ema>	 gilles: thanks for https://gerrit.wikimedia.org/r/#/c/393751 !
[14:35:31] <ema>	 do we really want Timestamp[3] though?
[14:35:43] <ema>	 that's "Time since last timestamp"
[14:36:02] <ema>	 (I think)
[14:36:15] <gilles>	 it's the time spent in the current stage
[14:36:25] <gilles>	 VSL:Timestamp:Fetch[3] is the amount of time spent fetching
[14:36:30] <gilles>	 etc
[14:37:11] <gilles>	 it's described here https://feryn.eu/blog/varnishlog-measure-varnish-cache-performance/#The_Timestamp_tag
[14:39:13] <gilles>	 [2] is the sum of the current stage and all previous stages. useful when looking at the last one, which gives you the grand total of time spent in Varnish (that's Resp[2] which you correctly filter on)
[14:39:54] <gilles>	 if you want the breakdown you need to look at [3], or I guess you could substract timestamps and get the same number, but I think the delta are more legible than a series of timestamps
[14:40:31] <ema>	 you're very right!
[14:40:39] <gilles>	 we could output everything but it'd be redundant and lengthy, imho
[14:41:35] <ema>	 let's ship it!
[14:43:11] <ema>	 I've tried the varnishncsa command by hand and it looks good, change merged
[14:43:29] <gilles>	 yeah I have access to varnish boxes, I tested it
[14:43:54] <ema>	 running puppet on cp3030 
[14:45:23] <ema>	 varnish-{frontend,backend}-slowreqs.service restarted fine, thanks again! 
[16:55:44] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793143 (10Fjalapeno)
[17:23:54] <wikibugs>	 10netops, 10Operations: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3793312 (10Cmjohnson) a:03RobH This did not get setup before shipping to Singapore...assigning to @RobH
[18:04:55] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793547 (10JoeWalsh) Confirmed that we can remove the `*.m.wikipedia.org` from the config and the OS will stop re...
[18:22:20] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3793642 (10Cmjohnson) p:05High>03Low
[18:46:40] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3793768 (10Fjalapeno) @Mholloway spoke with @JoeWalsh about the timing and targeting first week of January seems...
[18:56:08] <mutante>	 hi, i intend to purge all the Ganglia things from prod, so that includes cache servers. I have 3 patches: cache::canary (cp1008) (https://gerrit.wikimedia.org/r/393674)  cache::misc (https://gerrit.wikimedia.org/r/#/c/393675/)  cache::text/cache::upload  (https://gerrit.wikimedia.org/r/#/c/393676/)   this is because i went by $cluster in Ganglia web view and abundance of caution. when i did for 
[18:56:14] <mutante>	 appservers it caused some icinga puppet alse positives but that's just when it races. what it actually does is https://gerrit.wikimedia.org/r/#/c/393707/2/modules/ganglia/manifests/monitor/decommission.pp
[19:30:13] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH)
[19:31:22] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10RobH)
[19:33:07] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#3794094 (10RobH)
[20:09:01] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, and 4 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3794209 (10Mholloway) Will do.  Thanks, @JoeWalsh & @Fjalapeno.
[21:42:04] <wikibugs>	 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3650187 (10ema) The approach we have in mind can roughly be summed up with `varnishncsa | mtail`. We can group the six scrip...
[21:52:17] <bblack>	 gilles: ema: interesting stuff on the slow-query logs.  I'm not yet sure what to make of it.  The new patch is definitely moving in the right direction...
[21:53:01] <bblack>	 I suspect the high Resp[3] values commonly logged now, though, probably represent that a bunch of excess time is spent transmitting response bytes to the client.
[21:53:20] <bblack>	 which might just be "normal" in the our setup in certain cases where the client dissappears
[21:53:54] <bblack>	 and thus the overall long Resp[2]>60 we're filtering on, if it includes that, is also relatively-useless for what we're looking for.
[21:55:30] <bblack>	 (well, there are many things that qualify as "what we're looking for" with these logs, but if it's just "hey the client TCP connection died a slow death and it took forever for various bits of the stack to time out and give up", there's not much to care about there)
[21:56:37] <bblack>	 assuming the above, perhaps we want to filter on the deltas for the other steps to catch actual backend slowness
[21:58:19] <bblack>	 a filter like: "Fetch[3]>60 || Process[3]>60" would isolate just requests where fetching the backend response or processing it took excessive time
[21:58:36] <bblack>	 (which should be much rarer and more-important than a long send-time to a client)
[22:01:11] <bblack>	 or maybe, "Process[2]>60" - looking at the total time accumulated up through the Processing stage, but not the Resp part.  That's a cleaner "something unrelated to sending a repsonse to the client took longer than 60s"
[22:27:39] <wikibugs>	 10netops, 10Operations, 10ops-eqsin: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794607 (10RobH)