[07:06:39] <_joe_>	 hey, I had to restart the varnish backend on cp3030, it was throwing 503s
[09:44:03] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10Lucas_Werkmeister_WMDE) Is this just blocked on the...
[10:30:14] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634895 (10akosiaris) There are other comments from yours truly...
[10:42:10] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634952 (10daniel) @akosiaris I replied on the patch. Basically...
[11:02:12] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634979 (10Dereckson) As there is no argument for 303, we only...
[11:03:36] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10Dereckson) p:05Normal>03High [ Setting priority...
[11:12:09] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634992 (10daniel) *sigh* 301 it is, then. I wrote some more on...
[13:06:42] <wikibugs_>	 10Traffic, 10Continuous-Integration-Infrastructure, 10DNS, 10Operations, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3635184 (10hashar) 05Open>03Resolved a:03hashar Apparently that was transient or puppet was not willing to c...
[14:17:40] <bblack>	 _joe_: ack, thanks, looking at the stats and stuff
[14:21:01] <ema>	 bblack: I've looked into it briefly earlier on, same thing happened yesterday at the same time 
[14:21:04] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&from=1506313923634&to=1506411882980
[14:21:09] <ema>	 see 06:15ish
[14:22:13] <bblack>	 it's not very clear what the issue is though
[14:22:21] <bblack>	 looked like FE had more issues than BE
[14:22:35] <ema>	 at least in the case of cp3040, it seems suspicious that it would start having issues exactly at the same time by coincidence. Perhaps some bot?
[14:23:57] <bblack>	 I see (on 3030 around the window just before the restart) some increased disk-read activity, and I see some large spikes in the backend conns of the FE instance (~1K)
[14:24:07] <_joe_>	 bblack, ema: so what I saw, while having my coffee, was a ton of 503s with cp3030 as a backend
[14:24:12] <bblack>	 (also some FE fetch fails)
[14:24:13] <_joe_>	 I decided to just restart it
[14:24:41] <bblack>	 was 3030 actually the backend? it could've been the frontend
[14:24:51] <bblack>	 (with nothing else to the left, because it failed there)
[14:25:12] <ema>	 the graph plots backend fetch failures
[14:25:31] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=1506395286713&to=1506419940844
[14:25:36] <bblack>	 are we looking at the same thing ^ ?
[14:26:08] <bblack>	 oh I don't think the links preserve the host setting heh
[14:26:19] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=1506395286713&to=1506419940844&var-server=cp3030&var-datasource=esams%20prometheus%2Fops\
[14:26:23] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=1506395286713&to=1506419940844&var-server=cp3030&var-datasource=esams%20prometheus%2Fops
[14:26:27] <bblack>	 bleh
[14:27:11] <ema>	 we aren't looking at the same thing :) I meant the varnish-failed-fetches plot I've posted above
[14:28:29] <ema>	 the FE fetch failures on cp3030 don't seem very significant (max 0.45)
[14:29:11] <ema>	 it doesn't look like cp3030 threw any 503s actually?
[14:29:24] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&from=1506401768304&to=1506410178786
[14:31:03] <bblack>	 maybe a server name got mentally transposed somewhere
[14:35:09] <bblack>	 so yeah, 3040 shows a quick-ramp mailbox lag problem around the same time both mornings
[14:35:35] <bblack>	 perhaps we're hitting a sudden deluge of expiring objects at the ~24h mark after restart?
[14:35:59] <bblack>	 (causing expiry thread to lock up locks and lag out the runtime threads waiting on related locks, causing fetch fails, etc?)
[14:36:45] <ema>	 ah, interesting theory
[14:37:09] <bblack>	 but then again, the whole cause+effect issue is present here.  with the mbox lag ramping so quickly, it makes me wonder if it's just an effect of some other general disfunction
[14:38:23] <bblack>	 it does seem that timelines implicate the lag though
[14:38:32] <bblack>	 the lag ramp starts a little bit before the fetch failures do
[14:39:19] <bblack>	 expiry lockops actually drop off as the lag rises.  presumably because the expiry thread is locking up that lock too much, other threads are stalling on it, which leads to a lower rate of overall lockops
[14:40:15] <bblack>	 do we still have those pthread lock-related patches?
[14:40:21] <bblack>	 (running live, I mean)?
[14:42:25] <ema>	 bblack: code is still patched but I don't think we've got 'cache::exp_thread_rt' set anywhere in hiera
[14:42:54] <bblack>	 right
[14:43:24] <bblack>	 it would be a random stab to try turning it on for text here.  very random.  it might make it worse
[14:45:34] <bblack>	 that lead me over to our patches directory in general, now I'm trying to find all my mental context about the extrachance patch
[14:45:42] <bblack>	 we currently set it to zero in the initscript
[14:46:37] <bblack>	 with or without the patch, it seems like you only get extra chances on VBC_STATE_STOLEN
[14:47:21] <bblack>	 we currently run with zero extra chances, meaning even if it is VBC_STATE_STOLEN, we don't allow any retry
[14:48:42] <bblack>	 reminding myself: VBC_STATE_STOLEN means we reused some other existing backend tcp connection from the pool (as opposed to starting a fresh one)
[14:49:18] <bblack>	 the reason for retrying on a failure in that state, is that there's an unavoidable chance of a close-race (where the pool connection we picked up is being closed for idle timeout by the peer just as we start sending a request down it)
[14:51:17] <bblack>	 if the state (in that loop) is other than VBC_STATE_STOLEN, our patched extrachance param doesn't matter, the value gets explicitly dumped down to zero and there's no retry
[14:51:37] <bblack>	 so what we're limiting (to zero, presently) is those close-race retries
[14:52:11] <bblack>	 maybe 1 (the patch's default) is better? some of these symptoms could be related, if there's some pattern going on related to idle timeouts of backend conns, etc..
[14:52:35] <bblack>	 but surely we thought about all this before, I just don't remember the rationale for setting it all the way down to zero
[14:55:12] <ema>	 I think https://phabricator.wikimedia.org/T150247 might shed some light
[15:00:27] <bblack>	 hmmm yeah, thanks
[15:01:00] <bblack>	 so I guess our overall logical position here is let the stolen-state close-race happen, and let the frontend outer retry fix it up
[15:02:00] <bblack>	 (which kinda reminds me in general about all related things: a low rate of random backend fetch fails doesn't necessarily equate to user-facing 503s, as we do have that frontend retry papering over transient issues)
[15:04:20] <bblack>	 perhaps we should be adding some randomization to ttl and keep values in general?
[15:04:44] <bblack>	 it still seems likely that in some indirect way, the mbox lag spike on text -> fetch fail is going to be due to a whole lot of object expiries lining up on the same time mark
[15:07:40] <bblack>	 that grace-within-ttl patch I was working on before has some related concepts
[15:07:43] <bblack>	 https://gerrit.wikimedia.org/r/#/c/364606/1/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb
[15:07:57] <bblack>	 but it was going quite a bit further and trying other things too, and I think it's fully thought-through.
[15:09:13] <bblack>	 err I meant to say:
[15:09:19] <bblack>	 but it was going quite a bit further and trying other things too, and I think it's *not* fully thought-through.
[15:11:01] <bblack>	 and really, I guess you don't have to randomize keep separately, since keep is always effectively an offset past the randomized ttl
[15:11:34] <ema>	 true
[15:12:35] <bblack>	 and that patch likely doesn't rebase anymore either, since it was contingent on the fixed-keep-value stuff that we later modified and then reverted
[15:13:07] <ema>	 so the backend object expiration rate did spike, but after the fetch failures
[15:13:30] <ema>	 exp rate spike: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=19&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1506398306983&to=1506412854840
[15:13:39] <ema>	 (06:23)
[15:13:52] <ema>	 fetch failures: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1506398306983&to=1506412854840
[15:13:59] <ema>	 (06:11)
[15:15:11] <ema>	 however expiration rate went down before the fetch failures started (06:08)
[15:15:17] <bblack>	 yeah, but you could also argue that when the timing of a ton of objects collide in the expiry list and cause the lockups->fetchfails, you might not see an expiry rate bump (yet) because the mutex lockups are slowing down the actual expiry process
[15:16:13] <bblack>	 (the exp rate stat bumps on actually executing an expiry, relying on getting through the mutex contention)
[15:16:22] <ema>	 right
[15:17:55] <ema>	 the rate of backend objects made went up apparently before fetch failures, both today and yesterday
[15:17:59] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=now-2d&to=now&panelId=45&fullscreen
[15:18:32] <ema>	 (the graph shows negative values as I haven't had the time of figuring out the proper prometheus query for it, but I think the increase should be legit)
[15:19:48] <ema>	 that might indicate some client pulling in lots of objects into that specific backend at that specific time every morning or something
[15:23:55] <bblack>	 yeah maybe
[15:24:09] <bblack>	 maybe very short-lived ones too, which then causes a quick expiry spike after?
[15:24:17] <bblack>	 e.g. 5m-cacheable API queries or something
[15:24:51] <bblack>	 (I know there are some cacheable-but-short-TTL responses from the MW /w/api.php stuff at certain query params)
[15:26:06] <bblack>	 yeah that graph looks like some kind of counter-vs-gauge mismatch
[15:33:18] <ema>	 oh: ip_early_demux patch merged today at 09:14, no significant change in resource usage on the LVSs as far as I could tell
[15:36:54] <bblack>	 it may be that the utilization is too low to notice the diff heh
[15:37:59] <bblack>	 you'd think it would shift slightly one way or the other.  either the early_demux operation was helping ipvs (like it does for end hosts) and we see some cpu% increase from disabling the optimization.  or it hurts (redundant pointless work, like the router case) and disabling it reduces cpu% a bit.
[16:00:03] <wikibugs_>	 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3635971 (10Lucas_Werkmeister_WMDE) The URL https://commons.wiki...
[16:20:25] <ema>	 bblack: http://bit.ly/2xue79C
[16:20:53] <ema>	 that IP there stood out in the 503 graph for cp3040 this morning
[16:22:33] <ema>	 of course there's no indication that the specific IP triggered anything, just that it was accessing the API quite a lot during the timeframe
[16:23:31] <wikibugs_>	 10Traffic, 10Operations: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3636060 (10BBlack)
[16:23:34] <wikibugs_>	 10Traffic, 10Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3636059 (10BBlack)
[16:23:37] <wikibugs_>	 10Traffic, 10Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3636061 (10BBlack)
[16:23:40] <wikibugs_>	 10Traffic, 10Operations: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3636056 (10BBlack) 05Open>03Resolved a:03BBlack Equinix was selected by the process, and we've negotiated and signed and sent in the specific order at this point.
[16:24:41] <wikibugs_>	 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3636068 (10BBlack)
[16:24:44] <wikibugs_>	 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3636069 (10BBlack)
[16:24:47] <wikibugs_>	 10Traffic, 10Operations: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3636070 (10BBlack)
[16:24:50] <wikibugs_>	 10Traffic, 10Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#2962007 (10BBlack) 05Open>03Resolved a:03BBlack `eqsin` is the site name (Vendor: Equinix, Airport: [[ https://en.wikipedia.org/wiki/Singapore_Changi_Airport | SIN ]] )
[17:33:28] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636236 (10RobH) @ayounsi fixed the cr3, and we have a serials from it: JN1263160AF, ACRH4681, and CAHX6845.  Updated task description to remove t...
[17:33:55] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636238 (10RobH)
[17:37:40] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636242 (10faidon) @mark confirmed that S/N: TA3717090152 and S/N: TA3717090331 are the new QFX5100 that were delivered at esams a few weeks ago (...
[17:45:13] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636251 (10ayounsi) == Failover tests==  Rate: 1 ping per second |Tested item|ping from bast1001 to tellurium|ping from external to pfw3|ping from rigel to tellurium|Notes| |cr1 – pfw3a...
[18:05:12] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3636284 (10RobH) >>! In T176338#3624714, @Cmjohnson wrote: > @robh I want to confirm that I do have 2 spares  still in their original packaging in...
[18:06:37] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636287 (10RobH)
[18:07:20] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3621935 (10RobH)
[18:09:24] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636318 (10ayounsi)
[18:22:04] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636354 (10RobH)
[18:23:06] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3636355 (10RobH)
[18:23:47] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636356 (10ayounsi)
[20:59:56] <wikibugs_>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3637202 (10dr0ptp4kt)
[21:01:54] <wikibugs_>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2675091 (10dr0ptp4kt)
[22:00:42] <wikibugs_>	 10netops, 10DC-Ops, 10Operations, 10ops-esams: cr2-esams temperature warning - https://phabricator.wikimedia.org/T176816#3637416 (10ayounsi)