[07:06:39] <_joe_> hey, I had to restart the varnish backend on cp3030, it was throwing 503s [09:44:03] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10Lucas_Werkmeister_WMDE) Is this just blocked on the... [10:30:14] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634895 (10akosiaris) There are other comments from yours truly... [10:42:10] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634952 (10daniel) @akosiaris I replied on the patch. Basically... [11:02:12] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634979 (10Dereckson) As there is no argument for 303, we only... [11:03:36] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3214712 (10Dereckson) p:05Normal>03High [ Setting priority... [11:12:09] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3634992 (10daniel) *sigh* 301 it is, then. I wrote some more on... [13:06:42] 10Traffic, 10Continuous-Integration-Infrastructure, 10DNS, 10Operations, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3635184 (10hashar) 05Open>03Resolved a:03hashar Apparently that was transient or puppet was not willing to c... [14:17:40] _joe_: ack, thanks, looking at the stats and stuff [14:21:01] bblack: I've looked into it briefly earlier on, same thing happened yesterday at the same time [14:21:04] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&from=1506313923634&to=1506411882980 [14:21:09] see 06:15ish [14:22:13] it's not very clear what the issue is though [14:22:21] looked like FE had more issues than BE [14:22:35] at least in the case of cp3040, it seems suspicious that it would start having issues exactly at the same time by coincidence. Perhaps some bot? [14:23:57] I see (on 3030 around the window just before the restart) some increased disk-read activity, and I see some large spikes in the backend conns of the FE instance (~1K) [14:24:07] <_joe_> bblack, ema: so what I saw, while having my coffee, was a ton of 503s with cp3030 as a backend [14:24:12] (also some FE fetch fails) [14:24:13] <_joe_> I decided to just restart it [14:24:41] was 3030 actually the backend? it could've been the frontend [14:24:51] (with nothing else to the left, because it failed there) [14:25:12] the graph plots backend fetch failures [14:25:31] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=1506395286713&to=1506419940844 [14:25:36] are we looking at the same thing ^ ? [14:26:08] oh I don't think the links preserve the host setting heh [14:26:19] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=1506395286713&to=1506419940844&var-server=cp3030&var-datasource=esams%20prometheus%2Fops\ [14:26:23] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=1506395286713&to=1506419940844&var-server=cp3030&var-datasource=esams%20prometheus%2Fops [14:26:27] bleh [14:27:11] we aren't looking at the same thing :) I meant the varnish-failed-fetches plot I've posted above [14:28:29] the FE fetch failures on cp3030 don't seem very significant (max 0.45) [14:29:11] it doesn't look like cp3030 threw any 503s actually? [14:29:24] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&from=1506401768304&to=1506410178786 [14:31:03] maybe a server name got mentally transposed somewhere [14:35:09] so yeah, 3040 shows a quick-ramp mailbox lag problem around the same time both mornings [14:35:35] perhaps we're hitting a sudden deluge of expiring objects at the ~24h mark after restart? [14:35:59] (causing expiry thread to lock up locks and lag out the runtime threads waiting on related locks, causing fetch fails, etc?) [14:36:45] ah, interesting theory [14:37:09] but then again, the whole cause+effect issue is present here. with the mbox lag ramping so quickly, it makes me wonder if it's just an effect of some other general disfunction [14:38:23] it does seem that timelines implicate the lag though [14:38:32] the lag ramp starts a little bit before the fetch failures do [14:39:19] expiry lockops actually drop off as the lag rises. presumably because the expiry thread is locking up that lock too much, other threads are stalling on it, which leads to a lower rate of overall lockops [14:40:15] do we still have those pthread lock-related patches? [14:40:21] (running live, I mean)? [14:42:25] bblack: code is still patched but I don't think we've got 'cache::exp_thread_rt' set anywhere in hiera [14:42:54] right [14:43:24] it would be a random stab to try turning it on for text here. very random. it might make it worse [14:45:34] that lead me over to our patches directory in general, now I'm trying to find all my mental context about the extrachance patch [14:45:42] we currently set it to zero in the initscript [14:46:37] with or without the patch, it seems like you only get extra chances on VBC_STATE_STOLEN [14:47:21] we currently run with zero extra chances, meaning even if it is VBC_STATE_STOLEN, we don't allow any retry [14:48:42] reminding myself: VBC_STATE_STOLEN means we reused some other existing backend tcp connection from the pool (as opposed to starting a fresh one) [14:49:18] the reason for retrying on a failure in that state, is that there's an unavoidable chance of a close-race (where the pool connection we picked up is being closed for idle timeout by the peer just as we start sending a request down it) [14:51:17] if the state (in that loop) is other than VBC_STATE_STOLEN, our patched extrachance param doesn't matter, the value gets explicitly dumped down to zero and there's no retry [14:51:37] so what we're limiting (to zero, presently) is those close-race retries [14:52:11] maybe 1 (the patch's default) is better? some of these symptoms could be related, if there's some pattern going on related to idle timeouts of backend conns, etc.. [14:52:35] but surely we thought about all this before, I just don't remember the rationale for setting it all the way down to zero [14:55:12] I think https://phabricator.wikimedia.org/T150247 might shed some light [15:00:27] hmmm yeah, thanks [15:01:00] so I guess our overall logical position here is let the stolen-state close-race happen, and let the frontend outer retry fix it up [15:02:00] (which kinda reminds me in general about all related things: a low rate of random backend fetch fails doesn't necessarily equate to user-facing 503s, as we do have that frontend retry papering over transient issues) [15:04:20] perhaps we should be adding some randomization to ttl and keep values in general? [15:04:44] it still seems likely that in some indirect way, the mbox lag spike on text -> fetch fail is going to be due to a whole lot of object expiries lining up on the same time mark [15:07:40] that grace-within-ttl patch I was working on before has some related concepts [15:07:43] https://gerrit.wikimedia.org/r/#/c/364606/1/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb [15:07:57] but it was going quite a bit further and trying other things too, and I think it's fully thought-through. [15:09:13] err I meant to say: [15:09:19] but it was going quite a bit further and trying other things too, and I think it's *not* fully thought-through. [15:11:01] and really, I guess you don't have to randomize keep separately, since keep is always effectively an offset past the randomized ttl [15:11:34] true [15:12:35] and that patch likely doesn't rebase anymore either, since it was contingent on the fixed-keep-value stuff that we later modified and then reverted [15:13:07] so the backend object expiration rate did spike, but after the fetch failures [15:13:30] exp rate spike: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=19&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1506398306983&to=1506412854840 [15:13:39] (06:23) [15:13:52] fetch failures: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=1506398306983&to=1506412854840 [15:13:59] (06:11) [15:15:11] however expiration rate went down before the fetch failures started (06:08) [15:15:17] yeah, but you could also argue that when the timing of a ton of objects collide in the expiry list and cause the lockups->fetchfails, you might not see an expiry rate bump (yet) because the mutex lockups are slowing down the actual expiry process [15:16:13] (the exp rate stat bumps on actually executing an expiry, relying on getting through the mutex contention) [15:16:22] right [15:17:55] the rate of backend objects made went up apparently before fetch failures, both today and yesterday [15:17:59] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops&from=now-2d&to=now&panelId=45&fullscreen [15:18:32] (the graph shows negative values as I haven't had the time of figuring out the proper prometheus query for it, but I think the increase should be legit) [15:19:48] that might indicate some client pulling in lots of objects into that specific backend at that specific time every morning or something [15:23:55] yeah maybe [15:24:09] maybe very short-lived ones too, which then causes a quick expiry spike after? [15:24:17] e.g. 5m-cacheable API queries or something [15:24:51] (I know there are some cacheable-but-short-TTL responses from the MW /w/api.php stuff at certain query params) [15:26:06] yeah that graph looks like some kind of counter-vs-gauge mismatch [15:33:18] oh: ip_early_demux patch merged today at 09:14, no significant change in resource usage on the LVSs as far as I could tell [15:36:54] it may be that the utilization is too low to notice the diff heh [15:37:59] you'd think it would shift slightly one way or the other. either the early_demux operation was helping ipvs (like it does for end hosts) and we see some cpu% increase from disabling the optimization. or it hurts (redundant pointless work, like the router case) and disabling it reduces cpu% a bit. [16:00:03] 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Wikidata-Sprint: Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content - https://phabricator.wikimedia.org/T163922#3635971 (10Lucas_Werkmeister_WMDE) The URL https://commons.wiki... [16:20:25] bblack: http://bit.ly/2xue79C [16:20:53] that IP there stood out in the 503 graph for cp3040 this morning [16:22:33] of course there's no indication that the specific IP triggered anything, just that it was accessing the API quite a lot during the timeframe [16:23:31] 10Traffic, 10Operations: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3636060 (10BBlack) [16:23:34] 10Traffic, 10Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3636059 (10BBlack) [16:23:37] 10Traffic, 10Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3636061 (10BBlack) [16:23:40] 10Traffic, 10Operations: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3636056 (10BBlack) 05Open>03Resolved a:03BBlack Equinix was selected by the process, and we've negotiated and signed and sent in the specific order at this point. [16:24:41] 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#3636068 (10BBlack) [16:24:44] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3636069 (10BBlack) [16:24:47] 10Traffic, 10Operations: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3636070 (10BBlack) [16:24:50] 10Traffic, 10Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#2962007 (10BBlack) 05Open>03Resolved a:03BBlack `eqsin` is the site name (Vendor: Equinix, Airport: [[ https://en.wikipedia.org/wiki/Singapore_Changi_Airport | SIN ]] ) [17:33:28] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636236 (10RobH) @ayounsi fixed the cr3, and we have a serials from it: JN1263160AF, ACRH4681, and CAHX6845. Updated task description to remove t... [17:33:55] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636238 (10RobH) [17:37:40] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636242 (10faidon) @mark confirmed that S/N: TA3717090152 and S/N: TA3717090331 are the new QFX5100 that were delivered at esams a few weeks ago (... [17:45:13] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636251 (10ayounsi) == Failover tests== Rate: 1 ping per second |Tested item|ping from bast1001 to tellurium|ping from external to pfw3|ping from rigel to tellurium|Notes| |cr1 – pfw3a... [18:05:12] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3636284 (10RobH) >>! In T176338#3624714, @Cmjohnson wrote: > @robh I want to confirm that I do have 2 spares still in their original packaging in... [18:06:37] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636287 (10RobH) [18:07:20] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3621935 (10RobH) [18:09:24] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636318 (10ayounsi) [18:22:04] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3636354 (10RobH) [18:23:06] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3636355 (10RobH) [18:23:47] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad: Move eqiad frack to new infra - https://phabricator.wikimedia.org/T174218#3636356 (10ayounsi) [20:59:56] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3637202 (10dr0ptp4kt) [21:01:54] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2675091 (10dr0ptp4kt) [22:00:42] 10netops, 10DC-Ops, 10Operations, 10ops-esams: cr2-esams temperature warning - https://phabricator.wikimedia.org/T176816#3637416 (10ayounsi)