[03:33:20] even under current non-problem conditions, I've already seen the 1799 evidence in varnishlog sporadically, after looking at esams only a few minutes [03:34:32] to be specific, I hit that error log message 13x times between all the esams text fe+be instances inside of ~3 minutes looking [03:34:45] e.g.: [03:34:48] bblack@cp3040:~$ varnishlog -n frontend -g raw -q VCL_Error 39223803 VCL_Error c vcl_hit{} returns miss without busy object. Doing pass. [03:35:54] most likely there's a larger burst of them when problems hit. it seems the most-solid theory remaining at present, anyways. [03:36:45] (the problem with these is that they theoretically should be treated like true misses and coalesce if they're hot, but they fail to coalesce, so many hot requests flow through the floodgates briefly without coalesce protection of what should be a cacheable fetch) [03:37:39] so, I'm gonna go ahead and merge my 2x patches pointed out above, and see if we have a quiet(er?) esams morning [03:54:20] note those went live ~ 03:40-45-ish as the puppetization rolled out [03:54:28] there was a failedfetch burst around 03:30 just before [04:00:52] I've been logging esams text nodes for the VCL_Error above (actually any VCL_Error at all), and nada since the vcl changes. it's already been longer than it took me to see 13x of them before. [04:03:04] * bblack crosses fingers [04:07:52] https://grafana.wikimedia.org/dashboard/db/varnish-caching?panelId=8&fullscreen&orgId=1&from=1521599423804&to=1521605239966&var-cluster=All&var-site=All [04:08:07] ^ interesting that both text and upload true-hitrate also rose in the wake of those pair of VCL changes [04:32:26] still no logged VCL_Error [04:34:21] there's still minor evidence of little fetchfail ramps and mbox lag ramps in esams as the morning progresses, but none seem to have caused major fallout or runaway conns [04:34:31] we'll see how the rest of the morning goes! [04:38:12] 04:36 <+icinga-wm> PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [04:38:16] 04:38 <+icinga-wm> RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [04:38:40] ^ possibly-related, although I fail to fathom whether that's a bad thing or not (maybe now the right things are failing and not taking down everything else with them) [04:43:24] hmm a thin 5xx spike through cp3033 correlated with that RB alert. [04:43:41] (and cp3033 did suffer some runaway connection-count stuff to all the eqiad text boxes during) [04:43:50] still is... [04:44:12] 04:43 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min [04:44:17] 04:43 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min [04:47:03] seems to be recovering on its own fine though [05:11:27] 10Wikimedia-Apache-configuration: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4067490 (10TerraCodes) [05:15:26] 10Wikimedia-Apache-configuration: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4067501 (10TerraCodes) After trying some other domain names, it looks like more than just this domain has this error. [05:20:58] 10Wikimedia-Apache-configuration: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4067490 (10Jayprakash12345) Same here [08:11:22] bblack: the late expiry thread case? Indeed lots of gems there [08:13:31] on the flapping backends side, we can still get useful info with 'logger_name:varnishospital AND transition:"Back healthy"' and avoid the varnishospital spam from old VCLs being discarded [08:14:05] cp3032 seems to be the one in trouble this morning [08:15:20] it began flapping at 08:06 [08:15:51] or even earlier, that's just that default kibana limit [08:16:15] depooling [08:18:55] (cp3033 is flapping now, actually) [08:25:41] depooling cp3033 had the result of allowing the expiry thread to catch up and the flapping to stop [08:25:53] depooling cp3033's backend, that is [08:26:25] catch up might be too strong, let's say lag stopped climbing like crazy [08:26:42] see: [08:26:42] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=13&fullscreen&orgId=1&from=1521612830987&to=1521620791437&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=cp3033&var-layer=backend [08:33:16] bblack: VCL_Error seems like another interesting thing to send to logstash BTW [08:36:15] <_joe_> ema: I think we're having some issues right now, I've seen an alert on text-lb.esams ipv6 [08:39:05] _joe_: what did the alert say? Where have you seen it? [08:39:39] <_joe_> ema: on icinga, it went away after one minute [08:39:43] <_joe_> so before paging [08:39:57] good, yeah, see my depools on #-operations [08:40:00] <_joe_> it was text-lb.esams https ipv6 [08:40:39] <_joe_> ok, I'm currently on another fire, I just saw the alert popping up and wanted to let you know :) [08:41:06] yes, thank you _joe_! [08:43:01] so it seems that when a given -be's expiry thread is heavily under pressure, healthchecks for that -be begin failing (but occasionally) [08:44:11] that has negative fallout for other backends indeed due to the fast set-sick/set-healthy cycles [08:44:27] depooling a given backend altogether seems to help [08:44:50] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521614652467&to=1521621869837 [08:45:24] cp3033 depooled at 08:20 [08:45:40] cp3033 repooled at 08:35 [08:45:52] cp3030 depooled at 08:35 [08:46:17] now cp3033 is again misbehaving [08:55:06] for the record, no output from `varnishlog -n frontend -g raw -i VCL_Error` (on cp3030 at least) [09:06:50] 10Domains, 10Traffic, 10Wikimedia-Apache-configuration, 10Operations: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4067737 (10Peachey88) [09:44:38] so this morning the most problematic backends have been 3030, 3032 and 3033 [09:44:51] s/have been/are/ [09:45:41] they all have >24M objects, all other backends <11M [09:47:15] temporarily depooling the misbehaving one does seem to have beneficial effects in the short term avoiding flapping, in the medium term though issues move to the pooled ones [09:48:35] 3032 is currently depooled and 3030/3033 show growing lag/failed fetches [09:52:37] 3030's lag is in a downwards trajectory [10:07:04] when the fire is over I've a quick question for anyone around [10:09:32] volans: what is it? [10:10:41] lvs1008/9 have puppet disabled and will disappear completely from Icinga if we don't do a single run [10:10:45] because of the puppetdb migration [10:11:50] I don't think we do care very much about them at the moment [10:12:09] ok, just to let you know they will disappear from Icinga ;) [10:12:19] yeah I think we're ok with that [10:14:53] ack, thanks [10:32:20] varnishospital visualization: https://logstash.wikimedia.org/goto/ff5e258640c51ef6203f4c1e4fb25c94 [10:34:27] > the (rare but observed) case where the expiry thread have not yet [10:34:28] evicted an object [10:34:30] gh [10:42:48] oh cool, c12a3e5e2bc978edafaccfe1c586e5d5dab01fcb seems to apply cleanly on top of 5.1+0012-oh-leak.patch [10:46:43] https://gerrit.wikimedia.org/r/420987 [10:51:01] that's crazy [10:51:07] we will end with 5.1 on stereoids [10:53:48] yeah, pretty much all patches that they backport to 4.1.x are potentially important for 5.1.x too [10:53:51] https://github.com/varnishcache/varnish-cache/blob/4.1/doc/changes.rst [10:55:05] do they merged that on 5.2? [10:57:02] I think it's time to give up trying to understand their policy in that regard [10:57:05] weird, there is no mention to #1799 in https://github.com/varnishcache/varnish-cache/blob/5.2/doc/changes.rst [10:57:09] 6.0 did get the fix, 5.2 did not [10:57:13] ack [10:57:42] so yeah..we should do that ourselves :( [11:05:02] oh, our commit logs get added to the bottom of https://github.com/varnishcache/varnish-cache/issues/1799 [11:05:24] good to know, we can send hidden messages [11:07:41] trying wm7 on cp1008 [11:11:48] yeah but if we're getting no VCL_Error, most likely that patch isn't going to fix our problem [11:11:59] (although it's still helpful to have it!) [11:13:10] bblack: so you did observe some VCL_Errors yesterday, and they vanished after the VCL patches? [11:13:18] right [11:13:59] but in terms of fixing our biggest problems, it only would have if those VCL_Errors were bursting during the issue (flooding excess false pass requests and avoiding coalescing explains our burst of backend conns) [11:14:08] so perhaps they aren't doing so [11:14:39] (but it's still good to clean up another possible biting case) [11:14:50] right [11:15:14] current situation in text@esams: 3030/3033 still occasionally in trouble [11:15:23] so when I ignore other indicators and just look at the top few graphs of: https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now [11:15:34] 3032's be still depooled [11:15:46] it seems that mbox lag consistently leads the failed-fetches [11:15:55] (and the retries) [11:16:47] it also just-barely leads the increase in the be<->be conn graph [11:17:43] (mbox lag ramping out from its flat value, the flat value probably being the new "zero", as apparently v5 leaks away some counting there) [11:18:45] I don't know if the #1799/#2519 mentions about expiry thread falling behind were also meant to imply a source of the same VCL_Error messages, or not. [11:19:01] (or a separate silent source of a similar problem) [11:20:04] my understanding is the VCL_Error about hit->miss with no busyobject happened in the case that you hit an object in vcl_hit because it was within grace, but then logically decided to reject the grace and return miss. [11:20:58] but keep or expiry, I don't know if those actually come through vcl_hit, and thus may not trigger the same VCL_Error msg? [11:21:29] you mean if the object is still there because the expiry thread hasn't managed to get rid of it vs still there because keep [11:21:33] (the expiry case would be that, due to mailbox lag, an object that should be completely-expired and useless and is still in storage gets hit by a request, then rejected for use) [11:22:18] yeah, I'm really not 100% sure whether either of these other two cases (keep-induced or expiry-lag induced) of 1799-like behavior should produce that VCL_Error message when they happen or not. [11:22:39] the vcl_hit->return(miss) case does though, and my vcl patch killed those and we don't seem to have any more of them [11:22:46] right [11:24:20] afaik there's no patch for the expiry-related one anywhere. but our lack of vcl_hit->return(miss) + the 2519 one you just added, should fix cases other than expiry-induced... [11:25:13] (we could also switch to a fixed keep=0s temporarily to avoid the keep one, vs code patch. but then we wouldn't get the full effect until all objects with keeps had left storage, which takes days or many restarts) [11:27:14] so, another aspect of this that's interesting to me, is the flood of connections actually starts at the frontend, right? [11:27:41] or did I misread some graph earlier? [11:27:56] (hmmm maybe, I did) [11:28:54] fe<->be connection rate does increase, if that's what you mean [11:28:55] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?from=now-24h&to=now&panelId=5&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend [11:30:10] bblack: ok to upgrade misc@esams to wm7 for testing? [11:31:22] yes [11:31:31] so, yeah, that is what I mean [11:32:19] it seems like the spike in fe->be conns (from various fes to all bes) precedes even the mbox ramp on a to-be-problematic backend (which then precedes its connection spike to other backends + failed fetches) [11:33:32] but there's no mbox lag to speak of there to explain a pass-leak or similar like 1799 [11:34:41] and fe failed-fetches lead be failed-fetches, too [11:34:51] (and retries) [11:35:18] there's no matching spike on the inbound side of the frontend, though [11:35:37] so it's not like it's a sudden influx. it's just suddenly the fe needs a lot more connections and retries out the back side of itself [11:36:31] it could be that, aside from being exacerbated by the ease with which load->mbox-lag problems happen, that the be is acting reasonably and the fes' actions are the problem [11:40:37] things that could make the fe spike traffic to be in the absence of an inbound spike: [11:41:00] 1) sudden expiry/eviction of lots of hot things? [11:41:45] 2) sudden expiry/eviction of 1-few hot things which take an unreasonable time to fetch and don't coalesce? [11:42:08] 3) failed fetches being retried? [11:42:46] 3) bad behavior of intentional hfp/hfm/miss conditions in VCL (e.g. we hfp something that should've been cacheable, due to some one-shot issue...) [11:44:16] and yeah failed-fetch retries could too, but we graph that. the order of indicators appearing is: fe->be conns ramp up, then fe->be failedfetch, then be's own conns/failfetches start going up. [11:44:59] so I think the problem originates in something happening in the FE, while the be is initially doing ok things until the situation escalates far enough [11:45:27] (but I could be wrong. everything is inter-related!) [11:46:22] so when it hits the FEs, it hits them all equally [11:46:59] perhaps this is all overdetermined even! [11:48:25] food, bbiab [11:50:18] what I wouldn't give for a varnishlog entry that recorded a human wallclock time easily for single-line outputs :P [11:54:27] I have a stupid logging experiment going on cp3030-fe [11:55:27] it's just recording BereqURL-only (single-line from rawlog) for all bereqs coming out of the frontend, once per second in a file named for that timestamp. [11:55:42] (probably not all from that second either, more like a shmlog full of sampling, whatever's there) [11:56:18] maybe when another wave of excess fe->be happens, I can see if there's some statistic that stands out there (that certain berequrl's are dominating vs normal) [11:59:48] of course, there may be no more good examples left in today [12:02:00] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4068343 (10Vgutierrez) I just updated the [[ https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching | Prometheus Varnish caching dashboard... [12:05:42] vgutierrez / ema: seems I broke unit testing on travis-ci with my recent bgp attributes move, interestingly not locally though [12:05:47] i just pushed a change which I hope fixes it [12:11:31] ack [12:30:19] ok so I went back to look at logstash, hoping to find interesting slowlogs from during this morning's problem-times [12:30:29] before I even got that far, I saw interesting recent things there [12:31:43] https://logstash.wikimedia.org/goto/85241eb905c8113b071105138f79d163 [12:31:59] ^ this is a 12h view of lots of slowlog entries for one particular URL that keeps cropping up [12:32:15] URL is: [12:32:19] https://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp|content|ids|user&rvdir=newer&rvlimit=5&titles=Template%3ADid+you+know%2FPreparation+area+1 [12:33:10] the response from MW seems to consistently be to stall for 60s, then emit a "200 OK" with no-cacheable headers, with 162 bytes of contents that says: [12:33:15] [12:34:01] my first question is: why is this timeout error a 200 OK? [12:35:17] maybe another is: is it ok that this basically times out and fails persistently? is our intent that certain titles+arguments to that API result in a consistent 60s wait -> fail? [12:36:26] maybe a third is: why is this showing up fairly consistently the past ~15h or so? (the slowlog entries for it) [12:38:40] no cookies, no referer [12:39:55] seems to be all one client driving it [12:40:25] no UA [12:40:49] anyways, moving on [12:54:18] so I've got a mini-burst coming through cp3033 that looks like a tiny non-problematic version of the usual failed-fetches grafana pattern [12:54:30] it lines up with slowlog entries that are spamming for a single URL [12:55:02] t time-apache-delay     21.589461 [12:55:03] t time-fetch     146.737375 [12:55:35] pl.wikipedia.org/wiki/Kasza_gryczana [12:55:44] seems like maybe a burst of logged-in users hitting it... [12:56:10] correction, a burst of a single IP address hitting it over and over with authentication [12:56:54] UA is supposedly FF56, user seems legit, etc [12:58:20] of course now it comes back fine by the time I could manually test the delay to appservers.svc... [13:00:22] do those requests show up particularly during the EU morning? [13:00:33] which? [13:00:50] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal should be able to advertise to multiple routers - https://phabricator.wikimedia.org/T180069#4068603 (10mark) 05Open>03Resolved a:03mark This is now in the latest PyBal releases, so resolving this ticket. [13:01:02] those 60s ones timing out with 200 OK at the end [13:01:23] not really. I was looking at one particular case, and it's low-rate and fairly constant for the last ~15h [13:03:05] so the burst from the one logged-in user is interesting [13:04:28] https://logstash.wikimedia.org/goto/0eabe77df61d9e8e0cb234f3a6975d91 [13:04:40] ^ these 74 requests, all for that one plwiki page from one IP [13:05:19] 74 requests in a ~8 second window [13:05:35] they're all for a real logged-in users at that IP with a real UA, apparently [13:05:52] they also all have a referer to Special:Search stuff... [13:06:24] https://pl.wikipedia.org/w/index.php?search=Krupniok+%C5%9Bl%C4%85ski&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns102=1&ns103=1&ns828=1&ns829=1&searchToken= [13:06:39] does the search page trigger UAs to spam reqs like this? [13:06:56] that, +logged-in, can make for quite some mini-bursts of pass-requests like these [13:07:26] maybe in the EU morning is when we have more such things happening concurrently from many logged-in users, I donno [13:08:16] we might not have even noticed if the 74 requests were quick. but they all happened to also stall on a time-fetch in the 2mins+ range with an apache delay around 20s as well [13:08:33] so it's not like the UA was retrying a failure, it actually opened them all in parallel without waiting for the first response [13:08:42] so one of those requests went through cp3033's fe at 12:52:03 [13:09:07] at that point 3033's fe did show some fetch failures [13:09:08] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521633211503&to=1521637635683&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=frontend [13:09:23] I only see them in slowlog for 12:52:18 - 12:52:26 [13:09:38] but yeah, the whole reason I was looking in this range in slowlog was because of the above link you pasted :) [13:09:56] what happened at 12:52:03? [13:10:42] there's one slowlog entry with response-Date: Wed, 21 Mar 2018 12:52:03 GMT [13:10:50] I guess not this same URL [13:11:11] oh duh, ignore me about that [13:11:24] 1-50/74, I wasn't looking at the last 24 entries when considering the range :) [13:12:15] ok so the 74 reqs are over a 23 second window [13:12:24] something like 3 reqs/sec [13:12:31] ok yeah, it's the first entry I got when clicking on https://logstash.wikimedia.org/goto/0eabe77df61d9e8e0cb234f3a6975d91 [13:12:44] different sort maybe [13:13:10] so, maybe we can repro this behavior logged-in and using special:search somehow. it's probably something other people know about heh. [13:13:26] (that there's some JS that fetches previous of the results or something like that) [13:13:32] s/previous/previews/ [13:15:39] so, while trying to keep this morning's fire under control I did depool/pool the 3 misbehaving text_esams backends [13:15:53] 3032 in particular is still depooled [13:15:57] ok [13:16:11] and it does not seem to recover from mbox lag [13:16:12] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen&orgId=1&var-server=cp3032&var-datasource=esams%20prometheus%2Fops [13:16:35] in general, I keep swinging back to the basic idea that: if there are lots of truly slow requests at the application layer, none of the rest of this should be surprising. [13:16:43] and I suspect there are, as evidenced by slowlog [13:16:57] and I suspect apache's D= timing claims are unreliable as a witness to it being just a varnish problem [13:18:45] I thought the expiry thread would eventually find a way to evict stuff? It seems that without traffic going through it (hence no need to nuke?) it just doesn't try [13:19:04] you mean the fact that it never goes to zero anymore? [13:19:27] that, and this: [13:19:27] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3032&var-datasource=esams%20prometheus%2Fops&panelId=13&fullscreen [13:19:34] that's new with v5, it used to always reach zero if it recovered under v4 [13:19:55] we don't know why it doesn't reach zero [13:20:17] it could be a benign accounting problem though. maybe the actual queue is empty, but queue in/out stats are wrong and some "out" are not counted [13:20:37] but yeah, it could also be that there's that much sitting in the bottom of the queue not being picked up [13:22:30] I'd repool 3032's be and see what happens, unless you have objections [13:22:46] there were quite a lot of changes to cache_expire.c from 4.1.8->5.1 [13:22:49] go for it [13:24:20] lag immediately climbing up [13:28:57] objects expiration rate went to 0 (?) [13:29:08] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3032&var-datasource=esams%20prometheus%2Fops&from=now-1h&to=now&panelId=19&fullscreen [13:29:36] rate(varnish_main_n_expired{instance=~"$server:.*"}[2m]) [13:32:49] does that mean that before re-pooling it was managing to expire ~20 objects/s ? [13:35:20] it does not look like that, the number of cached objects did not change while the backend was depooled [13:35:23] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3032&var-datasource=esams%20prometheus%2Fops&from=now-3h&to=now&panelId=13&fullscreen [13:37:36] and indeed it began flapping [13:37:37] https://logstash.wikimedia.org/goto/ff5e258640c51ef6203f4c1e4fb25c94 [13:38:38] failed fetches, retries, 503 spike [13:39:05] depooling it again [13:42:11] very interesting [13:43:12] somehow it seems that, even when other conditions relieve pressure (eg: backend being depooled for a long time), there's no recovery attempt from the expiry thread [14:14:47] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4068806 (10kchapman) Thanks @Anomie my information might be old. Moving to TechCom-RFC Inbox for discussion. [14:54:34] ema: you appear to be the creator|unique updater of https://grafana.wikimedia.org/dashboard/db/client-requests?orgId=1 [14:54:39] ema: should we care about it? [14:54:54] it looks a little bit abandoned [15:08:33] vgutierrez: burn it [15:08:38] ema: <3 [15:09:17] now only 15 dashboards depends on varnishreqstats on graphite... [15:09:43] a lot of them owned by other teams [15:11:21] https://grafana.wikimedia.org/dashboard/db/experimental-backend-5xx --> created by anonymous and updated by anonymous [15:11:32] but smells like traffic anonymous [15:11:53] and it looks burnable as well [15:12:27] lol @ traffic anonymous [15:13:05] O:) [15:13:11] heavily varnish focused [15:15:58] the graph named "Diff Thingy" makes me think that anonymous in this case could very well be bblack [15:18:15] godog and Krinkle deserve a cold beer for pointing out and developing the search-grafana script [15:18:24] it's making my life way easier [15:19:00] * godog did nothing of note, but will take the cold beer [15:19:40] godog: you made me aware of the existence of the script :P [15:21:09] heheh fair! [15:22:16] so that experimental-backend-5xx was mine [15:22:33] because graphite had the data, and sometimes that data is interesting in outages and such [15:22:37] but I never spent time making it useful [15:23:03] so it's burnable in some sense, but it's also probably a useful thing to try to make a much better version of at some point [15:23:45] *sigh* [15:24:18] basically, if we're getting a bunch of 5xx on cache_text (or whichever), and they're not varnish-internal but coming from the application, we need some data source that's telling us which application backends are throwing the 5xx's [15:25:13] yup.. makes sense [15:25:16] I think without such a graph, the only other way is manually digging logs (oxygen, logstash?) [15:25:31] but then we'd rather have varnish's pov than the app's own logging pov, to [15:25:34] *too [15:25:59] I guess varnish does not probe every app backend. right? [15:26:12] we only define probes for varnish<->varnish [15:26:30] because everything else is abstracted through pybal/lvs service IP and pybal's doing the monitoring/depooling/etc [15:26:48] and varnish considers it a single ultra-reliable backend source IP [15:26:55] right [15:27:44] the varnish<->varnish probes are ther emostly to control shard director pooling [15:27:56] bblack: quickly back to cp3032, do you want to examine its state? Otherwise I'll varnish-backend-restart it and repool (it should have been restarted today by cron but was depooled) [15:28:08] go ahead [15:28:57] ema: what I really want to find at this point somehow, is proof-positive of a slow-fetch with a low apache time and high varnishfetch time where apache was the problem and the apache time was unreliable. [15:29:25] because I think that's the real cause here (or similar from non-MW backends) [15:29:41] but we don't have easy data to prove it with [15:31:22] (e.g. apache's D= time says <1s, varnish-fetch time says 124s, but it's because apache immediately emitted headers but took forever to finish emitting data) [15:32:06] (or the same case for RB, but those are pretty easy to find examples of, I just don't know if they rise to problematic levels at the problematic times) [15:33:46] I imagine apache always streams the application response, rather than buffering it all and then responding to varnish [15:34:02] we keep chasing varnish-internal problems and not solving the issue. at the end of the day, if there's enough slow app responses to hot and/or uncacheable items, varnish will melt by design and the problem is the slow responses. [15:34:55] ema: yes, from looking before it seems likely that headers+body are separate, too. it's quite likely that in some cases the headers including a short timing report come first, and the body starts (or finishes) much later. [15:35:36] but then that problem falls into a hole where we can't prove it one way or the other [15:36:06] was varnish's long fetch time because varnish is misbehaving, or because apache was feeding it slowly? there's no reliable metric that differentiates the two cases. [15:37:08] well, arguing the other direction: [15:37:56] we have had past examples in varnishslowlog, where e.g. esams-be had a slow fetch time from eqiad-be, and eqiad-be was a cache hit for that request, which seemed to blame varnish since an application fetch was never involved. [15:38:18] but they're rare, and it may be that they're fallout of varnish-level problems piling up from other causes, I donno [15:39:24] but again... [15:40:03] for both MW and RB, it seems disturbingly-easy to find examples of URLs that take a full 60s to timeout and return some form of error. [15:40:23] it doesn't take many such requests to tie up a lot of resources on the varnish side and drag everything down [15:41:41] hieradata/role/common/cache/text.yaml: first_byte_timeout: '63s' [15:42:29] it makes me want to reduce that number pretty dramatically. but, there are probably API calls that routinely finish in, say, 48 seconds and that some tools somewhere rely on getting useful output from, and turning those all into 503s would probably be frowned on by someone or other. [15:42:57] modules/profile/manifests/cache/text.pp: 'between_bytes_timeout' => '33s' [15:43:23] ^ that one too, which would be what applies to a pause between header output and body (and then between each chunk of body) [15:44:09] so if the application is slow, and apache immediately sends response headers, between_bytes_timeout would kick in [15:44:29] yes [15:44:34] should, anyways [15:44:38] so we can't lower that one either [15:45:09] the headers are from the app too. if headers are sent to varnish, the application sent headers to apache [15:45:23] right [15:45:26] it's about whether the application has significant delays between initial header output and full body output completion [15:45:52] the "200 OK-but-not-really" output I was looking at before was a firstbyte case, though [15:46:09] apache doesn't emit anything for 60s, then the 200 OK + headers + xml error message [15:46:55] (but in that case, apache does correctly self-report a long D= timing) [15:49:00] apache must be able to log the time between initial header and full body reception [15:49:05] right? :) [15:49:35] in other words, I guess we need apacheslowlog [15:50:11] * ema summons elukey who must know [15:50:20] full body send to the wire [15:50:28] yes [15:50:46] but we also need a foo-slowlog for all foos then [15:50:53] they're not all behind apache [15:51:24] once you start getting into the specifics of blaming though, there's always a hole somewhere [15:51:50] apache could think it sent it all to "the wire" quickly, but there's a network tuning problem on the host and it sat in the host's tcp buffers and spooled out slowly [15:52:58] (or there's an lvs1003 that delays packets on one side of the connection and harms TCP metrics, shrinking in-flight windows and thus causing the host to correctly spool out that buffer slowly regardless of its own host-level tuning) [15:53:18] ema: http://httpd.apache.org/docs/current/mod/mod_log_config.html#formats contains all the possible usable variables in the logs [15:55:17] so '%{ms}T', assuming that "the time taken to serve the request" means what we want it to mean [15:55:46] :) [15:56:09] it doesn't [15:56:15] (mean what you want it to mean) [15:56:38] %T is the same as %D just different units [15:56:43] %D The time taken to serve the request, in microseconds. [15:57:11] and %D is what's in apache's D= header output. it only records the time from request-reception to headers-emitted-to-wire. [15:57:27] (in theory in an actual log, but when we put it in a header itself, it clearly, logically, can't even mean that) [15:58:21] well maybe, I'm assuming the two %D's are the same here [15:59:43] for once, I'd be happy to find out that someone used the same identifier for two different things [16:00:17] hmmm maybe they did [16:00:37] https://serverfault.com/questions/589768/is-there-a-way-to-log-apaches-request-time-in-milliseconds [16:00:43] some 2008 blog post claims mod_log_config's %D does cover transfer time of body http://www.ducea.com/2008/02/06/apache-logs-how-long-does-it-take-to-serve-a-request/ [16:00:43] > the problem with %D and %T is they tell you how long it took to send data to the client, not how long the page took to build [16:01:49] better: [16:01:52] https://stackoverflow.com/questions/4735032/d-field-in-apache-access-logs-first-or-last-byte [16:02:36] ok [16:03:03] so yeah, build an apache->logstash pipeline that records slow %D's from mod_log_config instead of mod_header? [16:03:13] do we have any kind of apachelog->logstash already? [16:03:42] I have no idea [16:04:20] we do have apache logs in general on mw* hosts, with rotation [16:07:16] if I remember correctly we have it for the error logs, not the access one [16:07:19] *ones [16:12:10] we do have access [16:12:17] other_vhosts_access [16:18:31] ah nice! Didn't know it :) [16:43:22] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4069474 (10ayounsi) [16:44:44] sigh... prometheus+graphite have defeated me [16:45:39] prom: https://grafana-admin.wikimedia.org/dashboard/db/prometheus-varnish-http-requests-wip?panelId=4&fullscreen&orgId=1&from=1521640800000&to=1521642600000 [16:46:17] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4069488 (10ayounsi) 1st part completed, server's specific configuration (loopback IP) is not puppetized yet. From outside: ``` $ ping ping-test.eqiad.wikimedia.org PING ping... [16:47:16] graphite: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1&from=now-90d&to=now&panelId=4&fullscreen [16:47:29] for some reason I'm getting way more requests in the prometheus version [16:47:36] pretty sure that's a layer8 issue on my side... [16:49:31] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090#4069519 (10ayounsi) [17:19:06] <_joe_> vgutierrez: 60x as many? [17:19:18] <_joe_> vgutierrez: IIRC that's requests per minute [17:23:01] gotta love the BGP RFC [17:23:03] "A BGP implementation MUST connect to and listen on TCP port 179 for [17:23:03] incoming connections in addition to trying to connect to peers." [17:23:09] couple paragraphs later: [17:23:24] "When a BGP speaker is configured as active, [17:23:24] it may end up on either the active or passive side of the connection [17:23:24] that eventually gets established. " [17:24:48] peer-to-peer protocols are hard [17:25:07] they should've just built BGP on top of blockchain, that would solve all of this [17:25:09] <_joe_> vgutierrez: no, it's actullay that in prometheus you're graphing a counter [17:25:15] <_joe_> you need to plot the rate() [17:25:31] <_joe_> bblack: we'd just need a couple nuclear reactors per-dc [17:25:36] <_joe_> simple [17:25:46] <_joe_> to power the mining rigs, I mean [17:26:10] as long as it's green power [17:26:37] it will be, because we'll buy carbon offsets for using the speculative value of our bgp currency [17:27:00] <_joe_> bgpcoin [17:27:37] it can be used to settle transit bills while advertising the transited prefixes at the same time! [17:28:40] <_joe_> bblack: here is the whitepaper: https://i1.wp.com/www.techweez.com/wp-content/uploads/2018/01/PonziCoin.jpg?fit=3614%2C2798 [17:36:22] ema: reverting the 1s probe interval stuff, since it's not causative and slows down other reactions [17:53:01] 10netops, 10Operations: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317#4069851 (10ayounsi) p:05Triage>03Normal [18:05:59] 10netops, 10Operations, 10ops-codfw: Switch port configuration for ms-be204[0-3] - https://phabricator.wikimedia.org/T190322#4069950 (10Papaul) p:05Triage>03Normal [18:16:33] _joe_: well.. the graphite one is not plotting the rate but the sum... [18:17:01] basically number of GETs, not rate of GETS per minute [18:17:29] 10netops, 10Operations, 10ops-codfw: Switch port configuration for ms-be204[0-3] - https://phabricator.wikimedia.org/T190322#4069994 (10RobH) 05Open>03Resolved all four systems have had their switch port descriptions updated, enabled, and placed into the private vlan for each row. [18:17:43] <_joe_> vgutierrez: total GETs since varnish statup? [18:20:16] total GETs since varnishreqstats.mtail is there if I understand it correctly [18:20:40] https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/varnishreqstats.mtail [18:21:31] but the graphite equivalent works pretty similar: https://github.com/wikimedia/puppet/blob/production/modules/varnish/files/varnishreqstats [18:21:38] that's why I'm failing in see the difference [18:27:49] 10netops, 10Operations: Implement BGP graceful shutdown - https://phabricator.wikimedia.org/T190323#4070028 (10ayounsi) [21:04:22] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#4070660 (10Krinkle) [23:22:38] 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071230 (10ayounsi) [23:24:49] 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071230 (10EBernhardson) For elasticsearch we are expecting to switch all servers to 10G on their standard server refresh schedule. I'm not sure what exactly that schedule is though. [23:26:53] 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071259 (10ayounsi)