[18:03:44] <wikibugs>	 10Traffic, 10Operations, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10colewhite) p:05Triage→03Normal
[18:50:44] <jynus>	 bblack: around?
[18:51:49] <cdanis>	 jynus: it is mostly mailbox lag on cp1085 this time, i am thinking about restarting varnishbe on that server
[18:53:30] <mutante>	 so it's really not related to the eventgate deploy ? it all started 1 minute after that and that timeline from the previous incident sounded just like it too?
[18:53:47] <ottomata>	 there wasn't any deploy
[18:53:51] <ottomata>	 only staging is being messed with
[18:53:56] <ottomata>	 mw isn't talking to that at all
[18:54:30] <cdanis>	 mutante: the eventgate dashboards looked much different in that incident
[18:54:35] <cdanis>	 and we did not see any issues at the varnish layer
[18:54:43] <cdanis>	 it does look like these 500s are coming from varnish and not its backends AFAICT
[18:55:02] <mutante>	 ok, alright
[18:55:30] <cdanis>	 mailbox lag at cp1085 was 200k+ and rising, I have depooled it
[18:56:57] <shdubsh>	 There doesn't appear to be a cooresponding rise in error logs.
[18:57:14] <cdanis>	 mediawiki error logs?
[18:57:19] <shdubsh>	 right
[18:57:30] <cdanis>	 https://logstash.wikimedia.org/goto/f702d59fdb4aad6ece7df10f5ebbdb85
[18:57:34] <cdanis>	 yeah the issue looks to be in varnish
[18:57:56] <cdanis>	 very similar to the long-ongoing issue talked about in https://phabricator.wikimedia.org/T145661
[18:57:59] <bblack>	 I'm here
[18:58:16] <cdanis>	 bblack: https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All
[18:58:21] <bblack>	 ok
[18:58:23] <cdanis>	 I've just depooled cp1085
[18:58:31] <jynus>	 backends see to hit a hard limit
[18:58:33] <jynus>	 *seem
[18:58:39] <jynus>	 not a single host
[18:58:44] <cdanis>	 there's also https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=now-1h&to=now but maybe that's just a symptom?
[18:59:03] <bblack>	 right, it can be
[18:59:08] <jynus>	 increased 503 all datacenters
[18:59:26] <bblack>	 "mailbox lag" isn't a real problem, it's just a symptom we've grown accustomed to tracking, which sometimes points at a single-varnish problem
[18:59:36] <cdanis>	 ok
[18:59:51] <jynus>	 bblack: it seems to spike every 10 minutes aprox
[19:00:07] <bblack>	 but given the patterns I can see so far, I'd say this is likely induced by something else
[19:00:28] <jynus>	 that is why we called you, when we discarded mw and easy stuff 0:-)
[19:01:10] <bblack>	 I can help dig, but I doubt it's actually a problem in the cache layer
[19:01:21] <jynus>	 oh
[19:01:40] <jynus>	 where do you think it is?
[19:01:49] <bblack>	 I don't know, it could be many things
[19:02:25] <shdubsh>	 Is it concerning that 4xx responses are going up at the rate they are?
[19:03:46] <cdanis>	 they don't look too far above what they were 3 hours ago shdubsh
[19:04:55] <bblack>	 for now I'm just going to restart a couple of obviously-affected varnishes, but I'm not sure yet what's going on
[19:05:02] <bblack>	 it may help if we see the symptoms shift around to new hosts
[19:05:09] <jynus>	 it is a single instance every some minutes: https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?panelId=3&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All
[19:05:18] <cdanis>	 what's the usual command for that bblack? just using systemctl to restart varnish.service? (is that one the backend varnish?)
[19:05:24] <jynus>	 which is weird, but I quickly discarded puppet
[19:06:00] <shdubsh>	 the pattern is what weirds me out.  the spikes seem to coorespond with the 500s, and that stair-step pattern...
[19:06:17] <bblack>	 cdanis: there's a script that coordinates depool->restart->repool called "varnish-backend-restart", but it exits out if already depooled, so doing its stuff manually
[19:06:23] <jynus>	 shdubsh: ignore the stair, that is the granularity
[19:06:23] <cdanis>	 ah
[19:07:01] <shdubsh>	 ok
[19:07:14] <jynus>	 bblack: should I seach more people to help?
[19:08:35] <bblack>	 it would probably be productive to start looking for other leads
[19:09:48] <bblack>	 so I've restarted the two that are most-recently most-bad at mailbox lag symptoms in eqiad-text
[19:10:16] <bblack>	 and seeing some improvement.  but... it still looks more like it's influenced by actual external traffic and/or applayer behavior, still digging
[19:10:35] <bblack>	 it may just move around from one backend to another
[19:11:14] <bblack>	 seems like ~18:26 was the first icinga noticing it?
[19:11:19] <paravoid>	 can I help somehow?
[19:11:33] <jynus>	 bblack: 18:15 was the first spike
[19:11:43] <jynus>	 but smaller one
[19:11:58] <jynus>	 then it repeated at 25, 35 etc
[19:12:06] <bblack>	 spike of?
[19:12:19] <jynus>	 503
[19:12:23] <shdubsh>	 It looks like grafana goes through the caches and I'm getting 503s
[19:12:50] <jynus>	 bblack: https://logstash.wikimedia.org/goto/f702d59fdb4aad6ece7df10f5ebbdb85
[19:15:09] <shdubsh>	 grafana appears back now
[19:16:52] <cdanis>	 i'm failing to find anything interesting at the appserver layer.  plenty of free apache workers, nothing standing out in the mediawiki fatal/errors logs
[19:17:38] <jynus>	 it seems to be ok now since 19:09:30?
[19:17:46] <bblack>	 well, some of the first few 503s I was looking at, were all wdqs queries of the nature:
[19:17:49] <bblack>	 "GET","uri_host":"www.wikidata.org","uri_path":"/wiki/Q33810188","uri_query":"?action=constraintsrdf&nocache=1555439375947"
[19:18:16] <bblack>	 and there was some mediawiki-config traffic related to this "constraints" thing not long before the problems started, too...
[19:18:31] <jynus>	 you think like a huge eviction or something?
[19:19:18] <jynus>	 or a self inflicted long running connection?
[19:19:18] <bblack>	 no, I don't think so
[19:19:43] <cdanis>	 https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=26&fullscreen&orgId=1&from=now-3h&to=now
[19:20:07] <bblack>	 it could be a red herring too, but it seems like our own wdqs was making a ton of those queries above back into our public caches around the time, and again some hints about mw-config movement in the same area...
[19:20:29] <bblack>	 cdanis: from the timings on that, it's hard to say if it's a symptom or a cause heh
[19:20:32] <cdanis>	 yeah indeed
[19:20:38] <cdanis>	 but it correlates very well
[19:21:43] <bblack>	 it's not a very big percentage of the initial 503s though :/
[19:22:17] <cdanis>	 looking at the also-correlated spikes on https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=now-1h&to=now
[19:22:34] <cdanis>	 it is not like that wqds is a disproportionate amount of the elevated number of backend connections, either 
[19:23:00] <bblack>	 yeah, but it coul dbe disproportionately slow in responding, too
[19:23:11] <bblack>	 (wikidata I mean, responding to those constraints queries)
[19:23:34] <bblack>	 slow responses = backend connection pileup => easily causes all the rest
[19:23:49] <jynus>	 bblack: if things are ok now, we won't bother you more
[19:23:57] <jynus>	 we can debug later
[19:24:03] <bblack>	 I don't know if they are.  things are better than they were a few minutes ago anyways
[19:24:08] <bblack>	 but the problem may move or recur
[19:24:14] <cdanis>	 so the WDQS request flow is varnish->WDQS--many requests to gather data?-->varnish-->something?
[19:24:38] <bblack>	 cdanis: yeah maybe? I think some of it's not even really externally-driven
[19:24:52] <cdanis>	 or maybe it's one wikidata page can make a bunch of WDQS queries?
[19:25:02] <bblack>	 but wdqs itself does end up looping back into varnish and then down to mediawiki api stuff
[19:25:13] <jynus>	 cdanis: that, hopefully, shouldn't happen
[19:25:19] <bblack>	 or wikidata API I guess I should say
[19:25:43] <bblack>	 there's no obvious statistical pattern in the initial burst of 503s
[19:26:09] <jynus>	 yeah, that is why I thought of cache first, because it was very natural pattern
[19:26:17] <bblack>	 <something> caused connection parallelism from the eqiad caches -> applayer backends to spike up and self-limit, which then causes artificial 503s for all kinds of random queries trying to come through
[19:27:02] <bblack>	 the <something> could be just Bad Varnish, but it can also be any of a number of other external or internal causes.
[19:27:12] <jynus>	 indeed
[19:27:20] <cdanis>	 did you repool cp1085 btw bblack? if not i can
[19:27:30] <bblack>	 cdanis: yeah I did, as part of the restart
[19:27:40] <jynus>	 I was just hoping it had happened before and you knew :-(
[19:28:06] <bblack>	 in general, if it does look like a "bad varnish backend" problem (which is usually limited to one a time, and rare to crop up randomly like this), "run-no-puppet varnish-backend-restart" is the way to go.
[19:28:21] <bblack>	 it will depool, restart, repool, which will cure the single-node issue if the root cause is varnish
[19:28:43] <cdanis>	 I do see some allocstalls on the cp1* hosts in this time interval
[19:28:50] <bblack>	 but if the root cause isn't a misbehaving varnish backend, then it's basically a bandaid
[19:29:09] <bblack>	 there will be all kinds of correlations, but again it's hard to track causes
[19:29:12] <jynus>	 so my worry is that it looked that more than 1 server, or at least that is what I thought
[19:29:12] <cdanis>	 yeah
[19:29:42] <bblack>	 usually the biggest hint about whether it's a misbehaving-varnish-backend sort of thing, is that will usually only be one cp host
[19:30:03] <jynus>	 and this was more than one, right?
[19:30:16] <jynus>	 I am trying to learn if I did wrong
[19:30:16] <bblack>	 you can parse through the 5xx.log on weblog1001, and use jq to pull out the .x_cache value, and see that almost all the 503s implicate a single varnish backend as "int"
[19:31:07] <bblack>	 e.g.:
[19:31:12] <bblack>	 bblack@weblog1001:/srv/log/webrequest$ grep 'T18:2' 5xx.json |grep '"503"'|jq .x_cache
[19:31:13] <stashbot>	 T18: Plan to migrate code review from Gerrit to Phabricator - https://phabricator.wikimedia.org/T18
[19:31:29] <cdanis>	 lol ty stashbot
[19:31:30] <bblack>	 is my awful way to ask it for the x_cache headers for the 18:2x timeframe's 503s
[19:31:54] <bblack>	 and at the start of that output, you see a whole lot that have "cp1079 int".
[19:32:10] <bblack>	 which is exactly the kind of pattern we can see if cp1079 were just being a Bad Varnish
[19:32:19] <bblack>	 but as you move through time the problem just keeps moving around in those logs
[19:32:28] <bblack>	 first it's 1079 for a while, then 1081 for a while, etc, etc....
[19:32:35] <jynus>	 exactly what I saw
[19:32:42] <jynus>	 I just used logstash instead
[19:32:59] <bblack>	 that tends to mean something non-varnish problem is steamrolling through our varnishes one by one and hurting them
[19:33:08] <jynus>	 I see
[19:33:42] <bblack>	 what typically happens is some problem-traffic is consistently-hashed to 1079, and that traffic has problems (it's abusive in rate from the outside, or the applayer is super-slow responding, or whatever)
[19:34:21] <bblack>	 it maxes out cp1079's backend connections, cp1079 starts throwing up all kinds of issues and random 503s, the other caches see it failing healthchecks and auto-depool it for cache<->cache traffic, and the problem traffic just shifts off to the next node it hashes to
[19:34:40] <bblack>	 so the pattern just keeps repeating and moving around to different nodes
[19:34:42] <cdanis>	 which looks pretty much like what we observed
[19:35:04] <bblack>	 and in the case today of cp1085 and cp1083, I think beyond that, it eventually induced internal misbehavior and they needed restarts anyways to recover
[19:35:29] <bblack>	 (the mailbox lag / storage stuff in our varnish backends is fragile, so abusive patterns can then set it off and make things stay bad on that node)
[19:35:40] <bblack>	 so restarting those two helped clear that up
[19:35:45] <cdanis>	 any ideas on how to find the traffic that is the underlying cause?
[19:35:51] <bblack>	 but I don't think the problem started in the realm of Bad Varnish Behaviors.
[19:38:00] <bblack>	 so yeah, tracking down the cause is tricky, unless there's a really-obvious pattern of external abuse
[19:38:24] <cdanis>	 like would it be helpful to, when a varnish host is stuck, grab like... a count of outstanding requests broken down by backend? maybe with time spent waiting by varnish?
[19:38:28] <bblack>	 (e.g. millions of queries in some ddos or whatever, that form an obvious pattern in 1/1K sampling or whatever, which I don't see here?)
[19:38:31] <cdanis>	 does such data even exist?
[19:38:52] <bblack>	 there was some such data at one point, but it was experimental and iffy, in grafana
[19:39:08] <bblack>	 if it's backend-slow-response induced, I think we do have some stuff going to logstash about that
[19:40:30] <bblack>	 I'm trying to dig up some better sources, but I'm slow at some of this now, I don't use it often enough
[19:40:41] <cdanis>	 fair enough
[19:41:21] <cdanis>	 no urgency on that ofc, things seem stable now (thankfully)
[19:41:27] <bblack>	 varnishhospital is I think the thing that sends the logstashes I was thinking of
[19:41:42] <bblack>	 err 
[19:41:44] <bblack>	 varnishospital
[19:41:56] <bblack>	 and another called varnishslowlog
[19:43:51] <bblack>	 yeah those are both available under "visualize" in logstash
[19:44:17] <cdanis>	 nothing looking unusual in varnishslowlog for the time interval in question today
[19:44:48] <bblack>	 yeah
[19:45:11] <cdanis>	 for varnishospital there's... limited overlap at best
[19:45:23] <bblack>	 I'm trying to get at how to look at that data now
[19:46:03] <cdanis>	 I don't understand logstash at all :\
[19:46:49] <bblack>	 what really frustrates me in logstash is the super-slow autocompletion popups
[19:46:59] <cdanis>	 you can get something by going to 'discover' and just putting [varnishospital] in the search box
[19:47:51] <cdanis>	 but so far i'm just seeing varnishes complaining about other varnishes
[19:48:00] <bblack>	 ah logger_name
[19:49:00] <cdanis>	 logger_name:varnishospital AND layer:backend --> no matches?
[19:50:30] <cdanis>	 it's logger_name:varnishhospital sigh
[19:55:05] <cdanis>	 still have not found anything aside from varnishes complaining about varnishes, so I dunno
[19:55:50] <bblack>	 there's lots of them that aren't about varnishes, but that's also "normal"
[19:55:58] <bblack>	 trying to find any unusual correlating pattern though
[19:56:20] <cdanis>	 I must be missing something then
[19:56:32] <cdanis>	 I am going to take a quick break and then start writing what I can of an incident report
[19:58:09] <bblack>	 cdanis: what are you identifying "varnish complaining about varnish" from?
[19:58:28] <cdanis>	 so it's possible I just don't know how to read these events at all
[19:58:46] <cdanis>	 https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.16/logback?id=AWonrQqygtdVElVzEp21&_g=h@97fe121
[19:59:07] <cdanis>	 if this is *from* be_cp1085, then I can't tell what it is about
[19:59:21] <bblack>	 lol now the kibana UI is throwing errors at me
[19:59:39] <cdanis>	 lol mine too when i reloaded that page jeez
[19:59:51] <bblack>	 but yes, generally I think you're looking at "from"
[20:00:11] <bblack>	 there's some fields related to the time it took be_cp1085 to fetch, and some fields that can implicate mediawiki, etc
[20:00:33] <bblack>	 if I can get the UI back I can say more
[20:01:32] <bblack>	 oh sorry, part of our back and forth above is I had moved on to slowlog and you were still talking about hospital heh
[20:01:47] <bblack>	 I think hospital is just showing varnish<->varnish healthchecks of each other, yes!
[20:02:28] <bblack>	 anyways, in slowlog, down at the bottom of the discover fields list are the interesting bits
[20:02:47] <bblack>	 there's one called backendtiming or something like that, which is apache self-reporting how long it took to respond to varnish
[20:02:52] <bblack>	 and varnish's conception of the Fetch time
[20:04:41] <bblack>	 I can dig through those and find e.g. 120s response delays by www.wikidata.org -> varnish during the time in question, and other such nonsense
[20:05:05] <bblack>	 but the tricky thing is there's a fair amount of slow responses in the normal flow, and finding some new pattern that's explanatory
[20:05:32] <bblack>	 (I'm not sure if we can query for delay values in a numeric sense, e.g. graphs those >= 55s or whatever)
[20:06:07] <bblack>	 I've tried clearing session and a fresh browser instance and kibana just stays broken now :/
[20:07:03] <cdanis>	 mine too!
[20:07:08] <cdanis>	 that's infuriating 
[20:07:44] <bblack>	 yeah logstash needs some love
[20:08:02] <bblack>	 I vaguely recall SRE was going to take ownership of fixing it up, but I'm lost on the details
[20:08:38] <cdanis>	 clearing cache and shift-refreshing eventually got there form e
[20:43:31] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson)
[20:44:48] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) @ayounsi  lvs1013 and 1014 on-site work has been completed.  I did not add the LVS vlan....I will leave that to you.   I still need to run the cross-connects bu...
[22:19:01] <wikibugs>	 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi)
[23:34:36] <wikibugs>	 10netops, 10Operations: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10Dzahn)