[18:03:44] 10Traffic, 10Operations, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10colewhite) p:05Triage→03Normal [18:50:44] bblack: around? [18:51:49] jynus: it is mostly mailbox lag on cp1085 this time, i am thinking about restarting varnishbe on that server [18:53:30] so it's really not related to the eventgate deploy ? it all started 1 minute after that and that timeline from the previous incident sounded just like it too? [18:53:47] there wasn't any deploy [18:53:51] only staging is being messed with [18:53:56] mw isn't talking to that at all [18:54:30] mutante: the eventgate dashboards looked much different in that incident [18:54:35] and we did not see any issues at the varnish layer [18:54:43] it does look like these 500s are coming from varnish and not its backends AFAICT [18:55:02] ok, alright [18:55:30] mailbox lag at cp1085 was 200k+ and rising, I have depooled it [18:56:57] There doesn't appear to be a cooresponding rise in error logs. [18:57:14] mediawiki error logs? [18:57:19] right [18:57:30] https://logstash.wikimedia.org/goto/f702d59fdb4aad6ece7df10f5ebbdb85 [18:57:34] yeah the issue looks to be in varnish [18:57:56] very similar to the long-ongoing issue talked about in https://phabricator.wikimedia.org/T145661 [18:57:59] I'm here [18:58:16] bblack: https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [18:58:21] ok [18:58:23] I've just depooled cp1085 [18:58:31] backends see to hit a hard limit [18:58:33] *seem [18:58:39] not a single host [18:58:44] there's also https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=now-1h&to=now but maybe that's just a symptom? [18:59:03] right, it can be [18:59:08] increased 503 all datacenters [18:59:26] "mailbox lag" isn't a real problem, it's just a symptom we've grown accustomed to tracking, which sometimes points at a single-varnish problem [18:59:36] ok [18:59:51] bblack: it seems to spike every 10 minutes aprox [19:00:07] but given the patterns I can see so far, I'd say this is likely induced by something else [19:00:28] that is why we called you, when we discarded mw and easy stuff 0:-) [19:01:10] I can help dig, but I doubt it's actually a problem in the cache layer [19:01:21] oh [19:01:40] where do you think it is? [19:01:49] I don't know, it could be many things [19:02:25] Is it concerning that 4xx responses are going up at the rate they are? [19:03:46] they don't look too far above what they were 3 hours ago shdubsh [19:04:55] for now I'm just going to restart a couple of obviously-affected varnishes, but I'm not sure yet what's going on [19:05:02] it may help if we see the symptoms shift around to new hosts [19:05:09] it is a single instance every some minutes: https://grafana.wikimedia.org/d/000000478/varnish-mailbox-lag?panelId=3&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [19:05:18] what's the usual command for that bblack? just using systemctl to restart varnish.service? (is that one the backend varnish?) [19:05:24] which is weird, but I quickly discarded puppet [19:06:00] the pattern is what weirds me out. the spikes seem to coorespond with the 500s, and that stair-step pattern... [19:06:17] cdanis: there's a script that coordinates depool->restart->repool called "varnish-backend-restart", but it exits out if already depooled, so doing its stuff manually [19:06:23] shdubsh: ignore the stair, that is the granularity [19:06:23] ah [19:07:01] ok [19:07:14] bblack: should I seach more people to help? [19:08:35] it would probably be productive to start looking for other leads [19:09:48] so I've restarted the two that are most-recently most-bad at mailbox lag symptoms in eqiad-text [19:10:16] and seeing some improvement. but... it still looks more like it's influenced by actual external traffic and/or applayer behavior, still digging [19:10:35] it may just move around from one backend to another [19:11:14] seems like ~18:26 was the first icinga noticing it? [19:11:19] can I help somehow? [19:11:33] bblack: 18:15 was the first spike [19:11:43] but smaller one [19:11:58] then it repeated at 25, 35 etc [19:12:06] spike of? [19:12:19] 503 [19:12:23] It looks like grafana goes through the caches and I'm getting 503s [19:12:50] bblack: https://logstash.wikimedia.org/goto/f702d59fdb4aad6ece7df10f5ebbdb85 [19:15:09] grafana appears back now [19:16:52] i'm failing to find anything interesting at the appserver layer. plenty of free apache workers, nothing standing out in the mediawiki fatal/errors logs [19:17:38] it seems to be ok now since 19:09:30? [19:17:46] well, some of the first few 503s I was looking at, were all wdqs queries of the nature: [19:17:49] "GET","uri_host":"www.wikidata.org","uri_path":"/wiki/Q33810188","uri_query":"?action=constraintsrdf&nocache=1555439375947" [19:18:16] and there was some mediawiki-config traffic related to this "constraints" thing not long before the problems started, too... [19:18:31] you think like a huge eviction or something? [19:19:18] or a self inflicted long running connection? [19:19:18] no, I don't think so [19:19:43] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=26&fullscreen&orgId=1&from=now-3h&to=now [19:20:07] it could be a red herring too, but it seems like our own wdqs was making a ton of those queries above back into our public caches around the time, and again some hints about mw-config movement in the same area... [19:20:29] cdanis: from the timings on that, it's hard to say if it's a symptom or a cause heh [19:20:32] yeah indeed [19:20:38] but it correlates very well [19:21:43] it's not a very big percentage of the initial 503s though :/ [19:22:17] looking at the also-correlated spikes on https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=now-1h&to=now [19:22:34] it is not like that wqds is a disproportionate amount of the elevated number of backend connections, either [19:23:00] yeah, but it coul dbe disproportionately slow in responding, too [19:23:11] (wikidata I mean, responding to those constraints queries) [19:23:34] slow responses = backend connection pileup => easily causes all the rest [19:23:49] bblack: if things are ok now, we won't bother you more [19:23:57] we can debug later [19:24:03] I don't know if they are. things are better than they were a few minutes ago anyways [19:24:08] but the problem may move or recur [19:24:14] so the WDQS request flow is varnish->WDQS--many requests to gather data?-->varnish-->something? [19:24:38] cdanis: yeah maybe? I think some of it's not even really externally-driven [19:24:52] or maybe it's one wikidata page can make a bunch of WDQS queries? [19:25:02] but wdqs itself does end up looping back into varnish and then down to mediawiki api stuff [19:25:13] cdanis: that, hopefully, shouldn't happen [19:25:19] or wikidata API I guess I should say [19:25:43] there's no obvious statistical pattern in the initial burst of 503s [19:26:09] yeah, that is why I thought of cache first, because it was very natural pattern [19:26:17] caused connection parallelism from the eqiad caches -> applayer backends to spike up and self-limit, which then causes artificial 503s for all kinds of random queries trying to come through [19:27:02] the could be just Bad Varnish, but it can also be any of a number of other external or internal causes. [19:27:12] indeed [19:27:20] did you repool cp1085 btw bblack? if not i can [19:27:30] cdanis: yeah I did, as part of the restart [19:27:40] I was just hoping it had happened before and you knew :-( [19:28:06] in general, if it does look like a "bad varnish backend" problem (which is usually limited to one a time, and rare to crop up randomly like this), "run-no-puppet varnish-backend-restart" is the way to go. [19:28:21] it will depool, restart, repool, which will cure the single-node issue if the root cause is varnish [19:28:43] I do see some allocstalls on the cp1* hosts in this time interval [19:28:50] but if the root cause isn't a misbehaving varnish backend, then it's basically a bandaid [19:29:09] there will be all kinds of correlations, but again it's hard to track causes [19:29:12] so my worry is that it looked that more than 1 server, or at least that is what I thought [19:29:12] yeah [19:29:42] usually the biggest hint about whether it's a misbehaving-varnish-backend sort of thing, is that will usually only be one cp host [19:30:03] and this was more than one, right? [19:30:16] I am trying to learn if I did wrong [19:30:16] you can parse through the 5xx.log on weblog1001, and use jq to pull out the .x_cache value, and see that almost all the 503s implicate a single varnish backend as "int" [19:31:07] e.g.: [19:31:12] bblack@weblog1001:/srv/log/webrequest$ grep 'T18:2' 5xx.json |grep '"503"'|jq .x_cache [19:31:13] T18: Plan to migrate code review from Gerrit to Phabricator - https://phabricator.wikimedia.org/T18 [19:31:29] lol ty stashbot [19:31:30] is my awful way to ask it for the x_cache headers for the 18:2x timeframe's 503s [19:31:54] and at the start of that output, you see a whole lot that have "cp1079 int". [19:32:10] which is exactly the kind of pattern we can see if cp1079 were just being a Bad Varnish [19:32:19] but as you move through time the problem just keeps moving around in those logs [19:32:28] first it's 1079 for a while, then 1081 for a while, etc, etc.... [19:32:35] exactly what I saw [19:32:42] I just used logstash instead [19:32:59] that tends to mean something non-varnish problem is steamrolling through our varnishes one by one and hurting them [19:33:08] I see [19:33:42] what typically happens is some problem-traffic is consistently-hashed to 1079, and that traffic has problems (it's abusive in rate from the outside, or the applayer is super-slow responding, or whatever) [19:34:21] it maxes out cp1079's backend connections, cp1079 starts throwing up all kinds of issues and random 503s, the other caches see it failing healthchecks and auto-depool it for cache<->cache traffic, and the problem traffic just shifts off to the next node it hashes to [19:34:40] so the pattern just keeps repeating and moving around to different nodes [19:34:42] which looks pretty much like what we observed [19:35:04] and in the case today of cp1085 and cp1083, I think beyond that, it eventually induced internal misbehavior and they needed restarts anyways to recover [19:35:29] (the mailbox lag / storage stuff in our varnish backends is fragile, so abusive patterns can then set it off and make things stay bad on that node) [19:35:40] so restarting those two helped clear that up [19:35:45] any ideas on how to find the traffic that is the underlying cause? [19:35:51] but I don't think the problem started in the realm of Bad Varnish Behaviors. [19:38:00] so yeah, tracking down the cause is tricky, unless there's a really-obvious pattern of external abuse [19:38:24] like would it be helpful to, when a varnish host is stuck, grab like... a count of outstanding requests broken down by backend? maybe with time spent waiting by varnish? [19:38:28] (e.g. millions of queries in some ddos or whatever, that form an obvious pattern in 1/1K sampling or whatever, which I don't see here?) [19:38:31] does such data even exist? [19:38:52] there was some such data at one point, but it was experimental and iffy, in grafana [19:39:08] if it's backend-slow-response induced, I think we do have some stuff going to logstash about that [19:40:30] I'm trying to dig up some better sources, but I'm slow at some of this now, I don't use it often enough [19:40:41] fair enough [19:41:21] no urgency on that ofc, things seem stable now (thankfully) [19:41:27] varnishhospital is I think the thing that sends the logstashes I was thinking of [19:41:42] err [19:41:44] varnishospital [19:41:56] and another called varnishslowlog [19:43:51] yeah those are both available under "visualize" in logstash [19:44:17] nothing looking unusual in varnishslowlog for the time interval in question today [19:44:48] yeah [19:45:11] for varnishospital there's... limited overlap at best [19:45:23] I'm trying to get at how to look at that data now [19:46:03] I don't understand logstash at all :\ [19:46:49] what really frustrates me in logstash is the super-slow autocompletion popups [19:46:59] you can get something by going to 'discover' and just putting [varnishospital] in the search box [19:47:51] but so far i'm just seeing varnishes complaining about other varnishes [19:48:00] ah logger_name [19:49:00] logger_name:varnishospital AND layer:backend --> no matches? [19:50:30] it's logger_name:varnishhospital sigh [19:55:05] still have not found anything aside from varnishes complaining about varnishes, so I dunno [19:55:50] there's lots of them that aren't about varnishes, but that's also "normal" [19:55:58] trying to find any unusual correlating pattern though [19:56:20] I must be missing something then [19:56:32] I am going to take a quick break and then start writing what I can of an incident report [19:58:09] cdanis: what are you identifying "varnish complaining about varnish" from? [19:58:28] so it's possible I just don't know how to read these events at all [19:58:46] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.16/logback?id=AWonrQqygtdVElVzEp21&_g=h@97fe121 [19:59:07] if this is *from* be_cp1085, then I can't tell what it is about [19:59:21] lol now the kibana UI is throwing errors at me [19:59:39] lol mine too when i reloaded that page jeez [19:59:51] but yes, generally I think you're looking at "from" [20:00:11] there's some fields related to the time it took be_cp1085 to fetch, and some fields that can implicate mediawiki, etc [20:00:33] if I can get the UI back I can say more [20:01:32] oh sorry, part of our back and forth above is I had moved on to slowlog and you were still talking about hospital heh [20:01:47] I think hospital is just showing varnish<->varnish healthchecks of each other, yes! [20:02:28] anyways, in slowlog, down at the bottom of the discover fields list are the interesting bits [20:02:47] there's one called backendtiming or something like that, which is apache self-reporting how long it took to respond to varnish [20:02:52] and varnish's conception of the Fetch time [20:04:41] I can dig through those and find e.g. 120s response delays by www.wikidata.org -> varnish during the time in question, and other such nonsense [20:05:05] but the tricky thing is there's a fair amount of slow responses in the normal flow, and finding some new pattern that's explanatory [20:05:32] (I'm not sure if we can query for delay values in a numeric sense, e.g. graphs those >= 55s or whatever) [20:06:07] I've tried clearing session and a fresh browser instance and kibana just stays broken now :/ [20:07:03] mine too! [20:07:08] that's infuriating [20:07:44] yeah logstash needs some love [20:08:02] I vaguely recall SRE was going to take ownership of fixing it up, but I'm lost on the details [20:08:38] clearing cache and shift-refreshing eventually got there form e [20:43:31] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) [20:44:48] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) @ayounsi lvs1013 and 1014 on-site work has been completed. I did not add the LVS vlan....I will leave that to you. I still need to run the cross-connects bu... [22:19:01] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi) [23:34:36] 10netops, 10Operations: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10Dzahn)