[01:17:37] 07HTTPS, 10Traffic, 06Operations: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2266282 (10BBlack) In the settings, we can see that http://status.wikimedia.org/ is also available at http://status.asm.ca.com/8777 . There don't appear to be any TLS-related settings :( W... [01:52:21] 10Traffic, 10Varnish, 06Operations, 06WMF-Design, and 2 others: Better WMF error pages - https://phabricator.wikimedia.org/T76560#2266306 (10Danny_B) Neither "Requirements" nor "Things we can't do" mentions JavaScript... Any statement about it? [03:09:15] hi [03:09:56] 03:09 < chasemp> Does restbase never return failure to the user and move on or is tgis some automated client repeating the request? [03:10:03] I don't think it's being spammed at us from the outside [03:10:20] can we get gwicke on .. i don't know much about restbase and how to stop it from hitting parsoid with those requests. [03:10:28] let me call him. [03:10:32] thanks! [03:11:30] emailing outage-notification with something for now [03:13:10] from wtp1023 .. a snippet of the parsoid.log https://gist.githubusercontent.com/subbuss/4e68ed795ef7b0079a70cfd51e9ed7e0/raw/09b584a72473fba2fa1d3aabc73ddae74a146488/gistfile1.txt [03:13:25] looks like parsoid cluster is getting dos-ed [03:13:28] Can we write straight to the DB and null-out that article? [03:13:34] so, yes, changeprop [03:14:20] right .. so, it is some bad thing going on with changeprop. [03:14:35] repeating some: [03:14:37] 03:02 < bblack> best guess so far: [03:14:37] 03:03 < bblack> someone ask restbase about an old frwiki article from 2014, it's not in cache, it asked parsoid to parse it. parsoid is having trouble with it (hangs on trying to parse it?), and so RB eventually times out and asks again or something, and is spamming that through all the parsoid servers killing them [03:14:42] 03:04 < ori> https://fr.wikipedia.org/api/rest_v1/page/html/%EA%9E%80/106801025?redirect=false is the restbase request for that revision [03:14:45] 03:05 < bblack> and some wtp are recovering intermittently, probably because when they go crazy LVS/pybal depools them and they get to eventually recover and re-enter the pool [03:14:49] 03:05 < andrewbogott> why would that throw puppet alerts? [03:14:51] 03:13 < subbu> from wtp1023 .. a snippet of the parsoid.log https://gist.githubusercontent.com/subbuss/4e68ed795ef7b0079a70cfd51e9ed7e0/raw/09b584a72473fba2fa1d3aabc73ddae74a146488/gistfile1.txt [03:14:55] 03:13 < subbu> looks like parsoid cluster is getting dos-ed [03:14:58] 03:13 < andrewbogott> Can we write straight to the DB and null-out that article? [03:15:03] gwicke, ^ [03:15:19] I'm seeing cassandra errors in logstash [03:15:37] since I don't have rights to do anything myself, I'll try to page urandom [03:15:47] heh [03:15:53] hang on [03:16:12] on wtp1*, in /srv/deployment/parsoid/deploy/src/lib/api/routes.js [03:16:22] 377 var v1Wt2html = function(req, res, wt) { [03:16:22] 378 var env = res.locals.env; [03:16:22] 379 var p = apiUtils.startWt2html(req, res, wt).then(function(ret) { [03:16:23] gwicke: you can also dictate commands here if you have clear steps in mind [03:16:24] 380 if ( ret.oldid === 106801025 ) { return false; } [03:16:28] that last line i just inserted [03:16:41] and restarted parsoid on wtp* [03:17:15] I see some promising recoveries maybe-starting [03:17:16] that's subtler than wiping the db record, but not much subtler :) [03:18:55] I wish the NRPE checks wouldn't even mention the check they were trying to execute when the host is flailing [03:18:58] it's confusing [03:19:02] ori .. looks like another page now .. https://gist.githubusercontent.com/subbuss/b2a83cd89d81ec2cd567621ba0019c79/raw/4e76701a4b85d419339ecbd44f62cecb7314be09/gistfile1.txt [03:19:27] "userAgent":"ChangePropagation/WMF" [03:19:35] can we stop changeprop? isn't that new? [03:20:02] maybe changeprop is what's spamming wtp, and indirectly causing RB to fall over because wtp is abused? [03:20:28] i don't know anything about changeprop .. maybe deployed today? [03:20:36] you could stop it, it's probably service 'changeprop' on sca1001 and sca1002 [03:20:45] i tried it [03:20:46] sudo salt 'scb*' cmd.run 'puppet agent --disable ; service changeprop stop' [03:21:15] but? [03:21:38] i mean, i just tried it, was skimming the changeprop puppetization to figure out how [03:21:53] works for me if it works! [03:22:14] judging from https://grafana.wikimedia.org/dashboard/db/eventbus, there isn't much traffic from changeprop any more [03:22:22] wtps are happier, restbase hosts are not [03:23:19] didn't changeprop cause issues in beta earlier on? [03:23:41] tailing on wtp1023 and wtp1018 ... i am still seeing changeprop UA requests. [03:23:45] it may be that RB will take a while to recover from the extended parsoid crazy [03:24:34] so parsoid was taken out by repeated requests for a specific title? [03:25:23] gwicke, https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Parsoid+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [03:25:32] not just 1 title as far as i can tell. [03:26:00] subbu: fresh requests ,or requests that could have been in flight when i stopped the service? [03:26:18] i can try another parsoid restart, to kill any in-flight changeprop requests [03:26:31] might be worth a shot [03:26:44] doing [03:26:47] https://gist.githubusercontent.com/subbuss/b2a83cd89d81ec2cd567621ba0019c79/raw/05d49e61856cb5a4aae3af448a9556dc93be399d/gistfile1.txt [03:26:55] sure. [03:27:15] there was a spike of internal requests just before the slow-down [03:27:31] to RB or to parsoid? [03:27:32] looks like load is beginning to drop .. [03:27:36] which should exclude changeprop requests [03:27:42] to RB [03:28:05] this resulted in a slight increase in requests to parsoid [03:28:17] maybe we have created some kind of loop? [03:28:21] it's user-facing, btw: Is the wikipedia server hiccupping? Taking forever to save/load VisualEditor :p [03:28:32] where a req to RB can generate a req to parsoid that generates a changeprop event that generates... [03:28:36] nope ... still coming in ... [03:28:56] I really have no idea how changeprop is plumbed to other things [03:28:58] {"name":"parsoid","hostname":"wtp1018","pid":29169,"level":30,"logType":"info","wiki":"kmwiki","title":"ភពពុធ","oldId":181275,"reqId":"8fdf83c1-1207-11e6-88ec-53a99069692c","userAgent":"ChangePropagation/WMF","msg":"completed parsing in 3515 ms","longMsg":"completed parsing in 3515 ms","time":"2016-05-05T03:28:48.585Z","v":0} [03:28:58] {"name":"parsoid","hostname":"wtp1018","pid":29437,"level":30,"logType":"info","wiki":"kmwiki","title":"ភពពុធ","oldId":181275,"reqId":"8e1460b1-1207-11e6-90e5-0e0553f59eca","userAgent":"ChangePropagation/WMF","msg":"started parsing","longMsg":"started parsing","time":"2016-05-05T03:28:48.606Z","v":0} [03:28:58] {"name":"parsoid","hostname":"wtp1018","pid":29169,"level":30,"logType":"info","wiki":"frwiki","title":"Ꞁ","oldId":106801025,"reqId":"8fdf83c1-1207-11e6-88ec-53a99069692c","userAgent":"ChangePropagation/WMF","msg":"started parsing","longMsg":"started parsing","time":"2016-05-05T03:28:48.941Z","v":0} [03:29:14] fg [03:29:52] why do we keep seeing unicode titles in the problem logs? [03:29:54] but, load has dropped to 1200% from 2000% or so .. [03:29:56] https://grafana.wikimedia.org/dashboard/db/cassandra-restbase-eqiad is not looking good [03:30:13] I called urandom, he should be on his way [03:30:27] here, trying to get caught up... [03:30:33] I mean, is it my imagination, or are the spam things in parsoid logs never plain ascii titles? [03:31:13] i think it has quietened down in the parsoid logs now. [03:31:35] could be a classic unicode attack [03:31:40] still trickling in .. but way more slowly. [03:31:55] gwicke: theory: something (changeprop service probably) requested many old revisions which were not in cassandra; restbase requested those from parsoid, causing intermittent parsoid overload, but also many successful responses, which restbase is storing in cassandra, triggering compaction / gc [03:31:58] I was thinking maybe just a new unicode related bug of our own, not really attack [03:32:22] I'm seeing hinted hand-off activity in cassandra logs [03:32:39] bblack: yah, German joke.. [03:32:47] ok :) [03:33:22] geez, the spike in network traffic on the cassandra cluster is nuts: https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-system?from=now-24h&to=now [03:34:29] mostly on 7, 8, and 9 [03:34:35] all different racks [03:34:45] that looks like a very hot partition [03:34:46] another theory: after many writes to the summary and definition end points, cassandra started a major compaction, and GC wasn't up to it [03:35:18] request rates to parsoid were not very high [03:36:01] I think the outage is over, I am able to load VE for old revisions I would not expect RB to have in storage [03:36:07] load is definitely coming down. [03:36:07] yeah but parsoid did start falling apart too. if it was just RB choking up, wouldn't parsoid have been mostly ok? [03:36:48] yes, something must have been especially expensive on the parsoid side [03:36:55] yes, i think parsoid was choking (as in taking too long) on those particular titles ... and they were repeatedly being requested [03:37:18] we have a lot of metrics and logs covering this, so I'm sure we'll get to the bottom of this [03:37:19] let me grep the logs for those and see what is going on with those. [03:37:36] when RB aborts a request to parsoid because its timeout expires, does parsoid stop parsing? [03:37:42] cassandra throughput is coming back up [03:38:13] parsoid has a 2/3 (I'll look itup) timeout and the worker is killed if it doesn't respond in that time. [03:38:32] * 2/3 min .. [03:39:01] and RB should only retry if there is no response (timeout), or a 503 with retry-after [03:39:02] parsoidConfig.timeouts.cpu = 3*60*1000; // 3 minutes [03:39:10] and only once [03:39:24] it is changeprop that was retrying those heavily .. [03:39:28] what is changeprop now? [03:39:34] and did it get deployed recently? [03:40:21] but yes, there are and will be titles on which parsoid will choke and timeout. [03:40:50] shall we use https://etherpad.wikimedia.org/p/ve-2016-05-05 for notes? [03:41:26] 10362 "title":"ភពពុធ" [03:41:27] 8276 "title":"" [03:41:37] ("ve" because that was the user-facing part) [03:41:38] those many requests to those 2 titles on just wtp1023 [03:42:08] 10360 "wiki":"kmwiki","title":"ភពពុធ" [03:42:08] 8274 "wiki":"frwiki","title":"" [03:42:41] changeprop is a new service to gradually replace some job queue use cases [03:42:43] not all of the wtps are recovered in icinga yet, but they may get there slowly [03:43:30] the first deploy was targeted at the summary end point (not parsoid backed), and definitions for enwiki (partly parsoid backed, but low volume & the backlog for those had been processed around noon SF time) [03:43:38] gwicke: when was that deployed? ~12:30 UTC? https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-cf-latency-rate?from=1462333396504&to=1462419796504&panelId=25&fullscreen&var-datacenter=1&var-node=All&var-keyspace=local_group_wikipedia_T_parsoid_html&var-columnfamily=data&var-quantiles=99percentile [03:43:50] the parsoid request rates reflect this [03:44:18] urandom: yes [03:44:37] 17 "wiki":"kmwiki","title":"ភពពុធ","msg":"completed parsing in 447 ms" [03:44:37] 16 "wiki":"kmwiki","title":"ភពពុធ","msg":"completed parsing in 441 ms" [03:44:37] 15 "wiki":"kmwiki","title":"ភពពុធ","msg":"completed parsing in 456 ms" [03:44:41] definitions are enwiktionary only, and generally low volume [03:44:55] so, those are parsing fairly quickly. maybe it is the frwiki title that was choking. [03:45:16] yeah I still think unicode is involved somehow in the choking part [03:46:13] indeed .. that is the title that was causing the crawl .. it didn't help that it got 10s of thousands of parse requests. [03:46:20] "wiki":"frwiki","title":"","msg":"completed parsing in 129917 ms" [03:46:20] "wiki":"frwiki","title":"","msg":"completed parsing in 132780 ms" [03:46:20] "wiki":"frwiki","title":"","msg":"completed parsing in 136190 ms" [03:46:29] wow [03:46:48] it 8274 requests on wtp1023 .. so, x 24 [03:47:08] that has to be internal [03:47:22] something has a short timeout, and if parsoid doesn't respond, it just hits it again and again and again [03:47:23] indeed .. that is the title that was causing the crawl [03:47:28] most timeout user agents are: RB update job (job queue) followed by OCG [03:47:29] or ពុធ ? [03:47:37] it i snot uniform though ... [03:47:38] "wiki":"frwiki","title":"","msg":"completed parsing in 8203 ms" [03:47:38] "wiki":"frwiki","title":"","msg":"completed parsing in 8051 ms" [03:47:38] "wiki":"frwiki","title":"","msg":"completed parsing in 8258 ms" [03:48:17] clear winner for 500 is 'Ruby' -- sigh [03:48:34] but, all from an internal (prod / labs) IP [03:48:44] 10.68.16.87 [03:48:54] that would be labs [03:48:57] heh [03:49:32] labs-ns1 says: [03:49:33] 87.16.68.10.in-addr.arpa. 60INPTRfixmebot.globaleducation.eqiad.wmflabs. [03:49:36] ori sorry .. one sec. [03:49:41] fantastic hostname [03:49:46] 60INPTRfixmebot? [03:49:52] fixmebot.globaleducation.eqiad.wmflabs [03:49:53] bad pasting [03:49:57] gwicke, ori https://gist.githubusercontent.com/subbuss/95c4ac7cc100d071520db28d312a7bbd/raw/6645a7b4272cc0f250db6089925d1cda2e6ce952/gistfile1.txt [03:49:58] ah, yes [03:50:02] ok, lemme see who that is [03:50:24] (although obviously a misbehaving client shouldn't be able to etc. etc.) [03:50:34] so, the second # in that column is # of secs for parsing that title .. first column is how many parses took that long. [03:51:19] https://wikitech.wikimedia.org/wiki/Nova_Resource:Globaleducation lists a bunch of people, I'm not sure who's currently maintaining it [03:51:21] actually, nevermind.. it looks like all those requests are for pageviews [03:51:32] so, the vast majority of parses were in the low 10s ... let me try the other kmwiki title. [03:51:34] which are timing out often (rotating disks) [03:51:36] so probably it initially took 8s to parse, but whatever requested that decided 8 seconds was too long and hit it again before the first one finished [03:51:46] repeat until everything slows to a crawl, including those requests? [03:52:20] given that it's ruby, maybe User:Ragesoss ? :) [03:52:25] think he does some ruby [03:52:35] that instance has existed since march, created by (since departed) Sage Ross [03:52:39] Krenair: yep! [03:52:46] https://gist.githubusercontent.com/subbuss/7c78cf0840741061c67ae19d3352b242/raw/9de64df137b63076e8338cdae2f33bc83cb0dce2/gistfile1.txt [03:52:46] yeah but he moved to Wiki Edu [03:52:49] this might not be the culprit [03:53:06] so, for the kmwiki title .. vast majority of parses finished in < 5 seconds. [03:54:07] ori gwicke anyway .. etherpad time then? https://etherpad.wikimedia.org/p/ve-2016-05-05 ? [03:54:08] so what's clear is that RB saw cassandra timeouts, and cassandra was busy [03:54:34] just graphed the Ruby 500 responses, and they look quite stable over the last days [03:54:45] so not related [03:55:10] gwicke: I think I've been seeing those too, but hadn't dug into them. they come in on/off bursts [03:55:35] yeah, it has been a known problem for a while [03:55:49] the aqs servers will hopefully get SSDs soon [03:56:21] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1462409729366&to=1462414176425&var-site=All&var-cache_type=text&var-status_type=5 [03:56:29] those little low rate 5xx spikes on and off [03:57:55] ori, should i collect anything else from the logs? [03:58:11] wtp1004 is taking longer to recover than the rest [03:58:23] but it's finally clearing out in icinga now, way later than the rest [04:00:13] * subbu is glad we got all clients logging user agent strings and passing them through to parsoid in the last month [04:00:53] subbu: are you seeing a lot of slow requests from the Mobile Content Service? [04:01:28] i didn't check. [04:01:44] icinga's fairly normal now, except of course scb's puppet-disable and changeprop-dead [04:02:00] I'm pretty confident that the spike of internal requests came from MCS [04:02:44] a lot of them for /fr.wikipedia.org/v1/page/html/%EA%9E%80?redirect=false [04:02:48] redirect loop once more? [04:03:08] second most popular /km.wikipedia.org/v1/page/html/%E1%9E%97%E1%9E%96%E1%9E%96%E1%9E%BB%E1%9E%92?redirect=false [04:03:27] that would explain a lot [04:04:56] looks like a self-redirect: https://fr.wikipedia.org/api/rest_v1/page/html/%EA%9E%80?redirect=false [04:04:59] href="./Ꞁ" [04:05:21] I had to go AFK, catching up [04:05:23] gwicke: that reminds me... https://phabricator.wikimedia.org/T118548 [04:05:46] MCS had several of those bugs before, which is why we moved redirect handling to RB [04:05:57] I thought we were going to need VCL support there, which I never actually did yet, but now it's closed with "The new redirect handling code has been deployed in production." last week? [04:06:02] but it looks like that code is still active in MCS [04:06:43] bblack: yeah, that's the task; VCL support is needed to improve caching for ?redirect=false requests, but it's not critical [04:06:54] ok [04:09:39] https://km.wikipedia.org/api/rest_v1/page/html/ភពពុធ?redirect=false does not look circular [04:11:15] * paravoid is catching up [04:11:16] 10Traffic, 06Operations, 10Parsoid, 10RESTBase, and 3 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2266427 (10BBlack) [04:11:28] I suppose outage is over? [04:11:50] yeah, metrics look nominal again [04:11:57] paravoid: impact is over, but still investigating cause. also, we still have the new changeprop service shut down. [04:12:11] TimStarling, http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-traffic/20160505.txt [04:12:13] thanks [04:12:17] (which was a candidate cause, or at least involved in whatever eventual bad behavior) [04:12:56] we moved over here because the spam in -ops was bad [04:13:13] got it [04:15:39] https://gist.github.com/atdt/6059abf5549997f38a7af4634d0b3bd7 has most popular revisions requested from parsoid during outage, most popular user-agents, and most popular user-agents for the two most popular revisions [04:15:50] also: https://etherpad.wikimedia.org/p/ve-2016-05-05 [04:16:45] oldIds 181275, 106801025 were the most popular (4-5 times more than the third most popular) and the user-agent for those was almost exclusively ChangePropagation/WMF [04:17:37] ori, add a "grep started" or "grep completed" to count # of requests instead of log lines .. i made that mistake as well .. and updated it on the etherpad. [04:17:54] something is either looping through changeprop, or timing out and retrying a slow/failing/buggy query aggressively until meltdown [04:17:56] oh, right [04:19:29] are these the redirect pages? [04:19:50] the frwiki example we started with was a redirect page, I think [04:19:57] yup, looks like [04:20:04] both are [04:22:08] /srv/log/changeprop/main.log is a mess [04:22:42] lots of: [04:22:43] {"name":"changeprop","hostname":"scb1001","pid":10,"level":50,"message":"Retry count exceeded","event":{"meta":{"topic":"change-prop.retry.resource_cha nge","schema_uri":"retry/1","uri":"http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE","request_id":"891a92f1-1207-11e6-9964-5ad07d937bab","id":"9 5e39d6c-126a-11e6-81a7-9fb31907bc51","dt":"2016-05-05T02:39:27.193Z","domain":"fr.wikipedia.org"},"emitter_id":"change-prop# [04:22:43] summary_rerender","retries_ left":0,"original_event":{"meta":{"domain":"fr.wikipedia.org","dt":"2016-05-05T02:34:37.284Z","id":"e91700cd-1269-11e6-8bb5-51af0f3be924","request_id": "891a92f1-1207-11e6-9964-5ad07d937bab","schema_uri":"resource_change/1","topic":"resource_change","uri":"http://fr.wikipedia.org/api/rest_v1/page/html/ %EA%9D%AE"},"tags":["restbase"]},"reason":"Error in Cassandra table storage backend"},"levelPath [04:22:44] ":"error/summary_rerender","msg":"Retry count exceeded" ,"time":"2016-05-05T02:39:27.193Z","v":0} [04:23:17] so apparently it has a retry limit of some kind [04:24:27] yeah, currently it should log events whose retry limit is exceeded [04:24:29] & drop them [04:24:41] we could be seeing something that's not necessarily infinitely-retrying, but just multiplied-retries from various services invoking each other [04:25:37] normally, those requests should not even reach parsoid after the first parse [04:25:40] so something is odd [04:28:39] the frwiki redirect taking >10s to parse is odd [04:30:28] gwicke, https://fr.wikipedia.org/wiki/%EA%9E%80?oldid=106801025 doesn't look like a redirect [04:30:29] * bblack beats dead horse about how probably nothing should ever retry for inter-service calls internally. when in doubt, fail through fast to your caller. [04:31:02] but, in any case, it taking even 8 seconds seems odd .. [04:31:29] wonder if we have inefficiencies in our unicode handling. [04:31:50] subbu, view-source:https://fr.wikipedia.org/api/rest_v1/page/html/%EA%9E%80/125284517?redirect=false [04:31:53] I suspect so, that's probably one of the catalysts in all of thise [04:32:12] that's the *latest* version [04:32:41] but the old version is indeed not a redirect [04:32:52] ok ... parsoid got requests to %EA%9E%80?oldid=106801025 however [04:33:28] so... [04:33:42] well that thought went nowhere [04:34:16] bblack, sorry .. which thought? i was staring at the page. [04:34:20] my thought [04:34:53] is there some disconnect between identifying that a page is a direct by looking at latest rev, even when we're really dealing with an older rev that isn't one? [04:34:53] bblack: I'm sure the retries from the changeprop service didn't help, but that alone shouldn't account for more than a single-digit number of requests (unless there is a bug) [04:34:54] oh .. missed that .. unicode being the catalyst .. possible .. i'll parse that locally to see what is going on. [04:35:14] s/is a direct/is a redirect/ [04:35:40] which would cause some loopy behavior of trying to follow a non-existent redirect? [04:36:04] RB saw a lot of requests for /fr.wikipedia.org/v1/page/html/%EA%9E%80?redirect=false from MCS, which would be the latest revision [04:36:17] which is a redirect to itself [04:36:31] oh, why? [04:37:00] why a redirect to itself? [04:37:09] yeah [04:37:30] it turns out that things on wikipedia aren't always 100% consistent ;) [04:37:46] MCS has been good at finding such loops [04:37:50] so is some code in one of these services following that redirect-to-self indefinitely? [04:38:56] gwicke: "Theory: MCS entered redirect loop (as previously) " -- where / what? is there a postmortem or some other record for this somewhere? [04:39:00] it did so recently (in a request spike at the start of the outage), and based on the request volumes I think that's a plausible candidate [04:39:27] (which maybe needs a redirect depth limit where it just gives up, either that or explicit loop detection by comparing to all the article names looped through so far) [04:39:31] the previous redirect loops didn't lead to outages [04:39:52] but it sped up https://phabricator.wikimedia.org/T118548 [04:40:16] however, the MCS implementation is still active I think [04:40:18] we did have an ios app release today too right? [04:40:38] or sometime in the past day or two [04:41:02] I remember godog noting a 404 spike, which turned out to be /.well-known/apple-crap, which we then figured was related to an ios wikipedia app release [04:42:26] https://phabricator.wikimedia.org/T131416 is one previous MCS redirect loop instance [04:44:03] jurging by the trace logs in RB it is a redirect loop in MCS [04:44:22] but, the relevant code in MCS has not been touched since the last fixes [04:45:04] Pchelolo: yeah, that was my suspicion too [04:45:12] https://etherpad.wikimedia.org/p/ve-2016-05-05 [04:45:27] so revised theory: MCS redirect loop on a redirect-to-self article, plus when RB tried to get parsoid to parse that particular article (as opposed to past redirect loopers), it hit a parsoid slowdown/bug on utf-8 titles that greatly exacerbated impact until meltdown ? [04:46:28] edited a page to remove a loop [04:46:58] I don't have an explanation yet for why parsoid would even see most of those requests [04:47:01] and the title in question partitions to the Cassandra nodes that were overwhelmed, btw [04:47:18] which is what we saw on the last redirect loop [04:47:20] bblack, yes ... i am too tired to investigate why that page takes ~8-10 seconds to parse what seems like a fairly small page. [04:47:27] unless the first request somehow timed out right away [04:47:38] gwicke: I think it did [04:48:09] but, something for tomorrow. [04:48:20] I think even the very first request for it took 8s due to some parsing bug (probably unicode related), and that was too long. and some impatient code kept giving up and retrying and it kept taking longer due to overload [04:48:39] 8s is not *that* long, though [04:48:47] that on top of the spam of requests from MCS slowing down both layers [04:48:56] it's just going to build on itself [04:49:02] popular articles like Obama take significantly longer [04:49:18] yeah but they don't have redirect loops for MCS to go nuts on either [04:49:27] it's surprisingly long for a small page, but normally nothing that would cause requests to fail [04:49:57] as MCS looped, it would be creating parallel queries ->RB->parsoid? [04:50:11] sequential [04:50:18] the effects cascade surprisingly far, it's very hard to reason about [04:50:20] hmmm [04:50:33] e.g. how is the pdf cluster involved? https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=PDF+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [04:50:42] gwicke, yes 8s is not very long .. the only reason it mattered here is the huge volume of requests to parsoid. [04:51:09] ori: OCG is a big api user [04:51:32] oh, so this hammered the mediawiki API and affected all clients of that? [04:52:00] i'm still doing the stupid work of noting spikes and ordering them chronologically [04:52:01] it's mostly hitting RB and parsoid, afaik [04:52:12] i don't think ocg is involved in this .. [04:52:37] but, yes, anything that ties up the cluster in that period will exacerbate the effect. [04:53:00] so, ocg is involved in so far as it is sometimes a source of large parse requests. [04:53:10] yeah, I think it's more likely changeprop -> RB -> MCS (possibly loop) -> RB -> Parsoid [04:53:28] but, that is no different from any other time .. i.e. no change in request rate to parsoid from ocg as far as i can tell. [04:55:15] generally, RB -> parsoid rates weren't out of the ordinary [04:55:52] mean parsoid latency went up sharply, though [04:56:39] median went from ~600ms to 47s at :45 [04:56:47] https://grafana.wikimedia.org/dashboard/db/restbase [04:57:45] yeah but by :32 or so the RB req rate was dropping off like a rock [04:57:57] something was already happening well before that jump in parsoid latency [04:58:20] (notably, the same rate that had spiked up on changeprop deploy) [04:58:43] parsoid latency started at ~35 [04:59:24] just after a spike of internal requests at :32 [04:59:41] which would be MCS [04:59:53] 500+/s [05:00:58] parsoid has 360 workers on the cluster .. so, @ 8s/request, you need 45 requests/sec for those ids to tie down all the workers ... and from then on, it just spirals out with increasing latencies. [05:01:00] compactions start to spike at 03:28 [05:01:15] so, what made this MCS loop worse than past ones, and why didn't https://gerrit.wikimedia.org/r/#/c/280789/2/lib/parsoid-access.js fix it? [05:01:41] err, 02:28, that is [05:01:58] ori: yeah, even a bit earlier [05:02:20] is the smaller spike at 2:19 related? [05:03:08] 2:16? [05:04:12] https://logstash.wikimedia.org/#dashboard/temp/AVR_TLQ9CsPTNesWPPTj is a view of org.apache.cassandra.db.compaction.CompactionTask log events, you can click and drag to select a slice of the timeline to zoom into [05:04:14] erm, sorry- got the time wrong; there was a much bigger compaction spike at 09:27 [05:04:33] recent compactions look quite normal to me [05:04:43] hmmm also in the 'Public entry points, non-WMF clients' graph further down in https://grafana-admin.wikimedia.org/dashboard/db/restbase [05:05:13] restbase.internal.v1_page_html_-title.GET.ALL [05:05:18] kinda tracks similarly [05:05:41] bblack: yeah, I think it's pretty clear that those requests were from MCS [05:05:47] ok [05:06:03] I filtered the trace logs in logstash on internal in that period, and it's basically all MCS [05:07:17] but, there are several bits that don't make sense yet [05:10:07] https://gist.githubusercontent.com/subbuss/9ecec9ea63325912f1a6da35dabd40a0/raw/1df8240beb984c16c9a6c7526bba61125fe2df9d/gistfile1.txt shows request rate to the frwiki title. [05:10:42] this is only on wtp1023 .. and that is per minute req rates. [05:12:21] so, ~50 reqs / sec just for that one title across the cluster at times .. so, given that the title took 8-10 seconds or maybe longer at that time ... it was probably enough to bring the cluster to a crawl. [05:13:25] so we're still back to basically: MCS loop (which was supposed to be fixed, but isn't) + much slower render of self-redirect that's causing the loop, making fallout to parsoid worse. [05:14:11] if parsoid could have handled that title much faster, or if the request rate to that title was smaller .. parsoid wouldn't have ground down to a halt. [05:14:23] the request rate to the kmwiki title is even higher than the frwiki one. [05:14:34] right, which is why the past MCS loop events were tolerable [05:14:40] they were on faster-parsing articles [05:14:47] https://gist.githubusercontent.com/subbuss/c6c7e68045a0ce22ead79f4f0f963f5c/raw/53c24990d0951905d9cb434e91467513df40562d/gistfile1.txt [05:14:52] oh, in the beginning changeprop was involved too. At 2:36 it starts logging 'retry count exceeded' for the same title over and over again [05:14:54] https://km.wikipedia.org/api/rest_v1/page/html/ភពព្រះពុធ [05:14:56] 10Traffic, 10DNS, 10Mail, 06Operations, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2266493 (10Dzahn) ping, we need reviews/merge for https://gerrit.wikimedia.org/r/#/c/280644/ [05:15:04] bblack, right. [05:15:31] Pchelolo: gwicke was indicating there was a path for changeprop->RB->MCS->RB, to let changeprop kick off the problem on an edit [05:15:33] OK, this sounds convincing enough that it should be recapitulated in complete sentences in the postmortem [05:15:49] (I think) [05:15:52] also, what is the impact of changeprop being down? [05:16:11] it is still disabled (as is puppet, on scb* hosts). Who is affected? [05:16:18] it's no impact (yet), as it was just a test deploy which double-processed events [05:16:20] ori: no impact right now, all updates are still processed elsewhere [05:16:28] awesome [05:16:57] OK, I'm going to puppetize the change to disable the service [05:17:02] there are several gaps yet in this theory [05:17:05] no wonder my purge rates went up then, that was changeprop going live :P [05:17:18] 1) why did parsoid see so many requests for the same title? [05:17:32] (why didn't RB serve them from storage?) [05:17:41] because RB never got a successful parse back that it liked, so it kept having to ask again? [05:17:48] maybe it's not only slow but broken [05:17:53] (1) that is a different qn .. the theory still explains why parsoid cluster fell down. [05:18:25] only if those requests really failed & never made it to storage [05:18:35] bblack: if it never got a successful parse back it wouldn't start the loop in MCS. MCS is contacted only after a revision is stored [05:18:36] but 10s parse time doesn't sound bad enough [05:18:48] ok [05:19:35] so if there was only one MCS loop, and the looped requests are serial (asserted earlier), and parsoid takes 8s to parse, shouldn't the MCS loop be running at like 1 redirect per 8 seconds? [05:19:40] ok .. so, there are missing gaps on the mcs - changeprop - rb end ... but, i am satisfied on the parsoid end as to why the cluster went down .. spike in repeated concurrent requests to a high-latency page. [05:19:47] how do we get 500/sec reqs from the MCS loop? [05:20:36] it's possible that the initial burst was enough to swamp the parsoid cluster [05:20:53] separately, we need to figure out on the parsoid end if we can improve the parse latency on that page, but it still doesn't prevent a DOS if we hit this loopy scenario on another page. [05:20:55] if there were hundreds of requests in the first 10s, then all of those would actually hit parsoid [05:20:58] what initial burst, if the loop is serial and waits on RB that waits on parsoid? [05:21:06] so, the looping and retrying needs plugging. [05:21:19] bblack: yeah, that part doesn't make sense [05:21:25] MCS loop should never reach parsoid (in theory) [05:21:26] no looping without a 200 response [05:21:38] and from what I seem most responses were indeed 200s [05:21:41] *see [05:21:44] parsoid is completing those parses .. so it is returning http 200s [05:21:47] there were clearly hundreds of requests, they clearly came from loopy behavior [05:22:47] does MCS have a shorter timeout than RB does? [05:22:57] aha: most responses have the same etag, so were stored from storage [05:23:06] *served from storage [05:24:22] https://logstash.wikimedia.org/#dashboard/temp/AVR_X3d6jK4nptUtwUh_ the loop started on the changeprop side [05:24:30] so, 2:35 is when parsoid starts timing out massively on those titles .. https://logstash.wikimedia.org/#/dashboard/temp/AVR_XxhmjK4nptUtwUQE [05:25:41] Pchelolo: yeah, that's a very high rate [05:25:46] do you guys still need me for helping with any of the investigation on the parsoid end? I can hang around if so. [05:26:24] bblack: MCS's timeout should be longer than RB's [05:26:52] and the client it uses is the same as the RB one, and configured to be very conservative about retries [05:27:25] subbu: I think we are good for now [05:27:35] ok, I know the reason [05:27:41] subbu: thank you, also for pinging me! [05:28:07] changeprop seems to have a hard limit of 5 retries though, per change? [05:28:47] Pchelolo: could this be the race condition with commits we talked about earlier? [05:29:05] gwicke: nono, I'm about to write a lot of text. hold on [05:29:13] * gwicke holds still [05:29:27] Pchelolo: you might want to write directly to https://etherpad.wikimedia.org/p/ve-2016-05-05 [05:30:26] ok, writing there [05:32:39] * subbu is watching the etherpad [05:33:08] aha! the no-cache forwarding seems to be the missing puzzle piece [05:35:52] gwicke: it might be time to re-visit your RfC from a year ago on the subject of request timeouts and retries (https://phabricator.wikimedia.org/T97204). The standardization of nodejs tooling for WMF services should include a common library for cross-service requests. [05:36:23] that's pretty much implemented & used in all the node services [05:36:33] i am not following the changeprop/mcs/restbase details .. and i am drowsy .. will catch up tomorrow once the narrative is established. [05:36:42] except for timeout staggering [05:37:00] (although we are close on that as well, especially between RB & parsoid) [05:37:14] gwicke: done. My explanation is at the bottom. Sounds sane? [05:37:15] feel free to wake me up again if necessary :) but hopefully it won't be necessary. [05:37:17] Pchelolo just solved it [05:37:19] that's great, gwicke [05:37:28] see the etherpad [05:37:48] Pchelolo: that makes a lot of sense [05:38:07] yes [05:38:15] Pchelolo: nice [05:38:54] so we have many bugs here - RB forwards no-cache where it shouldn't, change-prop follows a redirect (and potentially preserves a no-cache), parsoid takes long time to parse that page, same-html check doesn't work [05:38:55] * gwicke wonders where the MCS request spike came from [05:39:10] I'm still on my dead-horse statement. retries mask errors and multiply problems more often than they help. bubble up errors and timeouts, don't retry things in the general case. [05:39:13] gwicke: see #7 in the explanation [05:39:35] from the user-facing POV, we still have 1x retry of 503s, which can in theory paper over any transient issue that happened 10 layers deeper. [05:40:23] bblack: current default is no retries except for a) connection timeout (1x in that case), or b) 503 with explicit retry-after [05:41:12] in a), the timout is also increased by some factor on retry [05:41:25] https://github.com/wikimedia/change-propagation/blob/1db7e80efefd38cc19e0a64e092bd22bf56dfede/lib/rule.js#L7 [05:41:35] ^ seems to indicate there's some internal retry logic there? [05:42:07] yeah, that's not at the http level [05:42:22] bblack: yep. but in this case retry logic wasn't really the root cause. even without the retry logic this problem would've happened [05:42:39] it still multiplies the meltdown [05:42:47] with changeprop, there is no user or vanish to retry [05:42:53] *varnish [05:43:24] a major issue we talked about earlier is that those retries aren't paced right now [05:43:42] they are executed asap, rather than after some delay [05:44:00] my beef with (a) is you've already waited too long and the user no longer cares and you've failed anyways. also is "connection timeout" failure to establish TCP, or failure to get a response in time? [05:44:11] if i'm not needed, i think i'll slink away (i have to get up in a few hours) [05:44:14] with changeprop, there is no user [05:44:21] a delay would probably be even less intuitive to debugging... [05:44:23] the point of those requests is to update storage & caches [05:44:27] and my beef with (b) is: in what scenario does a service return an explicit 503 and have some sane idea when itself will be fixed for retrying? [05:44:44] but in any case, how many bugs do we see here? [05:44:53] bblack: basically never, which is why that's more a theoretical case we don't need to worry about ;) [05:45:51] so, you set sane timeouts with an eye towards cascading them shorter and shorter as you dig deeper in the service->service->service stack. if you hit a timeout or an error, you immediate throw the error upstream. [05:46:01] no retries. no retry-afer. [05:46:19] if you've waited for any sane timeout length already, you've already lost the user when it expires. why bother retrying? [05:47:25] paravoid: at least 2 [05:47:29] and if we're worried about 0.001% transient little things causing rare unavoidable errors during rolling deploys or from a dropped packet somewhere... the outer 1x 503 retry the varnish frontend does on behalf of the user should fix it [05:47:33] possibly 3 [05:47:53] which ones? :) [05:47:55] because it's at the outermost layer of everything, it doesn't multiply anything [05:48:22] paravoid: two are already marked in Pchelolo's summary on the etherpad [05:48:33] the third is plausible, but not yet verified [05:49:00] I'm assuming above is my random stab with: (could be this is where the unicode html and/or title bug comes into play?) [05:49:22] bblack: yep [05:49:26] we've still got a long parse of a very short page with unicode in it at the parsoid layer, that subbu can probably figure out after sleeping [05:49:28] yeah, something might cause the unchanged-html optimization to fail [05:49:54] slow parse & possibly non-deterministic output [05:49:58] the two things may be related: slow parse of these unicode-y pages + unchanged-html fail [05:50:14] *nod* [05:50:28] encoding issues with URI vs HTML entity encoding of the utf8 bytes and whatever [05:50:53] notably they didn't just have unicode content, but also unicode in titles (in urls) [05:51:11] yeah, unicode normalization perhaps [05:51:22] I need CR for https://gerrit.wikimedia.org/r/#/c/287050/ and https://gerrit.wikimedia.org/r/#/c/287051/ so I can re-enable puppet on scb hosts [05:51:45] looking [05:53:31] oh sorry, just saw the "looking" [05:53:37] I've already merged but not puppet-merged [05:54:24] see -ops [06:02:24] I think we are all a bit too tired to write a proper post-mortem tonight, but we'll investigate in-depth & write a proper one within 24 hours [06:03:04] ori, bblack: thank you both for the help! [06:05:59] thank you gwicke [06:09:00] gwicke: yeah, that makes sense [06:09:16] gwicke: happy to help too, we can also discuss on our weekly later today too [06:09:21] (or I guess tomorrow for you :P) [06:10:13] yup, pretty soon [16:17:29] 10Traffic, 06Operations, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2267511 (10BBlack) We're overdue to circle back to this, but there's also a lot of investigating and thinking left to do, and IMHO the varnish4 transition as well as the Surrogate-Con... [16:24:07] <_joe_> bblack: I have a varnish question [16:24:59] <_joe_> say one requires a url that is a cache miss but should supposedly be cached, and fetching it takes a lot of time; what would be of subsequent requests for the same url? will they be enqueued to the first one completing? [16:25:33] <_joe_> we had a pretty strange situation today with wdqs (so, misc-web) that could be explained by this [16:26:54] yes, they'll all queue up and wait on the first miss until it returns [16:33:45] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835#516941 (10BBlack) Note also that both CC and SC have grace-mode information as well. In CC: our max-age is `s-maxage` with fallback to `maxage`, and our grac... [17:31:23] 10Traffic, 10Varnish, 06Operations, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2267690 (10BBlack) Did some further testing on an isolated test machine, using our current varnish3 package. * Got 2833-byte test file from uncorrupted (--... [18:45:55] 10Traffic, 06Operations, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2268070 (10BBlack) So, this the current state of affairs and the complex bits, using text as an example cluster: * hieradata `cache::route_table` for `role::cache:text` de... [19:21:38] 07HTTPS, 10Traffic, 06Operations, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2268225 (10Danny_B) [20:37:25] 07HTTPS, 10Traffic, 07Tracking: Requests for resources through a non-canonical address over HTTPS redirect to the canonical address on HTTP (tracking) - https://phabricator.wikimedia.org/T38952#2268407 (10Danny_B) [20:43:38] 10Traffic, 10Varnish, 10Librarization, 10MediaWiki-extensions-CentralNotice, and 3 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848#2268416 (10chasemp) p:05Triage>03Normal [20:43:57] 10Traffic, 10Varnish, 06Discovery, 10Maps, and 3 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2268417 (10chasemp) p:05Triage>03High [20:44:17] 10Traffic, 10Varnish, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, and 2 others: Spike: CentralNotice: Verify that our Special:HideBanners cookie storm works as efficiently as possible - https://phabricator.wikimedia.org/T117435#2268418 (10chasemp) p:05Triage>03Normal [20:44:32] 10Traffic, 10Varnish, 10MediaWiki-API, 06Operations: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867#2268419 (10chasemp) p:05Triage>03Normal [20:44:59] 10Traffic, 10Varnish, 10MediaWiki-Vagrant, 06Operations: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#2268421 (10chasemp) p:05Triage>03Low [20:45:12] 10Traffic, 10Varnish, 06Operations, 06Performance-Team: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2268422 (10chasemp) p:05Triage>03Normal [20:46:46] 10Traffic, 06Operations, 05codfw-rollout: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404#2268428 (10chasemp) p:05Triage>03Normal [20:47:10] 10Traffic, 10MediaWiki-ResourceLoader, 06Operations: commons.wikimedia.org home page has 404s loaded from JS (RL?) - https://phabricator.wikimedia.org/T134368#2268429 (10chasemp) p:05Triage>03High [20:47:20] 10Traffic, 10Varnish, 10Education-Program-Dashboard, 06Operations, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#2268430 (10chasemp) p:05Triage>03Normal [20:47:35] 10Traffic, 10Varnish, 10Analytics, 06Operations, 07Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#2268432 (10chasemp) p:05Triage>03Normal [21:30:12] 10Traffic, 10Varnish, 10MediaWiki-Vagrant, 06Operations: Make Varnish port configurable using hiera - https://phabricator.wikimedia.org/T124378#1954287 (10BBlack) We don't actually use port 6081 in production puppetization anyways, so I'm not sure what this is about? Related is T119396 which is a bit stal... [21:31:30] 10Traffic, 06Operations, 13Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) [21:31:54] 10Traffic, 06Operations, 13Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1825233 (10BBlack) [21:33:21] 10Traffic, 10Varnish, 10Education-Program-Dashboard, 06Operations, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#1855172 (10BBlack) Can you give me some idea what URLs we're talking about for "education dashboard pages"? I'm lost. [21:52:36] 10Traffic, 10Varnish, 10Analytics, 06Operations, 07Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#2268606 (10BBlack) IMHO, we should define better what we need. There's a lot of grey area and disagreement in the discussion so... [22:23:33] 10Traffic, 10Varnish, 10Education-Program-Dashboard, 06Operations, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#2268675 (10awight) @BBlack Oops--these are currently hosted on labs, at http://outreachdashboard.wmflabs.org/ and https:... [23:22:10] 10Traffic, 10Varnish, 06Discovery, 10Maps, and 3 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2268732 (10Yurik) Some stats to consider: the db update script reports 5-6 million z16 affected tiles per day. That's about 7million (5m/4 + 5m/16 + 5m/64...) with all...