[00:35:08] we're all wondering in hangout if 600 seconds is a magic interval anywhere in the invalidation logic anywhere [01:02:32] robla: 5 minutes sounds like a magic number to me [01:03:10] bd808: 10 minutes [01:03:11] hi! [01:03:38] bd808: how far down the rabbithole would you like to go? [01:03:43] ah math. I've got a degree in it so I can't do things like multiple and divide [01:03:46] We're in https://hangouts.google.com/hangouts/_/wikimedia.org/cache if you want. [01:03:57] LEave that to the interims :p [01:04:00] [sic] [01:04:20] * robla sometimes pretends he can math [01:04:51] awight robla sorry power outage, just came back on tho [01:05:16] is AaronSchulz around? He's a cache genius [01:05:47] Krinkle usually seems to know things like this too [01:05:52] * awight waves a large piece of cardboard reading "AaronSchulz" [01:06:14] yeah, it would be cool if he was here, especially because this commit is part of my really reckless speculation: https://phabricator.wikimedia.org/rMWe99bb1b7dcc4c52a74b60ca445f20b8db649222c [01:06:21] awight: find a Husky or a cute cat to wave instead. It will get his attention quicker :) [01:09:49] I just had a pitbull in hangouts [01:12:59] * bd808 needs food more than cache invalidation answers [01:13:18] good luck folks. Maybe you can trick ori into helping you [01:14:41] doubtful, it's Noam's bed-time, plus I got nerd-sniped by http://tstarling.com/blog/2016/09/laser-speckle-contrast-imaging/ [01:30:42] I've been engaged in all sorts of reckless debugging-by-fingerpointing. they're planning to rewrite MediaWiki, but only for meta ;-) [01:52:15] anyone know have any tips on dev setup to debug objectcache more or less as it runs on prod? [01:52:22] I see a memcached vagrant role... [01:53:13] ori bd808 AaronSchulz Krinkle ^ ? [01:53:26] * Krinkle pops up for a minute before shutting down [01:53:42] like a cell phone with a dying battery [01:53:58] that's the worst, when your phone boots up only to show the "Shutting down..." message. [01:53:59] slightly larger device, but same basic principle. [01:54:03] awww [01:54:04] :D [01:55:12] the 10 min window after invalidation where things work and then they don't, that sounds like WANObjectCache's 'tombstone' and/or 'lockTSE' logic. [01:55:31] Krinkle: here's what robla meant by reckless fingerpointing-debugging... wondering if this change might have affected fast availability of changes Mediawiki NS content (aka banners) https://phabricator.wikimedia.org/rMWe99bb1b7dcc4c52a74b60ca445f20b8db649222c [01:55:36] Maybe the slave or underlying cache is borked, but during the first 10 minutes, it's bypassed. [01:55:56] In which case WANObjectCache doesnt' cause the bug, it's rather surpressing the bug for a short time. [01:56:01] Krinkle: yeah that was the suspicion, but we already found another 600 second timer that seems to be the issue [01:56:17] Or rather the cause of the 600 second delay [01:56:23] Not sure yet tho [01:56:51] Maybe somehow the null response of a not-yet existent message is cached? [01:56:53] https://phabricator.wikimedia.org/rMWe99bb1b7dcc4c52a74b60ca445f20b8db649222c doesn't look suspicious to me in this context. [01:57:34] Where in core are the contents of a NS_MEDIAWIKI article written to MessageCache? [01:57:52] Where can I see what caching options and layers would be used for that on prod? [01:58:09] What CN does is edit an article in that NS [01:58:32] and assume that it'll be available as a message... basically anytime [01:58:49] AndyRussG: Is the problem that, ignoring the first 10min, eventually the previous revision shows up, or that no content at all shows up? E.g. if you try to reproduce it with changing a page and creating a page, does it always end up failing on not existing or just on whatever the previous state is. [01:59:51] If so, then it's indeed most likely caused by something in CentralNotice saving the name of the message somewhere, distributing it somewhere which in turn will ask for it, and ends up asking for it from a slave before it has been replicated, and then caches it in a way that will stay until the relevnat message is changed again (which won't happen since it [01:59:51] already happened). [01:59:51] Krinkle: I think the reports we have so far, both in the wild and haphasardly reproduced on prod, are of only empty messages showing up [02:00:03] This is usually mitigated by considering slave lag in the wanCache getWithSetCallback options. [02:01:15] Krinkle: we're no longer testing by seeing if the banner content returned is the name of the message. [02:01:20] This is it now: [02:01:23] $bannerContentMessage = $this->context->msg( $bannerKey )->inLanguage( $parentLang ); [02:01:24] if ( !$bannerContentMessage->exists() ) { [02:01:26] // Translation subsystem failure [02:01:28] throw new RuntimeException( [02:01:30] "Banner message key $bannerKey could not be found in {$parentLang->getCode()}" [02:01:32] ); [02:01:34] } [02:02:09] So it really is MessageCache thinking the message doesn't exist, as far as I can tell [02:02:39] AndyRussG: As you may know, we have a static localisation cache (=cache of .json files maintained in git, not including NS_MEDIAWIKI overrides and extra "messages"). That's cached somewhere. And there's a layer atop that, MessageCache. This one is cached in various layers, and invalidated when a page is edited/created/delete etc. in NS_MEDIAWIKI. [02:02:44] If I could get a local setup with MessageCache operating similar to prod, I might be able to reporduce localy [02:02:49] AndyRussG: When MediaWiki creates or edits a WikiPage in NS_MEDIAWIKI, eventually MessageCache::replace() ends up being called, which will invalidate the master cache key for said cache layer [02:03:02] so that the next request for a message will be forced to rebuild that cache [02:03:13] $this->wanCache->touchCheckKey( wfMemcKey( 'messages', $code ) ); [02:03:41] Krenair: file/line? [02:03:47] Sorrry meant Krinkle ^ [02:03:50] See also wfMessage -> Message -> Message::fetchMessage -> MessageCache [02:04:05] $this->wanCache->touchCheckKey( wfMemcKey( 'messages', $code ) ); is in MessageCache::replace() which is called by WikiPage when a page is edited/created/ [02:05:13] Oooh K thanks that's what I was looking for [02:07:09] Krinkle: also some suspicion that the problem response may happen sometimes in some datacenters and not others [02:07:26] This is all I got for now. [02:07:28] o/ [02:07:50] thanks Krinkle ! [02:08:25] Krinkle: thx!!! bye :) [02:12:44] awight: robla: FWIW here's the full method mentioned above ^ https://github.com/wikimedia/mediawiki/blob/master/includes/cache/MessageCache.php#L530-L598 [02:17:55] Krinkle: Here's the spell we invoke to write the banner content to the db: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/master/includes/Banner.php#L759 [02:21:38] AndyRussG: This is cheap. What about tweaking the wgNotice cache expiry to 700 seconds, so we give WANObjectCache time to settle down? [02:22:20] Or is that the opposite of what we want to do, and we would always cache miss? [02:23:28] awight: huh? I dunno. Remember like u noted it's apparently not stampede-correlated at all [02:24:19] ok right, the logged exceptions happen before stampede time [02:27:57] awight: what we need to know is why is Message::exists() returning false even though CN did the edit on the NS_MEDIAWIKI article... unless.. somehow it is a CN problem, and CN is actually for some reason not making the edit? That can't be, because eventually the message content rights itself [02:28:07] CN is definitely writing the message [02:28:25] I mean the article content, making the edit to the article on that NS [02:31:02] MessageCache::get() has to be returning false [02:32:01] that all sounds right. exists relies on fetchMessage, relies on MessageCache::singleton()->get( $key, true /*usedb*/, 'en' ) [02:34:08] aaarg now I've lost where it tries to fetch from ObjectCache [02:35:28] It's strange that MessageCache::getMessageForLang tries the database before cdb, I guess that's all so that the database can override cdb, cos it's the opposite of a performant thing to do. [02:36:39] Just a side note, it looks like the-wub and spatton are using the fallback language code paths during the kind of previewing they do [02:36:57] they also fetch 'en', but like to preview in many languages [02:37:24] ok getMsgFromNamespace is what calls load() [02:38:59] whuu... MessageCache::$mCache has a magic value "!NONEXISTENT for negative caching." [02:39:32] err "!NONEXISTENT". I put the rest in quotes to make it more clear that I didn't write something I don't understand ;) [02:40:14] I really want to get my hands on $where for debugging. [02:40:33] $info = implode( ', ', $where ); [02:40:33] wfDebugLog( 'MessageCache', __METHOD__ . ": Loading $code... $info\n" ); [02:40:40] awight: nice! yeah interesting. So the non-existence of a key can be cached, right? [02:40:42] can we turn that on perhaps? [02:41:57] looks like there's a lot of prod debugging options there... [02:42:41] awight: the fishy thing about chasing down ghosts in MessageCache is that I didn't see any changes in that file that coincide with the appearance of the bug [02:43:19] I don't think it has to come from that file itself [02:43:29] yeah [02:43:45] reading between the lines of ^ Krinkle's comments, I wonder if there might not also be an issue with old banners being served longer than they should sometimes [02:43:53] and we just don't notice it [02:44:01] We could also determine that by looking at response_size [02:44:10] that could also throw a wrench into some A/B tests [02:44:27] Interesting comments around loadFromDBWithLock [02:44:54] what does "non-blocking" cache slam protection look like? [02:45:28] wat? [02:46:02] I'm trying to understand the comment at MessageCache line 323 [02:46:16] especially, "Note that the DB slam protection lock here is non-blocking." [02:46:33] What does that mean for threads that aren't getting the fresh stuff? [02:46:35] Also interesting to note that $useDB doesn't actually force a DB lookup. It really means, don't use the static file thing. Even with that true, the getfromNS thingy uses cache [02:47:44] I guess we still recover from failure in loadFromDBWithLock, so that's probably not the bug [02:48:24] I thought useDB really did live up to its name [02:48:25] awight: does the message object that we get in CN store any info about where it looked for stuff? some additional tidbits we could throw in the log? [02:48:38] hmm! [02:48:44] it's a ruse [02:48:53] also, MEssageCache is a singleton, so maybe we can poke it [02:49:13] I don't see any useful state vars though [02:49:15] huh also [02:49:27] $where seems the most promising for that [02:50:01] Here's another atrocious idea--what if we call msg->get in a loop and give it 3 attempts. [02:50:14] We can also enable MessageCache debug logging for the 2nd and 3rd attempts [02:50:24] * awight dirty [02:50:30] *plays dirty [02:50:31] hahahah [02:52:02] # This used to throw an exception, but that led to nasty side effects like [02:52:04] # the whole wiki being instantly down if the memcached server died [02:52:38] anyone know how to fly this thing? [02:54:52] awight: can't we just enable or ask for the enabling of the MessageCacheError debug log? [02:56:57] I'd like to [02:57:20] I'm also noticing that loadFromDB is public, and includes an ad-hoc debugging struct [02:57:55] AndyRussG: also enable MessageCache debugging IMO [02:58:13] have you tried to reproduce in deployment-prep (aka beta cluster)? [02:58:34] Unfortunately it's rare and we don't know the steps to reproduce yet [02:58:37] bd808: no not yet [02:59:42] a "log all the things" and hope to find a hint in prod will melt all kinds of stuff [03:00:24] if you can detect in your code then the best bet would be to add a new logging channel just to debug this and add code to dump the state that you need to look at when things go wrong [03:00:25] yeah, I bet MessageCache::get is called from more than one place [03:01:40] the reason that db lookup (actually cache, then db) for messages comes before the cdb is that is the only way to make site message overrides work [03:01:51] yeah that makes sense [03:02:03] bd808: btw, it's not possible to change $wgMonologChannels after initialization, is it? [03:02:15] the cdbs are just the i18n/*.json files smashed in to a better format for lookup [03:02:21] It would be nice to call the offending function a second time, with more debug [03:02:37] it is possible, but gross :) [03:02:43] That's my middle names [03:03:06] you'd have to modify the live object graph of Monolog [03:03:18] and it would be very very WMF prod specific code [03:03:23] I'll find another way ;) [03:05:00] our whole message cache/lookup system is crufty in various ways. the cdbs are a perf hack that may or may not actually be needed in the modern stack [03:05:27] nobody wants to dig into it bad enough to build a better mousetrap yet though [03:05:35] bd808: does anyone reproduce WMF messagecache stuff on their local dev box? would vagrant memcached role work? [03:05:40] I can't tell if I'm tempted by the cheese or not [03:06:19] * AndyRussG goes to fridge [03:07:22] I think we have a role that uses cdbs. the memcached role shouldn't really be needed. mw-v has redis for cache by default and for this it should be basically the same [03:07:41] AndyRussG: side note--here's the log of when the banner content was changed. [03:07:45] https://meta.wikimedia.org/w/index.php?title=Special:CentralNoticeLogs&offset=&limit=500&log=bannercontent [03:08:25] It was created at 21:42, and updated at 21:43. The window of it working correctly was 22:11-22:21, for reference. [03:08:51] c. 30 minutes, but I'm not going to get superstitious about that. [03:09:11] huh. I need to pull webrequest data for that 21:00 hour. [03:09:46] (running) [03:11:29] all 87 bytes for that hour. [03:14:54] more accurate timestamps fwiw, the banner content was created at 20160927214240 and content changed at 20160927214342 [03:15:29] I don't see magic numbers. [03:18:25] nasty--the public methods I wanted to prod on MessageCache rely on stateful side effects of other methods to have already happened [03:18:28] awight: not sure exactly how and where the 10-minute cache thing is expected [03:18:55] hmmm [03:19:08] I think all that controls is that varnish is ready to let us recalculate the response for SpecialBannerLoader after that interval [03:20:24] Shouldn't that mean that we shouldn't expect logs for a given banner/language/project/some-other-URL-param combination more than once every 10 min? [03:20:35] yeah [03:20:39] K [03:20:49] * awight peeks between fingers at webrequests [03:22:02] yeah and that http responses sizes in webrequests for identical bannerloader URL should only vary once every 10 minute [03:23:40] k I'm following, so the 10 minutes doesn't mean anything about the *cause* of the missing message, it's just how it would look if anything were fallible about the recalculation [03:27:48] awight: yeah I think so [03:28:58] So we found the 22:21:07 request where it was recalculated as bad. Lemme look for the first request after banner creation, where it threw an exception [03:29:31] 2016-09-27T21:43:59 [03:30:52] 79s after the banner was created, 17s after the modification. Not looking very racy to me [03:31:37] But after that point, the bad content was varnish-cached. lessee the next request where it was rebuilt [03:32:40] hmmm [03:33:14] They come closer together than 10m [03:33:26] Pretty much every request to the preview URL [03:37:09] awight: but are the URLs identical? [03:38:07] Trying to see now, I'm filtering the requests for cache misses [03:38:58] * awight gets stuck cross-eyed [03:43:16] AndyRussG: It just got weird. I have no cache misses between 22:00 and 22:49?? [03:44:03] that's the haunting hour [03:44:06] awight: [03:44:07] https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))&_a=(filters:!((%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%27logstash-*%27,key:_type,negate:!f,value:mediawiki),query:(match:(_type:(query:mediawiki)))),(%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%27logst [03:44:09] ash-*%27,key:level,negate:!t,value:INFO),query:(match:(level:(query:INFO,type:phrase)))),(%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%27logstash-*%27,key:level,negate:!t,value:DEBUG),query:(match:(level:(query:DEBUG)))),(%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%27logstash-*%27,key:channel,negate:!t,value:runJobs),query:(match:(channel:(query:runJobs,type:phra [03:44:11] se))))),options:(darkTheme:!f),panels:!((col:7,id:Event-Level,panelIndex:3,row:3,size_x:3,size_y:3,type:visualization),(col:10,id:Top-Domains,panelIndex:5,row:6,size_x:3,size_y:3,type:visualization),(col:1,id:Trending-Messages,panelIndex:6,row:9,size_x:12,size_y:4,type:visualization),(col:10,id:MediaWiki-Versions,panelIndex:8,row:3,size_x:3,size_y:3,type:visualization),(col:1,id:Trending-Backtrace-Fil [03:44:13] e,panelIndex:10,row:13,size_x:12,size_y:3,type:visualization),(col:1,id:Top-Channels-table,panelIndex:11,row:3,size_x:6,size_y:3,type:visualization),(col:1,id:Top-Hosts-table,panelIndex:12,row:6,size_x:6,size_y:3,type:visualization),(col:7,id:Top-Wikis-table,panelIndex:13,row:6,size_x:3,size_y:3,type:visualization),(col:1,id:Events-Over-Time,panelIndex:14,row:1,size_x:12,size_y:2,type:visualization),( [03:44:15] col:1,columns:!(level,channel,host,wiki,message),id:MediaWiki-Events-List,panelIndex:15,row:16,size_x:12,size_y:16,sort:!(%27@timestamp%27,desc),type:search)),query:(query_string:(analyze_wildcard:!t,query:CentralNotice)),title:mediawiki-errors,uiState:(P-3:(spy:(mode:(fill:!f,name:!n)),vis:(legendOpen:!f)),P-5:(spy:(mode:(fill:!f,name:!n)),vis:(legendOpen:!f)),P-6:(spy:(mode:(fill:!f,name:!n))))) [03:44:17] aaaarg nice URL [03:44:19] onesec [03:44:52] http://tinyurl.com/hfa9dgb [03:46:00] {"id":"V@01GApAEKsAAFI6lRQAAAAK","type":"MWException","file":"/srv/mediawiki/php-1.28.0-wmf.20/includes/Revision.php","line":1555,"message":"Content of revision (CNBanner:B16WMDE_01_160927_size2-no-interval-message/en) could not be loaded for validation! [03:48:02] it's beautiful! [03:48:37] I guess that's our other bug, https://phabricator.wikimedia.org/T146880 ? Could be related? [03:50:53] I can't seem to run this interface tho, getting "visualization not available" [03:51:21] huh [03:51:39] just go to the mediawiki-errors dashboard and input "CentralNotice" into the search bar [03:52:03] That worked, but zooming in to my hour of interest is giving no results etc [03:52:16] Ah just put last 7 days [03:52:23] But I think this is the actual exceptions bug [03:52:49] not the core-or-infrastructure-suspected banner bug [03:53:47] yah translated messages... it seems to be missing $revID [03:54:14] includes/Revision.php:1554: "Content of revision $revId ($prefixedDBkey) could not be loaded for validation!" [03:57:37] AndyRussG: ignore me about the lack of cache misses, I had forgotten to sort the logs [04:02:41] ah K no worries [04:03:11] awight: I'd like to continue but have morning in 8 hours [04:03:38] AndyRussG: +1 [04:03:56] This doesn't mean anything, but for some reason these lists don't match up perfectly: https://tools.wmflabs.org/paste/view/975e3056 [04:04:00] but I'm done for the night too [04:04:10] Thanks, fun investigation! [04:04:13] huh interesting [04:04:17] awight: it's been fun!!! thx much!!! :) [14:26:52] Krenair: Do you have more of a backtrace for that? I don't see where it might be being called from, unless you have BlueSpiceExtensions, Html2Wiki, SemanticForms, or SemanticWatchlist installed. [14:42:54] Krenair: Oh, I see https://gerrit.wikimedia.org/r/#/c/313518/ fixed it already. [20:24:12] So after I whined at FF for telling me to delist the WikimediaDebug extension I got a new reviewer who says they don't think it should be delisted and will be glad to get it reviewed :) [20:25:37] bd808 even microsoft edge doin't do that [20:25:43] just convert chrome into edge [20:25:51] with a simple app in the windows store which is free [20:26:22] bd808 managed https://github.com/paladox/EdgeWikimediaDebug to create that by converting the chrome one from chrome to edge :) [20:26:35] with a few simple changes the conversion script did [21:10:54] bd808: style around by any chance? [21:11:37] bd808: would like to log using something like: $logger->info( "{type} foo", [ 'type' => 'somekind' ] ) [21:11:54] so the log would get in logstash a metadata 'type' == 'somekind' [21:12:02] but it seems we already use 'type' internally in logstash [21:14:10] yeah, don't use words that clash with these -- https://github.com/Seldaek/monolog/blob/master/src/Monolog/Formatter/LogstashFormatter.php#L64-L101 [21:14:29] ahhhh [21:14:41] datetime, @timestamp, @version, host, message, type, channel, level, extra, context [21:15:08] thank you :] [21:16:03] bd808: and another one. Given: "some message", [ 'command' => 'nuke' ] [21:16:18] looks like handling of HttpError is broken :( [21:16:23] since "command" is not part of the message template, would it still be passed forward to the log handlers / end up in logstash ? [21:16:26] anybody knows anything aboit this? [21:16:54] hasharAway: yes, it would. the context is passed no matter if it's used as a replacement data or not [21:17:11] bd808: that is wonderful !!! [21:17:15] SMalyshev: nope. Logging of HttpError or something spookier? [21:17:41] bd808: spookier... if you make HttpError it ignores the code and creates 500 error [21:17:58] which breaks cases where it should be 404 etc. [21:18:39] I *assume* it was supposed to be handled not in a way that any HttpError produces 500 with nasty message [21:18:43] hmmm... AaronSchulz was cleaning up some of the error handler code a week or two ago [21:19:07] that might be the cause :) [21:19:55] summoning AaronSchulz :) [21:20:02] SMalyshev: I'd start tracing with the commits that introduced includes/exception/MWExceptionRenderer.php [21:20:32] looks like there are 6, but the first is the place to start looking [21:20:53] looking [21:22:40] looks like HttpError->report knows how to handle codes, but MWExceptionRenderer ignores that [21:22:54] and never calls $e->report [21:25:03] not sure if I can figure out how to fix it without Aaron though... any idea if he's in/around today? [21:28:35] I haven't seen him active for hours. [21:32:42] hasharAway: I added some notes about the "reserved" words to https://www.mediawiki.org/wiki/Manual:Structured_logging#Add_structured_data_to_logging_context [21:32:59] not that anyone will read that, but it's worth a shot [21:36:11] SMalyshev: I probably merged the breaking changes, so I'll see if I can figure anything out [21:36:11] https://twitter.com/badedgecases/status/781928309059969029 [21:36:12] have you filed a bug yet? [21:36:21] https://phabricator.wikimedia.org/T147098 [21:36:49] looks like it also breaks logging of all HttpError's and that's why it wasn't in logstash [21:36:57] fun! [21:37:21] legoktm did the +2, but I +1'd. SHould be fixable [21:37:49] yeah the problem is MWError::report seems to not be called anymore [21:37:58] and HttpError does all the job there [21:38:10] bd808: I do read doc. When pointed to them :] [21:38:20] bd808: that wiki page has actually been quite helpful [21:40:52] SMalyshev: yup. looks like MWExceptionRenderer::output needs special handling for HttpError like is has for DBConnectionError [21:43:01] bd808: I could make a simple patch that just special-cases HttpError but I think there should be a better fix since just ignoring existing API for MWException is probably not a good idea, there may be more trouble lurking there [21:45:48] SMalyshev: yeah. there are others that this looks like it probably breaks [21:47:59] the fact that it outputs 500 before even the handling starts is totally broken :( [21:51:20] MWExceptionRenderer::output seems to be always called with AS_PRETTY. I wonder why the param then [21:51:40] ohhh [21:52:05] I have some ongoing madness about the handleException() that always uses AS_PRETTY [21:52:13] unless you get $wgCommandLine = true; [21:56:49] good luck with exceptions :(( [21:57:02] gotta sleep. Have a good week-end all [23:41:34] SMalyshev: did bryan's patch fix the problem you had?