[00:00:29] I see Krinkle added a subtask [00:00:40] twentyafterfour: this one, or a differnt one that I removed again? [00:00:46] twentyafterfour: check you double check I got th 14 digit timestamp right? [00:00:46] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/631318/ [00:00:56] I tend to blind myself in those [00:01:02] I'll check [00:02:33] shouldn't it be 20200930? was it already after midnight utc by the time this happened? [00:02:34] twentyafterfour: dirty diff wikiversions.json on deploy1001 [00:02:52] twentyafterfour: you're right, thanks [00:02:54] Krinkle: yeah I never committed and pushe the rollback sorry [00:03:11] want me to fix that now? [00:03:37] CR for you [00:03:48] I'll wait this time :) [00:03:59] twentyafterfour: yeahy, please undo or commit however it shoudl be [00:06:08] man gerrit is running real slow [00:06:44] oops, I did a git pull [00:06:49] aborting merge [00:07:28] it's taking forever to resovle deltas on what should be a tiny change to wikiversions [00:07:31] not sure what's happening [00:08:17] ok I aborted my push [00:08:30] yeah that doesn't like like the right tree [00:09:21] Krinkle: not sure how to fix this? [00:09:39] twentyafterfour: git reset origin/master perhaps? [00:09:44] my commit was just to put the repo in sync with the state on disk [00:09:45] then redo whatever you want to change [00:09:47] ok yeah that's fine [00:09:59] this will default to keeping the dirty stuff in sating [00:10:11] git reset --hard origin/master will undo it for real and keep only what's in the upstream repo [00:10:36] ok so it looks like your second patch with the date has a merge conflict? [00:10:46] oh not anymore [00:11:11] awaiting your push/merge [00:11:49] wtf it won't let me push [00:12:03] ! [remote rejected] HEAD -> refs/for/master/1.36.0-wmf.11 (branch master/1.36.0-wmf.11 not found) [00:12:05] error: failed to push some refs to 'ssh://twentyafterfour@gerrit.wikimedia.org:29418/operations/mediawiki-config' [00:12:13] I generally don't like editing/pushing from deploy host. I've always preferred local commits, +2 first with test and then pull/deploy. perhaps changing wikiversions could be a simple command in the local bin/ directory like mwexpanddblist that we run locally and then push as a normal commit. [00:12:29] also doesn't require storing gerrit credentials in prod which seems preferrable [00:12:31] I prefer that too but this was an emergency [00:12:36] yeah [00:12:38] fair :) [00:12:47] but we don't have this for non-emergency either, right? [00:12:52] faster rollback if you don't have to wait for gerrit+jenkins [00:13:02] right [00:14:44] I can repro on comons /w/index.php?title=Image:SouthAfricanStub.png&action=render&uselang=he via mwdebug2001 [00:14:48] nobody purge that [00:15:14] nice [00:17:11] Cached time: 20201001001638 [00:17:13] still happens [00:17:16] this is a newer entry [00:18:15] twentyafterfour: "msater/1.36.0-wmf.11"? [00:18:21] Should that not be wmf/1.36.0-wmf.11? [00:18:34] uh [00:18:45] Oh wait nm [00:18:50] I see the full context now [00:18:59] the version was just a topic [00:19:05] Yeah gotcha [00:19:24] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [00:19:35] OK, I don't know what's going on. [00:19:58] I was able to reproduce hte warning after the config patch applied there [00:20:03] and the cached entry is from after the rollback completed [00:20:13] wth [00:20:20] https://commons.wikimedia.org/w/index.php?title=Image:Soy-whey-protein-diet.jpg&action=render&uselang=es specifically [00:20:56] uh [00:20:57] so it looks like wmf.10 is producing new corruptions [00:21:06] the page looks broken but that's normal [00:21:09] this is a raw call without skin [00:21:35] the timing really coincided with the rollout of wmf.11 to group1 pretty perfectly ... I don't know what else it could be? [00:21:41] wmf.10 on mwdebug2001 did a fresh parse and produced the same error [00:21:52] weird [00:22:04] and it's unlikely to have opcache corrupted her in the same esact way again' [00:22:16] maybe something that was backported right before the train [00:22:34] Maybe [00:22:40] or maybe iut's not parser cahce that's the issue but something in memcached [00:22:42] and somehow it just didn't show up on kibana until I sync'd? not sure how that would be [00:22:50] you guys were talking about BagOstuff/memcached earlier [00:22:51] hmm [00:22:57] is there a separate key involved for something here? [00:23:10] I thought CacheTime was a sub object of ParserCache [00:23:17] RoanKattouw: are you caught up more or less? [00:23:34] I was on a call, just came back [00:23:37] $optionsKey = $this->cache->get( [00:23:37] $this->getOptionsKey( $wikiPage ), BagOStuff::READ_VERIFIED ); [00:23:37] if ( $optionsKey instanceof CacheTime ) { [00:23:42] OK. I understand this better now [00:23:49] I have no idea this code existed. [00:24:12] wait no, this cache is the parser cache [00:24:13] it's not memc [00:25:09] but the error happens in getKey() [00:25:21] before RejectParserCacheValue runs in get() [00:25:42] * twentyafterfour gets coffee, brb [00:25:53] It still doesn't explain, however, how it is that an hour after we rolled back that still new bad CacheTime objecst were put into the parsercache [00:26:08] but at least it explains why the config patch will not be able to purge that [00:30:41] Relevant branch diff: https://github.com/wikimedia/mediawiki/compare/wmf/1.36.0-wmf.10...wmf/1.36.0-wmf.11#diff-a0afa6d40f254cfbd7c1de51ce0e8bc2 [00:31:06] anyway, we know the wmf.11 code is bad. it made it private which is not backward-compatible [00:31:20] the question is now, how do we make MW ignore those entries. [00:31:37] RoanKattouw: leaning toward a new wmf.10-only patch in core that checks for it being set and returns false if not. [00:32:48] It made what private? [00:32:54] Sorry I'm not caught up on the issue here [00:34:10] RoanKattouw: see above wmf.10..11 diff. CacheTime::mUsedOptions was made private. Which is not backward-compatible with wmf.10. [00:34:25] also even while on wmnf.11 there were other issues with it [00:34:28] we rolled back [00:34:41] now we have 2 hours of broken parser cache entries causing 100s of erros every minute [00:35:15] I tried to purge them with RejectParserCacheValue in wmf-config but that didn't work because the failures are from before that hook runs. [00:36:15] also there is still the unanswered question of how it can be that new broken versions of this cache key were persited to parsercache after we rolled back. I kind of want to know how that is possible, because if we do add an isset() there and cast them to expired/false, and if it is still happening with currently deployed code and newly parserd pages, that would effectively disable parser cache [00:38:19] Oh so you have evidence of new PC entries being created without mUsedOptions while wmf.10 was running? [00:38:50] Indeed [00:38:50] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [00:38:51] The diff link didn't highlight anything for me, it's just showing me a commit list [00:39:06] RoanKattouw: ah, click on "Files" then ctrl-f to "parser/" [00:39:15] I assumed the permalink pointed there [00:39:36] This Kibana URL shows wmf.10 (old version, all wikis now) and me accessing a commons url that trigers the issue [00:39:47] and view-source shows it was cached a few minutes before I made that request [00:39:56] an hour after we rolled back already [00:40:10] it was host:mwdebug2001 so it was a pc cache miss [00:40:26] visible stil at https://commons.wikimedia.org/w/index.php?title=Image:Soy-whey-protein-diet.jpg&action=render&uselang=es [00:40:36] Parsed by mwdebug2001 [00:40:36] Cached time: 20201001001638 [00:40:43] I don't understand how that's possible. [00:43:44] untested patch - https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631324/ [00:43:46] Hmm [00:43:57] again, if somehow new entries are made with wmf.10, this would be a bad patch [00:43:58] I guess we don't have a reliable way of finding out what was actually on mwdebug2001 at that time [00:44:15] It was wmf.10 per the stack trace [00:44:21] I had just sca-pulled the wmf-config patch there [00:44:34] for rejecting entries from the 19:00-21:00 timeframe [00:44:44] Oh right [00:44:58] Yeah your patch makes sense, and also yes it would be bad if wmf.10 still generated entries like these [00:45:05] well, I suppose the stack trace isn't definitive per se, since it's on cache-access [00:45:12] so there really was an opcache corruption coinciding with this, right? [00:45:26] I don't know actually how that came into being either. I did open it multiple times [00:45:30] it certainly seemed to be an opcache bit switch corruption that caused the majority of errors [00:45:51] yeah, those were fatals for various Math and Wikibase classes not existing. [00:45:56] OK so here's what I propose [00:46:08] Let's search in logstash to see if there are other PC corruptions like this [00:46:24] https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors is producing 2000+ per minute [00:46:25] idk how easy it will be to distinguish wmf.10-generated ones from wmf.11-generated ones though [00:46:29] Oh yikes [00:46:30] but that is cache-access, not per se new parse [00:46:54] we can go through a few URLs with logged-out./canoical pereferences and see if some are from Oct 1 [00:47:20] Yeah I guess, but we won't easily be able to prove that your example is the only one that was generated by wmf10 [00:48:06] https://en.wiktionary.org/wiki/luhodluhod / Parsed by mw2282 / Cached time: 20200930210119 / [00:48:07] We could try running your patch for a little while and see if it wreaks havoc, but that would also be hard to tell. We'd expect to see a bunch of parser cache misses early on, the real problem would be if the same page keeps missing [00:48:23] this is generator of cache-access [00:48:32] the timestamp is from when it was wmf.11 so this one makes sense [00:48:32] Right [00:48:42] What was the time window that we had wmf.11 deployed? [00:48:52] it ended 19:30 - 21:05 is "bad" [00:48:57] started [00:49:01] Gotcha [00:49:09] So this was a few minutes before the rollback [00:49:37] So if your fears are true, we should be able to find more articles that have a broken ParserCache entry that was saved before 19:30 or after 21:05 [00:49:49] commons /w/index.php?title=Image:Cyanid-Ion.svg&action=render&uselang=sr - another one that makes sense [00:50:03] RoanKattouw: that's right. [00:50:10] I assume onlhy after because efore we had no php errorrs [00:50:25] I can stage the patch on mwdebug2001 and just browse some random urls [00:50:32] see if the log message gets triggered [00:50:56] and yeah, we could do a quick deploy and stand by [00:50:58] so the first scap which was for group0 was finished at 19:12:42 [00:51:48] Oh hmm yeah maybe let's try that [00:51:55] I'll gather a few URLs that are currently broken [00:52:23] We'll stage your patch on mwdebug2001, then purge those pages on mwdebug2001, and check if the issue goes away [00:52:35] ack [00:52:51] I've force mergd the patch as it woudl take 25-35min otherwise [00:52:56] will let jenkins run while we test [00:53:09] If we can get a dozen confirmed broken pages to be fixed by rerendering them on mwdebug2001, and none of them stay broken, then I will comfortable enough to deploy the patch [00:53:19] We'd still need to carefully watch the parser cache miss rate and be prepared to revert [00:54:09] Live on mwdebug2001.codfw [00:54:17] * Krinkle looks at Grafana [00:54:27] Uhh wait I found another broken one that was generated by wmf.10 [00:54:29] https://commons.wikimedia.org/w/index.php?title=Image:Matabeleland_-_Zimbabwe.svg&action=render&uselang=ang [00:54:42] https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-2d&to=now [00:54:45] [00:55:31] RoanKattouw: so, it looks like there are two parser cache keys involved with a parser cache fetch, is that right? One that stores a small CacheTime and one that stores the ParserOutput? [00:55:48] Although, maybe not, the time of the error and the time of that cache key are identical [00:55:53] I don't know why it is this way, but that's the way it is, right? [00:55:55] That's possible [00:56:01] Let me look at the code [00:56:13] could be that the CacheTime is surviving longer or separeately for some reason [00:56:24] and that these are only new ParserOutput objects [00:56:33] I don't know why they would be 1:many but if that's the case, it woudl explain it [00:57:17] so per Grafana it loooks like our hit rate has not suffered thus far, so the theory that these errors are causing *something* to be cached is correct. [00:57:54] Oh hmm yeah that would be interesting [00:57:58] well, likely anyway. it's also possible that the time window is too small to cause a big drop [00:58:01] The actual issue is with CacheTime objects, right? [00:58:05] that's right [00:59:23] $msg = "Saved in parser cache with key $parserOutputKey" [00:59:27] this is added to ParserOutput [00:59:28] but.. [00:59:36] " and timestamp $cacheTime" [00:59:42] cacheTime looks like it is given to both [00:59:56] so upon save() we do generally seem to set both [01:02:50] patch live on mwdebug2001.codfw [01:05:20] https://logstash.wikimedia.org/goto/ab5bedef86f74c9058f22f768f6125f4 [01:05:31] Kibana view on mwdebug2001 reduced to PHP erorrs and ParserCache channel [01:05:40] (given this mesasge rrequires me to enable verbose mode) [01:07:17] I'm rolling out the CommonSettings.php to other servers now first as well [01:07:20] that was still staged [01:07:24] Thanks, I was just trying to figure out how to do that (in logstash [01:07:31] which rejects the 19:00-21:00 range [01:07:49] I tried a couple of known bad articles, purging them on mwdebug2001 fixed the error both on mwdebug and in productoin [01:07:57] Oh OK so that'll help [01:08:05] We could try that first and see what errors remain [01:11:32] I think scap is stuck [01:11:43] twentyafterfour: https://logstash.wikimedia.org/goto/90e5bd41093f68c87e6f0eb428c936a0 [01:11:53] nothing since Finished sync-proxies (duration: 00m 01s) [01:11:59] sync-apaches: 0% (ok: 0; fail: 0; left: 367) [01:12:52] ok now it went through all at once after I scrolled back down in my screen [01:12:54] that's.. weird [01:13:02] process blocked by output? [01:13:04] cli output [01:13:07] anyway [01:13:14] stack.pop() stack.pop() [01:13:29] hmm [01:13:45] did you hit enter a few times and then it worked? [01:13:56] (wasn't that a thing?) [01:14:03] I just scroll back down in the ctrl-a scroll mode thingy in screen [01:14:09] Krinkle: we've had it get blocked before and hitting enter would unblock it [01:14:18] great [01:14:22] I'll remember that :) [01:14:29] never can reproduce it [01:15:01] I'm seeing the expected blib of forced expired entries at https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-6h&to=now [01:15:05] tried and tried to understand what would cause it and couldn't even after studying the source of python stdlib and cpyton [01:15:13] "Miss (rejected)" [01:16:16] twentyafterfour: maybe screen copy/select/scroll buffer creates the circumstances in an easiier to repro way. or maybe not. I haven't seen it happen before, but I also haven't tried scrolling before during a sync I htink [01:17:00] 01:12 UTC is the time of finishing this config sync [01:17:05] e.g. for logstash seeing of errors [01:18:01] 2020-10-01T01:17:34 mw2272 commonswiki /wiki/File:Antonio_Casale_in_%22The_Good,_the_Bad_and_the_Ugly%22,_1966.jpg ErrorException from line 116 of /srv/mediawiki/php-1.36.0-wmf.10/extensions/Math/src/MathHooks.php: PHP Warning: in_array() expects parameter 2 to be array, null given [01:19:53] I saw one for https://commons.wikimedia.org/wiki/File:Giovanni_Paolo_Panini_-_Alexander_the_Great_at_the_Tomb_of_Achilles_-_Walters_37510.jpg too [01:20:12] But when I visited it, I got a pcache timestamp that was later than the error message [01:20:38] And it didn't error again [01:21:32] that one, logged-out, gives