[01:26:32] TimStarling: I believe the general hypothesis, and I've heard Aaron describe this numerous times, that in a multi-dc scenario, there's not a reason to assume a previous value exists in both DCs, so MW assumes upon filling an existing checkkey'ed key that it may have changed recently. [01:26:53] for the first iteration of those kinds of keys. [01:27:36] In practice this may happen quite often of course if things fall out of cache, and I imagine it's fairly rare in practice. But there's also not an obvious way that such scenario would self-correct. I think if there's a self-correcting hypthesis, then the rest is just optmisation that we can kill off if not proven. [01:31:18] Maybe we could instead e.g. shorten the TTL to something like 10 seconds the first time (for the value, not the check key, otherwise the scenario keeps repeating itself). [01:31:22] why would a missing check key imply that the value does not exist in both DCs? [01:32:05] why is it only check keys that imply this? If the check key option is not given, nothing like this happens [01:33:12] I need a moment to remember. Note I'm not defending it per-se, but I do want to try to recall and understand the edge case for what it is. [01:33:24] I believe with normal keys there is no issue because we write tombstones. [01:33:44] so under a cross-dc race, WAN won't simply overwrite it and keep it. [01:34:20] MW does a merge() when storing values, and also uses add() when it's a known new value to avoid overwriting a tombstone that arrived cross-dc [01:34:46] I think there's a scenario we can construct where for a check key, this means an outdated value is cached long-term. [01:35:02] becuase the tombstone wouldn't be under every N key variants, only the check key. [01:35:30] I'm having trouble imagining the scenario right now though. [01:38:10] If you could write it up on Phabricator tomorrow, that would be helpful [01:38:21] assuming your workaround works, there's no rush [02:23:22] TimStarling: I think I've convinced myself it's fine to remove. I'll write it up later, but feel free to get a head start on drafting whatever removal you think makes sense. [02:29:02] In a nut shell: This logic comes from the initial commit which is importantly before we adopted mcrouter for broadcasting. That combined with the fact that we use merge() and add() appropiately both for the value key and the check key, I think means there are no situations in which a cross-dc race from either empty or non-empty scenario on either side can result in a stale value winning over a new one. I considered tnhre scenarios: 1) [02:29:02] Codfw empty and Eqiad going from value=1, check=1 to value=tombstone,check=2 and then this replicating while Codfw warms up value=1. and 2) Codfw calling its cache miss before broadcast, and finishing its value=1 cache set after the broadcast. I expect this to have its cache SET denied by merge() and its check key ADD denied by memcached. [02:30:48] I considered variants of 1 where it get-sets before and after the broadcast. If before it's correclty losing to the broadcast. If it's after, then tombstone prevents it storing a normal value (we get a very shorted lived interim value I believe) [02:31:07] That seems pretty robust. [02:32:32] What this does not consider, of course, is broadcasts getting lost on the network. That's in general something we don't try to add complexity for as it seems kinda endless. What do do is that some seemingly important places adapt their TTL based on a last modified timestamp so as to cache new data for less long initially. But that's aside from all this, and none of that is specific to check key scenarios. [18:17:03] Ah, well, protection doesn't matter since that inserts a revision and parser output varies by revision ID naturally, it doesn't need to do anything for it to cause a cache miss on page views. [18:17:25] But if e.g. someone varies category or page content more significantly on protection state, that could be interesting, let me try that. [18:23:14] Huh, indeed. [18:23:18] `{{PROTECTIONLEVEL:edit}} = {{PROTECTIONLEVEL:edit}}` [18:23:19] `{{#ifeq:{{PROTECTIONLEVEL:edit}}|sysop| [[Sysop]] [[Category:Sysop]] }}` [18:23:43] Viewing that, and then protecting the page as sysop, and then refreshling WhatLinksHere/Sysop and Category:Sysop, it doesn't self-correct. [18:23:51] Logs show no update or job queued [18:24:27] This seems too obvious, given our community and this being a fairly obvious example that protection templates would make use of and rely on, to have been a bug "forever" [18:24:39] Perhaps the refactoring to DerivedPageUpdator and RevisionStore lost this at some point along the way [18:34:42] that is curious re:page protection. In terms of timeline, we noticed a problem and added the hook on UploadComplete in apr 2021 (T278209). I'm not clear on when the timeline of those other refactors was, i took a quick glance at the git history for DerivedPageUpdator but not sure when it was refactored. There was a big on for MCR in 2018, [18:34:43] T278209: MediaSearch results not updated 12 hours after overwriting image - https://phabricator.wikimedia.org/T278209 [18:46:27] ebernhardson: ref T344285 [18:46:29] T344285: RevisionStore::newNullRevision (or equivalent higher-level method) missing LinksUpdate - https://phabricator.wikimedia.org/T344285