[15:22:40] Krinkle: on the off chance that you are working today, please have a look at https://phabricator.wikimedia.org/T235188 [15:23:13] Looks like a bug in WANObjectCache has been causing cache corruption. [15:27:31] duesen_: master for next week or prod too? [15:27:55] Krinkle: production [15:28:13] mainly TWN, but apparently also on WMF sites [15:28:31] duesen_: I saw a patch earlier that fixes a multi key pregen bug [15:28:35] Is that related? [15:28:38] hot fix is up for swat, a better fix lined up for review [15:28:40] I'm on mobile currently [15:28:49] the root cause is not entirely clear, and the code is hard to test [15:29:17] Krinkle: yes, that's the bug. appears to cause large swcale corruption on TWN [15:29:26] I believe a getMulti /batch was recently worked on in blob store [15:29:26] the multi-get methods had so far been unused [15:29:32] Was that the trigger? [15:29:35] so the hot fix simply makes them unused again [15:30:07] Krinkle: yes, blob store was apparently the first code that actually made use of multi-get. [15:30:08] I saw it fly by but not involved [15:30:13] Okay [15:30:20] Undoing that makes sense then [15:30:31] yea. [15:30:33] Can fix woc later [15:30:48] but WANObjectCache needs fixing. or the methods should be removed. [15:30:54] And would prefer not to disable caching out right as I have no idea what impact that would have here [15:31:08] this is just for bulk operations [15:31:15] Agreed but not today :) [15:31:16] maintenance scripts only [15:31:42] I thought everything went through the new code oath [15:31:45] Path * [15:31:47] Krinkle: no urgent action needed from you today. swat should take care of the UBN. I just wanted to make sure that WOC gets fixed [15:32:38] everything goes through the new code path for fetching, but not for caching. that happens one level up. the old code uses single-key getWithSetCallback as before [15:33:14] Krinkle: I'm having a lot of trouble wrapping my head wround the WOC code, or finding ways to test for this problem. [15:33:31] Sure yeah place in our inbox and we'll get to it. o suggest maybe cpt code it and we review to provide greater familiarity here gradually [15:33:50] i coded... *something* [15:33:57] And yeah questions like that as well in Gerrit :) [15:34:06] It has fairly extensive tests [15:34:20] But yeah not trivial [15:34:32] yes, but i don't understand them :) [15:34:32] i dropped it in the performance team's inbox [15:34:40] Cool [18:49:57] duesen: Hm.. so I'm a little confused. The current patch removes caching of the code path, I was expecting it to restore the previous non-batch behaviour though. [18:50:03] Or was there none? [18:50:27] A connection to the commit that introduced the use of it to BlobStore would help [18:52:00] Krinkle: the non-batch behavior is in getBlob. getBlobBatch is new. [18:52:12] You are right, a link to the original patch would have been good. [18:52:21] will add it in a comment [18:53:12] duesen: ok, will await that comment. I understand the batching is new, but what I see is not removal of batching but removal of caching. [18:54:42] oh hm, the patch didn't even go in? [18:54:46] I'm confused [18:58:52] Krinkle: I updated https://gerrit.wikimedia.org/r/c/mediawiki/core/+/542328/ [18:59:01] I hope jenkins will pick it up and merge it now [18:59:42] Krinkle: basically, we can't easily go back to unbatched access because we were never using unbatched access. Translate was just doing its own batching, based on db fields that are going away Really Soon Now [19:00:05] But the problem isn't with the batching itself, it's with the multi-get cache code [19:01:32] duesen: yeah, I understand the problem isn't with batching itelf, but the bug is triggered by the combination of batching+caching. I assumed the way it implemented "before: per-blob fetch with cache; after: batching with caching" was by making getBlob use getBlobMulti [19:01:37] However, I see that is not the case [19:01:46] They both work independently and duplicate the caching logic [19:02:07] which explains why you were able to "remove caching" without regressing the primary way the class is used today [19:02:22] because that still has its own caching, it isn't a shortcut to the batch method [19:02:48] exactly [19:03:17] this was a point of criticism during code review, but turned out to be a wise choice :) [19:03:23] although I see now that getBlob() does internally use the pre-existing batch fetch method fetchBlobs [19:03:31] so in a way it does, which is more confusing imho. [19:03:45] anyway, all good :)