[10:12:04] Krinkle: I understand but the problem is that doesn't match with what's happening in production. For exactly one RL module in one wiki in one skin (`enwiki:ResourceLoaderModule-dependencies:ext.wikimediaBadges|vector-2022|en`) we had 8 writes per second non-stop. This means at least 8 reads per second for this specific RL module is reaching mw appservers. I'm not talking about other modules or other wikis, e.g. [10:12:04] `ruwiki:ResourceLoaderModule-dependencies:ext.wikimediaBadges|vector|ru` (same module but in ruwiki) has an extra 2 per second [21:18:05] Amir1: In terms of reads that's not surprising. Plenty of things can cause that, e.g. a long tail of articles with a different set of modules in the stylesheet badge (e.g. with and without makeCollapsible, with and without TMH, etc). My point is merely that it isn't linearly correlated to volume. The CDN cache and varnish coellesce handle that. There's also a bunch of query string gargage that various "security" scanners try to inject to [21:18:05] bypass caches here and there. [21:18:34] Again similar to Thumbor which is also known as The 404 Generator, since everything that's not gargage is well-cached. What ends up goign to Thumbor is largely non-existent URLs. [21:19:31] `krinkle@stat1011:~$ kafkacat -C -b kafka-jumbo1007.eqiad.wmnet:9092 -t webrequest_text | fgrep -v 'cache_status":"hit' | grep -E 'uri_host":"en\.(m.)?wikipedia.org' | grep 'uri_query":"[^"]*ext.wikimediaBadges' | head -n100 | tee webrequest_nonhit_enwiki_wikimediaBadges.log` [21:19:50] This gives me less than 1 per second. And that's before we account for skin or language. [21:20:02] Examples include: [21:20:04] * `lang=enoG5n2XUX'%20OR%20409=(SELECT%20409%20FROM%20PG_SLEEP(15))` [21:20:55] in terms of genuine looking traffic, it's a handful of different article templates/categories and their different bundles. one that includes ext.3d styles for example [21:21:01] Anyway JustHannah and I found the bug today. [21:21:13] It turns out there wasn't any lack of sorting or normalization or deduplication [21:21:25] Rather, the normalization call, which happens already and in the right place, isn't handling multiple `../` correctly [21:22:28] so RelPath::getRelativePath('/srv/mw/foo/bar/sub1/sub2/../../image.png', '/srv/mw') == 'foo/bar/sub1/../image.png' [21:22:47] Bartosz patch works by accident in that it, earlier in the stack, calls it once more time, thus stripping one more pair. [21:23:13] whereas getExpandPath/joinPath handles any/all. [22:10:11] That would make it a DDoS vector since they all lock the same row and easily via 1K req per second passing CDN (via stuff you mentioned) you can lock the whole database. [22:10:30] Thanks for finding the root cause [22:36:16] Not exactly. We do have a memcache lock to dedupe before we connect/write attempts as well. [22:36:33] So it shouldn't overlap or have any concurrency [22:37:12] Any concurrent attempt will no-op [22:37:23] (As in, it isn't sent to MySQL in the first place) [22:38:31] This was using the core database for many years before we moved it to main stash. It's been optimised a lot. Probably we could do with less now that it is "merely" the MainStash DB but we still have all those guards in place. [22:41:32] I'm guessing we didn't have ../../ in any CSS until a few years ago. Or at least not in a module that was loaded in allmost all pages. It's quite common to share images between css files and to have each logically grouped, eg modules/ext.foo/images/ and modules/ext.foo/components/{bar,quuz}.css. [22:41:48] But two levels deep is fairly rare for style sheets.