[00:53:22] We received a less.php bug report about an undefined property. https://github.com/wikimedia/less.php/issues/132 [15:03:21] I'm seeing a 5000/sec increase in `Duplicate get(): "{key}" fetched {count} times` since 13:22Z. It's causing codfw log ingest lag. [15:06:18] Closest patch I found is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192303 - CC: James_F [15:06:33] cwhite: Huh. [15:06:51] cwhite: Do you have a stack trace or heat map? [15:07:00] Or just load? [15:08:11] Log volume is my concern. https://logstash.wikimedia.org/goto/5a58f55fed4fd43ebad4cd00668ca24d [15:08:28] Ack. [15:08:56] Here's better logstash link: https://logstash.wikimedia.org/goto/6725efaaa0ae19206f5f145139108cfd [15:10:46] It's from MWSession? [15:11:23] Our newly-enabled code isn't even live on most of those wikis. [15:11:31] It feels more likely that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192138 is responsible. [15:13:30] But I don't know that code much - and mfossati isn't in here. Ping in -operations? [15:17:59] Hmm... maybe there was a change to sessions? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1187779 CC xSavitar [15:20:14] Thanks for having a look, James_F, I'll keep digging. [15:20:55] Oh, yes, indeed, that looks much more likely. [15:48:28] https://grafana.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&from=now-7d&to=now&timezone=utc&var-kClass=MWSession [15:49:09] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-7d&to=now&timezone=utc&var-dc=000000026&var-site=codfw&var-prometheus=k8s&var-container_name=kask-production [15:49:32] I was going to say, maybe the new wrapper class fails to turn off this warning for process cache hits. [15:50:06] But... it looks genuine. Down to the kask level, 4x traffic increases [15:50:25] Mostly 404s [15:52:06] Unfortunately I can't zoom out further because.. Prometheus will timeout. [15:54:55] Miss rate seems to jump from around 40/sec to 200/sec on 9-29, then 16,000/sec at 13:54 [15:55:02] today, that is [15:59:40] cwhite: 9-29 could’ve been https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1183132 then [15:59:50] (the group0 version of the patch you mentioned earlier) [15:59:57] that sounds like a pretty good candidate to me tbh [16:01:34] Yeah, so the new backend class either lacks process caching or a different issue is causing it to lookup known-non-existing session IDs where previously it would not. The latter would be a regression of a known bug we fixed a few years ago. [16:01:48] eithr that, or it's a pre-existing bug triggered by external traffic as coincidence. [16:02:28] MultiBackendSessionStore->get() calls into getActiveStore() which does a get() against bagostuff. When getActiveStore() detects anonSessionStore, it hands the bagostuff back to MultiBackendSessionStore->get(), another $store->get() is called. [16:03:18] unless something is very broken there, the second get should come from in-process cache. [16:03:38] it did cause a big load increase on Cassandra so quite possibly something is wrong. [16:04:51] * xSavitar catches up on the conversations [16:06:50] maybe CachedBagOStuff doesn't cache misses? [16:08:40] the code seems correct at a glance [16:09:01] In the "no user info" code path, we perform two store lookups (of the same store - one for anonymous and the other for authenticated sessions), so that would increase the number of store lookups (x2). Does that raise any concerns? [16:09:34] It was a fallback logic for us to make sure we've looked up the key from "both stores" [16:09:34] right, that's two different CachedBagOStuff instances [16:09:50] but the backend behind them is the same so it does report dupes [16:11:23] I think most of this comes from the job queue, as that is the one common situation where the store has to guess whether the user is authenticated [16:11:56] and in practice the answer is always going to be yes, we don't import unauthenticated sessions [16:12:35] so probably we can do the authenticated store lookup before the unauthenticated one and avoid the extra traffic [16:13:12] Maybe we revert for now, investigate more and try again? The 404s seem to be increasing [16:13:24] yeah let's do that [16:13:50] +1 [16:14:00] I'm wondering if we can even be more strict to only look at the other store if we didn't find something from the previous lookup. [16:14:05] Ack re reverting. [16:14:36] tgr_, we revert for group0 too? We didn't see this impact yesterday [16:14:47] group0 is tiny [16:15:03] let's leave it enabled there so it's easier to test fixes [16:15:39] any volunteers for doing the deployment? I can do it but I should also leave soon ^^ [16:15:40] Ack [16:15:56] xSavitar: do you want to do it? I can otherwise [16:16:22] I've not been added to SpiderPig yet, I'll ask to be added so I can deploy from time to time. [16:16:39] ok [16:16:39] For now, you can go ahead, I'll standby and assist for testing... [16:16:49] s/for/with [16:17:12] tgr_ https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1192583 [16:17:23] xSavitar: you should be able to request permission at https://idm.wikimedia.org/permissions/ btw [16:17:23] Do we need to use a window or just do it now? [16:17:27] (i.e., not via a phab task) [16:17:47] IMHO you could deploy now [16:18:05] Check the schedule at https://wikitech.wikimedia.org/wiki/Deployments#!/deploycal/current [16:18:47] If your not within 45 min of another MW-related window, and it's working hours in Europe or US, usually good to go. [16:19:10] Lucas_WMDE, requested, thanks for the link [16:19:11] yeah, the puppet window shouldn't interfere (and it's done anyway) [16:19:31] (don't quote me on that as policy, there may be other factors we check that I'm forgetting to mention.) [16:19:35] Krinkle, Ack [16:38:04] miss rate is back to normal [16:38:18] presumably Logstash rate too [16:39:56] Logstash is looking much better now. Thank you! [16:42:53] sorry for the trouble [16:44:34] re-echoing what tgr_ said 🙏🏽 [16:46:40] <+xSavitar> I'm wondering if we can even be more strict to only look at the other store if we didn't find something from the previous lookup. [16:46:51] I think that's a good change on principle [16:47:06] wouldn't help here since we know these were all misses [16:47:40] so actually I think we have to figure out a way to pass a SessionInfo to the job queue [16:48:11] which is not hard but currently each job reimplements session exporting / importing so it will have to be done in a bunch of places [16:49:22] that said, I don't really understand where these misses came from [16:49:43] inflating misses by 2x would make sense, since there are now 2 stores [16:50:04] but the increase was much larger than that [16:50:38] for session store misses only, something like 100x [16:50:51] anyway need to go afk for a bit [17:18:53] can we pass the same CachedBagOStuff to both? [17:19:00] i.e. do the wrapping first instead of after. [17:44:34] it would mostly defeat the point of this being a dry run for having two different stores [17:46:59] but given the numbers, there must be some situation that was a sessionstore hit before, and is a sessionstore miss now, and that doesn't have anything to do with multiple stores or how they are handled [17:47:19] we must be passing in the wrong session ID somewhere [18:27:28] I don't know that it is a hit vs miss per-se. GET 200 hasn't decreased. The excess is entirely 404s, which suggets either a new kind of 404 pathway, or an existing 404 request that is being multiplied. [18:28:05] It is using the same RESTBag in both session stores, by reference, right? [18:28:37] I was merely suggesting that rather than wrapping those in their own CachedBag, we can have CachedBag be a responsibility one level up and wrap them before we compose the MultipleBackend service. [18:29:25] but yeah, if business logic tells you that the same key should never be requested from one and then the other (i.e. they're mutually exclusive) then I agree keeping the process caches duplicated/separate is fine/harmless. [18:44:30] tgr_, Krinkle, https://logstash.wikimedia.org/goto/e3cccbba4588d08e9cf1f039aca751b6 lines up with https://grafana-rw.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&from=2025-09-30T12:50:14.250Z&to=2025-09-30T17:41:14.453Z&timezone=utc&var-kClass=MWSession&forceLogin=true [18:45:04] It's clear looking at that logstash that we're doing multiple lookups [18:45:24] Now why we're getting cache misses, I need to find proof. [18:46:46] Or say "more cache missed" [18:46:51] *misses [18:54:06] These group of errors: https://logstash.wikimedia.org/goto/98903bb91a70c7022e3c41ad33f37e93 seem to be trying to tell a story [18:54:22] I can see several connection errors [18:56:05] A random request from one of those: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2025.09.30?id=Egt6m5kB_vxcIv2vefO6 seems to go through the MultiBackendSessionStore class [18:56:28] And the timing in https://logstash.wikimedia.org/goto/98903bb91a70c7022e3c41ad33f37e93 seems to tally with the deployment and revert [18:57:32] What could have been blocking the connect? [19:22:58] Hm! I may have figured out what the issue is, but I'll try to prove myself wrong tomorrow. [19:27:08] What I'm seeing is coming from MultiBackendSessionStore->set(). When we have a new key->value pair we want to write to the backend, the set() call on multibackend would call getActiveStore() which the corresponding new session info object (and no user info) and that will do store lookups (in both stores), which will result in cache misses because they've not yet been written. It's doing all that to figure out the correct store to write to. [19:27:09] Since it's doing that so many times, we have so many cache misses. [19:28:53] When both stores are fetched and no data is found for a given key, it's considered an anonymous session (hence picking an anonymous backend). [19:33:17] doing multiple lookups is expected, since we are looking up "both stores" (but in reality, the same store twice) when there's no user info [19:33:34] what's not expected is that there's way too many of them [19:34:34] what should have happened is that for some fraction of lookups, we are now doing two instead of one, so there's somewhere between a 0%-100% increase in lookups [19:34:46] but we had a 400%-ish increase [19:36:15] yeah could be during set() a well, but the numbers still don't make sense - we have way fewer writes than reads [19:37:13] maybe these are the WRITE_CACHE_ONLY set() calls? [19:38:05] right, those happen when there's no session at all [19:38:26] so the big increase makes sense [19:40:15] Yes [19:40:22] there are two code paths that lead to that, generateSessionId() and and calling Session::save() when the session isn't persistent [19:41:30] in the first case, it should be fine to just write both backends [19:42:40] in the second, not so much, but then I don't think SessionBackend::save() can happen with no user info [19:43:06] so basically set() needs special casing for when there's a cache-only flag and the value is false [19:43:35] it would be nicer to move generateSessionId() into SessionStore, but let's leave that for the future [19:45:04] Yes, generateSessionId() is the real deal here. [19:45:58] I can move it to SessionStore and proxy callers that reach it via SessionManager, it shouldn't be too hard, I hope. [19:46:24] SessionManager already has SessionStore injected [19:46:59] But if you want it to be a separate patch and not mix it with our current fix, that's fine. [19:48:15] Thanks for the idea Tgr. [19:54:49] I would go for the easier fix, mid-rollout [19:57:15] Ack