[02:03:04] brb, cutting half a *second* from mobile page load time for Persian Wikipedia :D [02:03:23] data/chart: https://phabricator.wikimedia.org/T405429#11231816 [02:05:07] thx Amir1 for pilotting that wiki first. It turned out to have a very similar baseline as Indonesian Wikipedia so I can compare it to last year's data. We might see something similar on idwiki next week! [07:36:29] Krinkle: any idea why RC/watchlist keep doing requests in the background even if "live updates" is off? [08:20:17] I guess it's for the "view new changes since" notice [10:01:31] How is it possible that the time to fetch decreases by 0.5s but the time to first byte only decreases by 0.25s? [10:02:19] I suppose p75 for one is not the same group of users as p75 for the other but still seems weird [10:03:48] nvm, it's explained in the post [10:05:16] anyway, very impressive! [10:57:26] number of redirects issued by our cdns is slowly going down https://w.wiki/FVkZ [10:57:40] So happy about the whole thing. W across the board [16:44:07] tgr_: yeah, this also means that our navtiming metrics for 'dns' and 'tcp' will become more useful for mobile. so far they were only useful on desktop because on mobile the were often 0 (too often to be explained by repeat visitors), because it took place place before fetchStart. It is as if there were never any cold dns or https connection state. [16:44:35] that work now moves to the place in the sequence where you'd normallly expect it: between fetch and response [16:45:07] the actual redirect (sans dns+https) is about 200ms at p75. still a 20% cut on TTFB. [16:54:10] can you post the post link again? sorry but I've lost it [17:05:32] apergos, maybe you're referring to https://phabricator.wikimedia.org/T405429#11231816 [17:06:15] That's the latest link Krinkle posted about the performance boost. [17:06:38] that and the task as a whole, thank you! [17:06:46] Ack, yw! [17:10:37] Krinkle: is there something like a TODO list for things we can clean up after the m. sunsetting is completed somewhere? [17:10:47] because I would add “close https://phabricator.wikimedia.org/T252227 and remove the warning from https://wikitech.wikimedia.org/wiki/Provenance” to it :) [17:13:06] In terms of JS hacks, very low impact. 99% of "m" mentions in on wiki scripts are either 1) redunant things like people re-constructing location.href based on various hacks due not realizing that , and $.ajax() allow relative URLs like /w/api.php just fine and so they "need" to use the mobile URL on mobile as otherwise it is a cross-domain request, which is only a problem bcause they used the full URL, or 2) things that support [17:13:06] the m-dot url without expecting it such as people scanning for external URLs and excluding if x.startsWith('https://en.wikipedia') or x.startsWith('https://en.m.wikipedia') which are fine as-is. [17:13:36] I've only found ~5 gadgets so far that use it to detect MobileFrontend for layout reasons and thus need updating. [17:14:24] Although after we start redirecting m-dot to standard (maybe ramping up this week mostly later though, trailing the main rollout) we can eventuallly remove a bunch of that as redundant since there won't be page loads there anymore, only redirects without JS. [17:14:47] examples at https://www.mediawiki.org/wiki/Requests_for_comment/Mobile_domain_sunsetting/2025_Announcement#Example_JavaScript [17:15:27] I hadn't seen T252227 before, that's interesting [17:15:28] T252227: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 [17:16:35] I'm guesssing wprov is stripped in Varnish so early that it is earlier than the mobile redirect? Both are "pre cache" and shoudl I think easily be able to be happen the other way around, but I guess nobody bothered to fix it [17:20:52] lucaswerkmeister: Feel free to start a list somewhere I suppose :) Maybe gather ideas on the talk page? [17:22:48] there’s a clean up step in the task description of https://phabricator.wikimedia.org/T214998 but I’m hesitant to edit that and ping so many subscribers ^^ [17:22:55] I guess on-wiki could work pretty well: https://www.mediawiki.org/wiki/Requests_for_comment/Mobile_domain_sunsetting#Phase_5:_Cleanup [17:23:14] maybe I’ll just add that link to the task description [17:45:06] eh, on second thought, not sure editing an already-accepted RFC is the best idea either [17:45:12] I’ll just set myself a calendar reminder in two weeks :P [21:01:15] Krinkle, per the issue we faced yesterday with session cache misses, it's been resolved. [21:01:34] We also noticed that the fix applied to group0 and you can see the drop here: https://grafana-rw.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&from=2025-10-01T19:01:50.933Z&to=2025-10-01T21:00:31.022Z&timezone=utc&var-kClass=MWSession&forceLogin=true [21:02:00] get_hit_rate is back to almost 99% [21:02:17] And the "misses" panel has dropped [21:02:26] Everything looks good so far on group1 [21:04:01] Errors have also dropped here: https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=2025-10-01T18:59:43.083Z&to=2025-10-01T21:03:21.775Z&timezone=utc&var-dc=000000026&var-site=codfw&var-prometheus=k8s&var-container_name=kask-production [21:04:09] Pretty sure those were coming from group0 [21:09:02] xSavitar: Looking at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1192884 [21:10:17] Ack! [21:12:54] I see, so the WRITE_CACHE_ONLY feature was not supported yet by the wrapper. [21:13:44] It's interesting that we need to "trick" the cache in this way. Ideally we'd have enough control over the flow that if we decide to generate a session ID we can also decide not to look it up. [21:14:21] But a lot of session code is implicit / stateless where each call is tryign to independently do everything, i.e. how we also buffer writes and lazily save etc. [21:14:58] I'd be interested in seeing the CachedBag wrapper dissappear long-term, barring some unforeseen usecase or gap in my understanding (quite possible of course). [21:15:29] for the same reason: once we fetch it from session store, we shouldn't have to trick ourselves into doing more fetches and instead keep it at the level responsible. [21:15:38] I suspect much of this was all to support php session handler? [21:16:05] but once that is gone, we can rely on code using our interfaces and so can perhaps "just" do these things expliciilty and procedurally [21:16:11] not specifically, SessionManager does lots of repeated lookups [21:16:38] some of that was for long-gone reasons, especially PHP 5 being terrible with circular references [21:17:06] maybe all of that, but would have to check [21:17:12] aye, makes sense [21:17:21] anyway having a cache inside SessionStore seems proper to me [21:17:56] whether that's CachedBagOStuff or SessionStore managing an in-process cache directly is a relatively minor detail [21:20:16] conceptually I think with SessionManager now not destructing SessionBackends when they aren't referenced, the SessionStore cache is not actually needed, aside from some edge cases like the WRITE_CACHE_ONLY thing [21:20:32] yeah we might get away with fewer moving parts too [21:20:49] but in practice, might need some fixing [21:22:53] I think without CachedBag, we'd have to trust that we do our lookups correctly and own the session data array explicitly in our classes. that will bring a lot of confidence and clarity to the code I think. But that also brings more responsiblity of course, so we'll need to have a period where we have CachedBag but act like we don't have it, i.e. log lookups that happen after we already generared or retreived the same session data and [21:22:53] audit it until it's done. fun little migration. [21:23:47] The traffic spike looked similar to me to before we implemented the "don't read newly generated session id" optimisation a few years ago [21:23:53] back on Redis [21:24:05] I didn't think it would be so literally the same thing. [21:24:15] there was a big increase in sessionstore GETs on the 23th, I wonder what that is [21:24:18] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=2025-09-20T06:22:17.081Z&to=2025-10-01T21:12:55.232Z&timezone=utc&var-dc=000000026&var-site=codfw&var-prometheus=k8s&var-container_name=kask-production [21:24:32] "set" spikes as well. [21:24:38] I figured sign up page scrapers? [21:25:09] could be [21:25:26] it's a very persistent one then [21:25:53] numbers are a different order of magnitude, nvm. [21:25:54] worth looking into, that's like a 6x increase [21:25:54] 20K [21:25:58] yeah [21:26:20] wait so that's the 404s, that's our bug [21:26:52] I see, you mean the baseline increase [21:26:55] yeah I see it now [21:27:12] 1K >3K constant [21:27:42] maybe 0.5K>3K even [21:29:40] it lines up with the services switchover, but I don't see how that would be related [21:30:07] hehe [21:30:08] ok yeah [21:30:17] switch dc=eqiad on top [21:30:25] 3K going down [21:30:29] up on the other side [21:30:59] it's zero there because we depool eqiad for a week, it'll be repooled today I think. [21:31:07] oh right, the charts are for one DC only [21:31:10] although most will stay switched, with codfw as primary [21:31:47] yeah, it's a remnant of pre-Thanos prometheus where we couldn't combine these easily [21:32:08] we could make it use Thanos now and keep dc=[all|eqiad|codfw] for when you want to filter but default to all [21:35:31] re: the group 1 deployment, there's still a fair amount of duplicate lookups: https://logstash.wikimedia.org/goto/993f9adf72662ff7e855146141ee34a2 [21:36:04] ~2000/min, not really a problem for either Cassandra or Logstash [21:36:39] xSavitar's other patch will probably fix that [21:38:34] not super urgent but probably should be fixed before going to group 2