[04:42:46] TimStarling: Is our strategy for caching text due for a re-think? Over the past two months, there has been a big spike in purges (T124418), enough to overloaded vhtcpd on the varnishes on a couple of occasions. My hunch is that it is attributable to a bug rather than organic change. At the same time, Wikidata is making content more interconnected than before, and purges are recursive, so it seems like a problem that we'll have [04:42:47] to deal with eventually. [04:43:37] At what point is it better to simply set a short cache expiry, rather than purging half the known universe on every edit? [04:44:30] some choice [04:46:30] setting a short expiry is not an option though [04:46:37] unless you mean <5s or something [04:46:55] even a 5s expiry would be user-visible [04:48:43] One other reason I am thinking about this is that separate experiments by gwicke, me and Peter show that inlining the CSS we load at the top of pages makes a _huge_ difference to time-to-first-paint, enough to trump the benefit of caching for subsequent page-views. But editors expect common.css changes to apply very quickly. [04:49:03] And ESI won't be ready until 2011 [04:50:42] does this huge difference depend very sensitively on RTT? [04:51:18] It might; why do you ask? [04:51:36] I mean, does it still outweigh the benefits of caching if you are closer than geosynchronous orbit? [04:52:30] Yes. Keep in mind that the benefit of caching are dampened by the fact that different pages can load different CSS modules, so there is some variation in the CSS URL from one page to the next. [04:52:54] I can actually perceive it even on my laptop, on a fast connection. [04:53:12] see https://en.wikipedia.org/speed-tests/pages.html [04:54:25] if there are different CSS URLs on different pages, that is a bug, right? [04:54:30] HTTP 2 Server Push might confer the same benefit without requiring a change to the caching model, but that might be far off; I think only Apache TomCat implements it currently. [04:54:46] the $group parameter is meant to be tuned to avoid that [04:55:58] How is that supposed to work? [04:57:27] in OutputPage::makeResourceLoaderLink [04:58:20] modules are sorted by group, and each group becomes a separate load.php call [04:59:27] I think that approach was only successful at moving user CSS / JS to a separate URL so that logged-in users still get the benefit of CDN-cached CSS [04:59:42] I could be wrong [04:59:51] maybe [05:00:15] mostly I am remembering this feature from RL design discussions, I haven't analysed its use in practice [05:00:48] but, at the design stage, that was the idea, that the groups would be tuned to avoid resending CSS/JS when you move from page to page [05:03:16] Module storage (caching modules in localStorage) mostly obviated that issue for module dependencies that are dynamic, with URLs generated in JS rather than embedded in the page HTML [05:04:28] but we're not caching modules in localStorage [05:04:54] please tell me we are not doing that [05:05:15] We are [05:05:56] https://meta.wikimedia.org/wiki/Research:Module_storage_performance [05:07:12] but didn't that break the user's browser forever with no way to fix it? [05:08:04] I thought I overheard someone talking about this at the all hands, about how localstorage is so badly broken, and I thought they were just reminiscing about the mistakes we made years ago [05:09:04] it broke localStorage caching on Firefox for users who visit multiple projects, because Firefox allows JavaScript to only see localStorage for the origin, but the quota is enforced on all storage for a top-level domain. [05:09:14] It's still broken for Firefox users. [05:09:20] is it at least disabled on firefox? [05:09:30] no, there are far more stringent quotas on module size [05:09:44] don't you use firefox? have you noticed any issues? [05:09:46] that makes no sense [05:10:06] which part? [05:10:17] Firefox's behavior, or the fact that we still have it enabled, but with stricter quotas? [05:10:38] you'd have to limit it to about 5KB [05:12:27] Why? The limit is 5MB. [05:12:47] with 2LD storage quota of 5MB and 900 sites, 6KB would tip it over the limit if you visited every site [05:13:37] do you operate an internet kiosk at JFK airport? [05:14:08] nobody visits every single site; most people visit a very small handful. [05:15:04] Angela had accounts on 600 wikis before CA was even invented [05:15:13] have you heard of https://meta.wikimedia.org/wiki/Small_Wiki_Monitoring_Team ? [05:16:06] there are people who really do visit hundreds of wikis per week [05:16:50] Well, module storage doesn't provide any benefit to them, but it doesn't break their ability to use the site [05:17:09] no, this was part of the original bug discussion, it does break their ability to use the site [05:17:27] because once localStorage is full, it breaks a whole lot of other random bits of JS [05:17:55] module storage can catch the exception and carry on, but then it has broken the entire 2LD for everything else that uses localStorage [05:19:14] Safari throws a quota exceeded exception on any attempt to set a value in localStorage in incognito mode [05:19:17] what is the stringent quota on module size and how is it enforced? [05:19:37] JavaScript code has to handle that, which is why we have a wrapper in core that handles the exception gracefully [05:21:05] we don't cache modules larger than 30 KB on FF. Modules tend to be either much smaller or much larger than that. [05:22:53] since the quota is per-2LD and not globally, the relevant figure is not 900 wikis, it is about 300 since that is the number of wikipedias [05:23:10] which gives you about 17KB per wiki [05:26:19] If it is vital to categorically avoid this issue in all cases, which I don't believe [05:28:20] The impact it made in 2013 was to the tune of several hundred milliseconds. We've optimized a lot of things since then (made all JavaScript load asynchronously, added SPDY support). Let's suppose that the benefit today would be a small fraction of what it was in 2013, like 25 milliseconds. That's 25 milliseconds multiplied by ~15 billion page views a month. [05:28:42] yeah we're saving a million lives a month or something, sure [05:29:26] looks like it's 655KB on enwiki at the moment [05:29:31] enwiki alone [05:30:28] sqlite> select sum(length(value)) from webappsstore2 where scope like 'gro.aidepikiw.%'; [05:30:28] 5107754 [05:30:41] ? [05:31:23] scope is backwards, that is *.wikipedia.org, I am saying that since you asked, yes my localstorage is very close to broken [05:31:46] some 135KB short of the quota [05:33:51] we delete the key before replacing it whenever it needs to be updated, so if the new value exceeds the quota it just doesn't get set [05:36:33] we batch ResourceLoader requests, but in order to assure that you get the latest version, the version string in the URL is the mtime of the most recently changed module [05:37:11] in practice this means that a change to one module has a large area effect, because all the other modules that get loaded in the same URL have to be re-retrieved too [05:37:22] what is the wrapper called? [05:37:55] resources/src/mediawiki/mediawiki.storage.js [05:39:35] we used to completely bust the browser's cache of resourceloader assets at least every week, but usually far more often than that, because in practice some on-by-default JavaScript code or another is being worked on at all times [05:42:43] Roan has a graph somewhere of bytes_out on the bits varnishes (which were a separate cluster then) during the rollout [05:42:53] well, in deployed extensions, there are 41 instances of "mw.storage" or "mediawiki.storage" and 258 instances of "localStorage" [05:45:21] exclude the ones that are wrapped in try / except [05:45:44] the remaining total is the number of tasks that should be opened for extensions which don't work in incognito mode in Safari [05:46:10] ok [05:46:25] but the ones that *are* wrapped in try/except, many of those will need tasks opened for them too [05:47:47] right. that is a problem, and a troubling one. [05:48:06] I asked Timo about a month ago if we should just disable module storage for that reason, because I figured I might be biased. He said no, because the gains outweigh the cost. I'm inclined to agree. [05:49:11] well, you're probably gathering that I don't agree [05:50:07] I don't think it's worth causing actual user-visible breakage in exchange for <250ms of performance benefit [05:50:33] I haven't heard any complaints since the 30KB quota for FF went out [05:50:55] it's just chance [05:51:23] and it depends on other users of localStorage as well, so it may change in the future [05:51:49] it'll get ported to webworkers soonish [05:52:22] presumably you don't have logs of bugs caused by localStorage, you're relying on users to report it [05:52:47] we have logs [05:52:57] we emit an event when we fail to store [05:53:07] i haven't looked at the data recently, though [05:53:55] we accept chance in many other places, though, when the deck is stacked right and the payoffs are right [05:54:02] heh [05:56:01] failure to store must be happening very often [05:56:18] every time I visit a new wiki it will fail to store [05:57:50] the part which is down to chance is the fact that I have 135KB remaining [05:58:08] which is not enough to store a single extra module cache, but it is enough for most other things [05:58:18] yes, that's a factor too [05:59:15] so it partly depends on the remainder of the division, quota % wiki cache size [05:59:44] right. and if you were so unlucky so as to have the quota filled entirely by module storage, it would almost certainly be brief, because some module store will fail to update itself soon enough [06:00:05] dog is begging, bbiab. [06:01:38] the best argument against module storage is this: although it is certainly true that most other things using localstorage require only a little bit of space, that may well be because developers expect it to be almost entirely occupied by module storage. It is not the prerogative of any piece of code to monopolize a resource like that. I think that is a fair argument. [06:10:52] start render is better when the css is inlined even for the second view, somehow [06:11:09] inlined: http://www.webpagetest.org/result/160201_G3_6QZ/ , not inlined: http://www.webpagetest.org/result/160201_J4_6R5/ [06:11:19] that's on DSL [11:50:55] https://www.mediawiki.org/wiki/User:Legoktm/PHP_5.5 - steps needed to bump the version requirement and stuff afterwards - I'm a little stuck on step 1, I'm going to look more into that tomorrow [14:07:47] gwicke: Ping me, if you have a minute [15:39:28] anomie: good morning [15:39:36] bd808: I hope so [15:39:41] :) [15:40:44] do you have an idea of which of the debug session messages are useful and which are not? [15:40:57] You probably already saw I put in two more patches yesterday for other potential causes of T125283, but I think we still don't have any idea whether that's the real cause. [15:41:59] *nod* I merged one of them and cherry-picked it to .11 [15:42:12] Re logging: It's hard, because a lot of the stuff is only useful if you're actually trying to figure out why something broke. Otherwise no one cares what exactly happened on every single pageview. :/ [15:42:45] yeah. logging anything on every page view is problematic just from a log volume point of view [15:42:56] E_TOOPOPULAR [15:43:30] As for the messages themselves, we could probably use more session IDs and/or user names being logged when the messages are logged, so we have a better chance to match things up instead of just seeing "oops, something failed". [15:44:21] For example, "failed to create empty session: Session ID already exists". Ok, sure, but which session ID? [15:44:37] OTOH, maybe logging that one at the "error" level is too much anyway. [15:44:45] I don't know. [15:47:56] I was trying to figure out if Varnish caches and returns cookies or not yesterday. I could neither confirm nor deny that in the time I spent reading internet docs and our VCL rules [15:48:08] I should ask bblack I suppose [15:59:34] Apparently we have safety new VCL code that keeps things with Cookie headers attached out of cache -- https://github.com/wikimedia/operations-puppet/blob/d21c146236ac7e8b678d6cf6ec53826b17db7ba1/templates/varnish/text-backend.inc.vcl.erb#L93-L101 [15:59:43] *safety net [15:59:49] hey both, can I interrupt shortly to discuss options? [15:59:57] greg-g: yes please [16:00:10] wanna do a hangout? [16:00:41] maybe, I'd love to get another relenger involved, [16:00:42] greg-g: Please keep me updated, we plan to branch Wikibase today (again)… would like to know our options [16:00:51] hoo: yup [16:01:44] bd808: uno momento [16:05:32] bd808: Doesn't look too new . But good, although that does make the "someone got served cached Set-Cookie headers" theory unlikely meaning we even more don't know what might be behind the bug. [16:06:11] anomie: *nod* [16:17:40] bd808: anomie in a few minutes I'll share a hangout link for us, probably with dan and maybe someone else [16:17:47] SFers are still waking up :) [16:18:26] works for me. I'm open until the top of the hour [16:19:05] us too [16:24:11] bd808: anomie https://plus.google.com/hangouts/_/wikimedia.org/wtf-11?authuser=0 [16:25:21] haha [16:28:35] TimStarling: ori: The main underlying issue is that localStorage (and most other currently stable storage APIs in the browser) have no TTL and have no exposed quota. E.g. you visit a wiki once and you'll keep those modules forever, they don't expire or revalidate until you visit the same wiki again. Upon revisiting, we immediately purge all modules of which [16:28:35] the version no longer matches the current version hash (regardless of whether you request that module on that page) - so things clear out pretty well on wikis actively visited. [16:28:55] But we really need a way to store things in a more volatile and less destructive manner [16:29:08] E.g. localStorage.set(key, value, expiry, priority) or something like that [16:29:16] WHATWG is working on things to that end [16:29:30] https://wiki.whatwg.org/wiki/Storage [16:30:25] As for user impact based on occupied localStorage, that is very real in theory. Less so in practice, because we have an Engineering-wide ban on use of localStorage. It cannot be relied upon and is currently reserved for module storage. This was not intended of course, and cannot last. It's a retroactive, temporary measure. [16:31:31] For the medium term we should probably find a different storage API that is almost unused, since there are now lots of use cases for localStorage. INcluding the Performance goal of not putting key/value pairs that are irrelevant to the server in cookies, but using LocalStorage instead, – which is currently blocked on mw.loader not using localStorage. [16:31:45] The ServiceWorker Cache (not WebWorker) is a prime candidate for this. [16:38:22] hoo: pong [16:38:48] there's also the browser cache API where the browser magically caches things for you if you just send the right headers! :> [16:39:43] gwicke: I have some questions regarding your plans for "page dependencies" [16:39:57] Do oyu have time for that? [16:42:18] hoo: sure [16:42:41] perhaps in #wikimedia-services, so that mobrovac & others can join in? [16:42:47] Yeah, sounds good [16:57:58] MatmaRex: We already use that heavily but is conceptually not suitable for our needs and performs less well. [17:12:00] anomie: good morning - Guess not. [17:12:17] blerg [17:14:19] Like I said it was not my emotional favorite choice but from a logical standpoint it seems the right course of action [17:15:04] the "be ready to revert if..." timeline and plan was too full of conditionals [18:24:36] what is the right course action? [18:25:51] bd808: ^ [18:26:35] ori: backing SessionManager out of master to let .12 release tomorrow and then putting it back in after the branch cut [18:26:45] so we skip a week of prod use [18:26:56] and let the rest of the world get their code moving [18:30:49] I'm not sure putting it back in right away is the best decision, to be honest. In your e-mail, you wrote ""if we back out now I'm worried that the bar to coming back will be set so high that we will never be able to achieve it." That seems baseless to me; no one is going to take away your +2. The only way I see SessionManager not happening is if you and your team burn out so badly that you never want to look at it again. [18:31:40] I think it's essential to find a way to put it behind a feature-flag so it can be enabled selectively on one or two wikis at first. [18:32:32] it's pretyt hard to feature flag this [18:32:38] maybe more than pretty hard [18:33:03] once you enable it on logingwiki is't really enabled everywhere [18:33:15] and if you don't enable on loginwiki its really eanbled nowhere [18:33:31] officewiki is a good option. This sounds counterintuitive, because it is private and sensitive. But it is separate from loginwiki and all its users are trusted. [18:33:55] So things like session cross-over are less dangerous [18:34:19] all of the issues that have been scary are centralauth related [18:36:00] well, if it's hard to feature-flag, it's a defect, IMO. I understand that it's not a trivial matter of adding a $wgUseSessionAuth or something like that, but it goes with the territory. The all-or-nothing choice is a fundamentally bad situation to be in. [18:37:35] You should find a way to chop up the deployment into a sequence of changes that can be rolled out and tested (and backed out, if necessary) individually [18:46:47] If this blows up wmf13, you will forfeit a lot of the credibility and trust people have in your judgment, and you are going to burn out. [18:48:02] Don't risk it. Software projects take longer than expected. It happens. [19:11:32] could beta.wmflabs.org be useful for testing this properly? or does it have to be a prod wiki? [19:30:25] Krenair: I think the question right now is what it "testing this properly". We've been on beta since about an hour after the .10 branch was cut and none of the scary issues that were reported in prod were found there. [19:31:02] many of the small and annoying issues found were due to "interesting" user agents [19:31:41] interesting in the sense that they had non-RFC compliant cookie handling or did funky things like bypass HTTPS in prod by only every using POST [19:43:06] csteipp: would you have some time today to chat about SessionManager next steps? [19:55:31] bd808: Yes please! I'm free for the next hour, or after 2 [19:57:29] Now works for me [19:57:34] * bd808 puts down sandwich [19:57:51] Should we do a hangout? [19:58:50] csteipp: ^ [20:00:50] bd808: Yeah, we can do that... [20:30:02] anomie: hello ! back from dinner. I am wondering if we should delay wmf.12 deploy by an extra day to give you time to review the no SessionManager patch [20:30:44] and potentially move it earlier in the day so you get more time ahead [20:30:54] hashar: I wrote it, other people need to review it. Then it probably could use some sort of testing in beta. [20:37:34] anomie: is that a patch that revert all the session manager stuff which is pending on master? [20:37:39] I kind of loose track :D [20:37:58] hashar: I haven't submitted it just yet, but it will be in a few minutes. [20:38:09] and I guess we would need related revert patches on the various other extensions [20:45:05] It looks like we don't, actually. The only extensions that I know of that were changed are CentralAuth and OAuth, and both will just fall back to their backwards-compatibility code paths that haven't been removed yet. [20:47:04] ahhh nice [20:47:15] that makes it easier [20:48:09] bd808, hashar: https://gerrit.wikimedia.org/r/267733 [20:49:30] anomie: thanks [20:49:40] bd808: Also, https://gerrit.wikimedia.org/r/#/c/267734/ [20:50:13] shiny! [20:50:23] and https://gerrit.wikimedia.org/r/#/c/267735/ [20:50:53] csteipp: anomie has a patch to make invalidating all sessions quickly possible -- https://gerrit.wikimedia.org/r/#/c/267734/ -- would be great to get review [20:51:36] bd808: FYI, it depends on the revert (and the unrevert depends on it) because they touch some of the same code in User.php. [20:51:48] on the CentralAuth patch https://gerrit.wikimedia.org/r/#/c/267735/ , you can flag it to be tested with the core patch by adding in the commit message: [20:51:56] Depends-On: I9d316a6bbb36278d138f39a89125ebb8cc71b28f [20:52:35] Zuul will trigger the job of CentralAuth as if $wgAuthenticationTokenVersion had been merged in core [20:54:16] Let's do that for the testing, actually, even though it should work fine without it (the global will just be null then (PHP doesn't seem to care if "global $foo" refers to an undefined), and so no behavior will change) [21:12:17] (tell me I'm dumb: could some of the issues SM is having that are hard to debug due to the session invalidation not being complete yet? ie: some people with sessions created by old code-paths interacting with new codepaths?) [21:18:00] greg-g: That seems unlikely, particularly considering T125267, but I can't rule it out. [21:18:30] anomie: /me nods [22:09:32] anomie: I talked to csteipp and have a good idea of a metric that we should get into logging for before/after comparisons with SessionManager (unique ips associated with session in a short time window). I'll write up a task for it [22:11:44] bd808: I wonder how many people are behind multi-public-address proxies though. [22:12:30] Don't know until we start tracking. We certainly shouldn't have any kind of alerting until we know what is "normal" [22:14:20] * anomie disappears for about three hours [22:19:44] bed crash. anomie poke me tomorrow morning whenever you log in :-} [22:32:22] ori: I hear your concerns. I'm not sure that I agree with parts of the analysis especially that all core architectural changes must be designed so that they can be feature flagged on and off. [22:32:53] not all core architectural changes [22:32:56] some :) [22:33:07] That being said I'd love to hear constructive suggestions on how this restoration patch might be split up -- https://gerrit.wikimedia.org/r/#/c/267737/ [22:33:35] ok, let me take a quick look [22:34:04] (the bot passowrd parts could be split out into their own patch again for certain) [22:34:39] where should i go for a high-level overview of all the things that are changing? is the RfC still the best document for that? [22:35:16] I thinl Brad made another page about AuthManager at a high level. Let me find that [22:36:11] ori: https://www.mediawiki.org/wiki/User:Anomie/SessionManager_and_AuthManager [22:36:47] the sessionmanager info there isn't very detailed [22:38:15] Commentary on the implementation patch might fill in some gaps -- https://gerrit.wikimedia.org/r/#/c/243223/ [22:38:56] (it would be neat to make something that scraped gerrit to make nicely readable pages about reviews like that) [22:57:25] bd808: how can you deprecate PHP session handling ($_SESSION, session_id(), etc.) but have things that use it continue to work? Is it because PHPSessionHandler does the same things behind the scenes, so that (for the time being) using the SessionManager API is equivalent to calling the native PHP APIs? [23:00:45] I believe that is the implementation, yes. There is code to check for changes to $_SESSION on save(). I honestly don't know the code very well. Brad and Gergo have been doing all the work while I piddle with other things. [23:20:47] ori: Krinkle: not sure how often you read your WMF mail, so here's a friendly note that you've got new WMF mail.