[07:58:53] This is a nice summary of performance alerts for a month: https://blog.mozilla.org/performance/2020/10/15/performance-sheriff-newsletter-september-2020/ [12:48:17] I got the tests up and running for checking long tasks on pages linked from enwiki main page. The number in parentheses is the budget/alert limit. https://usercontent.irccloud-cdn.com/file/3PLpsDTl/Screen%20Shot%202020-11-05%20at%2012.40.10%20PM.png [12:49:00] do you mean it will start from the current main page and go to whatever articles it currently links to? [12:50:52] yep max 200 pages, to try to get a better feeling for other pages than the normal 3-4 that we test. [12:53:08] "Browsing at the Speed of Right" [12:53:17] that sounds familiar :D [13:11:23] :) [21:47:24] just wanted to follow up and - sort of - answer my own previous question about broken traces, it seems that the problem disappeared after Scribuntu was updated to include some fixes for situations where it could cause exponential template expansion, but I haven't ended up figuring out why the traces were not truncated/discarded correctly in the first place. Anyways, at least it seems to be gone now :D [21:59:26] mszabo: ack, thanks [22:00:07] perhaps something in PHP (either generally or on your config) is also truncating the "real" back trace after a certain point? [22:00:41] thanks, that could be something to consider too - although to my knowledge this trace is wholly maintained by excimer, so it'd be weird [22:00:53] in any case, thanks a lot for the tool, it is very helpful and insightful :) [22:01:30] mszabo: yeah, it'd have to be a native zend thing where maybe it has some kind of upperlimit on the stack size. [22:01:42] I'm not aware of that though, and certainly wouldn't expect it to be lower than 250 [22:02:12] mszabo: we're currently working on making the files stored in swift rather than on local disk. [22:02:25] so as to have higher retention and the service be stateless [22:02:34] how are you currently using/planning for it? [22:03:56] I deployed it to k8s in a not-so-idiomatic way - we don't have ReadWriteMany PersistentVolumeClaims, so each component is a container in the same pod so they can operate on a volume). The volume is modestly sized as a periodic script in one of the containers uploads everything to a GCS bucket every 2 hours, and retention is set to be short so that everything is reaped from the volume after 1 day. [22:09:36] mszabo: cool, you might want to check out https://phabricator.wikimedia.org/T244776 and let us know if there's anything there that would make that easier in the future. [22:10:05] perhaps we could include an adapter for GCS or something like that. [22:10:16] thanks :) actually, the current situation is a good compromise I feel, because it doesn't require operations on the GCS bucket every time a new trace is logged [22:10:28] for sure yeah, same for us. [22:10:36] we serve last 24h at least from disk [22:10:56] then gzip/upload after a day [22:11:07] yeah, if that stays the same that's awesome :D [22:11:23] right now we have it set to 7 days, which is very long, but that's because our web portal doesn't yet look at Swift, so 7 days is because otherise we won't be able to easily open them [22:11:41] once our HTML viewer also includes swift then we can shrink that down to 24h [22:11:44] got it [22:12:07] do you have some kind of script or viewer to list the files in GCS and link them for viewing? [22:12:14] for now I'm just using the GCS UI to peruse the SVGs, but maybe I will try to create some better frontend to view them [22:12:23] ah right [22:12:24] cool [22:12:46] the whole thing only went live in the backup DC last week and in the primary today, so it's still maturing :) [22:13:10] so same-day graphs are not (easily) viewalbe right now for you [22:13:22] you actually wait for the reaper to see it and GCS is the main entry point [22:13:29] right? [22:15:44] actually the script runs every 2h, so the current day's WIP graph is synced every 2 hours [22:16:30] and retain is set to 1 so only the file from the previous day is kept on disk, apart from the one being updated for the present day [22:18:47] nice [22:20:01] it's already revealed that 12% of the time is spent in ExtensionRegistry::readFromQueue() :P [22:20:37] mszabo: so the direction we have so far is for the swift uploader to reap these files and then run a swift query for files that exist, and then dump a standalone HTML into the apache docroot with the full URL links to those files based on an HTML template. So that the apache is standalone and doesn't e.g. on-demand try to query Swift every time. [22:20:44] mszabo: https://phabricator.wikimedia.org/T187154 [22:20:53] oh, that's a neat idea [22:20:56] pehaps you'll have better luck [22:22:40] yeah, I looked at T187154 but I think this is a separate problem as we're hitting readFromQueue() a lot, which is the cache miss path. So probably our APCu config needs love (tweaked it, which should have a noticeable impact), but it also seems that cached load queues can't be reused between wikis that load the same set of extensions, because ExtensionRegistry uses makeKey() rather than makeGlobalKey() to build its cache key [22:22:40] T187154: Performance regression from Apcu/ExtensionRegistry::loadFromQueue on PHP7 - https://phabricator.wikimedia.org/T187154 [22:23:43] mszabo: hm.. intersting... yeah, I think it also varies by wiki in other ways based on config variables and hooks [22:23:52] e.g. onRegistration callback [22:24:06] but maybe we run that later? [22:24:15] I hope not, but I can see why maybe for legacy reason it is uncached [22:24:56] could be... but it seems that one is run after the data is fetched (potentially from cache) [22:25:00] mszabo: also, you might be able to use $wgExtensionInfoMTime [22:25:40] I considered it, it just needs some work to e.g. have scap (or your docker builder) bump something somewhere [22:25:54] yeah, good idea! [22:26:02] then you won't need to stat 1000+ json files at runtime [22:26:16] still better than reading/parsing 1000+ json files, but that's still a lot of stats for a cache key [22:26:38] I imagine it's mostly in the fs cache but still [22:26:56] yeah, and I don't know if/how there's overhead when containers enter the mix [22:27:36] do let us know if you end up finding that wgExtensionInfoMTime can work at scale in a convenient way and if it helped [22:28:33] https://phabricator.wikimedia.org/T233886 and https://phabricator.wikimedia.org/T228895 might also be of interest, if you're able to spare time on improving some of these. could pair up [22:28:48] yeah, I'm thinking it could probably even be set to something arbitrary given that k8s deployments destroy the world. & recreate it, so extension.jsons should never mutate in such a context [22:29:05] good point [22:29:28] so you have a 1:1 in terms of php-apcu: docker container : php-fpm, right? [22:30:01] Krinkle: any time this week for paratest btw? [22:30:16] yeah, the deployment will spin up a fresh php-fpm container with empty apcu [22:30:20] AaronSchulz: sure, today/tomorrow looks good. what's the next step? [22:43:25] krinkle: I remember looking at the MonologSpi partts myself, and maybe something that could be done would be to have LoggerFactory return a "stub" PSR logger implementation that lazy-initializes a real logger using LoggerFactory the first time one of its methods gets called [22:44:00] so basically StubObject, but one that implements LoggerInterface :P [22:45:31] the wmf-config around logging is insanly complex, I haven't double-checked recently but it sesems like we're creating a ton of objects and do so repeatedly fro every channel when in practice I feel like that doesn't need to be the case, but I don't know MOnolog well enough at the moment to see if that's true or how to do it instead. [22:45:57] it seems all each logger needs to know is the string of the channe name, a threshold integer and a pointer to whatever thing is used to transport stuff. [22:49:55] Krinkle: just doing a setup walkthrough over google meet [22:50:17] yeah, that's true as well - right now MonologSpi allows for very complex configuration, but it doesn't seem to be utilized by WMF at least, which is why those simple wrappers exist in wmf-config [22:54:23] AaronSchulz: ah ok, sent an invite for Monday mornin [22:57:35] it looks like Monolog loggers take the channel name in the constructor, so it becomes hard to reuse existing instances for different channels - although withName() clones the object, so perhaps a few handler/formatter allocations could be saved that way [23:01:24] mszabo: hm.. yeah, handlers/formatters should be instance-agnostic though, right? [23:01:49] perhaps we are using them incorrectly or stuffing too much logic in the handler classes [23:02:35] yeah, the handlers/formatters operate only on records, and the channel name is one of the values in the record array