[00:14:28] Reedy: Does circular dependencies on Gerrit work these days? [00:14:42] Seems to come in handy [00:18:21] They *should* work. [00:18:34] though there's a chance you could catch a bug :) [00:23:45] paladox: Will try [06:11:16] <_joe_> anomie: oh sigh [06:11:21] <_joe_> and yes, you're right [06:41:04] <_joe_> oh well, php can't read a json file created by php's json_encode [06:41:07] <_joe_> wtf [10:10:30] <_joe_> anomie: I found a few more brown paper bag fixes to make to both scripts [10:10:46] <_joe_> thanks for debugging the json_encode issue :) [12:25:31] <_joe_> I've had a report from coworkers that massmessage is not working for them [12:25:46] <_joe_> how are messages from massmessage delivered? the jobqueue? [12:33:32] _joe_: I'm not sure who to poke about this, so I'll tell you and ask you to pass it on if necessary. T221347 seems to have been something weird with the PHP7 opcache: When I looked at the bug (after seeing it mentioned on enwiki), I found that all instances of the "Kogo" exception in kibana were on mw1274 and using php 7.2. Having seen https://wikitech.wikimedia.org/wiki/Debugging_in_production#PHP7_Opcache recently, I tried running `php7adm [12:33:32] /opcache-free` on mw1274 and the log entries about the exception immediately stopped. I have no idea how "Kogo" (rather than "Logo") might have gotten into the opcache in the first place. [12:33:33] T221347: Fatal exception of type "ConfigException" - https://phabricator.wikimedia.org/T221347 [12:37:07] <_joe_> anomie: sigh that is horrible indeed. [12:37:46] <_joe_> anomie: were all requests to that host failing, or just some? have you checked by any chance? [12:39:26] _joe_: I don't know how to check that. :( Although it probably would only have affected MediaWiki index.php views that render the skin, while other entry points (e.g. api.php, load.php) shouldn't have hit that code path. [12:40:58] <_joe_> if it hits index.php, chances are the server would've been depooled if we tested for php7 already :/ [12:41:16] I did try using maintenance/shell.php from the command line on mw1274 (with php7.2) to test the code path, and that didn't reproduce it. [12:44:31] <_joe_> because opcache is only in php-fpm [12:44:45] <_joe_> now, at what time did you verify this? [12:46:04] <_joe_> I can go look at the stats we collect to see if something strikes me as odd around that time [12:47:50] I don't have timestamps. I saw it at somewhere around 11:50 UTC, and ran the php7adm command at right around 12:20:39. The earliest log entry for the error is 09:41:10 UTC. [12:50:23] <_joe_> great, thanks [12:50:38] <_joe_> I'm now chasing a UBN! bug that might land on your lap [13:12:46] I'm not sure but do we think UrlShortener extension could come in handy for codesearch links? [13:13:14] Usually, CS links are long and putting them in commit messages is not so cool per wrapping [13:13:42] Maybe the shortener can support codesearch links? Maybe Ladsgroup can throw a word or two [13:18:18] Sorry about the noise, posted here (https://www.mediawiki.org/wiki/Topic:Uy3ap16t24ocy6b0) which I think that's the right place. Thanks [13:28:12] <_joe_> anomie: could we create a simple endpoint for health-checking that would walk through most codepaths? [13:28:34] <_joe_> I'm thinking of a way to detect and depool such occurences [13:29:30] _joe_: Probably not, there are way too many code paths. Although this particular one would be hit by basically any wiki page fetch that renders the skin. [13:30:21] <_joe_> lemme rephrase: something that would include most files :) [13:30:41] <_joe_> that should be enough to cause an opcache-related issue to surface [14:41:14] Reedy: Would it be less disruptive to just change the string to Special:BlankPage again? [14:41:25] Maybe [14:41:41] Not sure if other things in the ecosystem depend on the new behaviour. [14:41:50] I know the last two weeks' worth of job-related issues did. [14:41:53] Pchelolo: Opinion? [14:42:04] Just make it use Special:BlankPage again? Or revert the patch out in full? [14:43:01] Reedy: Special:Blankpage should work in theory [14:43:10] we've also used Special:BadTitle in some places before [14:43:21] I wonder if the error rate on logstash is visible enough to test that on group0 wikis [14:44:37] https://logstash.wikimedia.org/app/kibana#/dashboard/5587ec70-d421-11e7-a2bf-bb774cde766e?_g=(refreshInterval%3A(display%3AOff%2Cpause%3A!f%2Cvalue%3A0)%2Ctime%3A(from%3Anow-24h%2Cmode%3Aquick%2Cto%3Anow)) [14:44:41] The amount does look visible enough [14:46:04] Let's try it? [14:46:40] Reedy: ye, given that the target is 0 errors [14:47:18] There does still seem to be some residual jobs from the previous deploy filtering through [14:47:20] I see a commons one [14:55:44] Reedy: I only see testcommons [14:56:00] I was looking at the errors you just said to ignore :) [14:56:04] legoktm: curious what you think of deprecation policy in context of class members. Is hard deprecation mandatory there (ergo, warnings, via __get?). See also https://phabricator.wikimedia.org/T71939 [14:58:38] lol [14:58:42] Looks like I need to update the tests too [14:58:42] 15:48:40 -'someCommand Special: 0=val1 requestId=0a2fad82b51a40fc67f17e7e' [14:58:42] 15:48:40 +'someCommand Special:Blankpage 0=val1 requestId=0a2fad82b51a40fc67f17e7e' [14:59:04] Fun. [14:59:37] Reedy: tests/phpunit/includes/jobqueue/JobTest.php [14:59:43] Yup, just doing it :P [15:00:23] Reedy: we should go back to group0 due to the jobquee issue. [15:00:31] oh, I think that was done. cool. [15:00:35] Yes, lol [15:02:39] Reedy: Using Blankpage I think might be a problem. Other changes have been made explicitly expecting ''. [15:02:44] Can we keep it for a few hours while we fix it? [15:02:54] Sure [15:02:58] Well, kinda [15:03:04] It's not just the MW side that's broken [15:03:19] It needs actual job queue side stuff changing too, it seems [15:04:41] Yeah, prolly jus a patch to EventBusJobQueue sublcass. [15:04:49] to match core. but… will stop talking now [15:06:16] So, we can't fix MW master/wmf.1 without a corresponding fix in EventBus, and we can't revert either? [15:06:35] Apparently? [15:09:09] * James_F sighs. [15:13:36] Yes, the new signature has a bug. It will require a change in core and/or EventBus's subclass of core's class. not sure which yet. [16:32:57] Krinkle: Do you feel comfortable with us landing your fix now? I'd really like to unblock the train, it's holding back the SDC feature release. [16:33:16] James_F: no, not a risk of more unrecoverable loss of jobs. [16:33:34] As soon as Aaron's up and he approves. [16:33:47] and he's now up. [16:34:00] at* [16:34:22] OK. [16:34:41] <_joe_> James_F: we're already had a 16 hours interval in which many jobs have failed [16:34:50] <_joe_> I'd rather play this safe [16:35:09] Fair. [19:36:11] _joe_: [19:36:12] [{exception_id}] {exception_url} ErrorException from line 73 of /srv/mediawiki/php-1.34.0-wmf.1/maintenance/language/generateUcfirstOverrides.patched.php: PHP Notice: Undefined variable: datai [19:36:14] from you? [19:36:35] ignoring for now, maybe a bug in the core script. [20:07:30] <_joe_> that's probably me modyfiying the script, yes :) [21:12:38] James_F: last week paid off https://docs.google.com/document/d/1SESOADAH9phJTeLo4lqipAjYUMaLpGsQTAUqdgyZb4U/edit [21:13:11] although gzip is great, and takes away the cake, making it far less of a win that I had hoped, but still good. [21:13:20] it doesn't show the wins on the HTML size btw, just the startup module. [21:15:22] Nice. Are we aiming for 100k? 90k? ;-) [21:15:36] 28 K, compressed. [21:15:53] Is that some magic frame size? [21:17:57] not really, but it's how large the startup module was when we began. Which is how much bandwidth we previously consumed during the first roundtrip that competes with the HTML bandwidth. [21:18:59] In terms of overall bandwidth and page load time, we've not added any cost, in fact it's lower (3 RTT, to 2TT overall. time to "start fetching page modules" from 2RT with 79K, down to 1RT with 34K) [21:19:05] Right. [21:19:16] But lower is better. [21:19:24] but the merging of mediawiki.js into startup.js did mean competing a bit more with early HTML transfers. [21:19:39] and yeah, just feels good to be back where we were, and most important, because there's so much low hanging fruit still. [21:19:53] Obviously, if we moved to a SPA we'd ship 500B as base. ;-) [21:19:54] It's quite strange that we transfer 20K just to state "here is what we /could/ load" [21:20:05] Yeah. [21:20:43] well, SPA means heavy JavaScript. I guess nowadays SPA is associated with PWA (e.g. service workers, heavy JS outside the main thread). [21:20:50] We're already ahead of that in a way by having all JS async. [21:21:06] so defacto, compared to that our JS cost is near zero (maybe the HTML script tag counts) [21:21:47] the project for this spreadsheet is over from last quarter, right now I'm just using it as proxy to measure progress on the reducing of startup registry, which was actually a separate task. [21:21:59] aka Wikibase/CX module registrations. [21:22:42] Sure, but we ship a fair amount of content on every page load that is mostly static between pages. [21:23:40] yeah, the HTML boilerplate for a skin isn't free, but overall it's surprisingly small really. It was more significant 10 years ago then today. Now it's comparable to 10 pixels in height of an infobox thumbnail. [21:25:20] in terms of performance, we'd still need the browser to re-parse that every page view. so no render time win (that is, browsers are already super fast at processing that, and we already put above-the-fold html first in the output). overall bandwidth cost would be down a bit, especially for repeat/deep sessions. But then again, if the industry average is anything to go by, that service worker would likely be a 3 megabyte webpack bundle. And youd [21:25:20] need to view a lot of pages to make up for that. [21:26:43] James_F: but, there's a lot of other great things we can do with SPA/PWA methodogies. E.g. central notice banners that render without jumps. and offline reading. and (my favourite) page view render times for logged-in users comparable to logged-out users. [21:33:25] Now you're just dreaming. :-) [21:35:22] (Also dynamic content insertion/removal from the DOM could reduce "actual" content data transfer massively, with e.g. https://www.youtube.com/watch?v=UtD41bn6kJ0&list=PLNYkxOF6rcIDjlCx1PcphPpmf43aKOAdF&index=18&t=998s ) [21:40:30] Yeah, transferring articles in chunks or more generally reducing the level on which we manage a piece of content would help a lot. [21:41:25] also in context of UX for editing, history review, attribution, user reputation, etc. [21:41:44] and section editing kind of does some of that, but I think we can do a lot more. [21:42:07] esp if we want to one day allow layouts to be a thing separate from document flow. [21:43:18] for things that are either on top, bottom or on the side; moving them out of wikitext would work (and in some ways, already works today, with hacks). But it doesn't really work for neatly interleaving different kinds of content while still being associated. Anyway, there's more than one way to solve that. [21:51:09] Yup. [21:52:43] However, for prose content (Wikipedias, Wikivoyages, etc.) we chunk content pretty heavily at the "section" level. See MF's local work on this; doing it in core shouldn't be too hard. Big issues are page-wide DOM effects from inline/unscoped styles. If we prevent those (balanced templates, no inline styles via CSP) it'll Just Work™. Maybe. [22:22:57] James_F: hey regarding kubernetesization of services, yay! :) will things like mediawiki job queue runners live in k*s land in near term also or is that something for later? just curious about the possibilities for scaling usage for stuff like the video scalers where usage spikes up and down depending on whether somebody's got a big batch of uploads [22:23:30] (there's also changes i can make down the line to make the work chunks smaller, which can dynamically scale better) [22:33:34] brion: Quick answer is "we don't know". [22:33:36] AaronSchulz: are you around? Have some questions about ChronologyProtector [22:33:48] :) [22:33:53] James_F: no rush then :D [22:34:51] brion: Migrating the MW/parser "Parsoid" service/config/extensions/skins monolith into containers poses a bunch of big challenges, not least of which is our current resource-tethering of specialised jobs to specialised machines (videoscalers is the most notable but not only example). [22:35:21] So, we'll see. Right now we're just focused on RESTful services. [22:36:33] Automagical scaling up and down resources for different parts of the cluster to meet demand (e.g. more API servers over"night" as Google hits us or whatever, or more videoscalers during a big batch upload) would be a great outcome, yes. [22:38:19] yeps. for the futures! [22:41:43] Rebuilding a massive docker image with 5 million lines of code and ~1GB of i18n CDBs and auto-deploying that to production on every merge commit is a bit scary to consider, however. :-) [22:52:50] SMalyshev: I'm stuck in CR land atm [23:04:10] Woah, I take it back. Not 1GB of CDBs, 1.6GB plus 1.9G of build files (which we're also scapping to each host, WTF?). [23:10:09] brion: would be great for video scalers indeed. note that for most jobs, however, we're not CPU bound (I tried disproving this a few times and was disappointed). I proposed a few times to allocate a spare server to job runners but was generally recommended against, given they're rarely at peak CPU even now. It's mostly limited by DB throughput for master writes and graceful waits for replication. [23:10:46] being able to allocate our 90% idle use on app servers to job queue does seem like a dream, but in practice wouldn't work in general, except for certain job types. transcode seems like a perfect fit indeed. [23:10:50] *nod* a lot of jobs have a lot of waiting on the db, i'd imagine [23:11:21] we could by default just have prod be a transcoder, that sometimes yields to people reading and editing this thing called wikipedia. [23:11:40] lol [23:15:00] Sanity check that php-{}/cache/l10n (prod i18n) should be scapped to production servers but php-{}/cache/l10n/upstream (deployment server i18n build files) shouldn't? I'll go file a task. That's 2GB to each host every full scap we could save. [23:15:50] hopefully by the time we get to packaging wmf-mw-deployment, we'll be on .php files for l10n. [23:15:58] Krinkle: Good joke. [23:16:02] which despite what it sounds like is an improvement. [23:16:13] We've talked about that for >7 years now. [23:16:47] I don't think it's been quite that long (I mean LCStore php array, not the original Messages*.php), but yes, it's been a while. [23:16:56] * James_F nods. [23:17:02] most of that was blocked on HHVM thinking it's okay to not have garbage collection for compiled code. [23:17:07] :D [23:17:37] which was fixed 1.5y ago, and enabled in prod 2-3 months ago. [23:18:11] the next step is scap being a bit more flexible about how it builds l10n files (basically, let mw-config and the maintennace script control it, instead of it). [23:18:22] then we can try it again for a test wiki. [23:18:36] tyler started on the scap work I believe [23:18:49] https://phabricator.wikimedia.org/T99740 [23:19:22] Right. [23:19:24] or mukunda. not 100% sure. [23:19:29] can't find the commit. [23:19:30] anyway [23:20:13] we might end up having .php l10ncache in 1.34 stable before wmf prof (as stock default). [23:20:18] which would be weird perhaps, but still cool. [23:20:41] right now the stock default is a mysql table... [23:25:24] Filed as T221428. [23:25:24] T221428: Scap should only sync built CDB files to production appserver hosts, not the build files as well - https://phabricator.wikimedia.org/T221428