[11:22:22] Reedy: Hi. Regarding https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/512545, any ideas for a sub-dir name? I’m out of ideas here. Thanks! [11:22:40] But I’ve addressed the issue related to .phpcs.xml file. Thanks for that! [12:27:59] xSavitar: JsonSchema or similar? [12:28:17] Considering it came out of "JSON Schema Validation Library" in one file ;) [12:28:27] Just seemed... sensible to try and keep them "together" ish [12:42:07] Thanks Reedy. I used “JsonSchemaValidation” then moved “libs” to “Libs” and our sub-dir is inside it. [14:05:11] anyone got time to review a simple config patch? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/516055 [14:05:24] the Matrix trial is blocked on this [14:11:09] thx Reedy! [14:11:43] tgr: Have you checked what actually uses that docroot? [14:11:57] As that .well-known might be exposed on a few more domains than expected [14:13:14] donate, vote, private sites... wikimania, chapters... [14:13:19] tgr: Where are you wanting it to appear? [14:13:23] https://github.com/wikimedia/puppet/blob/b51c793b0c586a0ee88444874ba831512d26f9a7/modules/mediawiki/manifests/web/prod_sites.pp [14:13:39] uh, no, I assumed that's just the main wikimedia.org [14:14:17] as in https://www.wikimedia.org/ ? [14:14:18] if that's not the case, deleting the redirect to the standard docroot is probably a bad idea [14:14:21] yeah [14:14:46] or at least the standard root has more stuff [14:15:12] https://github.com/wikimedia/puppet/blob/b51c793b0c586a0ee88444874ba831512d26f9a7/modules/mediawiki/templates/apache/sites/included/www.wikimedia.org.conf.erb [14:15:19] https://github.com/wikimedia/puppet/blob/b51c793b0c586a0ee88444874ba831512d26f9a7/modules/mediawiki/templates/apache/sites/included/www.wikimedia.org.conf.erb#L2 [14:15:28] I think you need to put it in docroot/wwwportal [14:16:00] hm, ideally it should be plain wikimedia.org, not www.wikimedia.org [14:16:04] is that an alias? [14:16:24] it 301s to location: https://www.wikimedia.org/ [14:18:16] <_joe_> uh yeah that's not what you want [14:18:33] <_joe_> that patch I mean [14:19:20] <_joe_> tgr: the ticket has a precise description of the desired behaviour? [14:23:46] hm, RFC 8615 says nothing about whether a .well-known URI can redirect to a different domain [14:24:20] _joe_: https://phabricator.wikimedia.org/T223835#5230126 is the accurate description [14:32:23] I guess I can just put it under wwwportal and see if Synapse accepts that [14:33:05] <_joe_> yeah [14:33:14] <_joe_> that's the easiest way to find out [14:33:31] <_joe_> in theory wikimedia.org/.well-known will redirect to the right place [14:34:15] it will, it's just unclear whether you can redirect a .well-known lookup [14:34:51] the RFC says "dereference the URI" so I guess that's a sort-of yes [14:35:21] IIRC LE (not the same, I know) does follow redirects when following .well-known/acme-challenge [15:20:24] Reedy: updated the patch, could you re-review? [15:50:42] wwwportal is used by wikipedia etc. as well, right? [18:01:51] Krinkle: I think so. The modular.im server will not accept other Matrix IDs that wikimedia.org though so no harm done. And I don't think part of the docroot could be limited to wikimedia.org without adding a bunch of complexity. [18:08:27] tgr: ugh, do we not have the symlinks anymore for the www portals? [18:08:29] Ah, that's too bad. [18:11:40] Krinkle: this is my first time looking at the docroot code, maybe I missed something [18:11:54] I'm just about to deploy that patch, should I hold it back? [18:12:36] Docroots are a bigger ongoing problem being cleaned up [18:12:59] I think this version of the patch definitely is less "damaging" (very loosely used) than the old version which was turning a symlink back into an actual dir etc :) [18:13:06] tgr: seems fine to me. It will mean it gets exposed to other domains, though. [18:13:14] but assuming that doesn't allow any abuse, fine by me. [18:50:35] TimStarling: I'm noticing that from excimer we're getting consistently more samples per day than from HHVM for load.php. Given relatively low deployment so far, this took me by surprise. Not yet sure what to make of it. [18:50:47] One theory is that it's performing slower overall [18:52:05] <_joe_> Krinkle: not what I observed when inspecting the apache log, but we can dig deeper [18:52:05] the interval/probability of samplings happening is roughly equal right, or did we intentionally increase it? [18:52:54] <_joe_> (also Tim is in a meeting which you were invited to :P) [18:53:28] ugh, my phone died, lost the notif. See it now. [18:53:30] oops [18:53:45] <_joe_> :D I assumed you were overworked [18:54:09] well, wmf.10 hasn't been light, but I could've made half an hour time if I realised. [18:54:11] oh well [18:58:47] <_joe_> Krinkle: from a random appserver [18:58:54] <_joe_> fgrep load.php other_vhosts_access.log | fgrep proxy:fcgi://127. | awk 'BEGIN {count=0; sum=0} { count +=1 ; sum +=$2 } END {print sum/count/1000}' [18:58:56] <_joe_> 54.6295 [18:59:15] <_joe_> (that is hhvm, also please kerningham forgive me for that savage use of awk) [18:59:23] <_joe_> fgrep load.php other_vhosts_access.log | fgrep -v proxy:fcgi://127. | awk 'BEGIN {count=0; sum=0} { count +=1 ; sum +=$2 } END {print sum/count/1000}' [18:59:25] <_joe_> 50.8395 [18:59:27] <_joe_> this is php7 [19:00:01] <_joe_> so load.php is slightly faster on php7 [19:00:14] <_joe_> I think that difference can be explained by php7 having only "real" users [19:00:28] <_joe_> as in people who use user agents with cookie support [19:00:44] <_joe_> I don't think most scrapers use load.php a lot [19:04:05] _joe_: hm.. I'd be interested in those matching 'modules=startup&' and then e.g. the slowest 100 and compare that. [19:04:36] * Krinkle looks for a mw server to pick [19:04:42] <_joe_> Krinkle: may I ask you open a ticket? altough I think you can read those files as well [19:04:52] yeah, checking now [19:04:53] thx [19:04:58] <_joe_> it's a bit sad we still don't have better instrumentation there [19:05:34] <_joe_> but I have to choose for which battles to die for. My team lacks personnel :/ [19:05:52] any of these should have some php7 traffic, right? [19:05:53] https://github.com/wikimedia/puppet/blob/22b0c95bf1aac03e5c468543bd966a97b48184a4/conftool-data/node/eqiad.yaml#L5 [19:05:56] (not including api or jobs) [19:06:09] <_joe_> yep [19:06:26] <_joe_> if you want more data, use mw1333 [19:06:31] <_joe_> or another of the most recent ones [19:07:11] <_joe_> keep in mind I do expect php to be slightly faster than hhvm with load.php, that was the result I found back when I was benchmarking [19:07:28] <_joe_> but real-world data might have a few outliers we didn't consider [19:08:07] yeah, I'm particularly thinking about when it's populating caches after a deploy [19:08:18] <_joe_> (also, use the 1% slowest requests, not the 100 slowest, that's like the 0.001 percentile ) [19:08:50] _joe_: I'll reply further on "the" phab task, but RE: php 7 and apc perf etc. sorry for the wording there, my intention was to clarify my current comfort level and status quo as I understand it. It was by no means a description of something we planned or think is reasonable. [19:08:54] <_joe_> that should be equally miserable on both hhvm and php I guess, but one never knows at that level of detail [19:09:07] <_joe_> heh ok [19:09:42] <_joe_> anyways I think for now (that is, until we do something better than running a cron to check if a restart is needed) the restarts will happen less often [19:09:50] <_joe_> than each 4-5 days [19:09:57] A quick fix for auto scaling, if we don't have time to do it right, would be to start doing a hundred req warm up between (re)start and pooling. or so [19:10:35] which could also dual-purpose as health validation, e.g. they must all come back 200 OK. [19:12:20] <_joe_> so about that, I had an idea [19:12:38] <_joe_> it's possible to pass php-fpm a preload script IIRC [19:12:55] <_joe_> that should allow to load all files in the opcache *before* we start answering requests [19:13:27] yeah. it could also potentially do a number of faux requests [19:13:31] <_joe_> but I thought of that as "later, when we get to 2nd-level optimization" [19:15:53] e.g. top 20 wikis (for l10n, wmf-config, and extension-specific), each: 2-3 load.php urls, MainPage view/edit/history, Special:RecentChanges. That should cover a decent amount. [19:19:22] <_joe_> no, the issue with opcache preloading is it becomes static, pretty much like HHVM's repoauth mode AIUI [19:21:44] _joe_: right, that would be a narrow subset only then. [19:21:47] I was thinking about apc warmup [19:22:09] <_joe_> yeah that could be done as well as part of the startup process [19:22:15] <_joe_> probably in scripting [19:23:12] _joe_: hhvm has support for it as well, ori wrote it into puppet at some point if I recall correctly, including a CGI tool called "furl" [19:23:21] it was enabled at some point and then it wasn't. [19:23:24] don't recall the details [19:23:48] <_joe_> hhvm has supposedly support for warmup scripts, but that never worked very well [19:24:47] <_joe_> ori then tried to use upstart to fix that, but not with great effects. On hhvm compiling the code to bytecode is way more expensive at first, also because the jit still has to optimize things [19:25:31] tl;dr? why are you considering a warm-up script for php7? [19:29:43] ori: tl;dr; apc overall size limit was going to be significantly smaller (no longer INF until OOM), but has been raised since. Next problem was opcache being very unstable (corrupts itself whenever it tries to clear itself, so we no longer clear it on deploy but use a short ttl instead at which point it will discover the change lazily, but it still corrupts/clears itself when opcache reaches a certain size, this happens at unpredictable times). [19:29:43] So we routinely restart, this means resetting APC implicitly as well, as such less lifetime/efficiency for APC. [19:29:54] also it means opcache being empty happens more often (e.g. right after restart) [19:29:59] so a warmup would help. [19:30:07] to avoid live traffic hitting it cold [19:30:28] _joe_: numbers from php7/hhvm for load.php startup module - https://gist.github.com/Krinkle/9c52bc0432214b18b1bdbb8477bf048f [19:30:32] both tail and head look better on php7 [19:30:43] (and average we looked at already, was better as well 50 vs 55 ms) [19:31:14] <_joe_> wow the tail is crazy [19:31:18] <_joe_> but to be fair [19:31:31] <_joe_> we need one appserver all-on-php before we make a fair comparison [19:31:42] <_joe_> php-fpm is operating at lower concurrency than hhvm [19:31:59] yeah, and also unlikely to hit all the same edge cases of super fast/slow. [19:32:04] lower probability [19:32:22] <_joe_> so yeah, on monday we can switch one [19:32:30] the jump from <1 to > 100ms on php7 is suspect as well. [19:32:34] is there an upstream php bug for the opcache corruption? [19:32:40] yes [19:32:45] <_joe_> ori: how many hours do you have? [19:32:55] <_joe_> there's a literature on the topic on bugs.php.net [19:33:00] https://phabricator.wikimedia.org/T221347 / and now at https://phabricator.wikimedia.org/T224491 [19:33:07] has refs to php.net bug(s) [19:33:28] https://phabricator.wikimedia.org/T224491#5218130 [19:34:05] <_joe_> so one thing I wanted to try was to drain traffic from a server, reset opcache, repool [19:34:25] <_joe_> that should reduce the possibilities of the bug happening, but it's still playing a game of chance [19:35:11] <_joe_> Krinkle: oh you should probably only grep for successful fetches [19:35:31] <_joe_> maybe the super-fast request is a 404 [19:35:36] Yeah, didn't filter by status code. [19:35:40] well, 304 or 500 more likely [19:35:46] <_joe_> or that, sure [19:35:50] I don't think my regex could match a 404 [19:36:55] <_joe_> ori: the only way to debug this would be to try to reproduce the bug on a test system with opcache memory protection activated, so that any problem would cause a segfault AIUI [19:37:06] <_joe_> thus giving us a core dump [19:37:32] "We do have this problem too on production, during heavy loading we got errors about sql queries that get mystery changed to "SELECT * FRPM" (notice the P), or links in our pages with www changed to vww." – php.net [19:37:37] welp that's not at all terrifying [19:37:47] <_joe_> but I've seem so many of these bugs open, resolved, reopened, re-resolved that I think opcache_reset is just broken [19:38:03] ori: yeah, I had the same reaction. Note that DELETE and SELECT have the same number of characters [19:38:19] <_joe_> i was just naive in assuming it would work i guess [19:38:39] <_joe_> i should know better, 10 years in my career [19:41:19] <_joe_> ori: the worst part is - to the best of my debugging abilities - the only way to detect this is by looking at the error log. opcache_info seems perfectly ok at a general look [19:41:31] <_joe_> when the corruption happens [19:43:09] _joe_: the varnish fragmentation we have between hhvm and php7, assuming that still exists, how widely is that applied? all app server requests? [19:43:38] thinking if it influences what we get on the back wrt to sampling of load.php, given it has a 99.9.. cache hit ratio in Varnish [19:44:42] <_joe_> all requests are split between php7 and hhvm at the varnish layer for appservers [19:45:39] <_joe_> so yeah, load.php has different caches, but at the current traffic split I'd be surprised if the cache hit ratio was bad enough to influence the results than the user's sampling being limited to browsers with js and cookies [19:46:56] <_joe_> which are bound to require load.php more often than other requests, and will also include a higher mix of logged in users vs logged out [19:47:11] _joe_: hm.. so we sample after the first view, which means they do have warmed caches from then that continue, which means there is a reduced request rate overall (that is, after enabling php7 in your session, you won't make load.php requests with it unless they are for something different and/or expires). [19:47:24] load.php isn't split by cookies. [19:48:02] from client perspective that is, for us, we'd split it, but the client won't even ask us. [19:48:22] what percentage of web requests are served by php7? [19:48:23] that's fine and shouldn't only influence overall rate, not perf distribution. [19:48:40] afaik 20% + 1 whole app server. Don't know if that's up to date. [19:48:58] 20% being, 20% crypto random in JS on the first view in a session. [19:49:05] for subsequent views. [19:50:22] <_joe_> ori: 20% of requests from clients which execute js and store cookies, + 1 api appserver (because on api that fraction is important [19:50:59] <_joe_> ori: that means around 10% of the total requests, see https://grafana.wikimedia.org/d/GuHySj3mz/php7-transition?refresh=30s&orgId=1 [19:52:09] if you're seeing silent code/data corruption I'd consider dialing that back down to 0 temporarily :( [19:52:52] <_joe_> we're not, we stopped using opcache_reset() [19:53:02] <_joe_> or we would've [19:53:38] <_joe_> right now we keep the opcache free space in check, and (soon) we'll have a cron to do that for us (meh, I know) [19:53:55] <_joe_> I still think we need a better solution before moving to higher percentages [19:54:03] <_joe_> and well, we need to deploy wmerrors [19:54:09] I had the same feeling, and overall we've had a few preventive dial backs whenever something was came up in related investigations. But I've not found new reports in Logstash of non-silent corruption since we stopped using resets post-deploy /and/ doing the rolling restarts instead. [19:55:01] If there were non-zero loud errors, I'd expect there to be more silent errors, but so far no more loud errors. Which at least for now made me comfortable to remain at this level. [19:55:30] at the same time, given we don't dial it further, it's also not entirely clear whether there is still more to learn from it at this percentage, or whether we've learned all we can. [19:57:50] <_joe_> I think we're just waiting for wmerrors and an agreement on what's acceptable around deployments to move the needle more aggressively [19:58:43] <_joe_> I'm pretty confident we won't find new bugs now, but running with some load allows me to learn a lot about the general behaviour of some components of the application server [19:59:51] <_joe_> I'm waiting for the CPT offsite to be over, then I hope wmerrors will be ready and I can have a sit down with release engineering to understand how to proceed re: deployments [20:00:47] <_joe_> the db config in etcd will remove my biggest concern around deployments, and that's coming in ~1 month or so [20:01:32] <_joe_> My take is that if we have a reasonable timeframe in which to do something to make our deployments sustainable, we can unblock the migration and we (SRE) can manage the situation in the meantime [20:02:16] <_joe_> this migration has been so far less bumpy than the HHVM one, thankfully, esp in terms of overall stability of the application platform, and that was expected. [20:02:38] <_joe_> What I didn't anticipate is some fundamental function like resetting the opcache being utterly broken since its introduction [20:06:59] you could try forcing a lot of opcode cache churn on a test machine to see if you can trigger the error [20:08:59] <_joe_> yeah that's what I plan to do when I have some time to help debug the issue, but I have to plan and work in the assumption that's not going to be very easy to pin down and to fix [20:09:28] for example, have a php script that defines two global variables: a random string and the (pre-computed) checksum of that string [20:09:46] <_joe_> my plan was to replay traffic from a real appserver, and run opcache_reset until it segfaults [20:09:47] update that script in a loop from a test harness [20:11:34] and meanwhile hammer a php endpoint that loads that script, computes a checksum, and compares it to the precomputed one [20:11:36] <_joe_> I suspect this to be a race condition, so it's hard for a simplified test setup to cause it to happen [20:11:38] ori: the other evil is that as of mid-2018, after every scap deploy, there is 2-3 minutes of most live http reqs timing out on HHVM. [20:12:06] <_joe_> Krinkle: that's because we started to actually enforce timeouts [20:12:16] <_joe_> in mid-2018, true story [20:12:27] <_joe_> because they were broken for a long time on hhvm [20:12:40] <_joe_> so we went with the 3 minutes timeout that apache had [20:13:07] _joe_: my point isn't that they timeout, it's that after a deploy since mid-2018, HHVM is also broken. [20:13:07] <_joe_> then the bug was fixed upstream and I forgot to enable timeouts [20:13:13] And that itself is not caused by having timeouts [20:13:26] <_joe_> no, I think HHVM was broken before as well [20:13:28] syncing code to mwdebug used to not result in the server being unusable over HHVM. [20:13:39] <_joe_> oh that's indeed true [20:13:50] wheras it is now. The only way to test code is to wait 3 minutes or use PHP 7. [20:13:53] well, the fix for HHVM is "easy", get rid of it [20:14:07] <_joe_> so, that's one of the things we'll gain by finishing this transition [20:14:19] <_joe_> I hoped we'd be done in 10 days from now, it will be a bit longer [20:14:26] <_joe_> I hope 1, max 2 months longer [20:14:33] The php7 migration is taking longer than expected, otherwise we would've/should've probably looked into what happend to HHVM last year. [20:14:48] maybe one of the minor upgrades broke its translation cache update process to be so slow initially [20:17:06] <_joe_> I don't think I can make any hypothesis about that. The mwdebug servers are heavily underpowered too [20:18:41] yeah, but we're getting the same fatals from prod after each deploy. [20:19:15] but yeah, there's some amount of grey area where it could've been a pre-existing issue. [20:19:48] <_joe_> anyways, php-fpm seems to have overall more boring behaviour [20:19:56] <_joe_> and boring is good in this context [20:21:15] <_joe_> I very much want to be done with hhvm at this point. I still think it made sense at the time, and overall it's a much more refined software than php-fpm, but it's clearly tailored to FB's use [20:24:29] <_joe_> ok time to go to sleep for me, see you tomorrow! [20:24:48] oh yeah, it's late for you. good night, nice chatting with you [20:27:46] Hey guys, I am working on a campaigns fallback feature for CentralNotice and I wonder if fallback setting state (which is a global on/ff) should be listed in logs being sent to EventLoggin ? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/517931/2/resources/subscribing/ext.centralNotice.display.state.js#291 [23:06:41] addshore: what do you mean re: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/443528/ [23:06:43] the link is binned?