[04:23:42] <_joe_> Krinkle: still around? [14:27:50] _joe_: morning :) [14:28:08] <_joe_> hey! [14:28:29] <_joe_> so, I have some good news, and you probably already read the meh ones [14:28:56] I'm just seeing an IRC ping for a postmerge job completing from a commit I deployed yesterday [14:29:09] at https://gerrit.wikimedia.org/r/#/c/513495/ [14:29:10] oh well [14:29:13] I guess it was backed up [14:29:18] _joe_: tell me more :) [14:29:50] <_joe_> so on one side, I've disabled opcache resets [14:30:07] <_joe_> and reduced the revalidation frequency to 10 seconds [14:30:16] <_joe_> which is meh [14:30:45] <_joe_> on the other, the all-php7 api server is performing comparably, if not slightly better, than the ones running mostly on hhvm [14:31:02] <_joe_> I did some api.log scavenging [14:31:03] _joe_: you mean api.log latencies? [14:31:05] cool [14:31:30] <_joe_> with some horror of pipes and grep and awk and perl, all on one line, but still [14:31:40] what's your impression mostly? shorter tails? lower median? [14:31:49] <_joe_> that I didn't measure [14:31:58] average? [14:32:19] <_joe_> yes I'm looking at averages for now [14:32:23] k [14:32:41] <_joe_> it's interesting that I see actions on which the mean is pretty distant [14:32:41] RE: T224491#5225784 - I didn't notice the timeouts in the php7 filter, thx for catching that [14:32:42] T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 [14:32:55] <_joe_> heh this morning the log was sparse [14:32:58] <_joe_> so I noticed them [14:33:20] <_joe_> I looked at all the errors since I've disabled the opcache resets [14:33:35] <_joe_> and I didn't see any new occurrences [14:33:37] so with opcache resets disabled, does that mean php7 will never do those resets, or do we get the automatic resets still when reaching capacity etc. like some php5-era online reports suggested? [14:34:03] <_joe_> you get resets if you ever reach the limits of opcache [14:34:28] <_joe_> but we're going to make opcache large enough and restarting the php-fpm server regularly, probably [14:34:56] <_joe_> and in the close future [14:35:05] <_joe_> the plan is to make scap restart php fpm [14:35:25] <_joe_> so deploy code => depool/restart/repool in batches [14:35:39] <_joe_> and removing opcache revalidation [14:37:06] <_joe_> I have all the code to do so ready in spicerack/cookbooks, so I guess we can adapt that to scap somehow. [14:37:24] <_joe_> that's what I hope to work on next week [14:42:38] _joe_: right, php-fpm restart implicit means starting with a fresh opcache? [14:42:42] (what about apc?) [14:42:55] <_joe_> both, yes [14:43:21] <_joe_> but, they both get filled up in a few seconds to serve with their cache hit ratio [14:43:41] <_joe_> so I wouldn't be too worried about that aspect [14:43:49] <_joe_> unless we do a release every minute, that is [14:44:03] <_joe_> we would probably need to change our ways around SWAT, I think [14:44:16] <_joe_> so that we do only one big cache invalidation for all of it [14:45:07] <_joe_> the alternative is to rewrite scap, of course. That could allow us to keep APC running and just swap the path of the opcache for instance [14:45:39] the cache hit ratio is not a useful metric from a perf or ux perspective, though. we have a long tail of stuff for localisation (# wikis * # lang codes), and RL minification (wikis # lang codes # skins # modules) etc. [14:46:05] pure speculation at this point, but we've shifted everything over the past few years with the assumption that things like apc can last a week or so at least. [14:46:33] anything above a week is rare and churns regularly for other reasons, including restarts indeed. [14:48:02] <_joe_> yeah first of all you're a bit optimistic if you think hhvm lasts a week on average. [14:48:30] <_joe_> but how's the 95th percentile of RL minification going now? [14:48:32] I don't know about average, but I would hope that more than 30 weeks a year, up time is more than 5 days. [14:48:58] <_joe_> it segfaults with a certain pleasure, just like php5 used to do [14:50:31] <_joe_> and we restart the api servers every 3 days for memleaks [14:50:58] <_joe_> like, on a random appserver I just had to log onto [14:51:16] <_joe_> hhvm is up since 19 minutes [14:51:28] ok, well, as long as any intentional changes to apc lifetime are documented on the task I suppose we can weigh it. right now I don't have time for php7 stuff though, will continue 1.5 week from now (today is UBN recovery and annual review finishings, next week off site) [14:53:49] <_joe_> it's ok, next week I think we'll have to work on other things and my offsite is the week after that [14:54:13] <_joe_> other things == adding to scap the ability to do a rolling restart [14:54:15] <_joe_> btw [14:54:21] <_joe_> in a future where we use containers [14:54:42] _joe_: btw, the mw1348 issue, did you look at that further? Maybe it was a coincidence that it started flapping so soon after going 100% on traffic for that server [14:54:50] I had to restart it a few minutes after you left yesterday [14:54:57] https://phabricator.wikimedia.org/T224491#5224470 [14:55:02] <_joe_> such things will happen again and again. Relying on the persistence of a local in-memory cache for performance is not exactly a best practice [14:55:11] <_joe_> no it's not a coincindence I think [14:55:32] <_joe_> at higher traffic, race conditions are more likely to show up sadly [14:55:46] right, but there weren't any syncs or opcache clears afaik, [14:55:46] <_joe_> that's what convinced me we can't try to rely on opcache reloads [14:55:49] so scared me a bit [14:56:11] but maybe it was from another deployment, I don't remember now [14:56:14] <_joe_> no there was a sync, I did check the logs for the opcache free [14:56:17] OK [14:56:19] cool [14:56:26] <_joe_> and there was one before the one you reported [14:56:33] so it'll continue on 100% with the opcache changes from today? [14:56:34] <_joe_> so it was definitely caused by a cache reset [14:56:48] <_joe_> yes, the server seems healthy [14:56:56] okay, and that's the only unsampled server for now? [14:57:04] <_joe_> yes [14:57:09] ok. [14:57:40] <_joe_> if things are calm with the revalidation, we might move forward with moving over more api servers [14:58:08] <_joe_> for the appservers, we prefer to stick to the a/b test for now, but that might change soon. [14:58:31] OK. I can't monitor/support for next 1.5 week though, so just keep that in mind :) [14:58:37] <_joe_> Again, only if the instabilities we saw again when adding more traffic don't show up [14:58:42] <_joe_> FWIW, I' [14:58:56] <_joe_> m looking at logstash daily, but I was just looking at fatalmonitor [14:59:35] <_joe_> I did some reading today, and one day where I'm less tired I might propose some ideas for a future way of doing our deployments [15:52:46] _joe_: Depool deployments sounds nice, but it'll take an age (and will mean giving depool authority to deployers, which we currently don't have). [15:53:13] <_joe_> not depool authority [15:53:21] <_joe_> what do you mean it'll take an age? [15:54:23] <_joe_> to implement or to run? [16:08:15] we had it built into scap already once. That was an ugly hack, but now I think etcd would make it easier to put back in. [17:07:14] _joe_: Sorry, to run. [17:44:31] depool+drain alone takes longer than the current full run of scap, though. [17:44:57] so it'll definitively take longer, whether that's an issue is another matter. [17:45:28] _joe_: regarding https://phabricator.wikimedia.org/T223952 - is there something mw/perf related you'd like me to look into around there? I have it on my list, but unsure what to do. [17:45:43] (Increased instability in MediaWiki backends, PyBal) [17:46:11] <_joe_> no, I think we got the culprit, it was that query mainly [17:46:51] <_joe_> and it was solved by going to wmf.7 [17:47:13] _joe_: did it exceed the timeout somehow? Or is it that an increased proportion of real traffic was these slow requests? (I thought PyBal was based on the fixed monitoriing request mainly) [17:47:49] <_joe_> the monitoring calls /w/api.php [17:48:17] <_joe_> and for some reason, due to db slowness probably, those took more than 5s [17:48:24] interesting. [17:48:33] ok, so cascading affect into simple reqs [17:48:41] <_joe_> so db is slow => connection pile up => response to a stupid req takes 5s [17:48:57] <_joe_> also the query was on enwiki and we just test enwiki from pybal [17:49:03] ack, cool, not something we see very often afaik, but good to know that can happen. [17:49:40] makes me wonder if it makes sense to further separate db interaction, e.g. db user with conlimit for web+api+job separately. [17:50:08] at least so that it reduces cascading failures. API timeouts is still bad of course, but keeps it isolated in theory. [17:51:24] right now have only three afaik. 1: app servers (web+api+job) maint, 2: schema changes, 3: dumps. [17:51:51] <_joe_> I would let a dba comment, but it's possible that wouldn't help much [17:52:18] <_joe_> I have no idea what's the state of the art for resource accounting and rate-limiting on the database side [17:52:51] <_joe_> truth is such a request should be somehow be recognized as expensive and be sent to specialized hosts probably? [17:53:13] <_joe_> api + appserver nowadays are fundamental to the correct functionality of the site [17:53:32] <_joe_> so you can't just sacrifice one for the other tbh [17:54:04] <_joe_> it would make sense to have the api calls from "humans" to go to the same cluster as the appserver ones [17:54:30] hm.. maybe, I think for 99% of users, they will never rely on an api.php request.. so api servers could have their plug pulled and users still reading and editing mostly unaffected. [17:54:35] <_joe_> with that I mean "api requests that are part of serving the content to the users of our sites" [17:54:36] That seems valuable. [17:54:52] <_joe_> editing not so much anymore :) [17:54:56] but it's possible we can't do that with just separate users. [17:55:01] <_joe_> VE and parsoid depend a lot on api.php [17:55:42] sure, but that's not the default action=edit form on enwiki. [17:55:46] not yet anyway. [17:55:52] <_joe_> ok gotta go, I have dinner plans! see you next week or afterwards :) [17:58:10] but even without editing, reading and logging in is still important too. So if api servers are unavailable 1% of the time, people probably won't complain. But if app servers are unavailable 1% of the time, people would see it and be annoyed more. But anyway, thats assuming we can actually reasonably isolated them at the DB side, and indeed, I don't know if that's even feasible/usefu.