[01:42:53] 10Wikimedia-Apache-configuration, 10Pywikibot-core, 10Documentation, 10Pywikibot-Documentation: Pywikibot documentation showing broken directory listing - https://phabricator.wikimedia.org/T132136#4074753 (10zhuyifei1999) @Dzahn could you look into this? Most other redirects do work properly, but not pywik... [09:19:03] <_joe_> ema, vgutierrez what number can I quote as a cache-text hit ratio? [09:25:52] I'd say https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&panelId=8&fullscreen&orgId=1&var-cluster=text&var-site=All [09:26:05] *but* let's wait for ema [09:26:27] * vgutierrez knows nothing about varnish [09:45:33] <_joe_> yeah I reported "more than 90%" [09:45:40] <_joe_> which was good for my argument [09:45:44] <_joe_> (against ESI) [10:15:31] it depends what kind of "hitrate" you mean :) [10:15:46] hitrate in hits/all-reqs, or hits/cacheable-reqs [10:16:06] well there's more than those two distinctions, too [10:16:34] I guess the other dimension is whether you toss out the varnish-internal responses (redirects, errs, etc) [10:16:52] hits/all-possible-applayer-reqs vs hits/all-possible-incoming-reqs [10:18:20] (also the numbers just bumped upwards recently with a change, so past 24h might be better than 7d) [10:18:50] the worst number, hits/all-incoming-reqs, is about 77.1% [10:20:05] hits/all-possible-applayer-reqs is 84.3% [10:20:39] hits/cacheable-reqs is 95.0% [10:22:34] <_joe_> uhm [10:23:46] <_joe_> bblack: apart from 1 - it's a solution in search of a problem 2 - it's legacy unmaintained shit in all FLOSS caching proxies and 3 - it's an absolute nightmare for debugging and cache invalidation [10:23:53] <_joe_> do you see other arguments against ESI? [10:24:05] bblack: if an appserver is too slow answering (grafana/prometheus), cache_misc would return a 503 for those slows requests, right? [10:24:46] <_joe_> also 4 - everyone realized quite some time ago this was a bad idea, are we really that much smarter than facebook and google?, but I'm not sure that would be well received [10:25:00] _joe_: the primary argument ESI right now is the pragmatic one. while attempting the ATS transition, we can't take on something like ESI implementation integration, so shove off for the next FY [10:25:20] <_joe_> oh this is for the long-term vision anyways [10:25:26] (next meaning FY19-20) [10:25:35] (at best!) [10:25:39] <_joe_> ATS support for ESI is not even official, right? [10:25:59] ATS supports a different subset than varnish does anyways [10:26:13] I'm not sure I could call it unofficial, more like experimental or something [10:26:32] <_joe_> well I would argue that anything other than "esi:include" is a mistake anyways [10:26:50] well, I wouldn't put it quite that strongly, but overall, yes. [10:27:00] <_joe_> as soon as you start using variables and flow structures there, it's a rabbithole [10:27:01] and of course, esi:include-only may as well be the old-school simpler SSI [10:27:55] while it does make thinking/debugging more complicated, there are rational ways to think about how it interactions with caching and invalidation [10:27:57] <_joe_> at my past job, I had to go through caching of several layers of ESI fragments to find out what broke a page for a percentage of users [10:28:15] https://docs.trafficserver.apache.org/en/5.3.x/reference/plugins/esi.en.html [10:28:16] <_joe_> because they bet heavily on ESI years before [10:28:32] <_joe_> and they just kept adding complexity there, because that's what happens [10:28:52] I'm not saying I *like* ESI. I tend to agree it will get abused and make our lives miserable. [10:29:07] <_joe_> akamai tried hard to convince them to move away from ESI, showing us all their internal research on why ESI hurts performance instead of boosting it [10:29:12] but it is possible to use it sanely, especially if we're very restrictive about the use-cases and the features we allow. [10:29:26] <_joe_> ok, I'll bite [10:29:35] <_joe_> how is then esi better than client-side composition? [10:29:42] it's a solution in search of a problem [10:29:45] but [10:30:33] well to put it politely, some of the problems it's searching for are non-technical in nature [10:31:22] client-side composition done sanely, with a legacy fallback to server-side composition (what we have today) makes more sense, yes. [10:31:35] <_joe_> that was my whole point [10:31:43] but there are caveats with client-side composition just like ESI [10:32:29] it can also give you runaway complexity and feature-abuse, and it's harder for opsy folks to even have insight into what's happening there once it's in the control of a giant JS or serviceworker mechanism that's difficult to follow [10:33:08] <_joe_> well you have the js debug console, OTOH [10:33:25] <_joe_> which makes debugging quite easier, if not by an opsen directly [10:34:05] the other thing is, explicitly rejecting ESI might push some towards a more-disturbing path [10:34:06] <_joe_> esi is obscure, unless varnish/ats offer a very good insight to how each page fragment is composited [10:34:23] <_joe_> which I doubt [10:34:36] which is not only serviceworkers, but complex compositors on the server-side as well (for the fallback part), and needing to bypass the edge, etc... [10:35:01] <_joe_> anyways, I think this is an interesting discussion to have collegially [10:35:12] at which point you end up with something where it's very easy to fall naturally into un-scaleable/manageable/protectable solutions and things start falling apart on random feature deploys because the delineations aren't so clear, etc. [10:35:13] <_joe_> and this is the kind of discussions the TechCom should start [10:35:55] <_joe_> that's the one think I and Daniel agree on :P [10:35:59] <_joe_> *thing [10:36:41] the way I see it, basic legacy 100% server-side composition (what we have today) sucks, but it's easy to understand. [10:37:22] it's a Good Thing to want to move past that, but no matter which direction you take, it's fraught with perils and bad temptations if you're not very clear and distinct about keeping the facilities limited and clearly-delineated. [10:38:36] adepending on the social and technical context, it's not unreasonable to end up saying a limited use of ESI is more-preferable to our outcomes if we go the other way with client-side js composition + legacy fallback. [10:38:50] it's not unreasonable to say the opposite, either. [10:39:03] <_joe_> eheh [10:39:07] but either requires some rigor and discipline, and either adds complexity. [10:39:41] and a good desgin for how you'll break down your outputs would work with either one, so there's no reason not to start working on that part, and defer the ESI-vs-js call for later. [10:41:09] I think I'd see a premature push for declaring ESI the winner ahead of doing the basic decomposition work a potential bad smell (wanting to know what features to abuse up-front and wrapping the design to them) [10:43:00] anyways [10:43:22] rewinding a bit, probably the hitrate stats I gave (which are all in the context of all incoming reqs) provide poor context for this decision [10:43:47] because they fail to make the critical distinctions of what's on the client side of those reqs [10:44:20] a certain percentage of our overall incoming reqs bucket is actually internal callbacks-to-self that shouldn't be going through the public interface but is. [10:44:38] another percentage is bot/code/otherwise-automated traffic that also has nothing to do with composing for human UAs [10:45:08] we'd really want to first break it down to just the reqs from external human-facing UAs, and then further break that down by UAs capable of client-side composition vs not. [10:45:19] to have some good input numbers for guidance, I think. [10:45:41] (e.g. especially a lot of the currently "uncacheable" traffic could be API calls not from human UAs) [10:46:39] <_joe_> the cacheability of the API is a topic that I think would be more interesting for "scalability" [10:46:52] I think it's reasonable/good at this point to pursue, in the abstract sense, doing the groundwork of decomposition on the MW side in a general fashion (and still ultimately re-composing server side in the legacy way, from the external pov, for now) [10:47:00] <_joe_> the main reason why I see us wanting to do composition is to reduce the "editor penalty" [10:47:24] <_joe_> and I'm not sure how server-side that could work [10:47:31] and it's also reasonable for traffic to say "we're not even going to contemplate ESI, or say what feature subset we'd be willing to support, or anything, till at least post-FY1819, because we have other things to solve first" [10:47:41] <_joe_> yes [10:48:14] service-side it won't help with anything really, externally (e.g. editor penalty) [10:48:19] <_joe_> but that comes after understanding why we want compositing, and what direction we want to go to for solving those problems [10:48:26] <_joe_> bblack: right [10:48:40] it's just laying the groundwork for phase 2, where you start turning on client-side comp and/or ESI and/or server-side. [10:49:04] <_joe_> bblack: I fail to see how ESI would help either [10:49:08] first you have to have things to compose, so first you need to compose the legacy stuff internally, even if you're still ultimately compositing it server-side [10:49:22] err that came out wrong [10:49:26] <_joe_> the reason why we don't cache content for logged in users is mainly they can customize the appearence of the site [10:49:31] first you have to have things to compose, so first you need to *de*-compose the legacy stuff internally, even if you're still ultimately compositing it server-side [10:49:41] <_joe_> I agree [10:50:13] once you get to that phase, then you're in a place to debate where composition takes and how [10:50:21] *takes place [10:51:01] and you can't ignore non-composing UAs (which are not just "legacy". in some cases it's very new phones which are significant sources of traffic and won't support your desired type of client-side composition) [10:51:38] reasonable people might argue (wrongly or not) that it should all be ESI because it's a universal solution. [10:51:53] or argue for client-side, but ESI-composed as a fallback for clients that can't compose [10:52:06] or argue for client-side, with server-composed as the fallback (no ESI) [10:52:32] <_joe_> which would be my choice, tbh [10:52:38] it would be mine, too [10:53:34] but a very intentional and limited-feature/scope usage of ESI just as legacy fallback for the client-composition, isn't necessarily awful either, if done carefully. [10:53:41] <_joe_> also, if your server-side composer is using serviceworkers in node as well, it's a tiny layer and you could have it run on the edges too. [10:53:59] <_joe_> (this is where you shoot me) [10:54:06] I was about to say [10:55:02] either way, I think the smart thing is to defer any decision on ESI, with a bit of a negative push on it. [10:55:19] <_joe_> ok [10:55:23] do the decomposition groundwork first in general terms that will make sense with any limited-feature solution, ask again in FY19-20 :) [10:55:51] <_joe_> if we want to have a discussion on the general direction before of then, we can do it ofc [10:56:02] yeah [10:56:35] <_joe_> like the api-returning-200-on-error-or-not-found issue :P [11:05:30] right, I need to keep up my end of that still [11:47:43] 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4075420 (10BBlack) >>! In T190410#4074369, @Anomie wrote: >>>! In T190410#4073920, @BBlack wrote: >> What exactly is the client going to do diffe... [12:34:16] there's cron spam from /usr/local/bin/hhvm-needs-restart on the app servers trying to connect to 10.64.33.10/lvs1010, known issue? [12:34:20] e.g. /usr/lib/ruby/2.1.0/net/http.rb:879:in `initialize': Connection refused - connect(2) for "10.64.33.10" port 9090 (Errno::ECONNREFUSED) [12:34:56] heh [12:35:05] <_joe_> moritzm: that's a not known issue and it's pretty serious [12:35:23] I need to gather some context [12:35:30] but lvs1010 has never been a production lvs :P [12:35:39] (and it's dead, jim) [12:35:43] <_joe_> yeah, we have to find out how it ended there [12:35:56] uh.. :9090 on an lvs instance it's pybal metrics port, right? [12:36:14] well.. pybal instrumentation http interface [12:36:17] <_joe_> so, basically depooling safely servers is done via a script mobrovac wrote [12:36:24] <_joe_> called pooler-loop.rb [12:36:26] <_joe_> it's in puppet [12:36:39] <_joe_> it checks with pybal the server has effectively been depooled [12:36:46] <_joe_> as in, not serving traffic [12:36:47] yes, and I shut off pybal on lvs1010 recently (lvs1007 too) [12:37:21] <_joe_> ok, so I need to find why pooler-loop would use the lvs1010 ip too [12:37:23] I think probably different people are making different assumptions about the lvs data in puppet :) [12:37:26] <_joe_> it's in puppet [12:37:33] <_joe_> bblack: probably [12:37:35] it's in some lists, but the lists aren't there for that kind of consumption [12:37:49] (there is no such list for that kind of consumption, actually) [12:38:13] <_joe_> bblack: ok, what's your sanctioned way to get "low-traffic LVS IPs in my datacenter"? [12:38:23] (in the sense that there will always be cases there are servers in those lists which are not-alive while bringing up new things or decomming old things) [12:38:38] there isn't a sanctioned way, nobody's ever asked for one :) [12:38:52] <_joe_> let's create one then? [12:39:02] yes, but that doesn't sound like a friday fix for the failing script [12:39:06] <_joe_> the script is resilient enough to keep working I think [12:39:15] <_joe_> let me test that first [12:39:16] I can start the pybals back up since I'm clearly not going to finish the decom today [12:39:54] <_joe_> bblack: one sec, let me see if this has real consequences or not [12:40:25] (or also, I could break up the big decom commit, and just get them out of those lists for now without doing the rest yet) [12:40:41] <_joe_> bblack: EEK [12:40:49] <_joe_> yeah we need to turn it back on [12:40:53] <_joe_> and I need to fix this shit [12:41:07] <_joe_> the script actually breaks [12:41:09] ok [12:41:19] can I go the other way and just pull them from the lists? [12:41:24] (for the short term solution I mean) [12:41:25] <_joe_> so let's first assess the current damage, then I'll fix that damn script [12:41:31] <_joe_> bblack: either way is ok [12:42:31] <_joe_> https://config-master.wikimedia.org/pybal/eqiad/apaches EEK [12:42:33] <_joe_> ok [12:42:43] where's pooler-loop? [12:43:09] modules/conftool/files/pooler_loop.rb [12:43:16] <_joe_> ./modules/conftool/files/pooler_loop.rb [12:43:36] <_joe_> ok I'm starting pybal on lvs1010 for now [12:43:45] <_joe_> and fixing the depooled servers [12:44:23] <_joe_> ouch it's just removed [12:44:40] yeah just wait a sec :P [12:45:37] <_joe_> modules/conftool/manifests/scripts/service.pp <== this is where lvs::configuration gets consumed [12:46:19] <_joe_> so it's enough to comment out lvs1010 from that list to fix the issue temporarily [12:46:45] from which list, that was my question [12:47:05] anyways, lvs1010 pybal restarted for now [12:47:45] what else uses pooler_loop stuff indirectly, aside from hhvm-needs-restart and such? [12:48:03] I'm concerned since this is off in modules/conftool [12:49:03] <_joe_> pooler-loop is used by scap for parsoid, for example [12:49:24] <_joe_> to do proper rolling restarts as parsoid fails to respond properly to queries for some seconds while being restarted [12:49:40] <_joe_> so they do depool => warmup => repool during deploys [12:50:31] yeah ok [12:50:47] https://gerrit.wikimedia.org/r/#/c/415044/1/modules/lvs/manifests/configuration.pp [12:51:07] ^ here in the decom commit, you can see lvs1010 (and such) removed from two different places, do you know which it's actually pulling from? [12:51:41] <_joe_> the former [12:51:47] <_joe_> lvs_class_hosts [12:52:20] anyways, even before we decided to kill those, they were in the list to allow them to be configured on the other side, not for others to consume from [12:52:57] it was quite scary even before I stopped pybal, as they weren't considered in production service. we often downed pybal or tried experimental versions that failed, or played with fixing their ethernet interfaces and had them go offline, etc [12:53:12] those lists are for the configuration of lvs itself, not for consuming lvs as a service from elsewhere [12:53:17] <_joe_> yeah the obvious fix is to make pooler-loop tolerate down servers [12:53:34] well even then, it's still a problem [12:53:38] <_joe_> because you also can have pybal down on another server [12:53:48] <_joe_> and yes, we then need to have such a list available somewhere [12:53:53] if Things are going to have dependencies on polling live production lvses, some other list must be invented for that use [12:54:12] <_joe_> this thing is in place since 2 years IIRC, btw [12:54:15] lol [12:54:40] <_joe_> I'm pretty sure you were involved in discussion I had with marko at the time, but you probably forgot :P [12:54:47] btw this is a common problem-pattern in general [12:55:20] that DRY drives to have one hostlist of foos somewhere, but it's consumed in different ways that matter when there are indeterminate states (bringing up new hosts, decomming old ones, test/canary-hosts, etc) [12:55:38] which is how we ended up with multiple hostlists for caches at one point too [12:55:39] <_joe_> so the problem is also [12:55:46] <_joe_> if you don't have one true list [12:55:53] <_joe_> when you insert a new host [12:55:59] <_joe_> you have to remember about all those lists [12:56:08] yup! [12:56:18] the "right" answer structurally is one list, with attributes [12:56:24] but puppet doesn't make that easy [12:56:26] <_joe_> so the correct solution imho is to have one common data-structure with the right atrtributes [12:56:29] <_joe_> ahah [12:56:32] <_joe_> yeah [12:57:51] <_joe_> so interestingly [12:58:05] <_joe_> we have a begin/rescue clause in pooler-loop [12:58:13] <_joe_> to protect against unavailable pybals [12:58:27] <_joe_> and guess what, it doesn't catch all the errors [12:58:30] <_joe_> just timeouts [13:00:56] merged the puppet-side removal [13:01:10] so they should puppetize away their lvs1010 refs, but I'll leave pybal running for the weekend, too [13:01:22] hey there [13:01:25] busy day today IRL [13:01:29] what have I missed? :) [13:01:54] a discussion of ESI [13:02:23] <_joe_> bblack: thanks for the assistance [13:02:29] and the fact that de/re-pools done via scap for deployments have been relying for years on being able to live-poll results from all the hosts in $lvs_class_hosts [13:02:36] which includes non-production cases like lvs1010 :) [13:03:16] nice [13:03:31] (so every time we've experimented on pybal on those hosts, it's probably chanced temporary havoc) [13:03:54] patched-up for now [13:04:28] seems eerily quiet a couple EU mornings in a row now for esams-text? [13:04:39] did we accidentally fix something and never realize which thing fixed it? [13:06:03] the AE:gzip thing went in ~25h ago [13:06:39] the last of the smaller be<->be conns spikes was earlier the same morning, but nothing much since [13:09:18] <_joe_> https://gerrit.wikimedia.org/r/421527 this should fix things [13:09:45] yesterday morning was quiet too in esams-text-land [13:10:07] and that was before the AE:gzip fix [13:11:08] 10Traffic, 10Operations, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10Performance-Team (Radar): PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4075622 (10BBlack) 05Open>03Resolved a:03BBlack The above took care of it from th... [13:11:43] from the broader 5xx perspective, yeah [13:12:14] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521694166471&to=1521708726132 [13:12:29] there was some minor be<->be spiking before the AE:gzip thing though, but it didn't rise to the level of failed-fetches [13:12:44] (and accompanying mbox-lag ramps) [13:13:17] there's also the 1799 fixes you applied on the 21st at 4:40 to keep in mind [13:13:33] oh right, hmmm [13:13:55] in any case, that was ~7h before the AE:gzip fix [13:14:08] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-24h&to=now [13:14:24] but past ~24h since that, things look very unspiky all across the failed-fetches dashboard [13:15:48] I am concerned about the cause discussed in the API-returns-200 ticket too [13:15:54] not for the URL mention there, but for other cases [13:16:12] if we had such a case happening, it would match very well with the spikes we see [13:16:23] (well, have seen a couple days ago) [13:16:32] such a case would be something like: [13:17:19] 1) There's a /w/api.php?foo url-object that's fairly hot traffic (or a whole category of them that are fairly-hot) [13:17:39] 2) It's normally a cacheable output, which is great because without caching the traffic would be pretty overwhelming [13:18:14] 3) Once in a while, maybe in the EU mornings in particular, it suffers some internal failure like a DB error/timeout, and returns an uncacheable "200 OK" + error message. [13:18:53] 4) Because this isn't a 5xx, our hit-for-pass optimizations create a 10 minute hit-for-pass object and now the floodgates open even though the error was perhaps singular or at least briefly transient. [13:19:40] hfm would help w/ 4) [13:19:48] right [13:20:02] well :) [13:20:29] so, especially as elucidated in the deeper discussions around 1799-like issues.... hfm isn't perfect either, and hfp is more-efficient than hfm in cases where hfp makes sense. [13:21:16] in the case of a non-error response with uncacheable response headers, it's reasonable to assume that URL is always uncacheable, in which case our best bet is hfp to avoid coalesce penalties [13:21:40] whereas in the case of a 5xx, you'd argue for either a singular-pass-just-this-request or a short hfm. [13:22:10] yes [13:22:43] but this case is neither. the api has decided a transient error should return 200+uncacheable. and if the original was cacheable and hot enough that we rely on that caching, we're hosed. [13:23:51] the workaround might be to treat requests to /w/api.php which return the header MediaWiki-API-Error (or whatever it was) like virtual-5xx's for this purpose [13:24:33] although from the language of the discussion I'm not sure if that's right either. It may be the case that MediaWiki-API-Error is also set for some cases that we wouldn't consider 5xx-like, I donno. [13:24:53] what are the drawbacks of hfm, again? Excluding the fact that you can't do conditional requests [13:25:19] (which might not be the end of the world in this error-but-not-really-because-200 case) [13:26:41] according to the docs, the only reason to use hfp over hfm is making conditionals work [13:27:17] and according to reality? :) [13:27:25] but I think in the bugs, I've seen hints that hfm may have undesirable internal properties and/or contribute to 1799-like cases and/or have some other perf issue where it creates lots of ephemeral objects, or something. [13:27:35] but I don't have a concrete answer [13:29:32] so we could give a go to a vcl patch that turns 200+uncacheable+mw-api-error into hfm [13:30:33] right [13:30:43] even if MediaWiki-API-Error is set for cases that aren't really 5xx-like, if they're uncacheable for just a short time it seems like a good idea [13:31:06] maybe prep a patch for that, and wait to see if any EU morning problems actually seem to recur, vs already-fixed. [13:31:29] sounds good [13:31:30] if we see one earlier in the morning, we'd likely have time to apply the patch and then see if it makes a diff the rest of the morning. [13:32:26] or, well... [13:32:48] also, I don't know how pervasive this pattern is. mw devs seem to be pretty strong on thinking this approach is correct. [13:33:14] it's perhaps possible this behavior (of 200+uncacheble for transient error) affects things outside of api.php too is what I mean. [13:33:31] mmh [13:33:43] maybe leaving out the api.php part of the condition, and just trigger on 200+uncacheable+mw-api-error only [13:33:50] or something [13:33:58] I donno, it's complicated :) [13:34:25] <_joe_> can I get a pair of eyes on https://gerrit.wikimedia.org/r/#/c/421527/ ? [13:34:26] or a more fool-proof patch that would catch lots of related cases, would be to just s/hfp/hfm/ for all such cases and see how it goes. [13:34:29] <_joe_> I'm pretty sure it's ok [13:35:25] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb#455 [13:35:42] ^ by "all such cases" I mean go ahead and try s/hfp/hfm/ universally there [13:36:56] varnishncsa -g request -q 'BerespStatus eq 200 and BerespHeader:MediaWiki-API-Error and ReqURL !~ "/w/api.php"' [13:37:03] nothing so far on cp1067 ^ [13:37:47] _joe_: it's probably not a regression, but beyond that (a) there are probably other errnos and (b) maybe the whole failure logic there needs re-thinking? there seems to be an odd mix there... if ignoring some errors and not others? [13:38:23] I mean, it ignores 404 too [13:38:41] maybe it should rescue all exceptions to return true if it's going to rescue connrefused [13:39:13] <_joe_> well, now that I think about it [13:39:31] ema: I don't even know if I was remembering that header name right for discussion purpose, lemme see [13:39:38] <_joe_> we should actually rescue any exception that's not a pybal error outside of that function [13:39:48] <_joe_> instead of trying to catch specific issues there [13:40:36] bblack: I do see responses with MediaWiki-API-Error by removing the !~ "/w/api.php" condition [13:40:58] ok [13:41:30] hey look what I found [13:41:30] https://phabricator.wikimedia.org/T180712 [13:41:37] yeah so as I was typing ealirer, I was going to say "just 200+uncacheable+mw-api-error", but it's also possible this thought-pattern persists elsewhere, but without the header or a different header name [13:41:42] e.g. under load.php or other scripts [13:43:03] trying a straight up s/hfp/hfm/ for the whole block in question is an easy enough experiment to catch many related problems, and also test whether hfm is an acceptable substitute that doesn't create new problems for the cases where we know that hfp would be more-efficient in some sense. [13:43:17] but even the lack-of-conditionals part is worrying. I know load.php relies on 304s [13:43:39] some others, especially persistently-uncacheables, may rely on 304 to scale well too [13:44:17] (perhaps another option would be to treat it like keep: if response has ETag or INM use hfp, otherwise hfm?) [13:44:43] s/INM/LM/ ? [13:50:02] why have I not merged the v4-v5 VCL cleanup yet? [13:54:54] it needed a manual rebase, let's see what pcc thinks [14:02:20] yeah s/INM/LM/ [14:03:29] in other news, SG seems to be holding up fine [14:03:45] <_joe_> SG being eqsin? [14:03:46] I'm kind of on the fence now about whether to turn it back off for the weekend, or just leave it going [14:03:59] SG being the clients we're sending to eqsin, but either way :) [14:04:06] <_joe_> oh ok [14:04:30] in any case, leaving it going for the US daytime at least [14:04:56] bbl! [14:06:42] bblack: what time yesterday did it go live? [14:08:08] <_joe_> according to https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-aggregate-client-status-code?orgId=1&from=now-2d&to=now&var-site=eqsin&var-cache_type=varnish-text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [14:08:20] <_joe_> if you just select GET [14:08:33] <_joe_> around 21 UTC yesterday [14:10:33] Ah, good call [14:10:52] <_joe_> also I guess it's going to be in SAL [14:11:39] <_joe_> ahem. apparently not [14:12:33] <_joe_> marlier: I assumed you're familiar with the acronym, as it's one of the most used around here, but in case you were wondering, SAL == https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:56] Yep, that one I knew. But good to check :-) [14:14:15] <_joe_> we're not that bad with acronyms here, I still remember my first meetings at $JOB~1 [14:14:28] <_joe_> were very very bad. [14:47:51] bblack: https://gerrit.wikimedia.org/r/#/c/421542/ [14:51:23] I'd say eqsin is working out quite nicely, for users in SG: https://grafana.wikimedia.org/dashboard/db/performance-singapore-caching-center?orgId=1&from=1521730800000&to=1521817200000 [14:52:04] Eyeballing it, median times look like they're 30-40% faster than previous. [14:56:00] cool [14:58:26] ~500 req/s [14:58:45] https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-caching?refresh=15m&orgId=1&from=now-6h&to=now&var-cluster=All&var-site=eqsin [14:59:09] 10Traffic, 10Operations, 10Pybal: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4075954 (10Jgreen) [15:02:57] <_joe_> marlier: response start is down almost 60% [15:03:07] <_joe_> pretty impressive [15:03:33] <_joe_> and first paint is down 45% [15:03:51] Yeah, it's great! [15:04:11] <_joe_> well, it's the best-case-scenario anyways (Singapore) I guess; it's still great :) [15:04:15] Of course, this is just from a subset of users who are located within ~5 miles of the actual data center :-) [15:04:25] But, yeah, great result [15:05:13] I'll be curious to see what things look like from Japan and Taiwan, in particular. (Those are our two highest-traffic countries in the group that are going to be switching to eqsin, and both happen to be relatively far away in terms of physical distance. ) [15:11:54] poor New Caledonians, "DOM complete" up to 13s [15:12:10] although they might simply be less interested in Wikipedia looking at https://en.wikipedia.org/wiki/Geography_of_New_Caledonia#/media/File:Noum%C3%A9a_Ile_des_Pins_Upi.JPG [15:14:17] note that only Singapore is currently routed to eqsin https://github.com/wikimedia/operations-dns/commit/0cb5336797fa58df8b4fe75ba27e6307ffaceffd [15:15:07] DOM complete going up for new caledonians seems unrelated [15:15:53] marlier: is it a coincidence that the SG perf increase is clearly visible at midnight UTC? [15:16:03] sure, sure. I was just surprised to see the difference, but it's probably a satellite link anyway [15:17:11] moritzm: it's fiber, but they're far from everywhere, only 1 uplink ISP, and not many users, so sampling probably not that great [15:17:14] XioNoX: yeah, it's a coincidence [15:17:31] Traffic started going there at 2100 utc [15:17:48] Took a couple hours for caches to fill [15:18:28] And the graph I linked is showing a 1-hour rolling average, so spikes are smoothed a bit [15:18:33] marlier: tcp time, is the tcp handshake? [15:18:49] Yes, including SSL [15:19:53] ok! [15:21:08] Yeah, overall it's a really nice improvement. [15:26:38] and we can do more! [15:41:13] bblack: is there anything we can do for clients in Singapore that resolve to a different DC? [15:50:18] XioNoX: Do you have examples? I think that should only happen if the MaxMind database shows a given IP as being somewhere else... [15:54:10] yeah, it's most likely a maxmind issue [15:54:14] https://www.irccloud.com/pastebin/xIqQUJC0/ [15:54:36] all those IPs are RIPE Atlas probes located in Sinapore [15:54:44] +g [16:06:58] 10netops, 10Operations, 10cloud-services-team: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424#4076177 (10ayounsi) a:03RobH For ACLs, please be more specific, ideally mentioning a source/destination IP(s) and port(s). Taking a random labvirt* host as... [16:18:32] 10netops, 10Operations, 10cloud-services-team: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424#4076204 (10RobH) so the symptoms of this were us trying to PXE boot labvirt1021 (10.64.20.40) and labvirt1022 (10.64.20.41). During the PXE boot, it gets the... [16:18:45] 10netops, 10Operations, 10cloud-services-team: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424#4076207 (10RobH) a:05RobH>03faidon [16:22:29] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4076230 (10Fjalapeno) @kchapman we are interested in picking this up in Reading Infrastructure, but haven't been able to get to it. We would still like to... [16:24:35] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4076236 (10Fjalapeno) For context: We have lots of client code with work arounds for getting the right sizes of images. So much duplication and bugs. We wa... [17:05:22] 10Traffic, 10Operations, 10ops-codfw: cp2006: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10ema) [17:05:37] 10Traffic, 10Operations, 10ops-codfw: cp2006: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076383 (10ema) p:05Triage>03Normal [17:09:21] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4076401 (10ayounsi) [17:15:13] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4076425 (10RobH) Just changd on asw switch stack: robh@asw-a-eqiad# show | compare [edit interfaces] + ge-2/0/0 { + description db1001; + disable; + } + g... [17:21:57] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4076475 (10ayounsi) [17:23:08] 10netops, 10Operations, 10ops-codfw, 10ops-eqiad: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044272 (10ayounsi) Removed them from the table from description. [17:48:11] 10Traffic, 10Operations, 10ops-codfw: cp2006, cp2010: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076593 (10ema) [18:21:18] re: various SG commentary above: [18:22:02] 1) yeah traffic turned on around 21:00. There's a multi-hour lag before perf graphs are impacted and has some slew, I wonder if "1h moving average" actually does its moving average over a greater window than 1h in practice? [18:22:35] 2) With such a small client population, caches can't achieve high hitrates, so the perf impact isn't as great as it might otherwise be, but getting most of it. [18:23:44] 3) GeoDNS routing isn't ever all that precise. MaxMind's data isn't perfect, and then we're going off maxmind of DNS queries, not actual user IPs, when it comes to routing. So yeah, it's imprecise [18:24:17] it was easy to find examples of users from other parts of the region that were picked up by turning on SG. I saw a few in logs from IN, ID, HK, AU, etc. [18:24:46] and on the other hand, it's likely some IPs actually-in-SG aren't picked up by the DNS routing for SG, either, due to whatever DNS and/or maxmind anomalies. [18:24:58] the more we spread through the region, the less this stuff matters [18:30:03] caches are slowly getting better over time, but this is about as good as they'll get at this low traffic level, as we're nearing the 24h point in the cycle [18:30:55] overall dc-local true-hitrate is 88.3% past couple of hours in eqsin, whereas the comparable number in esams right now is 96.8% [18:31:22] such a small client-request-rate means more of the traffic is unique, basically. [18:32:25] (err, esams local 96.1%, not that it changes things much) [18:35:38] bblack: I followed up with Telia about the packet loss, we're right in their congestion window they said was solved. Should we rollback SG->eqsin now or keep it longer? (Maybe we want to see how it behaves with a little packet less) [18:36:02] is it showing loss again? [18:36:27] with as little backhaul traffic as we're moving right now, it may not matter a whole lot [18:36:40] yes https://smokeping.wikimedia.org/smokeping.cgi?target=eqsin.Core.cr1-eqsin [18:37:10] oh uh [18:37:15] that doesn't show for the bast5001 graph though [18:37:31] maybe the ping to cr1-eqsin is ending up using the transits backup tunnel? latency looks different too [18:37:54] bblack: I also added the communities to have Telia not advertize our prefix to its US peers, but it seems like they are still advertizing it to Telstra in the US, they are investigating [18:38:07] nah, path goes through the Telia link [18:38:33] still it's odd the bast5 ping doesn't show the same loss as the cr1 [18:40:13] yeah, and librenms doesn't seem to be able to graph some stuff from cr1, for example CPU, while it was doing it fine last time I checked: https://librenms.wikimedia.org/device/device=159/ [18:40:49] hmmm [18:40:52] bblack: are we seeing any impact on the CP servers? [18:41:03] I guess that's what matters most [18:41:18] they're so under-loaded on traffic it would be hard to tell anything heh [18:41:27] (and no good comparison of client perf since this is first run) [18:41:49] 97/97 packets, 0% loss, min/avg/ewma/max = 191.814/191.890/191.875/193.496 ms [18:42:09] ^ but that's a ping I have running that's testing the practical flow that matters, a cp5->cp2 ping (same as cache misses) [18:42:19] looks fine right now [18:42:31] cool [18:42:43] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4076790 (10Tgr) The problem here is that TechCom is using Phabricator in a different way from the rest of the movement. The normal way is that you create... [18:42:56] running librenms manually finally pulled the CPU, and we can see so white areas https://librenms.wikimedia.org/device/device=159/ [18:44:16] white areas meaning missing cpu stats [18:44:26] yeah, odd [18:45:03] what's up with the event logs? expected? [18:45:22] all the "memory pool added", and then ~6h ago "processor removed", etc? [18:45:37] I guess that's librenms definiions added/removed, not actual router things [18:46:22] some minor bgp stuff in there too, an unconfigured peer trying to connect, etc [18:49:47] bblack: the added/removed is a consequence of not beeing able to pull the device via SNMP properly [18:50:14] right [18:50:16] the BGP stuff is usually stuff that needs future peers to either fix their side or setup their sessions [18:53:52] 10netops, 10Operations, 10WMF-NDA: Avoid US RTT for eqsin traffic - https://phabricator.wikimedia.org/T190559#4076857 (10ayounsi) p:05Triage>03Normal [18:55:34] 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850#4076882 (10demon) Bump. Can we get this resolved for the remaining nodes? [19:21:07] 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850#4076945 (10bd808) a:05demon>03None The cleanup command @chasemp used was `rm -fR /usr/local/lib/mediawiki-config &&... [20:07:46] 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850#4077149 (10bd808) >>! In T187850#4077131, @Marostegui wrote: > @bd808 I don't think we use mediawiki-config for anything... [20:17:44] 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850#4077173 (10Marostegui) Ah right, if it will get the directory back but just without the module, I'm sure that won't brea...