[02:43:11] <wikibugs>	 10Traffic, 06Operations: Extend check_sslxnn to check OCSP Stapling - https://phabricator.wikimedia.org/T148490#2724429 (10BBlack)
[07:10:01] <wikibugs>	 10Traffic, 06Operations: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2724568 (10Joe) For the record, yesterday's problem is different from the one we had before; it's still a memleak but of a different nature.  if no ticket is open for that, I'll open one this morning.
[07:41:09] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2724580 (10elukey) I tried to compare a Miley Cyrus link logging correctly a 400 (and Timestamp:Resp) with a "ba...
[10:32:26] <ema>	 I might be affected by puppet alzheimer
[10:33:14] <ema>	 for some reason I thought we refactored access to the varnish_version4 hiera attribute by looking it up in base.pp or some other place instead of doing hiera('varnish_version4', false) all over the place
[10:34:02] <ema>	 it could be that I've just dreamt it and now is time to make the dream come true 
[10:37:51] <volans>	 rotfl
[10:43:27] <elukey>	 it is your subconscious that is affected by too much puppet coding
[10:43:33] <elukey>	 :D
[12:28:31] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon)
[12:32:46] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724981 (10faidon)
[12:57:33] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2725031 (10elukey) p:05Triage>03Low
[13:21:53] <ema>	 oooh, it wasn't a dream! We did reuse some varnish4-related variable across our varnish puppet module, but that was $varnish4_python_suffix, not $varnish_version4
[13:22:27] <bblack>	 it'll all be over soon enough anyways :)
[13:26:46] <ema>	 bblack: Existential Tuesday? Or were you thinking of ATS? :)
[13:27:48] <bblack>	 I was thinking of ripping out all the varnish_version4 when we're done with text soon :)
[13:28:01] <ema>	 oh right
[13:32:55] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations: Font list resource doesn't have a "Content-type: text/plain;charset=utf-8" header - https://phabricator.wikimedia.org/T146421#2725118 (10elukey) The `conf` file does not have an extension and I think that Apache does not have any instruction about the MIME type t...
[13:36:47] <ema>	 elukey: can you please review https://gerrit.wikimedia.org/r/#/c/295207/?
[13:37:24] <ema>	 directly diffing the two versions of the file is probably easier than looking at the patchset on gerrit though: diff -u modules/varnish/files/varnishrls modules/varnish/files/varnishrls4
[13:42:24] <elukey>	 sure!
[13:42:52] <elukey>	 yeah I was reviewing the whole thing and it is not straightforward
[13:42:58] <elukey>	 doing the diff
[13:44:03] <elukey>	 my eyes are thanking you ema
[13:47:37] <ema>	 and ema is thanking your eyes
[13:49:29] <ema>	 meanwhile I've been trying to come up with some VTC tests for text v4 and after a bit of digging I found that this is needed to make the tests run:
[13:49:32] <ema>	 varnish v1 -arg "-p cc_command='exec cc -fpic -shared -Wl,-x -L/usr/local/lib/ -o %o %s -lmaxminddb' -p vcc_allow_inline_c=true -p vcc_err_unref=false" -vcl+backend {
[13:50:05] <ema>	 which only works on v4 of course given thant v3 didn't support -p vcc_allow_inline_c
[13:51:04] <ema>	 starting with v5 the varnishtest command allows specifying -p stuff from the CLI making it easy given our current setup to write version-agnostic tests (kinda), but with v4 that's not possible 
[13:53:04] <ema>	 what I mean is that we could pass different -p options through varnishtest-runner (which is a template) whereas our tests are currently static files and can't be tailored to different -p options
[13:53:35] <bblack>	 have you played with v5 much yet?
[13:53:45] <ema>	 not at all
[13:53:47] <bblack>	 ok
[13:54:01] <ema>	 it shouldn't be much of a jump in the unknown though
[13:54:10] <bblack>	 probably next quarter, we should start trying to do it and assuming it's easy to upgrade to
[13:54:23] <bblack>	 if it turns out to not be easy, we can stop and make bigger goals for it
[13:55:17] <ema>	 yeah
[13:56:28] <ema>	 for now I was thinking of writing v4 tests leaving the v3-compatible varnish args commented, not really optimal but probably acceptable
[13:56:41] <ema>	 any thoughts on this?
[13:57:59] <mark>	 whatdoyamean, the text v4 goal is almost done right? ;p
[14:04:48] <ema>	 the vcl compiles, as we all know if it compiles it works so yeah, we're basically done :P
[14:07:56] <bblack>	 :)
[14:08:35] <bblack>	 http://s2.quickmeme.com/img/7d/7da81be3684396211d3554746bef9aeba41d059107a22db41562402aae7d7e42.jpg
[14:08:51] <ema>	 :D
[14:15:02] <elukey>	 ema: the code review looks ok, but I don't remember the "i" and "I" varnishlog.varnishlog's parameters
[14:15:51] <elukey>	 ah tag list
[14:15:52] <ema>	 elukey: IIRC they've the meaning described by varnishlog(1)
[14:16:00] <elukey>	 yes yes you are right
[14:16:35] <elukey>	 LGTM
[14:16:50] <ema>	 cool, thank you!
[14:17:00] <elukey>	 checked also the parameters in VSL, no typos etc..
[15:20:09] <wikibugs>	 10Traffic, 10netops, 06Operations, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2725483 (10BBlack) @faidon - re: eqiad recdns IPv4 - I've uploaded DNS and puppet patches to switch that IP (by turning on the new IP first in parallel with the old...
[15:24:35] <wikibugs>	 10Traffic, 10netops, 06Operations, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2725529 (10BBlack) Oh, one other minor thing: the eqiad recdns IPv6 is already in the correct subnet for where it's at (matches with changing the IPv4 as the patche...
[16:10:46] <wikibugs>	 10Traffic, 10netops, 06Operations, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2725773 (10faidon) The network hardware I can configure in one swoop, so don't worry about that. Not sure if the PDUs/iDRACs/iLOs have any DNS configured whatsoever...
[17:14:18] <wikibugs>	 10netops, 06Operations, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2726109 (10Dzahn) >>! In T147320#2688855, @akosiaris wrote: > Setting to stalled for say.. 2 weeks ?  The 2 weeks are over today, exactly 14 days afer that comment.
[17:15:16] <wikibugs>	 10netops, 06Operations, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2688855 (10Dzahn) @joe ok with you if i go ahead and now merge https://gerrit.wikimedia.org/r/#/c/315891/ and actually shutdown palladium?
[18:44:07] <wikibugs>	 10Traffic, 06Operations: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2726535 (10BBlack) I've found one counterpoint recently, making a mathematically-backed-up claim that we don't have to worry about AES-128 batch attacks so much in the specific case of G...
[19:11:09] <wikibugs>	 10Traffic, 06Operations: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2726639 (10BBlack) Circling back to the broader plans and ideas here:  === FF/NSS ChaPoly/AES pref hacks ===   Firefox/NSS don't seem to be making any quick progress on ChaPoly prioritiz...
[19:30:07] <wikibugs>	 10Traffic, 06Operations: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2726702 (10BBlack) Also relevant to the above is Mozilla's current recommendations at https://wiki.mozilla.org/Security/Server_Side_TLS .  In a nutshell:  * Their Modern-only config's hi...
[20:05:12] <gehel>	 bblack: Hello! Is there a way to have LVS send requests from the same client to the same server?
[20:06:52] <gehel>	 bblack: Stas is having a look at implementing a way for a user to kill its current query, but of course that works only if the request is sent to the correct server.
[20:27:06] <gehel>	 another question, do you have any idea when T134404 (active / active clusters between 2 DC) might be ready? That has some influence on the planning for WDQS...
[20:27:06] <stashbot>	 T134404: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404
[21:06:01] <bblack>	 gehel: many services aren't active/active capable, I don't think it should have a big impact on wdqs timeline whether they're ready or not
[21:06:21] <bblack>	 gehel: what's the "influence on planning" there?
[21:07:14] <gehel>	 bblack: we are not that comfortable to do data reload at this point, as it means running on a single server for a few days, so we might want to ad another server to eqiad
[21:08:04] <gehel>	 bblack: I'm not entirely sure if / how we can route traffic to codfw while we work on eqiad...
[21:09:56] <gehel>	 bblack: my limited understanding let me think that if I configure varnish to use wdqs.svc.codfw.wmnet, we will have traffic going varnish codfw -> varnish eqiad -> wdqs codfw. It shoudl work, but is not optimal.
[21:11:01] <bblack>	 the routing part isn't the application's problem to begin with
[21:11:15] <bblack>	 but even so, we don't plan to do that if/when we can help it
[21:11:53] <bblack>	 the application's point of view is just whether or not they can support simultaneous normal request flow into both wdqs.svc.eqiad.wmnet and wdqs.svc.codfw.wmnet at the same time
[21:11:58] <gehel>	 we don't plan to be able to fall back on codfw cluster if eqiad cluster is problematic?
[21:12:47] <gehel>	 in term of application itself, it does not change anything. In term of the number of servers we want in each cluster, it does have some impact.
[21:12:59] <bblack>	 I think we're mixing up a lot of distinct issues here
[21:13:25] <gehel>	 yes, most probably
[21:13:30] <bblack>	 from the applications point of view, it needs enough machines to be fully resilient in the face of failure, and support its full load, in a single DC.
[21:13:46] <bblack>	 then it needs that same count of machines in the other DC.
[21:14:15] <bblack>	 the dual DCs are not to support things like single machine failures.  it's to support datacenter-level redundancy for everything.
[21:14:41] <bblack>	 we don't plan to flip traffic between the two application DCs because of the loss of a disk or a motherboard.
[21:14:52] <gehel>	 ok, then it is clear that we need to increase the size of wdqs clusters in both eqiad and codfw.
[21:14:55] <bblack>	 we may flip between the two application DCs for reasons unrelated to hardware failure then
[21:15:12] <bblack>	 (or because we've lost a whole DC)
[21:15:29] <gehel>	 I was misunderstanding the issue as we do switch traffic between DC for elasticsearch for some cluster wide maintenance
[21:16:01] <bblack>	 it's just an abuse of the capability, we really shouldn't be doing that.
[21:16:23] <bblack>	 a single DC cluster should be resilient on its own, even accounting for "maintenance"
[21:16:50] <gehel>	 in the case of elasticsearch, that start to be expensive :)
[21:16:57] <bblack>	 it's nice if we can sometimes provide that ability to cope with unique situations, but we shouldn't design for it
[21:17:12] <bblack>	 in the case of elasticsearch, maybe the application-layer design is bad :)
[21:17:22] <gehel>	 ok, for wdqs the situation is much simpler
[21:17:23] <bblack>	 (as it will be for many services, no doubt)
[21:17:53] <gehel>	 there are some upgrades where elasticsearch does need downtime, not much we can do about it...
[21:18:10] <bblack>	 sure we can: we can design software that doesn't require downtime to do upgrades
[21:18:34] <gehel>	 yes, except that we are not the one designing that software
[21:18:54] <bblack>	 that's neither here nor there
[21:19:06] <bblack>	 we're the ones designing the service, using existing software or improving on it
[21:19:34] <bblack>	 in any case, I doubt elasticsearch is the only service in this boat
[21:19:48] <gehel>	 ok, I see your point.
[21:19:57] <bblack>	 it's just important to understand that it's a design flaw.  it's what we're accepting because we don't have a better path yet.
[21:20:24] <bblack>	 and it means sometimes your design requires unavoidable downtime.  the software isn't designed for 24/7 service in a reasonable way.
[21:21:04] <bblack>	 if we can use our x-dc capabilities to avoid some of that downtime, I guess that's a good thing, but we're not trying to design around that.  it's just abusing one capability to make up for the problem.
[21:21:49] <bblack>	 if codfw is swalled up by a giant earthquake and we can't bring it back online for 3 months, you may still have maintenance to do.
[21:22:06] <gehel>	 and now you have me thinking about how we could design cirrus to allow no downtime upgrade...
[21:22:22] <bblack>	 well this all goes back to general principles
[21:22:35] <bblack>	 make everything you can stateless, it's good for scaling and resiliency of the stateless bits
[21:22:53] <bblack>	 the hard part is dealing with the stateful part, which is where things like mysql or cassandra or memcached live.
[21:23:15] <bblack>	 there's usually solutions for making them resilient too, but at least contain the problem to some standard state-engine that has solutions for the problem.
[21:23:30] <gehel>	 or elasticsearch in this case (well state can be recreated, but it is expensive)
[21:24:46] <bblack>	 I don't know that much about elasticsearch
[21:25:00] <bblack>	 it sounds on the surface like they've mixed up all the layers here and not really thought hard about this problem :P
[21:25:05] <ori>	 declaring this requirement up front would go a long way
[21:25:22] <ori>	 (this isn't a criticism at all, I just think a lot of what was said above was eloquent so I'm thinking about it.)
[21:25:44] <bblack>	 well it's been an "obvious" requirement since the days of only 1x application-layer DC
[21:26:04] <bblack>	 the addition of codfw to the mix doesn't relax any requirements around our single-DC capabilities.
[21:26:04] <gehel>	 basically elasticsearch is a database optimized for full text search, it just happen that in our case (and in most cases) it is used as a denormalization of a primary data store
[21:26:49] <gehel>	 the fact that it does not relax any requirement is non obvious (at least to me)
[21:27:06] <ori>	 it _should_ be obvious
[21:27:46] <gehel>	 at least now I know :P
[21:27:57] <bblack>	 the purpose of adding codfw is to give us datacenter-level resiliency, it's a separate concept from other layers of resiliency we have today.
[21:28:23] <bblack>	 in the future we might shift all load to codfw and take out eqiad for a while to move it to a new cage, or replace all of its network hardware.
[21:28:47] <bblack>	 or we might decide to relocate eqiad back to florida, and in the process we take out "eqiad" for a month or two before "florida" comes back online
[21:29:05] <bblack>	 or, obviously, one of the DCs could suffer an unpredicted disaster.  or lose all network connectivity.
[21:30:14] <bblack>	 the application should still meet all of its load an duptime needs within one DC.  the extra DC is a completely different topic, to deal with a completely different kind of issue.
[21:30:29] <gehel>	 back to the other question, do we have a way to have server affinity with LVS ?
[21:33:29] <bblack>	 the short answer is no
[21:33:44] <gehel>	 the short answer is good enough...
[21:34:20] <gehel>	 (not that I'm not interested in the long answer)
[21:34:30] <bblack>	 the slightly-less-short answer is that if the application's design wants that, you're trying to skimp out on hard parts of the problem by pawning off the work on another layer :)
[21:35:26] <bblack>	 LVS can do client IP affinity into the next layer, which is the caches.
[21:36:05] <bblack>	 the caches can do true client IP affinity into the underlying application, if we don't use LVS and have varnish know about the backends directly, which we'd prefer not to do (we only have one such case left, and we'd like to get rid of it)
[21:36:41] <bblack>	 in both cases, it's a best-effort as an optimization for the normal case.  you have to expect it doesn't always work right.  so if you're designing to fail when client affinity fails, your design fails.
[21:37:37] <gehel>	 that leads to my next question (if I have not stolen too much of your time already): why do we prefer LVS instead of having cache route to the different backends?
[21:37:41] <bblack>	 and in the varnish->applayer case, doing that means we're not using LVS abstraction at that layer, and losing the important benefits there (e.g. how etcd pooling works, etc)
[21:37:55] <volans>	 gehel: to kill their own query I'll do something else, happy to chat if you want to
[21:38:15] <gehel>	 ok, you answer my question
[21:38:29] <gehel>	 volans: always happy to chat!
[21:38:45] <bblack>	 gehel: what I said above (we lose the benefits of pybal+conftool+etcd), but also, dealing with the actual set of backends (instead of an LVS abstraction) directly in varnish means more-complicated bullshit for VCL to deal with, and we hope to reduce that complexity rather than increase it
[21:39:25] <gehel>	 I think that we are trying to work around blazegraph limitations here, but I don't understand blazegraph enough yet...
[21:39:50] <gehel>	 volans: have some time tomorrow to chat? It is getting very close to bed time here...
[21:39:50] <bblack>	 right now all that complexity is still there, but only ~2-3 cases on LVS use it.  If we can eliminate them, we can take a machete to broad swaths of puppet->ruby->vcl complexity
[21:40:03] <bblack>	 err, ~2-3 cases on cache_misc still use it
[21:40:03] <volans>	 gehel: sure
[21:40:44] <bblack>	 those cases being what we talked about recently: logstash and wdqs moving to LVS service endpoints and such (we're about done there I think)
[21:40:50] <SMalyshev>	 volans: you what you do to kill the query?
[21:40:58] <bblack>	 the one that will be hardest/last is stream.wikimedia.org
[21:41:45] <volans>	 SMalyshev: I need first some additional info to be sure I understand the problem correctly
[21:41:59] <gehel>	 bblack: actually, if you have some time to review https://gerrit.wikimedia.org/r/#/c/315675/ (and related) I can merge them and we can remove kibana from our list!
[21:42:07] <volans>	 but it's quite late here in europe, so probably better to chat tomorrow as gehel suggested :)
[21:42:16] <wikibugs>	 10netops, 06Operations, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2727215 (10akosiaris) FWIW, I am fine with that. But probably send one last heads up to ops@.
[21:42:20] <SMalyshev>	 volans: well, the problem is very simple. We have an http request that launches a query. we want to be able to cancel that query
[21:42:39] <SMalyshev>	 volans: tomorrow is fine too :)
[21:42:43] <gehel>	 yeah, my brain is starting to not follow the conversation straight.I'm shutting down...
[21:42:52] <bblack>	 well that sounds simple if your service is one peice of code on one box :)
[21:43:06] <gehel>	 bblack: Thanks a lot for your time. Enlightening as usual! 
[21:43:06] <volans>	 yes, I understand that, but I want to know where is this query running
[21:43:11] <bblack>	 and/or if the launched query is synchronous to the client heh
[21:43:24] <volans>	 and how :)
[21:43:33] <SMalyshev>	 bblack: so the affinity issue in *this* case I don't think is related to caches. We need affinity for one simple thing - being able to cancel a query we just sent
[21:43:41] <bblack>	 if it's async and it's a service deployed with a bunch of relatively-statless random scalable frontends.... it's not simple
[21:44:03] <gehel>	 the query is launched from the end user's browser, it is synchronous
[21:44:04] <volans>	 I would look into decoupling options here
[21:44:05] <SMalyshev>	 bblack: cancel is not (and can't be) cached. and if it fails occasionally also not a huge deal as long as it works in most cases
[21:44:47] <bblack>	 if it's syncronous, then the client can terminate the query by hitting stop or whatever.
[21:44:51] <gehel>	 I'm not even sure it make sense to cancel the queries on the blazegraph side. We have time limits on execution, and the client dropping the HTTP request should be enough, no?
[21:44:55] <SMalyshev>	 bblack: frontends are kind of stateless... they are stateless but there's a bit of state - namely, you have to tell it which query to cancel, and you have to tell it to the right frontend
[21:45:18] <SMalyshev>	 bblack: nope, closing http pipe does not guarantee query is stopped
[21:45:41] <bblack>	 but we do expect that the client normally keeps the pipe open, and the results are "lost" if the pipe is closed?
[21:45:48] <SMalyshev>	 bblack: it might stop the query if it's already writing into the pipe and pipe gets closed, but otherwise the query might continue afaik
[21:46:18] <bblack>	 if so, I'd say add read-side close detection to this pipe...
[21:46:26] <gehel>	 actually canceling the query on blazegraph is a micro optimization, no?
[21:46:27] <bblack>	 and trigger the local termination of the query
[21:46:31] <SMalyshev>	 bblack: yes, if pipe is closed then the results would be lost. though given there's varnish in the middle I'm not even sure it would work
[21:46:52] <SMalyshev>	 bblack: I mean, you can close pipe to varnish, but when it will close backend connection? I have no idea
[21:46:59] <bblack>	 good point
[21:47:30] <SMalyshev>	 I don't think varnish guarantees anything for this scenario, does it?
[21:47:43] <bblack>	 well, it could, but we don't in general.  assuming the backend is HTTP/1.1 and all
[21:47:59] <bblack>	 varnish will at least attempt to reuse connections for requests from many clients
[21:48:25] <SMalyshev>	 bblack: the backend is jetty, so I guess it does what jetty does... probably supports http/1.1 but not sure how exactly
[21:48:45] <SMalyshev>	 ah no, wrong, the backend is nginx, and then jetty
[21:48:50] <bblack>	 right
[21:48:50] <gehel>	 sorry, I'm really stopping here. I'll read the log tomorrow...
[21:48:50] <volans>	 lol, blazegraph website gives me NET::ERR_CERT_AUTHORITY_INVALID
[21:48:56] <bblack>	 byte gehel :)
[21:49:00] <volans>	 bye gehel 
[21:49:02] <gehel>	 bye
[21:49:13] <SMalyshev>	 gehel: bye :)
[21:49:21] <SMalyshev>	 volans: hmm works for me... 
[21:49:22] <bblack>	 I also have to run soon.
[21:49:51] <bblack>	 options for getting the http connection to matter for termination: websockets (ewww), or we could use pipe-mode in varnish to map it out 1:1 (eww)
[21:50:23] <bblack>	 otherwise wherever the real shared state lives, you need to track the existence of live long-running queries, so that another appserver can find out whether they're running and/or signal to another appserver to stop what it's doing?
[21:50:30] <SMalyshev>	 bblack: I don't really want to do any special tricks there... it should work with regular browser over regular http
[21:50:39] <SMalyshev>	 at least the queries
[21:50:48] <bblack>	 or you just accept that they're un-cancelable and you don't care, because they're readonly and their execution is bounded anyways?
[21:50:57] <volans>	 SMalyshev: but how you would like the user to trigger the kill of the query?
[21:51:08] <SMalyshev>	 bblack: right now they are un-cancellable, yes
[21:51:15] <bblack>	 why does the user care about canceling anyways? they probably don't care about our resources much heh
[21:51:41] <SMalyshev>	 bblack: if you wanted to run 1 sec query and it takes 30 sec instead, yu may want to cancel
[21:51:55] <SMalyshev>	 bblack: actually, we also prefer you to cancel... 
[21:52:09] <SMalyshev>	 saves us to run the whole whopper of a query that nobody needs
[21:52:11] <volans>	 don't rely on user's behaviours ;)
[21:52:27] <SMalyshev>	 we don't *rely* on it, we *encourage* it :)
[21:52:45] <SMalyshev>	 it's in their interest, especially if we get longer timeouts
[21:52:52] <SMalyshev>	 and it's in ours too
[21:53:04] <volans>	 has the DB some sort of explain that can easily tell if a query will run for long times?
[21:53:05] <SMalyshev>	 usually huge queries are mistakes
[21:53:31] <SMalyshev>	 volans: well, yes and no. it does have explain, but the way it is implemented it runs the query anyway :)
[21:53:49] <SMalyshev>	 so no, not in advance
[21:54:14] <volans>	 okok
[21:54:34] <SMalyshev>	 there are cardinality estimates etc. but they are often wrong when query is actually executed, esp. for more complex ones
[21:55:57] <volans>	 and depending how accurate are the tables statistics I guess, as usual with most DBs
[21:56:51] <bblack>	 it has reached time for me to run for a while, but I'll catch up when I get back
[21:57:05] <volans>	 so for more specific suggestions I would have to look into it, I don't know the details on how it handles connections/queries, etc.
[21:58:51] <volans>	 but the general ones are the same of bblack's: pipe monitoring, decoupled, or just don't care about cancelling and set short timeouts by default and if a user want a longer query can change it in the interface
[22:01:12] <SMalyshev>	 volans: well, we can't do short timeouts because some queries do need longer time
[22:01:27] <SMalyshev>	 people are complaining about 30 sec even now
[22:01:46] <SMalyshev>	 it's that some queries are legitimately slow and some are slow by mistake
[22:02:09] <SMalyshev>	 and ideally we would want users to be able to cancel the latter ones, so they won't drain the resources needlessly
[22:02:37] <volans>	 I was suggesting that the user can choose it (within a range) and by default the UI uses a short one so the user that wants to run a more complex query can choose a longer one
[22:02:45] <SMalyshev>	 we *can* survive without it, but it's not an ideal position, we'd be better off if we could have canel
[22:02:48] <SMalyshev>	 *cancel
[22:02:50] <volans>	 but of course doesn't protect you from the mistaken query
[22:03:39] <volans>	 well if you don't have much control of the pipe because too many layers/hops, you might consider the decoupled option
[22:03:48] <SMalyshev>	 volans: well, that's a possibility but it makes UI more complex and I suspect pepple would just set it to "longest possible" because why not
[22:04:08] <SMalyshev>	 volans: what's decoupled option?
[22:04:16] <volans>	 I would not expect the same user to hit cancel though ;)
[22:04:32] <volans>	 probably opening a new tab and running another query instead :D
[22:04:58] <SMalyshev>	 volans: that's what people do now. but the old bad query still keeps running until it expires
[22:05:26] <volans>	 ofc
[22:05:29] <SMalyshev>	 and people actually *ask* for cancel button, so they are ready to behave f we let them :)
[22:05:40] <volans>	 ok
[22:06:12] <volans>	 the decoupled depends on what info do you need to be able to cancel the query so that the cancel button send this info to a service that is able to kill the query
[22:06:51] <volans>	 consider that I have zero knowledge of blazegraph
[22:07:48] <SMalyshev>	 volans: well, I need to know 2 things: 1. query ID (which I can generate myself when issuing query if I want) and 2. on which backend this query runs
[22:08:04] <SMalyshev>	 then I can give this ID to that backend and say "cancel this"
[22:08:29] <SMalyshev>	 so the problem is getting to the right backend
[22:10:23] <volans>	 which of ot's API are we using?
[22:11:58] <SMalyshev>	 volans: sorry, didn't understand that question
[22:12:21] <volans>	 from their website I can see a REST API, a JAVA client, other client libraries
[22:12:30] <SMalyshev>	 volans: ah, ok. The REST API
[22:15:12] <volans>	 from a quick look I can't see a API for async query that you can submit and poll for the results
[22:15:33] <volans>	 so probably you have to set the queryId (UUID) to be able to cancel the query 
[22:16:19] <volans>	 at that point you could or have a service that checks on all backends if that query esists and is owned by the right user and cancel it
[22:17:17] <volans>	 or at the moment the query is dispatched to one backend save that info on some shared state service
[22:19:45] <SMalyshev>	 volans: well, that's the problem... since I don't know which backend the query runs on I need to either go to all of them or build a whole new service just to track which query runs where
[22:20:16] <SMalyshev>	 which does not fit into current structure really - there's nowhere to plug such service even
[22:20:26] <SMalyshev>	 let alone the question of extra moving part
[22:20:47] <SMalyshev>	 if I could have short-time affinity it would all work fine
[22:21:35] <volans>	 the check on everyone could be as simple as having something running on each backend that poll/subscribe to a queue and cancel local queries only
[22:22:54] <ebernhardson>	 perhaps a redis pub/sub channel could work, we have redis servers you can talk to already to broadcast cancels. but those have regular connection issues in mediawiki so maybe not the best idea...
[22:23:21] <ostriches>	 ew no more redis connections plz
[22:23:28] <ostriches>	 those timeouts are the bane of my existence
[22:23:35] <ostriches>	 (one of the banes)
[22:24:15] <SMalyshev>	 volans: well, this is probably possible... but can we send a request to all backends instead of just one?
[22:25:09] <volans>	 SMalyshev: that's with a pub/sub model, otherwise they will have to poll some shared state for the cancels
[22:25:20] <SMalyshev>	 actually since cancelling unexisting query is not a problem, if I had a way to have something send it to all backends it'd work...
[22:25:34] <volans>	 I would check the user too
[22:25:53] <volans>	 unless you can ensure the queryID is unique across the cluster
[22:25:53] <SMalyshev>	 volans: there's no "the user"... no logins
[22:26:03] <volans>	 ah
[22:26:15] <SMalyshev>	 and query ID is unique
[22:26:21] <SMalyshev>	 it's a random uuid
[22:27:04] <SMalyshev>	 so sending it to all hosts is fine, but having to maintain a whole new service just for that sucks
[22:28:04] <SMalyshev>	 esp. given that this service has also to know where other backends are, are they alive, etc.
[22:30:38] <volans>	 I guess the cluster members are in puppet and you don't need to know if they are alive if doing the cancel handling the failures
[22:31:53] <volans>	 are those standalone blazegraph hosts or is a cluster configuration?
[22:33:31] <volans>	 SMalyshev: anyway, quite late here, I'll have to go soon
[22:41:24] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727296 (10Paladox)
[22:42:41] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727296 (10Paladox)
[22:48:16] * volans off for today, ttyl
[23:12:25] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727371 (10Aklapper) It only shows errors in the developer console.
[23:13:05] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727373 (10Paladox) yes.
[23:13:36] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Internet-Explorer, 07Upstream: Visting https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm on Internet Explorer shows errors - https://phabricator.wikimedia.org/T148595#2727374 (10Paladox)
[23:14:17] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown, 07Browser-Support-Internet-Explorer, 07Upstream: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595#2727375 (10Aklapper) p:...
[23:18:50] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown, 07Browser-Support-Internet-Explorer, 07Upstream: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595#2727381 (10Paladox)