[14:15:24] <ema>	 https://people.mpi-sws.org/~gummadi/king/king.pdf
[14:15:54] <ema>	 the idea is fascinating, but is it really that easy to find NSs allowing recursive queries from random clients?
[14:32:02] <paravoid>	 this used to be the case almost universally up until a few years ago
[14:32:18] <paravoid>	 then they were started to be getting used for cache poisoning and DDoS attacks
[14:34:02] <ema>	 oh I've just noticed the article is from 2002 :)
[14:35:07] <volans>	 lol, also nowadays the resolver could be pretty far from the hosts ;)
[14:35:34] <volans>	 (disclaimer: I've just read few lines, I might have misunderstood the whole thing :D )
[14:52:43] <ema>	 moritzm: I'm done with lvs reboots, it's 4.4.2-3+wmf8 everywhere
[14:54:35] <moritzm>	 great, thanks
[15:17:59] <wikibugs>	 07HTTPS, 10Traffic, 10DBA, 06Operations, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#3025390 (10jcrespo)
[15:36:11] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3025438 (10Anomie) >>! In T119366#3023410, @Tgr wrote: > `#time` and co. are used on many pages and usually they do not require cache invalidation. For exa...
[17:08:03] <wikibugs>	 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3025831 (10Ottomata) > term udplog { + 1 > Remove IPs the term analytics-publicIP-v4: +1  > Review the IPs in term ssh Don't know anything about this, but also not sure why we have speci...
[17:32:43] <bblack>	 ema: ping
[17:33:06] <bblack>	 ema: cp1067?
[17:34:02] <bblack>	 depooling it anyways for now
[17:36:33] <ema>	 bblack: pong, I'm half-afk
[17:36:46] <ema>	 trouble with cp1067?
[17:36:59] <bblack>	 yeah and I see you already logged in there earlier
[17:37:07] <bblack>	 it's been giving some 503 spikes, the backend process
[17:37:23] <bblack>	 also notable its weekly restart cron is due in ~5h
[17:37:35] <bblack>	 maybe it just got too fragmented and didn't make it to a full week
[17:37:47] <ema>	 yeah perhaps it reached its limit earlier than usual
[17:37:57] <bblack>	 what else was going on there today earlier?
[17:38:24] <bblack>	 you have a session open from abck around when the second small spike hit, I figure you must've already looked a bit
[17:38:44] <ema>	 yeah I was taking a look at a 503 spike and didn't find anything interesting
[17:38:56] <bblack>	 ah well, it has kept recurring and getting a little larger
[17:39:03] <ema>	 varnishlog output in ~ema/503.log
[17:39:05] <bblack>	 I depooled the box for now
[17:39:28] <ema>	 ok
[17:39:32] <bblack>	 we could suppress the cron, restart early, repool
[17:39:38] <bblack>	 or just leave it repooled until it reaches restart, or whatever
[17:39:48] <bblack>	 seems silly to restart it twice 5h apart
[17:40:45] <ema>	 I'd vote to leave it depooled and repool after the cron restart
[17:42:26] <bblack>	 ok
[17:42:33] <bblack>	 I'll do it later, I'll still be here :)
[17:42:49] <ema>	 cool, thanks :)
[17:44:09] <wikibugs>	 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3026040 (10Tgr) >>! In T119366#3025438, @Anomie wrote: > Well, ideally it would limit cache expiry to "however much time is left until the comparison chang...
[17:54:32] <ema>	 bblack: might be interesting to look at the b_exl column in dstat --varnishstat
[17:54:44] <ema>	 and compare it to other cp1* backends
[17:55:18] <ema>	 that's the expiry mailbox lag which IIRC we thought was part of the picture
[17:56:03] <ema>	 and BTW that reminds me that we should plot it :)
[17:56:23] <ema>	 it's now ~1500k
[18:11:50] <bblack>	 on cp1067 1500k I guess you mean
[18:13:20] <ema>	 yep
[18:19:31] <bblack>	 yeah it might be interesting to make some grafana dash of those
[18:19:47] <bblack>	 it might also be interesting to alert when values over a certain threshold persist
[18:20:54] <bblack>	 it's dropped a bit now.  it was 50k when I just started dstat, now like 20k
[18:21:14] <bblack>	 it's interesting that with very little traffic (depooled), it doesn't very quickly drop to zero
[18:21:30] <bblack>	 that must be indicative of some logic bug in how the situation is handled by varnishd
[18:25:09] <bblack>	 I still kind of wonder about the text stats shift on Feb 8 or so
[18:25:35] <bblack>	 2017-02-08 12:45 UTC, -ish
[18:25:56] <bblack>	 we get a big jump in frontend internal responses in varnish-caching, and a big jump in POST requests in varnish-aggregate-client-status
[18:26:24] <bblack>	 they seem to correlate, so a whole bunch of new POST requests that are being handled directly from the FEs
[18:26:28] <bblack>	 maybe to beacon endpoints?
[19:59:52] <ema>	 bblack: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen&from=1487083518132&to=1487099474209&var-server=cp1067&var-datasource=eqiad%20prometheus%2Fops
[20:05:03] <ema>	 on other machines there's also occasionally some lag, but definitely not that much 
[20:07:28] <ema>	 the first 503 spike of today was at 15:44, first jump in expiry mailbox lag at 15:53 
[20:20:47] <bblack>	 cool :)
[21:47:04] <SMalyshev>	 bblack: hey, do you have a minute to talk about https://phabricator.wikimedia.org/T112151 ?
[21:58:52] <bblack>	 SMalyshev: sure
[21:59:24] <SMalyshev>	 bblack: so I want to enable POST queries for WDQS... but I'm not sure how to best do it
[22:00:24] <SMalyshev>	 the main driver for it for now is query size - federation can produce really large queries and our setup has limit about 7K
[22:00:56] <SMalyshev>	 so many tools know that and use POST for long queries, but we don't serve POST now
[22:01:21] <bblack>	 why don't we?
[22:01:32] <bblack>	 I mean, I hate it, but whatever, it is what it is
[22:01:37] <SMalyshev>	 I made a hacky patch here: https://gerrit.wikimedia.org/r/#/c/243883/ to route POSTs to GETs, but we may also try and just allow POSTs 
[22:01:49] <bblack>	 I don't think we're specifically blocking POST
[22:02:06] <SMalyshev>	 bblack: nginx setup on wdqs side blocks posts
[22:02:23] <SMalyshev>	 because posts can be used for writing, and we don't want to allow those
[22:02:28] <bblack>	 ok
[22:02:44] <SMalyshev>	 so that's the tricky part - we have to allow POSTs but not all POSTs
[22:02:54] <bblack>	 doesnt' wdqs have some explicit control to disable writes without disabling all POST?
[22:03:51] <SMalyshev>	 currently it doesn't... I''m not sure if we can do it on blazegraph size easily... since we do need updates to work, just internal ones, not ones coming from internet
[22:04:33] <SMalyshev>	 and I'm not sure if jetty can be set up properly (I don't know too much about jetty security...)
[22:05:07] <bblack>	 so, blazegraph needs the ability to restrict writes based on either the actual client IP, or some trusted header
[22:05:50] <SMalyshev>	 well, trusted header maybe... since we can strip this in nginx I guess
[22:06:02] <bblack>	 we ahve trusted headers like that already
[22:06:11] <bblack>	 e.g. X-Client-IP, which is sent to your nginx from varnish
[22:06:23] <bblack>	 but that only applies to cache_misc requests (including outside world)
[22:06:53] <bblack>	 if internal stuff is directly hitting wdqs.svc.eqiad.wmnet, then you'd have to get that client IP from the local nginx on wdqs
[22:07:14] <SMalyshev>	 internal stuff is worning with localhost
[22:07:33] <bblack>	 ok
[22:07:40] <bblack>	 so that's pretty easy to restrict to then
[22:08:03] <SMalyshev>	 so it's ok. The problem is for blazegraph server it's hard to know if it's legit post, so it probably needs to be done in nginx I guess
[22:08:16] <bblack>	 "legit post"?
[22:08:36] <SMalyshev>	 basically a) POST coming from updater or b) POST that is a safe query
[22:09:09] <bblack>	 so even the localhost updater traffic comes through nginx?
[22:09:14] <SMalyshev>	 the problem here is that we're getting into blacklist/whitelist problem... I don't want to enumerate all modification queries
[22:09:30] <SMalyshev>	 bblack: no, updater traffic goes directly to blazegraph
[22:09:45] <bblack>	 I see
[22:10:12] <bblack>	 so, have nginx set a header "X-From-Nginx: 1" or whatever, unconditionally
[22:10:24] <bblack>	 and have blazegraph disallow writes if that header is set
[22:12:32] <SMalyshev>	 that's kind of a problem that I'm not sure jetty has a way to do it
[22:13:04] <SMalyshev>	 the problem is I'm not even sure jetty knows what "write" is, and blazegrah running on top of jetty does know, but it doesn't know anything about headers
[22:13:23] <SMalyshev>	 since it's a java servlet it's all segregated
[22:13:37] <SMalyshev>	 and most of it isn't even our code so hard to modify
[22:15:12] <chasemp>	 it seems no fun to have to go the route of some preshare or simple validation but if the metadata is mutually exclusive maybe that's the only play
[22:16:43] <bblack>	 yeah sounds tricky
[22:16:48] <SMalyshev>	 yup
[22:16:54] <SMalyshev>	 https://wiki.blazegraph.com/wiki/index.php/REST_API
[22:17:46] <SMalyshev>	 basically in POST API the only difference is if there's query= param then it's read, but it would be delete= or update= and then it's write
[22:18:31] <SMalyshev>	 that's why I wanted to use lua - in lua I can parse args and ensure they are safe
[22:20:17] <SMalyshev>	 *if* I redirected POST to GET in nginx, it would be safe since only queries can be done via GET, anf java seems to have pretty generous limits on URLs AFAIK... 
[22:25:30] <SMalyshev>	 bblack: do you think it's a viable solution?
[22:29:34] <SMalyshev>	 java has URL limit about 8k though, so maybe it's not a good solution
[23:01:05] <bblack>	 SMalyshev: I think redirecting post to get would be tricky, you'd have to re-encode things too
[23:01:28] <bblack>	 but yeah, whatever you can make work!
[23:01:38] <SMalyshev>	 yeah since GET is limited to 8k I think it's not the best idea...
[23:02:19] <SMalyshev>	 I found though blazegraph may have mime type that means always query, but it applies it inconsistently. If I fix that, we may be able to just set the mime type and it'd be ok
[23:03:14] <bblack>	 ok
[23:03:25] <SMalyshev>	 I'll ask blazegraph folks if they agree with that pacth, if so it should be easy then
[23:08:23] <bblack>	 sounds good :)
[23:13:57] <SMalyshev>	 hmm looks not so simple, as that MIME type reads from body and not from query param, may require recoding still :( will dig more
[23:18:17] <SMalyshev>	 bblack: do you know if nginx can check whether query POST data has certain parameter?