[14:15:24] https://people.mpi-sws.org/~gummadi/king/king.pdf [14:15:54] the idea is fascinating, but is it really that easy to find NSs allowing recursive queries from random clients? [14:32:02] this used to be the case almost universally up until a few years ago [14:32:18] then they were started to be getting used for cache poisoning and DDoS attacks [14:34:02] oh I've just noticed the article is from 2002 :) [14:35:07] lol, also nowadays the resolver could be pretty far from the hosts ;) [14:35:34] (disclaimer: I've just read few lines, I might have misunderstood the whole thing :D ) [14:52:43] moritzm: I'm done with lvs reboots, it's 4.4.2-3+wmf8 everywhere [14:54:35] great, thanks [15:17:59] 07HTTPS, 10Traffic, 10DBA, 06Operations, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#3025390 (10jcrespo) [15:36:11] 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3025438 (10Anomie) >>! In T119366#3023410, @Tgr wrote: > `#time` and co. are used on many pages and usually they do not require cache invalidation. For exa... [17:08:03] 10netops, 06Analytics-Kanban, 06Operations: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3025831 (10Ottomata) > term udplog { + 1 > Remove IPs the term analytics-publicIP-v4: +1 > Review the IPs in term ssh Don't know anything about this, but also not sure why we have speci... [17:32:43] ema: ping [17:33:06] ema: cp1067? [17:34:02] depooling it anyways for now [17:36:33] bblack: pong, I'm half-afk [17:36:46] trouble with cp1067? [17:36:59] yeah and I see you already logged in there earlier [17:37:07] it's been giving some 503 spikes, the backend process [17:37:23] also notable its weekly restart cron is due in ~5h [17:37:35] maybe it just got too fragmented and didn't make it to a full week [17:37:47] yeah perhaps it reached its limit earlier than usual [17:37:57] what else was going on there today earlier? [17:38:24] you have a session open from abck around when the second small spike hit, I figure you must've already looked a bit [17:38:44] yeah I was taking a look at a 503 spike and didn't find anything interesting [17:38:56] ah well, it has kept recurring and getting a little larger [17:39:03] varnishlog output in ~ema/503.log [17:39:05] I depooled the box for now [17:39:28] ok [17:39:32] we could suppress the cron, restart early, repool [17:39:38] or just leave it repooled until it reaches restart, or whatever [17:39:48] seems silly to restart it twice 5h apart [17:40:45] I'd vote to leave it depooled and repool after the cron restart [17:42:26] ok [17:42:33] I'll do it later, I'll still be here :) [17:42:49] cool, thanks :) [17:44:09] 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3026040 (10Tgr) >>! In T119366#3025438, @Anomie wrote: > Well, ideally it would limit cache expiry to "however much time is left until the comparison chang... [17:54:32] bblack: might be interesting to look at the b_exl column in dstat --varnishstat [17:54:44] and compare it to other cp1* backends [17:55:18] that's the expiry mailbox lag which IIRC we thought was part of the picture [17:56:03] and BTW that reminds me that we should plot it :) [17:56:23] it's now ~1500k [18:11:50] on cp1067 1500k I guess you mean [18:13:20] yep [18:19:31] yeah it might be interesting to make some grafana dash of those [18:19:47] it might also be interesting to alert when values over a certain threshold persist [18:20:54] it's dropped a bit now. it was 50k when I just started dstat, now like 20k [18:21:14] it's interesting that with very little traffic (depooled), it doesn't very quickly drop to zero [18:21:30] that must be indicative of some logic bug in how the situation is handled by varnishd [18:25:09] I still kind of wonder about the text stats shift on Feb 8 or so [18:25:35] 2017-02-08 12:45 UTC, -ish [18:25:56] we get a big jump in frontend internal responses in varnish-caching, and a big jump in POST requests in varnish-aggregate-client-status [18:26:24] they seem to correlate, so a whole bunch of new POST requests that are being handled directly from the FEs [18:26:28] maybe to beacon endpoints? [19:59:52] bblack: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen&from=1487083518132&to=1487099474209&var-server=cp1067&var-datasource=eqiad%20prometheus%2Fops [20:05:03] on other machines there's also occasionally some lag, but definitely not that much [20:07:28] the first 503 spike of today was at 15:44, first jump in expiry mailbox lag at 15:53 [20:20:47] cool :) [21:47:04] bblack: hey, do you have a minute to talk about https://phabricator.wikimedia.org/T112151 ? [21:58:52] SMalyshev: sure [21:59:24] bblack: so I want to enable POST queries for WDQS... but I'm not sure how to best do it [22:00:24] the main driver for it for now is query size - federation can produce really large queries and our setup has limit about 7K [22:00:56] so many tools know that and use POST for long queries, but we don't serve POST now [22:01:21] why don't we? [22:01:32] I mean, I hate it, but whatever, it is what it is [22:01:37] I made a hacky patch here: https://gerrit.wikimedia.org/r/#/c/243883/ to route POSTs to GETs, but we may also try and just allow POSTs [22:01:49] I don't think we're specifically blocking POST [22:02:06] bblack: nginx setup on wdqs side blocks posts [22:02:23] because posts can be used for writing, and we don't want to allow those [22:02:28] ok [22:02:44] so that's the tricky part - we have to allow POSTs but not all POSTs [22:02:54] doesnt' wdqs have some explicit control to disable writes without disabling all POST? [22:03:51] currently it doesn't... I''m not sure if we can do it on blazegraph size easily... since we do need updates to work, just internal ones, not ones coming from internet [22:04:33] and I'm not sure if jetty can be set up properly (I don't know too much about jetty security...) [22:05:07] so, blazegraph needs the ability to restrict writes based on either the actual client IP, or some trusted header [22:05:50] well, trusted header maybe... since we can strip this in nginx I guess [22:06:02] we ahve trusted headers like that already [22:06:11] e.g. X-Client-IP, which is sent to your nginx from varnish [22:06:23] but that only applies to cache_misc requests (including outside world) [22:06:53] if internal stuff is directly hitting wdqs.svc.eqiad.wmnet, then you'd have to get that client IP from the local nginx on wdqs [22:07:14] internal stuff is worning with localhost [22:07:33] ok [22:07:40] so that's pretty easy to restrict to then [22:08:03] so it's ok. The problem is for blazegraph server it's hard to know if it's legit post, so it probably needs to be done in nginx I guess [22:08:16] "legit post"? [22:08:36] basically a) POST coming from updater or b) POST that is a safe query [22:09:09] so even the localhost updater traffic comes through nginx? [22:09:14] the problem here is that we're getting into blacklist/whitelist problem... I don't want to enumerate all modification queries [22:09:30] bblack: no, updater traffic goes directly to blazegraph [22:09:45] I see [22:10:12] so, have nginx set a header "X-From-Nginx: 1" or whatever, unconditionally [22:10:24] and have blazegraph disallow writes if that header is set [22:12:32] that's kind of a problem that I'm not sure jetty has a way to do it [22:13:04] the problem is I'm not even sure jetty knows what "write" is, and blazegrah running on top of jetty does know, but it doesn't know anything about headers [22:13:23] since it's a java servlet it's all segregated [22:13:37] and most of it isn't even our code so hard to modify [22:15:12] it seems no fun to have to go the route of some preshare or simple validation but if the metadata is mutually exclusive maybe that's the only play [22:16:43] yeah sounds tricky [22:16:48] yup [22:16:54] https://wiki.blazegraph.com/wiki/index.php/REST_API [22:17:46] basically in POST API the only difference is if there's query= param then it's read, but it would be delete= or update= and then it's write [22:18:31] that's why I wanted to use lua - in lua I can parse args and ensure they are safe [22:20:17] *if* I redirected POST to GET in nginx, it would be safe since only queries can be done via GET, anf java seems to have pretty generous limits on URLs AFAIK... [22:25:30] bblack: do you think it's a viable solution? [22:29:34] java has URL limit about 8k though, so maybe it's not a good solution [23:01:05] SMalyshev: I think redirecting post to get would be tricky, you'd have to re-encode things too [23:01:28] but yeah, whatever you can make work! [23:01:38] yeah since GET is limited to 8k I think it's not the best idea... [23:02:19] I found though blazegraph may have mime type that means always query, but it applies it inconsistently. If I fix that, we may be able to just set the mime type and it'd be ok [23:03:14] ok [23:03:25] I'll ask blazegraph folks if they agree with that pacth, if so it should be easy then [23:08:23] sounds good :) [23:13:57] hmm looks not so simple, as that MIME type reads from body and not from query param, may require recoding still :( will dig more [23:18:17] bblack: do you know if nginx can check whether query POST data has certain parameter?