[05:47:12] 10netops, 06Operations, 10fundraising-tech-ops, 10ops-eqiad: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2483556 (10faidon) I configured pfw-eqiad port ge-2/0/11 to be in the fundraising VLAN. You might want to open a new #ops-eqiad tas... [08:07:23] ema: bblack: Hi, I´m finishing my data analysis part and going to write my Thesis now. As probably >50% of the thesis is wikimedias caching structure I wanted to ask if one or maybe both of you could read through my thesis (when I got a "finished" version) before making it final. To make sure I made no mistakes and so on. So proofreading is what I´m asking. Hopefully I will have a version ready by end of the week, start next week. [08:08:18] Snorri: I'm happy to proofread your work as long as it's written in English :) [08:10:25] ema: Of course it´s english :) I think it would be harder to translate everything into german than just writing it in english. [09:44:35] 10Traffic, 10Varnish, 06Operations: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2491474 (10ema) Wikitech page added: https://wikitech.wikimedia.org/wiki/XKey [11:28:24] 10netops, 06Operations: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2491626 (10faidon) In the meantime, I allocated ports and IPv4/IPv6 subnets for the link and setup DNS with 0a1872b. BFD & IGP is still pending and will follow only after we have confirmed that the link w... [11:40:52] 10netops, 06Operations: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2491636 (10elukey) Update after a long time (my bad): we have been experimenting with Cassandra bulk loading without really succeeding, even using cassandra 2.2.6. W... [12:31:56] ema: did you package all of varnish-modules when you did it for xkey? almost all of them look really useful, we should maybe consider them a baseline for v4 VCL in general. [12:32:32] vmod_var in particular would clear up so much confusion and unecessary header output and strange hacks about copy data through various available headers. [12:33:53] and vsthrottle looks like a better vmod_tbf, and vmod_cookie would simplify some of our VCL too, we're already using vmod_header. [12:36:26] bblack: morning! I was wondering if we have any good dashboard (maybe with gdnsd metrics if any?) about the three name servers [12:36:49] I didn't find much in grafana/ganglia today [12:37:12] (I was curious after eeden's alert that happened this weekend) [12:38:09] bblack: yes, I've uploaded the new upstream to debian and backported it for jessie-wikimedia [12:38:20] bblack: so we do have all the modules available [12:39:22] ema: awesome. although I guess we can't touch shared VCL until text converts. [12:39:32] since the newer mods won't be there in v3 [12:39:40] right [12:41:49] elukey: nothing in grafana/graphite. there are custom stats emitted by gdnsd, and we do track some of them in ganglia [12:42:13] a diamond collector should be fairly trivial to write [12:42:17] we should probably move those to graphite, would be nice [12:42:32] much easier than the ganglia plugin I'm guessing [12:44:11] hey [12:44:11] https://github.com/varnish/varnish-modules/blob/master/docs/vmod_tcp.rst [12:44:15] func_get_estimated_rtt [12:44:21] that looks kinda awesome [12:44:30] it would be if varnish supports TLS [12:44:30] we've talked about doing something like that before [12:44:35] yeah :( [12:45:15] hmm [12:45:19] nginx seems to have the equivalent too [12:45:20] elukey: if you look at eeden (or the others) in ganglia, there's a gdnsd metrics group in there [12:45:23] $tcpinfo_rtt, $tcpinfo_rttvar, $tcpinfo_snd_cwnd, $tcpinfo_rcv_space [12:45:43] yeah we could pass it down in a variable [12:46:40] if nothing else, we can pass it off to X-Analytics and maybe it can be used to improve our insights into stats [12:47:12] bblack: ah ok I was looking for a cluster aggregate view in the main page, my bad. Seeing the metrics now :) [12:47:29] there are other cool things you could do, but questionable in practice. e.g. pretend we detected a mobile UA if rtt estimate is >500ms (but still allow cookie override back to desktop view, just like mobile) [12:47:48] elukey: there is one for Authoritative DNS under "Views" [12:49:24] 10Traffic, 06Operations: Push gdnsd metrics to graphite and create a grafana dashboard - https://phabricator.wikimedia.org/T141258#2491791 (10elukey) [12:49:32] also, I didn't get a page on eeden. what was the alert over the weekend? [12:50:25] bblack: I've received an email from @asm.ca.com with [12:50:26] ALERT! DNS: Nameserver error on 91.198.174.239: Cannot connect to the name server. [12:50:35] Message: Nameserver error on 91.198.174.239: Cannot connect to the name server. [12:50:38] interesting that about 1/3 of our reqs now carry edns-client-subnet. I think the ratio was lower long ago. [12:51:01] I think asm.ca.cm == status.wikimedia.org external monitoring [12:51:17] totally ignorant about these emails [12:51:26] https://status.wikimedia.org/155942/DNS [12:51:46] could've been network stuff between us and their monitoring, too [12:52:57] new dashboard discovered today, didn't know it :) [12:52:59] I'm really excited about vmod_var and vmod_cookie. those have the potential to remove so much hacky bullshit from our VCL. [12:55:13] bblack: from https://grafana.wikimedia.org/dashboard/db/server-board I can see a "hole" in metrics last 30 mins, that kinda matches with the date/time on my email [12:55:27] but I am not sure about the meaning [12:55:58] Jul 23rd starting from ~21 UTC [12:56:17] bblack: vmod_var would allow us to remove all the X-Whatever hacks to copy stuff around right? [12:56:42] 10Traffic, 06Operations: Push gdnsd metrics to graphite and create a grafana dashboard - https://phabricator.wikimedia.org/T141258#2491811 (10elukey) p:05Triage>03Low [12:57:07] ema: right. internal variables can really be variables instead of abused header values. and we necessarily leak some of those in various directions (client, appserver, other caches) when we don't have to, too. [12:58:20] ema: in some cases it probably un-tangles some of the actual code complexity, too. because sometimes we have to put things in strange places they don't belong because "vcl_foo can only access req.http and vcl_bar can only access beresp.http", etc... a lot of those quirks would go away. I think with vmod_var it's just frontend vars and backend vars. [12:59:28] (my guess would be the ones set on the frontend side are still visible in the backend req, too) [13:00:01] one can set global variables apparently [13:00:16] yeah, it's interesting, but also scary [13:00:21] very [13:00:31] > Global variables have a lifespan that extends across requests and VCLs, for as long as the vmod is loaded [13:00:49] right, which means they poof on a true restart without any saved state. [13:01:00] it's hard to imagine where/how we'd use such a thing without persistent state for it. [13:01:16] and obviously they must cost more perf-wise, they'd have to involve heavier locking. [13:02:19] still, might be an interesting mechanism to implement certain scary switches to deal with problematic scenarios [13:03:12] like @traffic_shutdown in hieradata today, could instead be set via an HTTP request from localhost [13:03:43] hooking up conftool -> global_set() could be interesting in general [13:04:20] maybe use it for protective modes under DDoS too (e.g. quickly set a var across a cluster that blocks all POST traffic, or simply drops all cache miss/pass traffic). [13:04:37] (either of which is better than letting the site melt for all users, if that's what it comes down to) [13:05:44] maybe we could even return graced objects in that case? [13:05:53] in v4, yes [13:06:10] nice [13:06:39] looking at v4 and xkey and future purging in general, we probably should be using softpurge for article content [13:06:54] but maybe not for functional content (like static asset purges) [13:19:57] bblack, Snorri: https://phabricator.wikimedia.org/P3567 (age-reset.vtc) [13:20:09] https://phabricator.wikimedia.org/P3568 (grep 'Age: ') [13:20:52] unless I'm missing something, after the first cache eviction from the frontend the object always get fetched from the varnish backend [13:22:34] ema: from the backend's backend? [13:23:08] bblack: the real server serves the page only once (first full miss) [13:23:38] we have two varnishes in the test, v1 and v2. v1 with frontend vcl and v2 with backend vcl [13:25:03] after the object expires on v1 (default_ttl=4), we see v2 in the leftmost column of varnishlog for all requests [13:26:05] bblack: so, we do not hit the appserver but we hit v2 (the one using backend vcl) [13:26:20] and we don't re-cache it for a futher 4s? [13:26:56] that's the part I don't understand: if we *would* cache it, then v1 should serve the subsequent requests without fetching from v2 [13:27:24] (see https://phabricator.wikimedia.org/P3568) [13:27:32] I'd expect 1 request to the "applayer" at the start. 3 total requests to v2 out of 10 total to the frontend. ish. [13:28:12] it's 5 total requests [13:28:14] your paste shows a lack of Age reset [13:29:33] it may matter than the vtc 'server' acts more like our appservers do in the real world (with setting Cache-Control: s-maxage=10,must-revalidate,max-age=0) [13:36:43] technically, we've only observed the actual Age: reset in v3. it could be that that part of the problem is "fixed" in v4 already. but the underlying issue (which may be more about misunderstanding standards on our part, leading to bad ideas about variable TTL tuning on the layers) may still be present as indicated by your v2 reqs... [13:37:40] in any case, we should almost certainly back off from our current pattern of setting FE TTLs lower than BE TTLs. I'd just like to be sure we understand exactly why first, before we remove the real-world ability to see it in action. [13:38:36] yup. I've running the test against v4, for the record [13:38:45] s/I've/I'm/ [14:11:19] 10Traffic, 06Operations: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2491990 (10BBlack) What's out evaluation plan here? Do we want to stall on proper IPv6 for in our VCL geoip lookup service first and do comparisons on that data? Or do some kin... [14:25:50] 10Traffic, 06Operations: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2492061 (10faidon) I'm honestly not worried all that much about tunnels anymore. In my experience, they're very rare nowadays and especially in this cross-country fashion (Googl... [14:27:54] 10Traffic, 06Operations: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2492065 (10BBlack) For the VCL stuff, what I meant is that for IPv6 user traffic, we could compare the runtime lookup we do for Set-Cookie on the IPv6 address to the one done via... [14:30:39] 10Traffic, 06Operations: No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org - https://phabricator.wikimedia.org/T81605#2492069 (10BBlack) Moving forward and checking perf metrics after is an option, too. But unless the change is quite dramatic it will be hard to see it. Rolling forward and back... [14:32:45] bblack: why would the non.-org nameservers be involved? [14:32:59] ns0/1/2 are just under wikimedia.org [14:33:09] and other TLDs should not include glues for that one [14:33:22] and even if they do, conformant recursors should ignore them, right? :) [14:33:59] hmmm right [14:34:14] too early in the morning for thinking about delegation glue I guess :) [14:34:45] oh, ok [14:35:25] you had me overthinking this [14:35:45] it made more sense to me that I wasn't thinking something than you being wrong about DNS :) [14:36:41] you can include unecessary glue, but yes conforming resolvers should ignore it, and really I wouldn't expected TLD servers to even try to include it. [14:37:14] yeah [14:37:34] on the other other hand, when working with registrars directly they often ask for glue addrs on nameservers in their forms and I don't know that they really filter for that, or try to submit it to the other TLD or something. [14:38:01] I guess I've never done this myself, directly with a registrar (e.g. registered a .com with nameservers in .org) [14:38:26] hopefully even if they ask on the form, they ignore the addresses since the nameservers are out of zone and do nothing with them. [14:38:55] 10netops, 06Operations, 10fundraising-tech-ops, 10ops-eqiad: put pfw1- ge-2/0/11 in the 'fundraising' vlan for new host frqueue1001 - https://phabricator.wikimedia.org/T140991#2492084 (10Jgreen) That's great news re. additional available interfaces, I'll create a new #ops-eqiad to do the cable swap and su... [14:39:26] (or maybe they just verify the supplied data matches what is in the org registrar for the nameservers) [14:39:33] no idea! [14:40:52] yeah, regardless it's probably not a factor here [14:41:15] if the perf change isn't dramatic, do we really care? [14:42:33] I mean, yes, ideally we should squeeze out every last millisecond, but maybe our time is not best spent here? [14:49:20] yeah it's a tough call how much to look at each perf-affecting thing. Sometimes as an org, we get mired in needing too much analytics validation of perf impact when it's minor or the idea being tested is fairly obviously-good. [14:50:30] in this case, I'm inclined to care a bit, because the results of poor V6 geoiplookup could be awful. picking the wrong DC adds a lot of unnecc latency. [14:50:46] but like you said, if it's awful enough to care, it's probably awful enough to be able to see it on a graph through the noise. [14:51:26] joining the V6 world with our AuthDNS is a good thing regardless in terms of driving adoption and such. [14:51:30] yeah [14:51:35] yeah that was the point [14:51:38] it's the right thing to do, much like taking the small hit on the HTTP/2 transition. [14:51:53] it's also easy to procrastinate this decision, which is what I kinda did before [14:52:00] me too! [14:52:12] since it probably gets better as time passes [14:52:22] the quality of the GeoIPv6 data gets better, I mean [14:52:27] hopefully :) [14:52:54] and most users that would be stuck on v6-only networks are probably using ISP (well, Mobile Carrier) DNS servers that have outbound-V4, too, so it's not like it's a huge problem. [14:53:34] but still, we should move forward with standards and blah blah. not having V6 nameservers in 2016 seems like we're behind the times at this point. [14:55:45] if we want to do the "what do other big sites do?" game: amazon and facebook have v6 glue, google and twitter don't [14:55:54] oh they do now [14:56:05] when I was last checking this none of the top IPv6-enabled sites did [14:56:28] msn.com and yahoo.com have AAAA glue too [14:56:51] interesting [14:58:25] google's probably the one that does the most geographic routing magic of them all, though. so it's a little worrying that they don't. [15:01:29] microsoft's stuff is kind of split: microsoft.com, msn.com, and live.com have it, but bing.com doesn't [15:03:15] microsoft's authdns AAAA had a cool idea too, use :53 for the last octet, easy to remember :) [15:03:44] ns3.msft.net.172800INAAAA2620:0:34::53 etc... [15:04:11] Hi, who can I ask with a question about the varnish-traffic dashboard on grafana? (I'm wondering about the "by server" graph: are these only the upload caches, and where are the text caches?) https://grafana.wikimedia.org/dashboard/db/varnish-traffic [15:04:56] e.g., why is cp3031 not in the list? [15:11:13] dberger: I think it doesn't know how to filter on cluster, only by-dc. Also, these are ethernet-interface level stats in that graph, they're not necessarily useful to interpret as other things (like public traffic levels) [15:12:13] dberger: the "cp3* reception" graph and such are limited to the top-10, which is why they tend to show upload cluster [15:16:23] dberger: I didn't create that one, but I'm editing it up a bit now to be more-useful... [15:16:45] (it still won't split on e.g. text vs upload, though) [15:20:30] dberger: saved a new version: it no longer limits to 10 nodes per graph, and the tooltip was fixed to be non-cumulative. [15:46:02] 07HTTPS, 10Traffic, 06Operations: letsencrypt puppetization: add parallel rsa+ecdsa cert support - https://phabricator.wikimedia.org/T141266#2492282 (10BBlack) [16:06:03] bblack: many thanks for the info! [16:06:34] bblack: and thanks for editing, though I have to say: it seems to be broken in my Firefox just right now [16:06:59] maybe force reload or close and re-open? it may get confused when the UI changes out from under on edit? Not sure [16:09:03] bblack: I've tried that, but the graphs at the bottom of the page remain white (axis labels, but no lines and server labels) [16:12:49] bblack: sorry, third re-open did the job. My bad. I can find all Varnish instances now, great! [16:16:53] dberger: awesome :) [16:23:35] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2492507 (10BBlack) Noting from last meeting about this: We've **tentatively** said we'll try to make this (implementing a robust A/B test infrastructure at the Varnish level) an... [16:25:05] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2492511 (10Nuria) Second @BBlack. We will make this a shared goal among traffic and analytics team [16:30:10] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310174 (10ori) What's the rationale for prioritizing it? [16:34:04] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2492540 (10BBlack) It's a seasonal issue that's come up every few months for the past couple of years. Every time we need to run an A/B test, we go back through the same conver... [23:08:19] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2494260 (10Nuria) @BBlack: i volunteer to write a design doc with user cases /high level design ideas and issues by the end of this quarter so we can use it to scope the work we...