[14:07:09] ema: so for the rest of the mobile->text thing: ulsfo is already on etcd, so we can do that today the same was as codfw [14:07:15] s/was/way/ [14:07:34] we can be a little bit less careful on timing too, as codfw has primed the eqiad backend caches, and ulsfo->eqiad isn't horrible latency [14:07:57] but I'd still wait 15+ minutes after adding the first text, then just go by 5 min intervals. [14:08:56] and then _joe_'s last schedule I heard was esams today and eqiad tomorrow for the rest of the pybal+etcd migration. So we can basically follow after him and keep using the same method. [14:09:55] esams needs to be fairly slowly done, again because of latency and traffic diffs. Ramp in weight on initial one, wait 30+ minutes after initial one reaches full weight, etc... [14:10:12] eqiad can be done fairly quickly, as misses will fetch from populated backends with near-zero latency. [14:11:03] esams has the additional complexity that the text cluster currently has uneven weighting baked in by design, because it has two different generations of hardware in it. [14:11:45] so I guess start the ramp there with the lower-weighted ones, and preserve text's weightings in the end state [14:24:48] bblack: alright! [15:15:00] bblack: so, just to be sure. https://gerrit.wikimedia.org/r/#/c/266230/ should be merged first, then I can start the conftool dance. Right? [15:16:39] ema: yup! [15:18:44] puppet-merged [15:19:25] as expected, puppet-merge added all nodes with pooled=no [15:22:04] sounds good :) don't forget !log for migration start/end too [15:22:58] bblack: I won't! :) [15:24:13] but thanks for the reminder, already this morning I forgot to !log my brawl with one of the kafkas [15:27:02] :) [15:27:29] bblack, ema: whenever you are not flipping caches all around the world can we chat about https://phabricator.wikimedia.org/T107749 ? [15:30:13] bblack: should I ramp in weight on the first ulsfo machine (cp4008) or is that unnecessary? [15:30:37] ema: might as well just to be safe [15:31:00] alright [15:31:53] elukey: yeah that topic deserves revisiting for sure [15:32:49] elukey: it's kinda tied in with this too though: https://phabricator.wikimedia.org/T96848 [15:33:16] WOW [15:33:19] TL;DR - nginx broke SPDY/3.1 when they introduced H2 in a recent mainline version, and we're kinda stuck because of that, because we want both protocols, not either/or [15:33:47] how much does the http2 module cost for nginx :P ? [15:33:53] it's free [15:34:08] the problem is the patch that introduces it, also guts SPDY/3.1 support [15:34:17] and we want both for a transitional period [15:34:36] If I recall correctly cloud flare re-added spdy to their nginx with a patch [15:34:49] not sure if they maintain it somewhere [15:35:30] yeah, they basically already did our "option 1" in that ticket for us [15:36:01] but I haven't gone to look whether they released it or not yet, they said early 2016 I think [15:36:19] if that option doesn't work out for some reason, though, it's another nail in the "let's move to something other than nginx" coffin [15:37:23] bumping weight of pooled nodes in ulsfo to 10 unless there are objections [15:38:32] elukey: on the varnish tuning comment you made in the ticket: it's entirely possible we can fix that somehow, I don't know yet. [15:39:14] the bottom line is that the way we have things configured today, because each worker thread can only have one (non-persistent!) connection at a time, this greatly limits total parallelism into varnish, and so nginx is effectively queueing and de-parallelizing the traffic for us. [15:39:27] which isn't great either :/ [15:40:06] but without the max_conns param, turning on persistence opens the floodgates in the other direction: nginx will actively maximize parallelism and probably overrun thread/conn limits in varnish [15:40:24] err max_idle_conns I mean [15:40:40] no, I had it right the first time :) [15:41:56] have you checked recent numbers of SPDY-but-not-H2 UAs? [15:42:09] iOS was the biggest one, but that got fixed, right? [15:42:45] well safari in general was the biggest one [15:42:51] yeah we can re-check [15:43:33] (did they patch old iOS, or just fix with iOS 9?) [15:43:47] anyways, the stats will tell [15:44:08] just fix with iOS 9 I *think* [15:44:16] but there was a huge uptake of iOS 9 IIRC from the TLS stats [15:44:32] cp4008.ulsfo.wmnet: pooled changed no => yes [15:44:32] cp4008.ulsfo.wmnet: weight changed 1 => 1 [15:44:52] paravoid: I think there may have been an IE version that did spdy-but-not-h2 too, I'm not sure [15:49:28] bblack: would it be possible to hammer a traffic machine with some syntetic traffic and different settings for Varnish? [15:49:48] but yeah I guess that we should also decide what to do with spdy [15:49:50] cp4008.ulsfo.wmnet: weight changed 1 => 5 [15:49:52] elukey: probably with e.g. ab (apachebench) [15:50:17] elukey: but keep in mind that cp1008 actually does hit the rest of the infrastructure like real live requests. only its nginx and varnish-fe are isolated. [15:50:24] its stats go into prod stats too [15:51:17] (so if you force a bunch of cache misses, they will hit other cp10xx text varnish-be, and mw appservers) [15:52:16] ahhh so I can cause some damage, got it [15:52:19] elukey: re: connection counts, keep in mind the newer-gen cache machines can have ~32-48 nginx worker procs. [15:52:35] so the idle connection count applies there too [15:52:55] (e.g. at "4" like my old patch, we're looking at a floor of 128 or 192 conns -> varnish [15:52:58] ) [15:53:15] cp4008.ulsfo.wmnet: weight changed 5 => 10 [15:54:09] cp4008 is now a fully functioning member of dc=ulsfo,cluster=cache_mobile [15:54:30] I'll wait ~10 minutes and then start adding the others with weight=10 [15:54:54] bblack: if you agree, that is. :) [16:00:50] bblack: got it, I'll try to check the nginx http2 patch first to see if the spdy removal is entangled with http2. It doesn't seem at first sight but I would be worried about preferring spdy to http2 [16:01:07] I mean, causing nginx to choose spdy over http2 [16:01:34] the other main problem is that bug fixes will be none for spdy [16:01:41] so really tricky :( [16:03:12] Apache traffic server seems to support TLS, H2 and Spdy, but big change [16:04:27] ATS, when working as a forward SPDY proxy, was exploding at ~100M/s for some reason that I never really had the time to debug [16:04:40] ema: yeah ok :) [16:05:27] elukey: we already know about the nginx mainline patch: it mostly *replaced* the spdy code with h2 code (as there is very little difference) [16:05:43] as opposed to factoring it so the two could share the common parts of the code [16:06:09] re: ATS, that's something I've wanted to explore for a while, but it's way down the priority list right now [16:06:12] ah ok, so a bit of a mess [16:06:35] we haven't tested it at all here, but I'm hopeful that a well-designed ATS setup could be better for us than varnish. there's a long road to figuring out if that's true... [16:06:38] cp4009.ulsfo.wmnet: pooled changed no => yes [16:06:39] cp4009.ulsfo.wmnet: weight changed 1 => 10 [16:06:59] definitely post-varnish4-transition at this point, in any case [16:07:56] oh yes for sure, not sure how we could replace all the nginx and varnish infrastructure with ATS before Varnish 4 :) [16:08:17] (one of the larger selling points for ATS, btw, is that they're not doing the semi-open-source freemium model like varnish and nginx are these days, because they're ASF, which in general means they're more-aligned with us philosophically on all things related...) [16:08:34] right [16:10:01] yep very good point [16:11:46] cp4010.ulsfo.wmnet: pooled changed no => yes [16:11:46] cp4010.ulsfo.wmnet: weight changed 1 => 10 [16:13:09] related to the spdy/h2 thing and the above about open-source, it's also possible in the nearer term we could explore s/nginx/apache/ just for TLS termination. [16:13:51] I think a modern apache can do everything we get from our patched nginx probably (including dual ECDSA/RSA certs, OCSP, SPDY+H2, etc), but it would probably have to be a custom package based on sid, maybe with additional patches that already exist out there in the world. [16:14:02] plus testing and tuning that's a viable replacement perf-wise [16:14:58] bblack: I believe that httpd has all the feature we need but sadly mod_event will not work with TLS, and we'll need to fall back to the worker mpm.. [16:15:22] well the nginx setup we use now is effectively like forked worker_mpm [16:15:31] well, kinda [16:15:38] it's like forked works with event inside them, I guess [16:15:42] s/works/workers/ [16:16:25] cp4016.ulsfo.wmnet: pooled changed no => yes [16:16:25] cp4016.ulsfo.wmnet: weight changed 1 => 10 [16:17:22] anyways, who knows, there may be some way to make apache performant enough though :) [16:21:10] H2 is also heavily developed in httpd, good point in favor. Well we could run some tests and see, worst that can happen is that ATS will be the next candidate [16:21:51] cp4017.ulsfo.wmnet: pooled changed no => yes [16:21:51] cp4017.ulsfo.wmnet: weight changed 1 => 10 [16:26:40] cp4018.ulsfo.wmnet: pooled changed no => yes [16:26:40] cp4018.ulsfo.wmnet: weight changed 1 => 10 [16:26:49] and we're done adding nodes [16:29:43] ema: busy day :) [16:30:30] not getting bored at all! [16:31:00] will start depooling mobile nodes soon [16:33:58] cp4011.ulsfo.wmnet: pooled changed yes => no [16:34:10] (ETA 20 minutes) [16:38:11] cp4012.ulsfo.wmnet: pooled changed yes => no [16:43:10] cp4019.ulsfo.wmnet: pooled changed yes => no [16:43:38] hey folks [16:43:49] is there a task documenting the traffic-related codfw-rollout steps already? [16:44:00] presumably upgrading codfw to tier 1? [16:44:09] not really, no [16:44:52] the plan would diverge strongly depending on how we solve the encryption problem though [16:45:01] do you have any thoughts on the matter already or should I file a more generic task for now? [16:45:39] well, assuming x-dc applayer crypto out the back edge of varnish is a solved problem.... [16:45:50] it goes something like: [16:47:04] 1) remove users from codfw in geodns, 2) reconfigure tier-1 varnish-be to know how to reach applayer at either tier-1 site, controlled by some etcd flag or puppet flag or whatever, which is set for "applayer in eqiad" initially. [16:47:32] 3) reconfigure codfw varnishes to consider themselves tier-1 [16:47:48] 4) bring users back onto codfw (which doesn't now backend to eqiad caches, it backends to eqiad applayer in practice) [16:48:05] 5) switch ulsfo from backending to eqiad-tier-1 to backending to codfw-tier-1 [16:48:12] cp4020.ulsfo.wmnet: pooled changed yes => no [16:48:16] or something like that [16:48:18] and we're done with ulsfo [16:48:21] it doesn't have to be very specific *right* now, but we need to track this work (and at a priority) for our quarterly goal [16:48:26] https://gerrit.wikimedia.org/r/266253 [16:48:45] paravoid: in the net, the bottom line is it's like a week of figuring out complicated puppet/template bits, and a week of deployment schedule. [16:48:46] I'd be inclined to file an even more generic task than that, but if you want to file a more specific one, that's fine too [16:49:00] but the blocker in front of it is "fix varnish outbound TLS", which is much harder. [16:49:41] I thought it was a blocker too, mark was swearing that we said it wasn't :) [16:50:03] that was in person, when I was telling elukey about all that :) [16:50:05] I think we said we could re-interpret that as unecessary, kinda [16:50:19] but the plan is a little different in that scenario? [16:50:19] so let's write these down and add appropriate blocking tasks I'd say [16:50:45] oh mark's right actually [16:50:57] we'd like to solve varnish outbound first and do it like above, but we don't technically have to. [16:51:30] it's just, the failover is a little uglier then. [16:51:41] and it's failover, not active-active t1 at the varnish layer [16:52:57] uhm, ok [16:53:03] task(s) please? :) [16:53:16] it's ugly [16:53:18] yeah [16:53:18] I'm slightly confused now [16:53:49] it's very confusing :) [16:53:59] it would be better if we could just fix TLS first [16:54:14] anything else makes it much uglier [17:01:49] 7HTTPS, 6Analytics-Kanban, 6operations, 5Patch-For-Review: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1962106 (10Nuria) 5Open>3Resolved [17:50:17] 10netops, 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1962276 (10Dzahn) [18:19:39] perf numbers look good -- https://performance.wikimedia.org/#!/week [18:20:05] i don't think we've had any perf fixes roll out so maybe it's whatever it is you guys have been working on? [18:20:59] monthly view useful too [18:21:26] 10Wikimedia-DNS, 7domains, 10Wiki-Loves-Monuments-General, 6operations, 5Patch-For-Review: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1962449 (10Dzahn) >>! In T118468#1958689, @JanZerebecki wrote: > I guess, to transfer WMF legal needs to be willing to host all WLM... [18:22:51] ori: possibly, hard to say [18:23:12] I mean, maybe, from expanded potential cache size and more frontends for mobile? shouldn't have had much effect on text [18:39:45] ori: another thing that could be related is network links. I know we've turned several up recently, I don't know how likely they were to cause any net decreases in user latency. [18:42:04] 7HTTPS, 6Research-and-Data, 10The-Wikipedia-Library, 10Wikimedia-General-or-Unknown, and 3 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#1962546 (10DarTar) Started a preliminary discussion with Ops on the timeline of a... [19:25:51] bblack: so esams tomorrow as soon as the pybal etcd rollout is over? [19:34:24] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1962960 (10faidon) p:5Triage>3Normal [19:35:14] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1936607 (10faidon) I removed the static routes from cr1/cr2-eqiad for the 6to4 and Teredo routes. Nitrogen shouldn't be used anymore and can be decommissioned (I adjusted the task description). [19:36:45] ema: yeah [19:37:14] or I may start it later tonight. we're clear from _joe_ to switch esams, just have to make the switch and confirm LVS ok [19:37:41] (the etcd switch) [19:39:27] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963001 (10Dzahn) a:3Dzahn cool, thanks Faidon. i'll take the decom part [19:41:49] bblack: sounds good! [19:43:29] I call it a day then, see you tomorrow [19:44:30] ema: cya tomorrow :) [19:50:11] <_joe_> bblack: did you switch esams by any chance? [19:52:28] _joe_: no I didn't get to it, but I plan to sometime soon-ish [19:52:38] _joe_: you're welcome to as well if you wanna do it faster than me! :) [20:03:00] <_joe_> bblack: I've prepared the changes, but I'm pretty tired tbh [20:05:17] _joe_: np, go rest :) [21:09:02] 7Varnish, 10MediaWiki-Vagrant: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1963478 (10Mattflaschen) 3NEW [21:27:44] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963555 (10Dzahn) used this as an example to go through decom process with @papaul he made https://gerrit.wikimedia.org/r/#/c/266310/ and https://gerrit.wikimedia.org/r/#/c/266311/ [21:52:59] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963619 (10Dzahn) server has been shutdown, removed from puppet, DHCP revoked puppet cert and salt-key, disable notifications and removed from Icinga/stored configs [21:58:09] bblack: somewhat unrelated, but do you know what happened to alex's patch to enable websockets for misc-varnish? [21:58:54] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963654 (10Dzahn) @papaul is going to follow-up with a DNS change and subtask for onsite-tech [21:59:09] 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963655 (10Dzahn) a:5Dzahn>3Papaul [22:01:45] 10netops, 6operations: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Papaul) 3NEW a:3Dzahn [22:02:18] 10netops, 6operations: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Papaul) [22:02:39] 10netops, 6operations: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963677 (10Dzahn) [22:09:02] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1963710 (10JanZerebecki) https://grafana-admin.wikimedia.org/dashboard/db/tmp-t124418 [22:12:52] YuviPanda: yeah I donno, I vaugely remember that, for phabricator I guess? [22:13:08] It was for etherpad I think [22:13:12] and phab wanted it too [22:18:20] 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1963752 (10ori) >>! In T124418#1963710, @JanZerebecki wrote: > https://grafana-admin.wikimedia.org/dashboard/d... [23:06:06] bblack: For your consideration: https://gerrit.wikimedia.org/r/#/c/266332/ and https://gerrit.wikimedia.org/r/#/c/266414/ ; these patches make the testwiki/mw1017 changes I proposed on the engineering list (https://lists.wikimedia.org/pipermail/engineering/2016-January/000017.html) [23:13:02] ori: the differing backend names thing is odd, I'd have think we'd have noticed earlier [23:13:06] since they run the same VCL now heh [23:13:39] yeah, i meant to congratulate you on that -- i know it has been in the works for a while [23:14:17] oh I guess the commit predates that original, it's just the commitmsg [23:16:10] ori: ok looking at DNS relatedly: [23:16:10] templates/wikidata.org:test 600 IN DYNA geoip!text-addrs [23:16:10] templates/wikidata.org:test.m 600 IN DYNA geoip!mobile-addrs [23:16:10] templates/wikipedia.org:test 600 IN DYNA geoip!text-addrs [23:16:13] templates/wikipedia.org:test.m 600 IN DYNA geoip!mobile-addrs [23:16:32] are users of test.wd.o ok with it too? [23:16:45] I mean, I assume we'd pull all those hostnames at that point