[14:07:09] <bblack>	 ema: so for the rest of the mobile->text thing: ulsfo is already on etcd, so we can do that today the same was as codfw
[14:07:15] <bblack>	 s/was/way/
[14:07:34] <bblack>	 we can be a little bit less careful on timing too, as codfw has primed the eqiad backend caches, and ulsfo->eqiad isn't horrible latency
[14:07:57] <bblack>	 but I'd still wait 15+ minutes after adding the first text, then just go by 5 min intervals.
[14:08:56] <bblack>	 and then _joe_'s last schedule I heard was esams today and eqiad tomorrow for the rest of the pybal+etcd migration.  So we can basically follow after him and keep using the same method.
[14:09:55] <bblack>	 esams needs to be fairly slowly done, again because of latency and traffic diffs.  Ramp in weight on initial one, wait 30+ minutes after initial one reaches full weight, etc...
[14:10:12] <bblack>	 eqiad can be done fairly quickly, as misses will fetch from populated backends with near-zero latency.
[14:11:03] <bblack>	 esams has the additional complexity that the text cluster currently has uneven weighting baked in by design, because it has two different generations of hardware in it.
[14:11:45] <bblack>	 so I guess start the ramp there with the lower-weighted ones, and preserve text's weightings in the end state
[14:24:48] <ema>	 bblack: alright!
[15:15:00] <ema>	 bblack: so, just to be sure. https://gerrit.wikimedia.org/r/#/c/266230/ should be merged first, then I can start the conftool dance. Right?
[15:16:39] <bblack>	 ema: yup!
[15:18:44] <ema>	 puppet-merged
[15:19:25] <ema>	 as expected, puppet-merge added all nodes with pooled=no
[15:22:04] <bblack>	 sounds good :) don't forget !log for migration start/end too
[15:22:58] <ema>	 bblack: I won't! :)
[15:24:13] <ema>	 but thanks for the reminder, already this morning I forgot to !log my brawl with one of the kafkas
[15:27:02] <bblack>	 :)
[15:27:29] <elukey>	 bblack, ema: whenever you are not flipping caches all around the world can we chat about https://phabricator.wikimedia.org/T107749 ?
[15:30:13] <ema>	 bblack: should I ramp in weight on the first ulsfo machine (cp4008) or is that unnecessary?
[15:30:37] <bblack>	 ema: might as well just to be safe
[15:31:00] <ema>	 alright
[15:31:53] <bblack>	 elukey: yeah that topic deserves revisiting for sure
[15:32:49] <bblack>	 elukey: it's kinda tied in with this too though: https://phabricator.wikimedia.org/T96848
[15:33:16] <elukey>	 WOW
[15:33:19] <bblack>	 TL;DR - nginx broke SPDY/3.1 when they introduced H2 in a recent mainline version, and we're kinda stuck because of that, because we want both protocols, not either/or
[15:33:47] <elukey>	 how much does the http2 module cost for nginx :P ?
[15:33:53] <bblack>	 it's free
[15:34:08] <bblack>	 the problem is the patch that introduces it, also guts SPDY/3.1 support
[15:34:17] <bblack>	 and we want both for a transitional period
[15:34:36] <elukey>	 If I recall correctly cloud flare re-added spdy to their nginx with a patch
[15:34:49] <elukey>	 not sure if they maintain it somewhere
[15:35:30] <bblack>	 yeah, they basically already did our "option 1" in that ticket for us
[15:36:01] <bblack>	 but I haven't gone to look whether they released it or not yet, they said early 2016 I think
[15:36:19] <bblack>	 if that option doesn't work out for some reason, though, it's another nail in the "let's move to something other than nginx" coffin
[15:37:23] <ema>	 bumping weight of pooled nodes in ulsfo to 10 unless there are objections
[15:38:32] <bblack>	 elukey: on the varnish tuning comment you made in the ticket: it's entirely possible we can fix that somehow, I don't know yet.
[15:39:14] <bblack>	 the bottom line is that the way we have things configured today, because each worker thread can only have one (non-persistent!) connection at a time, this greatly limits total parallelism into varnish, and so nginx is effectively queueing and de-parallelizing the traffic for us.
[15:39:27] <bblack>	 which isn't great either :/
[15:40:06] <bblack>	 but without the max_conns param, turning on persistence opens the floodgates in the other direction: nginx will actively maximize parallelism and probably overrun thread/conn limits in varnish
[15:40:24] <bblack>	 err max_idle_conns I mean
[15:40:40] <bblack>	 no, I had it right the first time :)
[15:41:56] <paravoid>	 have you checked recent numbers of SPDY-but-not-H2 UAs?
[15:42:09] <paravoid>	 iOS was the biggest one, but that got fixed, right?
[15:42:45] <bblack>	 well safari in general was the biggest one
[15:42:51] <bblack>	 yeah we can re-check
[15:43:33] <bblack>	 (did they patch old iOS, or just fix with iOS 9?)
[15:43:47] <bblack>	 anyways, the stats will tell
[15:44:08] <paravoid>	 just fix with iOS 9 I *think*
[15:44:16] <paravoid>	 but there was a huge uptake of iOS 9 IIRC from the TLS stats
[15:44:32] <ema>	 cp4008.ulsfo.wmnet: pooled changed no => yes
[15:44:32] <ema>	 cp4008.ulsfo.wmnet: weight changed 1 => 1
[15:44:52] <bblack>	 paravoid: I think there may have been an IE version that did spdy-but-not-h2 too, I'm not sure
[15:49:28] <elukey>	 bblack: would it be possible to hammer a traffic machine with some syntetic traffic and different settings for Varnish?
[15:49:48] <elukey>	 but yeah I guess that we should also decide what to do with spdy
[15:49:50] <ema>	 cp4008.ulsfo.wmnet: weight changed 1 => 5
[15:49:52] <bblack>	 elukey: probably with e.g. ab (apachebench)
[15:50:17] <bblack>	 elukey: but keep in mind that cp1008 actually does hit the rest of the infrastructure like real live requests.  only its nginx and varnish-fe are isolated.
[15:50:24] <bblack>	 its stats go into prod stats too
[15:51:17] <bblack>	 (so if you force a bunch of cache misses, they will hit other cp10xx text varnish-be, and mw appservers)
[15:52:16] <elukey>	 ahhh so I can cause some damage, got it
[15:52:19] <bblack>	 elukey: re: connection counts, keep in mind the newer-gen cache machines can have ~32-48 nginx worker procs.
[15:52:35] <bblack>	 so the idle connection count applies there too
[15:52:55] <bblack>	 (e.g. at "4" like my old patch, we're looking at a floor of 128 or 192 conns -> varnish
[15:52:58] <bblack>	 )
[15:53:15] <ema>	 cp4008.ulsfo.wmnet: weight changed 5 => 10
[15:54:09] <ema>	 cp4008 is now a fully functioning member of dc=ulsfo,cluster=cache_mobile
[15:54:30] <ema>	 I'll wait ~10 minutes and then start adding the others with weight=10
[15:54:54] <ema>	 bblack: if you agree, that is. :)
[16:00:50] <elukey>	 bblack: got it, I'll try to check the nginx http2 patch first to see if the spdy removal is entangled with http2. It doesn't seem at first sight but I would be worried about preferring spdy to http2
[16:01:07] <elukey>	 I mean, causing nginx to choose spdy over http2
[16:01:34] <elukey>	 the other main problem is that bug fixes will be none for spdy
[16:01:41] <elukey>	 so really tricky :(
[16:03:12] <elukey>	 Apache traffic server seems to support TLS, H2 and Spdy, but big change 
[16:04:27] <ema>	 ATS, when working as a forward SPDY proxy, was exploding at ~100M/s for some reason that I never really had the time to debug 
[16:04:40] <bblack>	 ema: yeah ok :)
[16:05:27] <bblack>	 elukey: we already know about the nginx mainline patch: it mostly *replaced* the spdy code with h2 code (as there is very little difference)
[16:05:43] <bblack>	 as opposed to factoring it so the two could share the common parts of the code
[16:06:09] <bblack>	 re: ATS, that's something I've wanted to explore for a while, but it's way down the priority list right now
[16:06:12] <elukey>	 ah ok, so a bit of a mess
[16:06:35] <bblack>	 we haven't tested it at all here, but I'm hopeful that a well-designed ATS setup could be better for us than varnish.  there's a long road to figuring out if that's true...
[16:06:38] <ema>	 cp4009.ulsfo.wmnet: pooled changed no => yes
[16:06:39] <ema>	 cp4009.ulsfo.wmnet: weight changed 1 => 10
[16:06:59] <bblack>	 definitely post-varnish4-transition at this point, in any case
[16:07:56] <elukey>	 oh yes for sure, not sure how we could replace all the nginx and varnish infrastructure with ATS before Varnish 4 :)
[16:08:17] <bblack>	 (one of the larger selling points for ATS, btw, is that they're not doing the semi-open-source freemium model like varnish and nginx are these days, because they're ASF, which in general means they're more-aligned with us philosophically on all things related...)
[16:08:34] <ema>	 right
[16:10:01] <elukey>	 yep very good point
[16:11:46] <ema>	 cp4010.ulsfo.wmnet: pooled changed no => yes
[16:11:46] <ema>	 cp4010.ulsfo.wmnet: weight changed 1 => 10
[16:13:09] <bblack>	 related to the spdy/h2 thing and the above about open-source, it's also possible in the nearer term we could explore s/nginx/apache/ just for TLS termination.
[16:13:51] <bblack>	 I think a modern apache can do everything we get from our patched nginx probably (including dual ECDSA/RSA certs, OCSP, SPDY+H2, etc), but it would probably have to be a custom package based on sid, maybe with additional patches that already exist out there in the world.
[16:14:02] <bblack>	 plus testing and tuning that's a viable replacement perf-wise
[16:14:58] <elukey>	 bblack: I believe that httpd has all the feature we need but sadly mod_event will not work with TLS, and we'll need to fall back to the worker mpm.. 
[16:15:22] <bblack>	 well the nginx setup we use now is effectively like forked worker_mpm
[16:15:31] <bblack>	 well, kinda
[16:15:38] <bblack>	 it's like forked works with event inside them, I guess
[16:15:42] <bblack>	 s/works/workers/
[16:16:25] <ema>	 cp4016.ulsfo.wmnet: pooled changed no => yes
[16:16:25] <ema>	 cp4016.ulsfo.wmnet: weight changed 1 => 10
[16:17:22] <bblack>	 anyways, who knows, there may be some way to make apache performant enough though :)
[16:21:10] <elukey>	 H2 is also heavily developed in httpd, good point in favor. Well we could run some tests and see, worst that can happen is that ATS will be the next candidate 
[16:21:51] <ema>	 cp4017.ulsfo.wmnet: pooled changed no => yes
[16:21:51] <ema>	 cp4017.ulsfo.wmnet: weight changed 1 => 10
[16:26:40] <ema>	 cp4018.ulsfo.wmnet: pooled changed no => yes
[16:26:40] <ema>	 cp4018.ulsfo.wmnet: weight changed 1 => 10
[16:26:49] <ema>	 and we're done adding nodes
[16:29:43] <elukey>	 ema: busy day :)
[16:30:30] <ema>	 not getting bored at all!
[16:31:00] <ema>	 will start depooling mobile nodes soon
[16:33:58] <ema>	 cp4011.ulsfo.wmnet: pooled changed yes => no
[16:34:10] <ema>	 (ETA 20 minutes)
[16:38:11] <ema>	 cp4012.ulsfo.wmnet: pooled changed yes => no
[16:43:10] <ema>	 cp4019.ulsfo.wmnet: pooled changed yes => no
[16:43:38] <paravoid>	 hey folks
[16:43:49] <paravoid>	 is there a task documenting the traffic-related codfw-rollout steps already?
[16:44:00] <paravoid>	 presumably upgrading codfw to tier 1?
[16:44:09] <bblack>	 not really, no
[16:44:52] <bblack>	 the plan would diverge strongly depending on how we solve the encryption problem though
[16:45:01] <paravoid>	 do you have any thoughts on the matter already or should I file a more generic task for now?
[16:45:39] <bblack>	 well, assuming x-dc applayer crypto out the back edge of varnish is a solved problem....
[16:45:50] <bblack>	 it goes something like:
[16:47:04] <bblack>	 1) remove users from codfw in geodns, 2) reconfigure tier-1 varnish-be to know how to reach applayer at either tier-1 site, controlled by some etcd flag or puppet flag or whatever, which is set for "applayer in eqiad" initially.
[16:47:32] <bblack>	 3) reconfigure codfw varnishes to consider themselves tier-1
[16:47:48] <bblack>	 4) bring users back onto codfw (which doesn't now backend to eqiad caches, it backends to eqiad applayer in practice)
[16:48:05] <bblack>	 5) switch ulsfo from backending to eqiad-tier-1 to backending to codfw-tier-1
[16:48:12] <ema>	 cp4020.ulsfo.wmnet: pooled changed yes => no
[16:48:16] <bblack>	 or something like that
[16:48:18] <ema>	 and we're done with ulsfo
[16:48:21] <paravoid>	 it doesn't have to be very specific *right* now, but we need to track this work (and at a priority) for our quarterly goal
[16:48:26] <ema>	 https://gerrit.wikimedia.org/r/266253
[16:48:45] <bblack>	 paravoid: in the net, the bottom line is it's like a week of figuring out complicated puppet/template bits, and a week of deployment schedule.
[16:48:46] <paravoid>	 I'd be inclined to file an even more generic task than that, but if you want to file a more specific one, that's fine too
[16:49:00] <bblack>	 but the blocker in front of it is "fix varnish outbound TLS", which is much harder.
[16:49:41] <paravoid>	 I thought it was a blocker too, mark was swearing that we said it wasn't :)
[16:50:03] <paravoid>	 that was in person, when I was telling elukey about all that :)
[16:50:05] <bblack>	 I think we said we could re-interpret that as unecessary, kinda
[16:50:19] <bblack>	 but the plan is a little different in that scenario?
[16:50:19] <paravoid>	 so let's write these down and add appropriate blocking tasks I'd say
[16:50:45] <bblack>	 oh mark's right actually
[16:50:57] <bblack>	 we'd like to solve varnish outbound first and do it like above, but we don't technically have to.
[16:51:30] <bblack>	 it's just, the failover is a little uglier then.
[16:51:41] <bblack>	 and it's failover, not active-active t1 at the varnish layer
[16:52:57] <paravoid>	 uhm, ok
[16:53:03] <paravoid>	 task(s) please? :)
[16:53:16] <bblack>	 it's ugly
[16:53:18] <bblack>	 yeah
[16:53:18] <paravoid>	 I'm slightly confused now
[16:53:49] <bblack>	 it's very confusing :)
[16:53:59] <bblack>	 it would be better if we could just fix TLS first
[16:54:14] <bblack>	 anything else makes it much uglier
[17:01:49] <wikibugs>	 7HTTPS, 6Analytics-Kanban, 6operations, 5Patch-For-Review: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1962106 (10Nuria) 5Open>3Resolved
[17:50:17] <wikibugs>	 10netops, 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1962276 (10Dzahn)
[18:19:39] <ori>	 perf numbers look good -- https://performance.wikimedia.org/#!/week
[18:20:05] <ori>	 i don't think we've had any perf fixes roll out so maybe it's whatever it is you guys have been working on?
[18:20:59] <ori>	 monthly view useful too
[18:21:26] <wikibugs>	 10Wikimedia-DNS, 7domains, 10Wiki-Loves-Monuments-General, 6operations, 5Patch-For-Review: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1962449 (10Dzahn) >>! In T118468#1958689, @JanZerebecki wrote: > I guess, to transfer WMF legal needs to be willing to host all WLM...
[18:22:51] <bblack>	 ori: possibly, hard to say
[18:23:12] <bblack>	 I mean, maybe, from expanded potential cache size and more frontends for mobile? shouldn't have had much effect on text
[18:39:45] <bblack>	 ori: another thing that could be related is network links.  I know we've turned several up recently, I don't know how likely they were to cause any net decreases in user latency.
[18:42:04] <wikibugs>	 7HTTPS, 6Research-and-Data, 10The-Wikipedia-Library, 10Wikimedia-General-or-Unknown, and 3 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#1962546 (10DarTar) Started a preliminary discussion with Ops on the timeline of a...
[19:25:51] <ema>	 bblack: so esams tomorrow as soon as the pybal etcd rollout is over?
[19:34:24] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1962960 (10faidon) p:5Triage>3Normal
[19:35:14] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1936607 (10faidon) I removed the static routes from cr1/cr2-eqiad for the 6to4 and Teredo routes. Nitrogen shouldn't be used anymore and can be decommissioned (I adjusted the task description).
[19:36:45] <bblack>	 ema: yeah
[19:37:14] <bblack>	 or I may start it later tonight.  we're clear from _joe_ to switch esams, just have to make the switch and confirm LVS ok
[19:37:41] <bblack>	 (the etcd switch)
[19:39:27] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963001 (10Dzahn) a:3Dzahn cool, thanks Faidon. i'll take the decom part
[19:41:49] <ema>	 bblack: sounds good!
[19:43:29] <ema>	 I call it a day then, see you tomorrow
[19:44:30] <bblack>	 ema: cya tomorrow :)
[19:50:11] <_joe_>	 bblack: did you switch esams by any chance?
[19:52:28] <bblack>	 _joe_: no I didn't get to it, but I plan to sometime soon-ish
[19:52:38] <bblack>	 _joe_: you're welcome to as well if you wanna do it faster than me! :)
[20:03:00] <_joe_>	 bblack: I've prepared the changes, but I'm pretty tired tbh
[20:05:17] <bblack>	 _joe_: np, go rest :)
[21:09:02] <wikibugs>	 7Varnish, 10MediaWiki-Vagrant: Varnish failed to provision - https://phabricator.wikimedia.org/T124711#1963478 (10Mattflaschen) 3NEW
[21:27:44] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963555 (10Dzahn) used this as an example to go through decom process with @papaul   he made https://gerrit.wikimedia.org/r/#/c/266310/  and https://gerrit.wikimedia.org/r/#/c/266311/
[21:52:59] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963619 (10Dzahn) server has been shutdown, removed from puppet, DHCP  revoked puppet cert and salt-key, disable notifications and removed from Icinga/stored configs
[21:58:09] <YuviPanda>	 bblack: somewhat unrelated, but do you know what happened to alex's patch to enable websockets for misc-varnish?
[21:58:54] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963654 (10Dzahn) @papaul is going to follow-up with a DNS change and subtask for onsite-tech
[21:59:09] <wikibugs>	 10netops, 6operations: Decommission nitrogen (IPv6 relay) - https://phabricator.wikimedia.org/T123732#1963655 (10Dzahn) a:5Dzahn>3Papaul
[22:01:45] <wikibugs>	 10netops, 6operations: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Papaul) 3NEW a:3Dzahn
[22:02:18] <wikibugs>	 10netops, 6operations: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Papaul)
[22:02:39] <wikibugs>	 10netops, 6operations: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963677 (10Dzahn)
[22:09:02] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1963710 (10JanZerebecki) https://grafana-admin.wikimedia.org/dashboard/db/tmp-t124418
[22:12:52] <bblack>	 YuviPanda: yeah I donno, I vaugely remember that, for phabricator I guess?
[22:13:08] <YuviPanda>	 It was for etherpad I think
[22:13:12] <YuviPanda>	 and phab wanted it too
[22:18:20] <wikibugs>	 10Traffic, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1963752 (10ori) >>! In T124418#1963710, @JanZerebecki wrote: > https://grafana-admin.wikimedia.org/dashboard/d...
[23:06:06] <ori>	 bblack: For your consideration: https://gerrit.wikimedia.org/r/#/c/266332/ and https://gerrit.wikimedia.org/r/#/c/266414/ ; these patches make the testwiki/mw1017 changes I proposed on the engineering list (https://lists.wikimedia.org/pipermail/engineering/2016-January/000017.html)
[23:13:02] <bblack>	 ori: the differing backend names thing is odd, I'd have think we'd have noticed earlier
[23:13:06] <bblack>	 since they run the same VCL now heh
[23:13:39] <ori>	 yeah, i meant to congratulate you on that -- i know it has been in the works for a while
[23:14:17] <bblack>	 oh I guess the commit predates that original, it's just the commitmsg
[23:16:10] <bblack>	 ori: ok looking at DNS relatedly:
[23:16:10] <bblack>	 templates/wikidata.org:test        600 IN DYNA     geoip!text-addrs
[23:16:10] <bblack>	 templates/wikidata.org:test.m      600 IN DYNA     geoip!mobile-addrs
[23:16:10] <bblack>	 templates/wikipedia.org:test                    600     IN DYNA         geoip!text-addrs
[23:16:13] <bblack>	 templates/wikipedia.org:test.m                  600     IN DYNA         geoip!mobile-addrs
[23:16:32] <bblack>	 are users of test.wd.o ok with it too?
[23:16:45] <bblack>	 I mean, I assume we'd pull all those hostnames at that point