[03:52:16] 10Traffic, 6Operations, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2034107 (10ori) p:5Unbreak!>3Normal [07:35:14] 10Traffic, 6Operations, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2034405 (10MoritzMuehlenhoff) Proposed patch by upstream at https://trac.nginx.org/nginx/ticket/901#comment:4 (but not yet merged into nginx Mercurial) [08:09:33] ema: http://info.varnish-software.com/blog/pinterest-speaker-sf-summit [08:09:47] "Pinterest is rolling out a migration from Varnish 3 to 4. In this session, Pinterest’s Jenifer Zinner will share the lessons learned during the migration " [08:09:52] :D [09:45:41] elukey: cool! I'd like to also share my lessons learned during the migration [09:46:18] even though the event security would probably kick me out after a while [09:48:24] Five reasons to love Varnish this Valentine’s Day [09:48:37] (go tell my gf) [09:51:52] ahhahah [10:22:04] oh we reverted the ttl_fixed_1be change [11:20:52] 10Traffic, 6Operations, 5Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2034832 (10ema) [13:28:21] bblack: thanks for the VCL review, I'm addressing your comments [13:28:31] BWT what happened to the ttl_fixed_1be patch? [13:28:48] I've seen it got reverted [13:29:24] s/BWT/BTW/ :) [13:51:14] it got reverted, the the revert was reverted, then the reverted revert was reverted. [13:51:46] becuase I was looking at cache_upload a lot yesterday, and I came to understand that my thinking was off and the change probably wasn't valid. [13:52:04] we can't assume that hacking beresp.ttl has any transitive effect on the next cache up the layers, only on the local storage of the object. [16:06:52] so [16:07:01] shall we sync up on codfw-rollout before the meeting in an hour? [16:11:39] we can! [16:12:31] paravoid: my biggest question in my mind right now, is I'm not sure there's any meaningful way to switch cache tier-1 around without being in sync with the switch of MW/services [16:13:01] maybe part of the meta-problem with discussing is that cache tier-1 is an overloaded term now [16:13:24] well, tier-1 is [16:13:58] anyways.... [16:14:12] how do you mean? [16:14:19] x-dc latency is not completely ignorable. we're still talking +35ms or so [16:14:44] if I assumed that mediawiki was stuck in eqiad and not ready to switch (or alternatively, that we're testing these two things independently) [16:15:18] then the only meaning for "switch cache tier1 but not applayer tier1 to codfw" is to re-arrange the cache traffic to end up in codfw backends, and have codfw backend caches talk to eqiad appservers x-dc [16:15:31] which adds +35ms latency on top of whatever other effects [16:16:13] I don't know that that's a state we'd ever chose to be in with two available tier-1 DCs in the general case, right? [16:18:35] so rewinding a bit, there's the two senses of what "cache tier-1" means: where cache pops backend traffic to, and some notion of which applayer DC that cache routes traffic to [16:18:56] well, adds +35ms for cache misses [16:19:15] yeah, I just don't know how well we cope with that in the general case yet. I guess it's something to find out! [16:19:34] what do you mean cope? [16:19:50] perf team won't be happy but other than that I don't see a huge effect [16:20:14] well if it was s/20ms/35ms/ I'd be completely-comfortable saying it's just a perf effect [16:20:26] but s/0ms/35ms/ can change a lot of things [16:20:48] there can be multi-request patterns [16:21:06] like user->cache->applayer->cache->applayer->cache->applayer [16:21:12] well it's never 0ms :) [16:21:17] but you get what I mean [16:21:19] always coming from a client, somewhere far away [16:21:31] local-network latency is small, and there could be in-built assumptions around that for timing/failing [16:21:42] our internal apps may also make requests back through the front of the caches [16:22:13] suddenly something that looped through some subrequests at near-0ms now has real latency to contend with that it's never seen before [16:22:14] not so many anymore, right? [16:22:19] I don't know! :) [16:22:36] well worst case, something internal will break and we'll have to fix it :P [16:22:49] but I tend to assume the worst. there are probably paths like user->caches->restbase->cxserver->caches->MW-API [16:22:58] or something of that nature [16:23:05] and that MW-API hit is on the public hostname to boot [16:23:09] because seriously, if we have internal apps that go via caches and are latency-sensitive... fuck them :) [16:23:37] ok anyways, ignoring that problem [16:24:13] the idealized future scenario we'd been discussing slowly moving towards in the previous iteration of the multi-dc meeting, which is a further-out goal than what we're doing this quarter... [16:25:07] the idea at the cache layer there was that eqiad+codfw are both considered "tier1" caches. we have some config (static default? switchable? etcd/confd?) that can change which tier2 use which tier1 [16:25:29] so e.g. ulsfo backends to codfw, esams backends to eqiad, by default, with some ability to fail that over if we lose a primary DC [16:25:37] and eqiad/codfw caches don't talk to each other [16:26:06] the tier1 eqiad/codfw backends by default send all traffic to local mediawiki (which is fine being active/active for readonly/anon traffic) [16:27:01] but a write request (POST?) goes to the notional "primary" of the two DCs where writes are happening, and sets a sticky cookie for that users requests to keep going there for readonly too for the next N seconds [16:27:20] right [16:27:22] (which is x-dc from e.g. codfw backend cache to eqiad appservers) [16:27:48] and the N seconds is some limit we've placed on "normal replication should never be lagged more than N" [16:28:05] yup, I remember all that [16:28:52] what we're aiming for here is less than all that, but still we want to be at least moving in the right direction towards those kinds of ideas, instead of going one way and then rewinding and going a different way in how we factor all of this out [16:29:29] one of the key shared points here is that we have to change up this notion of site_tier=="one" into a couple of separate things [16:30:06] if possible, sure [16:30:09] in the ideal long-term view, being tier-one no longer necessarily means all applayer requests are local [16:31:16] e.g. we have some VCL that's conditional on tier==one in the sense of "we're directly contacting the applayer if we're tier==one, as opposed to another cache" [16:32:30] so if that's not really true in this short-term scenario, then we're not really switching to having dual tier-1 caches defined [16:33:15] instead we're trying to define mechanisms, separately for "switch caches between tier-1 and tier-2 status", "switch which tier-1 a tier-2 talks to" (which is part of the longer-term thing anyways), and "switch which applayer DC a given tier-1 talks to" [16:33:44] uhm, ok [16:33:44] those first and third are very different than the long-term plan [16:34:45] I'm assuming we're still basically throwing out the x-dc PII crypto for this test switchover period, of course [16:36:59] if we had the 3x switching capabilities listed above working, then our procedure for testing independently of applayer is essentially: "shut users out of eqiad to preserve caches", "turn off writes, or all traffic since we're not confident about POST", "swith codfw to tier-1 status (talking to eqiad applayer)", "switch ulsfo/esams to talk to codfw as their tier1", "turn back on writes/traffic" [16:37:19] and then what we've done is made codfw the tier-1 backend cache, but not moved the applayer's location [16:37:43] and then in sync with mediawiki/services switchover, we still have to coordinate the switch of codfw from "talk to eqiad applayer" to "talk to codfw applayer" [16:38:07] or if you want to do that part first, we do that to eqiad (switch to to talking to codfw applayer), and then do all of the above afterwards/async [16:38:26] so wait [16:38:41] (always assuming that we don't care about x-dc crypto) [16:38:43] why can't we do [16:39:00] 1) depool eqiad frontends from gdnsd (we know that works) [16:39:29] 2) turn ulsfo/esams backends to codfw (should work fine, for ulsfo it might even make sense to do it permanently) [16:39:47] 3) turn codfw's backends from eqiad varnish-be to appserver.svc [16:39:57] and done? [16:40:12] 1) is already step 1 above, so sure [16:40:44] 2) is probably possible, but needs some validation that the VCL can handle it (that our layers are transitive in the right ways) - may not be hard to figure out, may not require any new work? [16:41:31] 3) I think you're right. I think I keep getting stuck on the readonly/halt of traffic because I'm still looking at the longer-term view with writes to 2x DCs and such... [16:42:18] so with your 3-step plan, we could switch effectively to codfw-tier-1-caches, but still be talking to appservers.svc.eqiad [16:42:36] yeah [16:42:44] indepenently (before or after that), we'd have another switch that changes whichever-is-tier1 between appservers.svc.eqiad and appservers.svc.codfw [16:42:55] yes, I was about to write exactly this :) [16:42:57] and that part I think we still can't do without stopping everything [16:43:08] stopping in what sense? [16:43:29] we'll block writes for the duration of the switchover, but we'll do that in mediawiki-config [16:43:34] well I assume mediawiki is not active-active multi-dc capable, not even after this EOQ goal [16:43:36] as we'll also need to do database master switchovers etc. etc. [16:43:49] but the cache layer won't need to really care about that [16:44:06] what does "block writes" even mean though? is the response to attempted writes going to fuck up the caches? [16:44:29] in theory no -- and it didn't last time [16:44:33] ok [16:44:37] but it's something we can discuss in 15' :) [16:44:52] with MW people, I'm clearly not authoritative [16:45:24] back on point-2 above - it's something I've been looking at with other refactor work. I think we can validate that it works out ok, and maybe make a few small changes, but it's definitely something that needs investigating [16:45:52] ok [16:45:52] (the whole reason I looked at it before was the possibility of a minipop contacting a tier-2 cache backend) [16:45:57] nod [16:46:04] note that ulsfo now is much closer to codfw than eqiad [16:46:15] yeah [16:46:16] so it might even make sense to do it permanently [16:46:31] well it definitely does when we someday have active/active MW for at least reads [16:46:56] in the interim, probably does too, but it's a longer chain of dependencies for ulsfo [16:47:06] in terms of failure scenario or whatever [16:47:07] yeah [16:47:31] but we can at least do it in advance of the switchover date possibly (assuming the puppet foo won't take a month to figure out) [16:47:50] it shouldn't [16:48:03] in general, does this sound like a sane plan? [16:48:18] I'm not sure how I feel about ignoring the x-dc crypto for one :) [16:48:43] yes, I think so. (3) also requires some puppet and/or confd templating stuff too [16:49:09] if we don't ignore x-dc crypto, everything changes if we can't have codfw->eqiad-app or vice-versa [16:49:16] nod [16:49:23] and it becomes synchronous with MW switchover on some level [16:49:38] but I'm not sure how I feel about coupling mw/restbase/varnish/etc. altogether on a single day [16:49:44] yeah [16:50:03] alternatively, I can try to hack around x-dc crypto with tls outbound proxies [16:50:12] meh [16:50:15] but it's ugly and temporary, would rather avoid it [16:50:39] and it's not risk-free either [16:50:44] nope :) [16:51:06] anyways, ulsfo depool at same time as meeting start, I want to sync up with ema on that right quick [16:51:14] ok [16:51:18] thanks :) [16:51:33] let's sync up at the meeting too, esp. wrt 18:44 < bblack> what does "block writes" even mean though? is the response to attempted writes going to fuck up the caches? [16:52:20] yeah [17:02:52] 10Traffic, 6Operations, 10ops-codfw, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2035651 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLwLycR-0X0Il_jxsQQ} [2016-02-17T17:02:48Z] depo... [17:34:05] 10Traffic, 10Analytics, 6Operations: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2035781 (10Milimetric) [17:35:24] 10Traffic, 10Analytics, 6Operations: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2035791 (10Milimetric) p:5Triage>3High [18:01:50] speaking of logging servers [18:02:03] should we also procure/setup an oxygen codfw equivalent? [18:02:33] not sure how much point there is though, as we just consume from kafka which will remain eqiad-only for now [18:03:25] 10Traffic, 6Operations, 5Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2035896 (10BBlack) So, I've figured out some of the things that were confusing me yesterday. To recap that: 1) I now question and need to investigate whether our TTL caps are really e... [18:49:29] 10Traffic, 6Operations, 10ops-codfw, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036040 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 10:49:13 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [18:58:19] 10Traffic, 6Operations, 10ops-codfw, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036091 (10emailbot) **`Brandon Black`** replied via email on `Wed, 17 Feb 2016 18:58:12 +0000` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failu... [19:02:09] 10Traffic, 6Operations, 10ops-codfw, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036111 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 11:02:02 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [19:34:22] 10Traffic, 10Deployment-Systems, 6Operations, 6Performance-Team, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2036304 (10Krinkle) [20:25:44] 10Traffic, 10Deployment-Systems, 6Operations, 6Performance-Team, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2036441 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmfl... [20:26:32] 7Varnish: hitting 502 errors on en.wikipedia.org - https://phabricator.wikimedia.org/T127227#2036442 (10Catrope) [20:28:00] 7Varnish: hitting 502 errors on en.wikipedia.org - https://phabricator.wikimedia.org/T127227#2036448 (10ori) @fbstj, is the problem still occurring? We had to do some maintenance work on the Nginx layer and it caused a brief spike of 502 errors. It should be OK now -- could you try? [20:30:20] 10Traffic, 10Deployment-Systems, 6Operations, 6Performance-Team, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2036457 (10Krinkle) [21:11:34] 10Traffic, 6Operations, 10ops-codfw, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036604 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 13:11:25 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [21:15:14] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036617 (10Southparkfan) [21:24:30] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036666 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 13:24:03 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [21:42:40] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036739 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 13:42:32 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [22:15:36] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036810 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 14:15:08 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [22:16:34] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036812 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 14:16:07 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [22:39:56] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036863 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 14:39:47 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [22:44:04] 10Traffic, 6Operations, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2036867 (10BBlack) I'm enabling this for upload now as well, as I've been testing one esams cache with live hacks for a while now and not seen any issues. Will try to keep... [22:49:02] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036875 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 14:48:32 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [22:51:40] 10Traffic, 10MediaWiki-Interface, 6Operations, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2036883 (10ori) >>! In T124356#2028810, @BB... [23:00:38] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036903 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 15:00:09 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [23:03:31] 10Traffic, 6Operations, 6Phabricator, 7Blocked-on-Operations: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2036928 (10MBinder_WMF) The recent Phab upgrade chatter has had my teams ask me to check on this. I think it may have gotten swallowed by the... [23:03:40] 10Traffic, 10MediaWiki-Interface, 6Operations, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2036930 (10Stashbot) {nav icon=file, name=M... [23:05:54] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036938 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 15:05:45 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [23:20:31] 10Traffic, 10MediaWiki-Interface, 6Operations, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2036991 (10BBlack) I've executed bans for d... [23:44:02] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2037112 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 15:41:26 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [23:50:18] 10Traffic, 6Operations, 10ops-ulsfo, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2037173 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 15:49:33 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki...