[00:37:56] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2995351 (10Dzahn) 05Resolved>03Open reopening - I have one more request please. Can we please change install1001 to install1002 and install2001 to install2002? These are both V... [00:39:09] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2995356 (10Dzahn) I realize this might be on your last day before you are away for a while, please feel free to put up for grabs and i'll ask others. [00:39:25] 10netops, 06Operations: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#2995358 (10Dzahn) [12:57:30] bblack: we've got timeout_idle set to 5s (default) on our varnishes, which means that all pybal idle connections to the frontends are closed and re-created every 5s. Any objection with raising the timeout to 60s? It's currently 60s on nginx and 200s on apache [12:58:41] > Idle timeout for client connections. A connection is considered idle, until we have received the full request headers. [13:02:16] that seems like a sane line of thinking, but, give me a bit to wake up more and look at it more [13:02:22] there might be an obscure good reason [13:03:08] sure :) [13:06:07] how are you feeling? [13:07:50] better, thanks! [13:13:08] 10Traffic, 06Operations, 07Wikimedia-Incident: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#2924351 (10BBlack) Recording this while I remember it: # The VSLP director code panics if there are no backends defined for a dir... [13:17:08] 10Traffic, 10Analytics, 06Operations: Add global last-access cookie for top domain (*.wikipedia.org) - https://phabricator.wikimedia.org/T138027#2996314 (10BBlack) [13:17:32] ema: adding missing traffic tag + cc:you on ^ [13:17:54] whenever you get free time, I've committed us to that one this quarter, and probably better you than me :) [13:18:29] exact same functionality as existing WMF-Last-Access, but a new separate cookie name, and stripping the language hostname off so it applies at the 2LD level. [13:18:53] sounds good! [13:21:28] on the timeout thing... so nginx can handle a bunch of idle clients pretty efficiently with its event model [13:21:53] varnish allocates a thread per client connection, that's probably why it was at 5s, since it was the predominant public-facing port pre-HTTPS. [13:22:19] to minimize impact if someone tries a slowstart sort of attack and opens a bunch of idle empty conns or trickles data into them [13:23:14] it's still public-facing (mostly for redirects), and it still takes traffic from nginx too though, so DoS-ing port 80 would still DoS the https conns effectively too if we reached some varnish thread limit. [13:23:53] right [13:24:01] and nginx isn't a pure TCP proxy. if someone opens a bunch of idle conns to nginx with its 60s timeout, it's not going to open matching ones to varnish until it's got a real request to forward [13:24:19] so, all that is to say, there is a reason to keep a low timeout there, but... [13:24:37] it's a defensive measure, and not an incredibly good one. even at 5s, it's just harder to do it, it's not stopping it. [13:25:07] how about having nginx listening on 80 for the redirects and varnish internal only? [13:25:21] right, under that model it wouldn't matter [13:25:41] we're still blocked on switching to that model by two things: [13:26:04] 1) stream.wikimedia.org's lack of HTTPS-redirect enforcement for legacy rcstream (due to be fixed by mid-year by getting rid of rcstream, basically) [13:26:18] 2) Having a solution for all the non-canonical domains which we don't redirect because they don't yet have LE certs, etc [13:27:00] well, those things are blocking the simple switch, with the pre-provisioned "nginx port 80 redirect-only" simple version [13:27:25] I guess we could re-order things if we moved the "redirect-or-forward" type of logic that's in varnish today up to nginx port 80, with the lists of domain exceptions, etc. [13:28:46] but back to the immediate question: I guess we could raise the 5s for now too. the attack seems less-likely, and the defense isn't perfect anyways. [13:29:10] does it make a big diff for pybal? [13:30:05] sub https_recv_redirect { [13:30:19] ^ is the logic that would have to move up to nginx, to switch to nginx-port-80 early [13:31:12] (for the cases that don't hit the 751 tls redirect or 403 clauses there, it would have to use the same proxying to the backend that the nginx https stanza does) [13:31:37] and it would have to template in somehow to keep modules/tlsproxy generic for other uses [13:31:59] so pybal does its work just fine with the 5s timeout, I was just wondering why the reconnects were happening only with cp hosts and not with MWs [13:32:39] and doing the reconnects every 5s seemed a bit too much :) [13:38:13] it would be nice (but seems low on the prio list compared to other pybal issues) to have pybal have its own timeout that can be set shorter than the destination and overlap [13:38:43] e.g. if the destination is known to be configured with 15s idle timeout, pybal closes its connections at 13s, and always opens a new one just before closing the old one to maintain surveillance [14:38:23] uhuh, tests green for the first time in two years! \o/ https://github.com/wikimedia/PyBal/commits/1.13 [14:38:44] \o/ [14:38:54] that was just a missing dependency in requirements.txt, but still :)