[06:45:36] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3480500 (10Nemo_bis) [06:56:53] 10Traffic, 10Operations, 10Mobile, 10Need-volunteer, and 2 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#2904582 (10Nemo_bis) Can you guarantee to support all the URLs with parameters which wo... [08:09:07] 10Traffic, 10Commons, 10Operations, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3480638 (10ema) 05Open>03Resolved a:03ema We do have occasional backend fetch failures. Closing, as this looks like a transient error. [12:40:56] 10Traffic, 10Commons, 10Operations, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3481084 (10Jeff_G) >>! In T171421#3469152, @fgiunchedi wrote: > @Aklapper _usually_ traffic since this indicates varnish failure to fetch and most lik... [12:56:21] 10Traffic, 10Commons, 10Operations, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3481110 (10ema) >>! In T171421#3481084, @Jeff_G wrote: >>>! In T171421#3469152, @fgiunchedi wrote: >> @Aklapper _usually_ traffic since this indicates... [14:51:05] 10netops, 10Analytics, 10Operations, 10User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3481411 (10elukey) [14:59:25] [lvs-users] UDP packet loss when real server removed from farm [14:59:31] http://archive.linuxvirtualserver.org/html/lvs-users/2017-07/msg00002.html [14:59:38] sounds familiar! [15:04:15] _joe_: https://github.com/facebook/gnlpy/pull/23, if they like it then calling add_service with ops=True will do the trick [15:04:46] <_joe_> ema: \o/ [15:05:15] ema nice [15:05:40] ema: also while reading about the UDP issues, there's lots of hints that we're probably not operating optimally for TCP either (at least for the cache fe cases) [15:05:57] re: using weight=0 rather than hard depool, and playing with the quiescent_template sysctl and related [15:06:51] (what's awful about the 2-3 sysctls involved is they're global to all services, but still, changing them might be a net win for the TCPs) [15:06:53] yeah, is there any drawback in depooling with weight=0? [15:07:35] ema: I think, depending on those sysctls and the scheduler, it's possible weight=0 still sends new conns their way, too (based on old connection templates) [15:08:59] but with the right combination of settings, it should be possible to reach a world where: (a) the only way we need to depool is via weight=0 for all cases (we'd only delete a realserver if we're actually decomming it or whatever) and (b) that always means in the TCP case "let existing connections keep going, but don't schedule new ones there" (obviously if/when the server dies/reboots, those conn [15:09:05] ections will break and re-connect elsewhere, but that's unvaoidable) [15:11:59] one thing to note is I don't think we're using "persistent" connections anyways [15:13:11] yeah [15:13:25] so confirming the current live setup, none of our services seem to use "persist" (-p) [15:13:38] and our sysctls are expire_nodest_conn=0 + expire_quiescent_template=0 [15:14:42] the whole point of "persist" is that once a scheduling decision is made for a client (e.g. by wrr), a template is kept for a timeout to keep sending that same client IP to the same destination server. [15:15:15] and the reason we don't use it is that we're using the "sh" scheduler, which gets much of the same benefit by hashing the client IPs to the destination servers and not using stateful persistence templates [15:15:46] http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.persistent_connection.html has all the old/crazy docs on everything related [15:16:54] side note: Mar 2014 This HOWTO is no longer being maintained. [15:16:56] :( [15:17:51] still a good primer though [15:18:17] definitely [15:25:33] so, assuming we want to continue avoiding "persist" (I think we do) [15:25:42] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481495 (10Jgreen) [15:25:49] expire_quiescent_template has no effect, as the "template" it refers to is the persistence one [15:26:37] I'm still not sure about expire_nodest_conn w/o persist, and weight=0-vs-remove, still reading [15:27:10] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481515 (10Jgreen) [15:27:14] 10netops, 10Operations, 10ops-eqiad: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3481514 (10Jgreen) [15:27:19] there's quite a lot of interesting stuff scattered through various mailing list posts (eg: http://archive.linuxvirtualserver.org/html/lvs-devel/2013-08/msg00071.html) [15:27:46] > OPS connections are accounted in InActConn for a very short period, they live up to 1 jiffie, eg. 10ms. Also, WRR should be reliable for OPS while other schedulers (eg. *LC) are not suitable. [15:30:49] yeah because wlc relies on feedback from the counts of connections in the table, which are now only living up to 1 jiffie or whatever :) [15:31:08] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481522 (10ayounsi) We have plenty of ports on the new switches to accommodate that. My suggestion is that we do it after the migration to the new infra (and... [15:31:50] effectively, we wouldn't really need the connection hash table, right? Not even for TCP services as we're using sh [15:34:34] no, we still need the table of existing (well, and recently-existing) connections [15:35:02] or else every sh state change (e.g. add/remove server, change weights anywhere) will break the majority of connections immediately [15:35:17] ha, of course! [15:35:28] having the table of live connections lets them keep routing to where they were before, assuming it's still a valid destination, and only new conns follow the new weighting, etc [15:36:44] and of course there's all the final timeout stuff (FIN_WAITs and TIME_WAITs and CLOSE_WAITs), so ipvs has to remember a connection state a bit after it has ended, since it's only seeing one side of the conversation and can't be sure that the socket (as in ip:port+ip:port) is truly dead and a new matching combination is a unique new connection. [15:36:55] which is I think why it tracks in Inactive Conns [15:38:40] I'm still not entirely sure of the exact behavioral results of the 4 possibilities of: expire_nodest_conn=0|1 + depool-via-remove vs depool-via-weight0, the docs are confusing. [15:38:53] (all assuming no persist, of course) [15:39:33] it seems like regardless of expire_nodest_conn, *new* TCP connections should immediately go elsewhere on remove or weight=0 [15:39:48] expire_nodest_conn is about what happens to packets on existing connections in those cases [15:40:26] is that what "destination server not available" means? [15:40:31] https://www.kernel.org/doc/Documentation/networking/ipvs-sysctl.txt [15:40:31] my best guess/interperation is that with weight=0 depooling, the sysctl doesn't matter and connections will attempt to keep talking to the zero'd backend, until they possibly timeout/rst [15:41:24] but if you actually delete a backend server, with expire_nodest_conn=0 it simply drops the packets on the floor (until the client gives up or the backend is restored to the table), and expire_nodest_conn=1 will immediately RST the active connections [15:41:44] yeah that's the core question, what that doc means by "destination server not available", but I think they mean "actually removed from the table" [15:42:32] but it's possible that definition also includes weight=0... (I'd hope not though) [15:43:00] if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) { [15:43:22] if (sysctl_expire_nodest_conn(ipvs)) { [15:43:22] /* try to expire the connection immediately */ [15:43:23] ip_vs_conn_expire_now(cp); [15:43:23] } [15:43:31] also I said "[ipvs] will immediately RST", but really that's: "then the client program will be notified that the connection is closed [15:43:34] " [15:43:50] 10netops, 10Operations, 10fundraising-tech-ops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481550 (10RobH) >>! In T171962#3481522, @ayounsi wrote: > We have plenty of ports on the new switches to accommodate that. My suggestion is that we do it aft... [15:43:53] I'm guessing ipvs doesn't actually RST, just schedules to a different backend, which will RST for lack of knowledge of the conn [15:45:00] and the IP_VS_DEST_F_AVAILABLE flag is removed by __ip_vs_unlink_dest [15:45:28] yeah I was just reading the same [15:45:32] so it's deletes, not weight=0 [15:45:48] looks like [15:46:53] ok, so, it sounds like: [15:47:30] 1. weight=0 -> let existing connections keep going as best they can, till they timeout/rst/whatever (but no new conns) [15:48:07] 1. removal + expire_nodest_conn=0 -> packets from existing connections silently dropped on the floor (client may eventually timeout, or recover if the destination is re-added shortly after and retransmits fix it) [15:48:12] lol [15:48:44] 3. removal + expire_nodest_conn=1 -> packets from existing connections get re-scheduled (to a new server that doesn't know the connection and RSTs it, for faster failure -> reconnect) [15:49:55] conn_reuse_mode is interesting too (in kernel sysctl docs, but not in the older ipvs docs) [15:51:13] so the "port reuse" case is because of the ambiguity, right [15:51:44] under normal conditions, it's possible the same clientip:clientport will connect to service:443 freshly, shortly after a previous similar connection [15:52:18] which is the whole thing TIME_WAIT states and such are meant to avoid... the confusion between whether that's a new connection that happens to have the same ip:port on both sides, or a leftover packet from the previous connection. [15:52:30] BTW there's another usage of sysctl_expire_nodest_conn in the code (ip_vs_in), in case conn_reuse_mode is set [15:52:41] yeah ^ [15:53:09] so especially in the DR case those kinds of things are confusing, because we only see one side of the TCP conversation [15:54:38] it sounds like we want bit2 turned on there [15:55:58] nice [15:56:06] bit 2: it is bit 1 plus, for TCP connections, when connections [15:56:06] are in FIN_WAIT state, as this is the last state seen by load [15:56:06] balancer in Direct Routing mode. This bit helps on adding new [15:56:07] real servers to a very busy cluster. [15:58:15] right [15:58:38] that aside, I'm trying to figure out what we should be doing with expire_nodest_conn and methods of depooling in various cases [15:58:41] my current thinking is: [15:58:51] 1) We should enable expire_nodest_conn=1 [15:59:24] 2) In cases where we have the option of a graceful (planned) depooling, we should do weight=0 for a time period* before eventual removal [15:59:50] 3) In cases where the server just dies (e.g. pybal monitor-fail), and stays dead, we really want to just remove it as quickly as possible** [15:59:57] but: [16:00:53] * ideally, we'd poll for all reasonable connections to finish up before moving to removal from ipvs and then downing the service, but you'd have to have a timeout on that anyways and give up on super-long ones to get your job done... it's simpler to just build in a reasonable delay, e.g. weight=0->sleep:60->remove. [16:01:51] ** except that maybe the monitor could do a quick-flap (either a real quick-flap of the service, or a quick flap of monitoring itself borked, in which case a short weight=0 period would've been better) [16:03:07] we know pybal's monitoring is flappy-by-design (in the name of reacting very quickly to failure) [16:03:29] so our options for handling the ** there most-correctly are twofold: [16:04:04] 1) We build some anti-flap into pybal monitoring, and have it remove when it's pretty sure it's dead (ick?, and removes the benefit of fast pybal reactions for new conns) [16:04:46] 2) We leave pybal being hypersensitive like it is, but have it do weight=0 on initial fail, followed by removal after the failure has persisted past a reasonable flap window (e.g. 2-3x monitor-fail in a row?) [16:05:06] What we have today is: [16:05:57] I like (2) :) [16:05:59] expire_nodest_conn=0, and all of our monitor depools are via removal, and at least for the cache clusters we do all depools via removal normally as well. So in all of these cases the packets get dropped on the floor and either the client eventually times out, or recovers when the service quickly recovers and re-adds [16:06:35] 10Traffic, 10Operations, 10Mobile, 10Need-volunteer, and 2 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#3481621 (10Jdlrobson) >>! In T154227#3480505, @Nemo_bis wrote: > Can you guarantee to s... [16:06:41] (I think, anyways) [16:07:07] also, I think at one point in the past pybal didn't handle the weight=0 case. I think we already fixed that and some MW depooling stuff already uses it via etcd? but I'm not 100% sure. [16:09:43] given our existing "depool via removal" commonly for both monitor-fail + maint, I'm on the fence as to whether enabling expire_nodest_conn=1 today without fixing the rest is a good idea or not [16:09:47] there's tradeoffs [16:10:19] maybe we should fix up depooling (at least maint depooling) to use a weight=0 period first [16:11:38] if we just turn on the expire sysctl first, in cases where a maint or monitor depool is very quick at re-adding, connections might be pointlessly-broken (but quickly recovering with fresh reconns) [16:12:15] but it's not like they would've quickly-readded correctly in any of the common maint cases anyways, right? because we're either rebooting, or we're restarting nginx, and the TCP conn will still break when that happens. [16:13:01] so maybe that part's not much of a tradeoff (may as well turn on the sysctl now, and fix up the weight=0 stuff as we go) [16:13:59] the pybal monitor-flap case is more-interesting though. if we think pybal commonly flaps services and real unpredirected failures (machine/daemon dies or network dies, whatever) caught by monitoring are rare, enabling expire_nodest_conn is just going to result in more pointless RST of live connections that might've otherwise stayed afloat? [16:14:16] or if we think pybal isn't too flappy and we want better response to real failures, turning it on now is a good idea [16:15:31] either way we can/should go ahead and fixup conn_reuse=2 (or is it 3? the docs seem to say bit2 implies bit1, does that means it's better set as =2 or =3, or is there any diff?) [16:17:20] seems like reading current code, =2 and =3 are equivalent [16:18:14] I don't think we support weight=0 in pybal [16:18:24] # Include weight if specified [16:18:24] if server.weight: [16:18:25] cmd += ' -w %d' % server.weight [16:18:33] so, we don't :) [16:18:33] heh ok [16:18:57] so maybe we need to bump priority on fixing that, in light of all the above [16:19:18] old ticket: https://phabricator.wikimedia.org/T86650 [16:19:23] yep [16:19:43] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#3481655 (10ema) p:05Low>03High [16:19:47] I've set the priority to PYBAL :P [16:20:24] lol [16:20:43] that ticket has some interesting nits in it too. we're past the kernel concerns, but do we have a new enough ipvsadmin for weight=0? :) [16:20:55] I think we need 1.28 for it lol [16:21:02] oh come on [16:21:28] lol [16:22:06] well we know our setup isn't too awful today [16:22:12] let's take the less-risky approach: [16:22:36] 1) Get ipvsadm to 1.28 (stretch reinstalls?) + add simple patch to allow weight=0 in current pybal [16:23:06] 2) Fix up our standard tools/docs to use the weight=0->sleep->remove method for planned maint [16:23:07] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10RobH) [16:23:26] 3) Fix up pybal if we can, to do weight=0 on first fail, then remove when failure persists [16:23:37] 4) Turn on expire_nodest_conn=1 [16:25:58] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3481679 (10RobH) [16:29:41] is priority PYBAL meant to be heard like priority KAAAAHHHHHNNNNNN ? [16:29:44] https://www.youtube.com/watch?v=wRnSnfiUI54 [16:30:37] haha yes [16:31:49] we could also package 1.29 and add a patch to get rid of that memory allocation problem which caused so many sleepless nights already [16:31:58] and send it upstream :) [16:32:00] <_joe_> lol [16:32:23] <_joe_> ema: we do support weight in 2.0-defv [16:32:35] <_joe_> we should really try to move that to master [16:32:44] <_joe_> and think about releasing :P [16:33:11] <_joe_> bblack: pybal 2.0 has a "drained" status [16:33:16] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481733 (10RobH) [16:33:36] <_joe_> we can set the ipvs FSM to, or request. [16:35:00] I'm currently dubious about releasing a big new rewrite to fix a bug though [16:35:17] I suspect it isn't hard to just update current pybal to allow 0 to be a legal weight value [16:35:36] (and really, is there's a value in having an explicit "drained" state as distinct from pooled+weight=0?) [16:35:53] since weight is an integer we pick up from etcd or whatever [16:36:12] other related bits: http://archive.linuxvirtualserver.org/html/lvs-devel/2013-06/msg00055.html [16:37:05] we have the kernel part of that, but dog only knows if ipvsadm 1.2[89] have support for those "scheduler flags" [16:38:04] <_joe_> pooled + weight 0 is either an etcd state [16:38:28] <_joe_> or something you transitions through when you have an outage or you're depooling safely a server [16:38:37] <_joe_> even if the weight in etcd is !=0 [16:39:10] I guess what I see it as is: [16:39:43] 1) planned maint: the maintenance tool uses etcd to set weight=0 (->pybal->ipvs), then delay, then remove (->pybal->ipvs) [16:40:07] <_joe_> that should work [16:40:08] 2) monitor-fail within pybal: set internal weight to zero (overriding etcd data) immediately, remove when failure persists [16:41:01] but either way, allowing "weight" to be an integer which has zero as a legal value seems like the right model, vs either disallowing metadata weight=0 and reserving it to a special etcd+pybal state of "drained", or having two different paths to set weight=0 (explicitly via weight=0, or implicitly via "drained") [16:41:05] those flags are in 1.27 https://git.kernel.org/pub/scm/utils/kernel/ipvsadm/ipvsadm.git/commit/?id=3141694cb0d873adc8ef9b09ff71d510957ef640 [16:42:19] ok awesome [16:42:36] I'm still puzzling over my last link. I think I understood it once but don't currently [16:42:49] " [16:42:49]  [16:42:49] By default the SH scheduler rejects connections that are hashed onto a [16:42:53] realserver of weight 0" [16:42:55] ? [16:43:06] does it mean "rejects packets of existing connections" there or "rejects new connections"? [16:43:57] <_joe_> heh [16:44:11] <_joe_> I guess it's either testing or looking at the code :P [16:44:18] yeah [16:44:44] really, the scheduler shouldn't be involved in any existing connection anyways, only in new ones (the connection table handles existing) [16:45:18] but I thought, given my understanding of sh before, that if you added a weight=0, when sh reconfigured for that it would have no slots, and therefore no new connection would have to it anyways, making that statement about rejecting them non-sensical [16:45:31] s/have to it/hash to it/ [16:46:20] reading... [16:47:15] but we should understand this first I think, maybe it breaks all our thinking about weight=0 because sh is "special" anyways :P [16:51:04] 10netops, 10Operations, 10fundraising-tech-ops: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3481776 (10ayounsi) [16:52:04] I'm off, see you! [16:52:43] cya! [16:53:11] yeah so ip_vs_sh's hash table (of 256 buckets that servers are slotted into), always has every server, that's why they need that hack... [16:53:32] we've know about the 256 limit for a while, it becomes effectively a limit on the total weight of a pool, which is why we keep weight values low [16:53:55] but it's probably been a long time since that's been discussed, who knows if it's documented [16:55:02] but basically every server in the list gets at least one hash slot [16:55:13] if weight=1, you get one slot. if weight=0, you also get one slot :P [16:55:59] and the default behavior of ipvs is if a new connection hashes to the singular slot of a weight=0 server, it will reject the connection, unless we set this new flag to make it pick another server [16:56:20] (which seems kinda ridiculous, I wonder why they didn't just have weight=0 mean "zero slots in hash table" in the first place...) [16:59:24] maybe I'll have to stare at it harder to understand the why eventually [17:00:15] but it seems crazy, we'd never want to set sched=sh+weight=0 without setting the flag for fallback, it would make things Bad [17:02:23] it also fails (rejects conn) if a server is in the sh table but has varnished (dest==NULL), which I'm guessing might be a race case inbetweeb "remove backend server from ipvs", and "sh updates its hash table shortly afterwards"? [17:02:37] s/varnished/vanished/ heh [17:03:38] this all harks back to what we've said years ago: sh is "ok", but it would be nice to write a replacement that behaved more like what we expect, with a chash that respects weighting properly (including weight=0) and never fails to choose a backend [19:06:04] 10Traffic, 10Operations: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3482223 (10RobH) a:05RobH>03None [19:06:35] 10Traffic, 10Operations: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10RobH) These two systems are installed and calling into puppet, ready for service implementation. Assigning to @ayounsi but not sure if this should be him or @bblack. [19:09:08] 10Traffic, 10Operations, 10ops-ulsfo: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3482230 (10RobH) 05Open>03stalled [19:09:11] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3482231 (10RobH) [19:19:23] 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3482263 (10GWicke) [19:20:33] 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10GWicke) [19:26:07] 10Traffic, 10ArchCom-RfC, 10Operations, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3482294 (10GWicke) [19:26:11] 10Traffic, 10Operations: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3482295 (10ayounsi) a:03BBlack [19:31:15] @bblack, I looked a bit closer into implementation options for concurrency limiting at the Varnish layer; I now think it would be fairly easy in Varnish, but am a bit less sure about app-level load balancing in Nginx [20:06:05] well, fairly easy at the cost of some kind of thread-shared state between the threads handling the concurrent connections (varnish has a thread-per-concurrent-request sort of model) [20:32:49] it could be an extension of the vsthrottle module, which uses state shared across threads, mutex protected [20:33:29] modification would be to add a way to return a token to the bucket when a request is done [21:42:28] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3482880 (10GWicke) >>! In T133178#3428811, @Krinkle wrote: > I'd recommend the latter, but not indefinitely. We'd deprecate REST on `wikimedia...