[10:52:27] there's some ongoing work to add Maglev consistent hashing to ipvs: http://archive.linuxvirtualserver.org/html/lvs-devel/2017-12/threads.html [10:53:46] the main difference with Karger/Rendezvous AFAIU is that it focuses more on even balancing among backend nodes, at the expenses of resilience to backend pool changes [10:54:57] see section 3.4 of https://static.googleusercontent.com/media/research.google.com/ko//pubs/archive/44824.pdf [10:56:49] I've added a graph plotting 'client requests' to the varnish instance breakdown dashboard https://grafana.wikimedia.org/dashboard/db/varnish-traffic-instance-breakdown [10:57:54] it does look like balancing with the sh scheduler is not particularly even [10:59:19] eg. now on text/esams we've got ~7.3K requests hitting cp3033 and ~5.6K on cp3030 [11:00:55] interestingly, on other DCs the distribution seems much more even [11:02:01] <_joe_> ema: so in the case of tls termination, maglev would be a good pick, from what you say [11:03:48] _joe_: I guess so. We do not depool the tls terminators too often, and even if client affinity is disrupted for a bit it's not the end of the world [11:04:19] the drawback would be in terms of performance (TLS session reuse, TFO) [11:05:09] on the other hand, it really does look like sh gets the job done pretty well everywhere except for esams [11:14:01] ema: looks like the divergence is proportional to request rate, looking back e.g. 2d [11:15:00] also the bottom graph might need some recording rules to make it faster to load [11:18:51] godog: yes, request rate seems to have something to do with that [11:19:09] re: bottom graph, I'm not sure we actually really want it, perhaps not in the current form? [11:20:20] there are probably too many backends in that graph to draw useful conclusions out of it [14:15:10] yeah maglev is pretty neat, there's lots of ideas to steal there [14:15:59] roughly speaking, maglev is something like our pybal+ipvs on the LVSes. The critical differences seem to be: [14:16:41] 1) Obviously, the chashing and other pretty features of their balancing/tracking stuff [14:17:44] 2) They're replacing our use of ipvs's Direct Routing with GRE-routing, which is a little less-tricky, but adds overhead (I guess they must have jumbo frames on the inside so they don't have to re-packetize when tunneling) [14:18:06] I have a tab open with the PDF and no time to read it, so your TL;DR is very much appreciated ;) [14:19:32] well I think 1+2 is about it, but 1 covers a lot of ground :) [14:20:47] s/DR/GRE/ is basically "instead of leaving the original destination address in the IP dest header and editing the ethernet destination mac to match the target host, just wrap the packet in a GRE header sent to the target host's host IP, leaving the original packet inside" [14:21:10] you avoid the complexity of true DR, and exchange it for the overhead/complexity of GRE->realserver [14:21:11] FB does the same with LVS-Tun I think [14:21:21] either way, the response traffic is direct [14:21:56] they claim their software stack is efficiency enough to saturate 10G nics at wire speed with tiny packets, too :) but not quite yet efficient enough to do so for 40G nics [14:23:04] they also bypass the kernel, interestingly [14:23:19] I guess 3) and because of their fancy state-stuff and proper chashing, they do ECMP into their LVSes/Maglevs from the routers, for active/active/active/active/.... [14:26:16] anyways, maglev isn't open source, but the whitepaper gives good ideas to look into [14:26:45] and it's cool to see that their unique chashing algorithm has ipvs patches. [14:27:07] the reason for 2) is that with DR all machines need to be in the same broadcast domain, and that is not an option for them https://landing.google.com/sre/book/chapters/load-balancing-frontend.html [14:27:53] in brief, compared to what we'd think of as normal chashing, theirs is able to spread load more-completely-evenly, because it's re-weighting the hashtables on the fly depending on the load moving through. [14:28:16] so it's not quite so consistent, in the name of more consistency in backend load levels [14:29:17] and then in practice, to make the inconsistency matter less, they're relying on the basic ECMP hashing of L3 from routers->maglevs, and the fact that while the chash data for new connections may vary due to load, within a maglev they're also state-tracking existing connections to avoid disrupting them as the table changes. [14:29:50] (ipvs does this too, right? it only uses the scheduler for SYN, not for established connections) [14:32:34] on a not-quite-related topic, we should poke around a bit at https://caddyserver.com/ as a possible/situational nginx replacement [14:32:43] I think so, yeah, that's why it's "stateful" [14:33:13] it seems to have a lot of nice properties and it's written in Go [14:33:51] but I don't know if (a) the HTTPS feature-set is already sufficient for our needs or (b) it actually scales like nginx does wrt OS-level threads and queues and all that low-level stuff. [14:34:25] it has built-in LE support though (as in, it properly manages cert fetching and renewal on its own automagically) [14:38:51] back on the IPVS-DR subject: we have the same limitation about broadcast domains. The way we fix it is we attach the LVS servers to every broadcast domain (which is why they have 4x ethernet ports each in the core DCs, one per row). [14:39:21] which works fine for us for the foreseeable future I think, but obviously that doesn't scale well if they core DCs get much bigger and/or have many more broadcast domains. [14:39:41] (and/or a lot more LVS servers chewing up switch ports per-broadcast-domain) [14:40:14] I think it will be a long time before that's a practical issue for us, so meh [14:43:40] qq about a DNS rename if you have time :) [14:43:42] https://gerrit.wikimedia.org/r/#/c/397539/1 [14:44:08] we are in the process of renaming notebook1002 to kafka1023 (since kafka1018 is down and not recoverable) [14:44:23] also a good occasion to test wmf-auto-reimage [14:44:39] me and riccardo came up with a procedure in https://phabricator.wikimedia.org/T181518#3827811 [14:44:56] but we were wondering what is best PTR-wise to set [14:47:01] replacing notebook1002 there with kafka1023 should work as intended I think [14:47:38] both in the production zone and mgmt that is [14:48:14] godog: yeah, the only caveat is that the reimage script needs first to resolve the mgmt interface of the old name and later on the new one [14:48:35] so for the time of the reimage we need both A records for the two mgmts [14:49:37] but for the PTRs I'm not 100% sure, I think it can work with only 1 PTR already pointing to the new one, I don't think remote IPMI does check it [14:50:12] volans: ah, got it [14:50:32] yeah I doubt anything looks at the PTRs, but I guess it can't hurt to duplicate there temporarilyu [14:50:39] but yeah renaming is a bit of a odd ball [14:50:43] have you looked at the ipsec issues on all of this as well? [14:51:34] the procedure I've suggested is to start the reimage with notebook1002 shutdown [14:51:40] kafka1023 will be set with role spare system as starter, I thought to postpone all the config issue as a subsequent step [14:52:07] ok [14:52:31] we just need to be sure it gets added to the ipsec stuff before we start forwarding traffic to it as a broker from the caches [14:52:56] yep got it [14:54:17] AFAIU it will be a spare system for the reimage process and only after will be transitioned to the kafka role [14:54:27] elukey: correct me if I'm wrong [14:54:38] volans: correct [14:55:28] we still have a manual procedure to create partitions etc.. for the kafka data disks, so I need some manual steps before even thinking about adding kafka1023 to the cluster [15:04:26] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#3828227 (10fgiunchedi) More details on the Prometheus part: [] Allow bast4002 as an additional Prometheus host (https://gerrit.wikimedia.org/r/#/c/393943/) [] rsync Prometheus data from bast4001 to bast40... [16:05:18] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3828368 (10BBlack) It's a pain any direction we slice this, and I'm not fond of adding new canonical domains outside the known set for indivi... [16:43:59] bblack: the hfp/hfm documentation patch can be abandoned, right? https://gerrit.wikimedia.org/r/#/c/386895/ [16:46:54] ema: yes [16:48:17] done! [16:51:46] 10Wikimedia-Apache-configuration, 10Operations, 10Wikimedia-Language-setup, 10Puppet, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3828534 (10Dzahn) >>! In T169450#3827175, @MarcoAurelio wrote: > @Dzahn (not sure if you handle this stuff): Sorry, i don't. [21:33:56] 10Wikimedia-Apache-configuration, 10Operations, 10Wikimedia-Language-setup, 10Puppet, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3829414 (10EddieGP) @Joe As you've already commented here, could you help with deployment of https://gerrit.wikimedia.org/r/#/c/39... [21:45:57] 10Traffic, 10Commons, 10Multimedia, 10Operations, and 3 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3829487 (10Bawolff) So yes, this sounds sane to me (With the caveat, I haven't looked at the multimedia code in a while). Some comments:...