[08:41:41] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3977824 (10Vgutierrez) 05Open>03Resolved [08:41:46] aand it's done :D [09:16:45] <_joe_> vgutierrez: are you properly onboarded now? :P [09:19:44] yup, now I can seriously break stuff around [09:19:45] :P [09:21:39] <_joe_> cool [09:21:46] <_joe_> start by patching pybal [09:21:55] <_joe_> although, nowadays, that's almost safe [09:21:59] <_joe_> we have unit tests [09:22:26] <_joe_> in the good ole days, when men were men and pybal did eval() on stuff downloaded via http(not S) [09:22:33] <_joe_> then it was a challenge [09:22:42] <_joe_> :P [09:26:02] are you kidding, right? xD [09:30:00] <_joe_> nope [09:34:46] <_joe_> well the http address was internal, but still :P [09:35:57] you'll enjoy checking the threat model of my previous company [09:36:47] everything running in AWS and TLS termination in the edge, so everything behind that was plain text [09:36:53] <_joe_> cool [09:40:06] <_joe_> vgutierrez: well we're trying to slowly move away from that ourselves [09:40:14] <_joe_> although we own and manage our datacenter [09:40:35] <_joe_> but well I guess by now you know [09:40:44] <_joe_> varnish is the biggest villain here [09:43:49] yup.. that was an interesting topic for discussion in my previous company as well [09:44:02] they have serious varnish lovers there (vg.no for instance) [09:44:33] and lack of TLS termination was/is painful for varnish IMHO [10:05:38] bblack: I like the memory increase theory re:varnish http false positives, they're affecting text hosts though (so v4) [10:13:19] <_joe_> did we add vgutierrez to security@, btw? [10:14:09] I did it myself O:) [10:14:36] <_joe_> heh ok [10:14:39] <_joe_> just checking [10:45:42] ok so here's the timeline of events during the varnish-backend-restart of cp4030 yesterday that resulted in icinga spam: https://phabricator.wikimedia.org/P6710 [10:46:11] 02:19 minutes between "Stopping Varnish" and "Started Varnish" [10:47:08] and 01:59 minutes between the first soft icinga critical and the hard one [10:48:30] looking at past backend restarts on cp4030, they all lasted between 0:02:12 and 0:02:19, with the exception of one quicker restart which completed in 50s [10:51:19] interestingly, on cp3040 the last 4 restarts have been much faster, they completed in 62, 33, 45, and 36 secs [10:53:49] 10netops, 10DBA, 10Operations, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3978036 (10Marostegui) Please change this to db2093 as we have decided to rename that host from tendril2001 to db2093 (T186123#3975533) Thanks and sorry for che changes! [10:57:43] they're both dell servers with intel SSDs, so we can't blame that for the performance difference [10:59:31] are the filesystems involved roughly the same age too? [11:00:44] cp4030 has more cached objects at restart time compared to cp3040 though (>30M vs <25M) [11:04:03] godog: not even close, cp3040's created Sat Mar 14 21:19:17 2015 vs. cp4030's created Thu Oct 19 18:16:00 2017 [11:04:52] but yeah I think the difference in number of stored objects explains the diff [11:06:01] yeah looks like it [11:09:23] heh, definitely, the fast restart on cp4030 (50s, Feb 1st) happened when the backend had 6M objects stored in it vs. the usual 30ish [11:12:15] mkfs would be faster! [11:12:22] I'm only half serious btw [11:13:21] as opposed to rm ; sync [11:15:17] we also sleep 10s (ugh) between rm;sync and varnish start as a workaround for T149881 which is perhaps not an issue anymore now that we've switched from -spersistent to -sfile [11:15:20] T149881: varnish-be not restarting correctly because of disk space issues - https://phabricator.wikimedia.org/T149881 [11:15:45] yeah, for sure it is going to be proportional to fs size as opposed to space allocated [11:20:39] at any rate, with the default monitoring::service options (3 retries, 60s check interval) we can get a hard critical after ~2 minutes - see https://phabricator.wikimedia.org/P6710 [11:21:05] depending on check timings of course [11:21:21] so yeah my proposal of bumping check interval (or retries!) still stands :) [11:58:03] 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3978245 (10elukey) After the last round of patches nfacctd/pmacct are sending events to Kafka using three topic partitions rath... [15:09:40] https://blog.nviso.be/2018/02/15/going-beyond-wireshark-experiments-in-visualising-network-traffic/ --> something like that to show how our services work from a network point of view would be awesome [15:11:20] ema: basically what we really want is synchronous-rm, which "rm; sync; sleep" does not actually do, so the whole thing isn't ideal to begin with [15:12:31] if we really wanted to make it synchronous and avoid axcessive delay, one way would be to rm; umount; mount. [15:12:34] but that seems pretty scary [15:12:56] or alternatively and maybe-faster, "umount; mkfs; mount" [15:13:37] I see godog already observed this and I didn't read far enough back :) [15:14:30] the gist of the problem(s) we're really trying to solve here is this: [15:15:08] (1) filesystem storage gets fragmented over time, so rm (or mkfs) and re-creating the file when we get the chance to during a restart resets the clock on that. [15:16:15] (2) if we just "stop; rm; start", the "rm" is gauranteed to be visible in the basic directory-metadata sense, but free-space accounting (at least for this fs on this kernel) is asynchronous, and so if we do "start" too quickly, which re-creates large pre-allocated files filling the whole FS, the creation will fail for lack of space, just because filesystem-level free-space async accounting hasn't [15:16:21] completed. [15:20:06] and then "stop; rm; sync; sleep 10; start" is a hacky workaround that seems to consistently work. [15:20:34] apparently in this case though, the "rm" or "sync" is actually-stalling for quite a while (although sadly I think that still gives us no gaurantee in the general case) [15:21:09] maybe we should time a quick non-zeroing mkfs? [15:22:07] what we currently use in the installer late_command for these filesystems is: mke2fs -F -F -t ext4 -T huge -m 0 -L sda3-varnish /dev/sda3 [15:23:27] (-T huge basically just implies "inode_ratio = 65536") [15:31:56] heh, mkfs takes a full minute per fs, mostly because it spends time doing an SSD discard on all the blocks of the partition [15:34:17] I wonder if discard is actually-useful in this case [15:35:19] I was thinking on the implications of going with nodiscard [15:35:49] for a normal filesystem that might stay mostly-empty afterwards it makes sense [15:36:03] in our case, we're going to allocate the whole thing and churn all the blocks in a relatively short window anyways [15:36:37] also the 64K inode ratio could be higher [15:37:23] we only create at most 3 files on any of these, even though they're a bit over 700G [15:37:39] but maybe insane values aren't well-tested, meh [15:41:47] someone could say that having a FS for 3 files it's an overkill [15:42:17] but I don't know if varnish is able to handle raw devices [15:44:05] actually... http://lists.varnish-cache.org/pipermail/varnish-dev/2013-April/007545.html [15:44:12] forget it :) [15:44:13] right [15:44:38] varnish is design around the os's vm abstraction anyways [15:44:49] it basically mmaps the file and treats it like memory heh, IIRC [15:45:29] mkfs with -E nodiscard is blazlingly-fast in our case [15:46:05] I'm just trying to remember now the discard details to know if that's a good idea [15:46:20] but really, even if it's a bad idea, we're already effectively doing the same thing with our cycles of "rm" [15:46:36] because we don't mount the fs with the "discard" option, our weekly rm on the mounted filesystem also doesn't discard all that space, either. [15:49:01] so, I think it's fair to sat that "mkfs -E nodiscard" is no worse than our current rm. [15:50:41] and on an unused system, the umount->mkfs->mount cycle is pretty fast, on the order of <1s [15:50:59] I suspect umount may stall a bit on a heavily-loaded system though, as it may try to do some final state sync [15:52:15] too bad umount doesn't have some kind of --unclean option (detach this from the filesystem namespace ASAP, but do not worry about syncing a clean filesystem for later remounting, because we're going to wipe it anyways) [15:52:24] "lazy" sounds close, but isn't [15:53:08] theoretically speaking, not using TRIM/discard should impact on write performance in SSD disks [15:53:20] well, yes :) [15:53:26] but as you just said, we're not using discard mount option [15:53:39] so we're are already suffering that [15:53:42] but also, theoretically, filling the whole device with bytes and writing to them constantly also impacts write performance [15:53:52] which is what we do anyways [15:53:58] yep, we are SSD killers [15:54:33] which is why we buy the fancy high-end ones. they have a bunch of extra internal blocks not exposed to the consumer, just for the GC stuff to have working room. [15:54:48] (and fancier/smarter handling of such stuff in general) [15:55:22] so I guess that doing or not 1 discard on the mkfs isn't going to make a huge difference [15:56:27] right [15:56:44] we've been using the Intel "server-grade" SSDs, which is currently the S3710 [15:57:04] because we know we can allocate the whole thing and write to all the bytes constantly for years and not see a failure so far. [15:57:15] vgutierrez: our graphite machines also like to chew on SSDs very much :) last year the ssd broke right before fosdem, fun times [15:57:27] https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-s3710-series.html [15:58:10] the "lifetime writes" on the 800GB model we're using lately is 16.9PB [15:58:46] so that gives us about 22151 complete overwrites of all the storage [15:59:30] which to approximate that scope, that should allow 5 years lifetime if we're overwriting the whole device worth of storage approximately once every 2 hours. [15:59:39] (which is crazy, we don't write quite that hard I don't think, on average) [16:01:53] that would be 800 mbits written per second on the SSD, (100 MB if we talk in storage units) [16:02:38] whereas the rate write perf is ~4.6 times that [16:02:43] but still, we're not maxing out write i/o rate :) [16:04:35] taking a random sample from esams upload cluster at this time of day, which should approximate worst-case [16:04:55] our average write rates seem to be closer to ~5MB/sec [16:06:54] but in the higher-level sense, the true dataset behind the upload caches is > the storage size of the caches, and their efficiency is at least minorly limited by storage size [16:07:03] and our TTLs are ~24h [16:07:19] yup... I was reading https://phabricator.wikimedia.org/T144187 this morning [16:07:23] so we kind of roughly expect we churn all the storage bytes in <24h for all of our usual observations about cache_upload to make sense [16:07:30] I'm still amazed by the work there BTW [16:07:42] yeah that ticket is nuts :) [16:07:55] it should be printed and hanged in a wall or something [16:08:08] most of the hard work in there was volunteered :) [16:08:42] (context: username Danielsberger in that ticket is not an employee of ours) [16:08:50] yey.. I was stalking him [16:08:59] (on a good way ofc) [16:09:39] there's still a lot of TODO on the topic of that task [16:10:04] the biggest bit being to close the loop on easily re-generating the tuning numbers based on our actual traffic over the last N months or whatever [16:10:19] so that we can re-adapt as things change, instead of using fixed tuning values from his test dataset of our traffic long ago [16:11:03] Daniel released the tool he used, so we could ask Analytics to provide a dataset every X months [16:11:12] yeah [16:11:16] but dunno if that's feasible on their side [16:11:30] the pieces are all there to make it happen, we just haven't done the legwork to get everything hooked up and use various persons' time, etc [16:12:12] but reading https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/README [16:12:22] looks like it should be easy [16:13:03] the best part about the solution there with the exp(-size/c) solution is that it's stateless, other than this whole "occasionally re-analyze long-term data offline for tuning" [16:13:31] <_joe_> bblack: you should write an adaptive algorhythm that adapts varnish tuning numbers at runtime reacting to the real-time stream of requests, processed via flink [16:13:32] whereas our best guess at the start of the ticket was to write a stateful bloom-filter solution and engineer around overlapping replacement bloom sets over time, etc [16:14:14] realtime tuning would defeat the purpose of this, though [16:14:19] <_joe_> yes [16:14:23] <_joe_> I'm mostly joking [16:14:33] yeah, but it's a fair point of debate/confusion :) [16:15:12] we want the tuning adapted to our long-term averages, and thus the effect of the algorithm is to help defeat the impact of realtime aberrations (like when some bot tries to scan every file quickly, etc) [16:15:32] <_joe_> oh yes, I'm just saying that smart systems that are able to react to the current trends intelligently are nice; of course systems where a solution can be approximated well enough with any mathematical function is preferrable [16:15:51] <_joe_> s/is/are/ [16:16:14] whereas if we could somehow re-calculate this stuff in-the-moment, it would "adapt" to accomodate the traffic patterns we dislike and don't want to help [16:17:11] <_joe_> well, or you could adapt deciding you're not caching classes of objects based on recognized behavioural patterns [16:17:26] well that's always the trick, how do you do that statelessly? :) [16:17:27] <_joe_> or caching for a very short amount of time, whatever. [16:18:11] <_joe_> yeah, nope! [16:18:30] it's sort of like when you first look at the "large rate of 404s" problem, e.g. if someone's just scanning through the URL-space ignorantly trying every https://upload.wikimedia.org/[A-Za-z0-9]{1,64} [16:19:12] the only obvious trivial answers to avoiding caching all the 404s is to... [do something that once you think about it hard enough, ends up caching all the 404s in a slightly different way] [16:19:43] <_joe_> eheh [16:20:02] the original bloom-filter idea was an improvement on that, in that it can statistically solve the problem fairly well, while taking far less space than doing the actual caching [16:20:13] but the stateless exp() filter is even better :) [16:20:19] <_joe_> indeed [16:20:57] <_joe_> as I said before, I love mathematical models that approximate reality sufficiently well [16:21:22] <_joe_> that's basically the description of my first career :P [16:21:57] oh I have an interesting problem to troll you with then [16:22:10] <_joe_> at 17:22 PM on friday? [16:22:18] <_joe_> if I fall for it, I'm really hopeless [16:22:20] isn't that the Trolling Hour? [16:22:31] hmm beer o'clock actually [16:22:32] so, the basic context here is our GeoDNS routing [16:22:43] <_joe_> maybe volans|off (who's also off, but what about we put some cumin in the discussion) [16:22:49] <_joe_> ohhh that's interesting [16:23:03] now, GeoDNS routing in general is fuzzy and imprecise, because caches and lack of edns_client_subnet, so precise solutions aren't super-important [16:23:35] <_joe_> you want to optimize the distribution of geodns areas based on minimizing latencies and still balancing traffic evenly? [16:23:45] but currently, as an approximation of "what's the latency between user@coordinates X,Y on the globe, and datacenter@coodinate A,B on the globe" [16:23:50] * ema grabs some popcorn and follows the nerdsnipe building up [16:23:50] (nevermind even balancing, for now) [16:23:57] how mich backlog should I read? :) [16:24:02] <_joe_> ahahahha [16:24:04] *mich [16:24:05] just a few minutes [16:24:15] aaargh mobile's keyboards [16:24:21] so the current approximation of network-latency-distance is geographic-distance [16:24:45] that is, it does the trig to get a great-circle-distance approximation between the two coordinates on the planet, and calls that the latency. [16:25:24] that actually kinda-sorta-works-ok for a lot of cases, roughly enough, especially with a small and widely-space set of datacenters to choose from. [16:25:46] <_joe_> didn't we use the ripe atlas data too in the past? [16:26:00] that's latency measurement based [16:26:10] this is straight from the geoip db [16:26:10] [this isn't about what we do here operationally, it's about the "automatic" mode of gdnsd that we don't yet use here, but would like to use] [16:26:23] <_joe_> ohhh I see [16:27:04] the auto-mode means you just feed gdnsd the coordinates of your datacenters and let it pick, without manually defining a country/state map like we do today, based on our thinking and ripe data, etc [16:27:23] the diff between our manual tuning and the auto-mode, even today, isn't very big [16:27:45] but we'd like it work better, and that boils down to approximating network latency between two points on the globe better than "geographic distance" [16:28:22] so anyways [16:28:40] there's lots of "obvious" ways to do it manually using lots of real data, but they're difficult to engineer. [16:28:59] looking at approximation solutions that can be calculated quickly, though: [16:29:00] <_joe_> what do you want to obtain? [16:29:09] https://www.submarinecablemap.com/ [16:29:12] <_joe_> everyone is mapped to the smallest-latency datacenter? [16:29:39] (based on coordinates, yes) [16:30:10] if user is at 43,53, and we have a set of 10 datacenters at these other 10 sets of geographic coordinates, which is latency-closest, to some reasonable approximation [16:30:13] 10netops, 10DBA, 10Operations, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3979092 (10ayounsi) No worries, port description renamed! [16:30:14] <_joe_> so I don't know how detailed is the ripe atlas map is, but you could in fact use it as your metric if we have enough datapoints around the globe. [16:30:34] yes, that's the exact-ish solution. there's even more-exact ways to engineer it. [16:30:59] such as sampling 1/1000 of js-capable user-agents and having them measure latency to all DCs and report it back to us to build better maps, etc [16:31:39] but here, for this troll, I'm talking about mathematical approximations that are just "better than using raw geographic distance to approximate latency" [16:31:48] that is what cedexis does [16:31:57] <_joe_> so you could do dist_total = dist(P_user, P_known_latency) * c + known_latency [16:31:58] (abusing JS capable UAs) [16:32:08] so we have as interesting data: https://www.submarinecablemap.com/ [16:32:31] this gives you the basic shape of the global internet, which clearly does not having cabling running in every direction between every two points [16:32:52] <_joe_> ok so you have these data points [16:32:58] i have in the past thought about using linux's tcp metrics per-connection [16:33:01] one of the most-notable discrepancies from as-the-plane-flies approximations is that cables don't cross the poles [16:33:05] linux maintains RTT per tcp connection [16:33:14] so if you collect that and aggregate across network prefixes... [16:33:16] on the server side [16:33:27] <_joe_> you could find the nearest entry point to the user in the cable network [16:33:34] [yes, but that suffers a bias in that we only get data for the endpoints they're already mapped to] [16:33:42] (indeed) [16:33:52] <_joe_> and calculate the path along the edge of that graph to your nearest datacenter [16:33:55] anyways, hang on while I run down these existing thoughts [16:34:02] <_joe_> ok sorry [16:34:03] <_joe_> go on [16:34:05] <_joe_> :) [16:34:35] so the data on submarinecablemap is actually available as regularly-updated XML, so we can process that data. It has the coordinates of the cable-landing sites and the latency of each link. [16:34:44] (well, path length of each link, anyways, close enough) [16:34:50] <_joe_> so it's a graph [16:34:52] yes [16:34:56] <_joe_> ok [16:35:11] the problem is, you'd need to make the call on when to ride the submarine cables and when not to [16:35:47] so if you're given two random coordinates, you basically first have to determine if they're within the same "continent" so you know whether to use the sub cables to reach between them or not, which is also not trivial without math on the approximate shapes of all the continents. [16:35:52] but in real life that depends on the user's provider peerings [16:35:59] otherwise you end up taking submarine paths between, say, SFO and IAD [16:36:26] <_joe_> bblack: you can try something simpler, that will have some failing edge cases [16:36:28] volans|off: (yes, but this is just a "better approximation than geography", at least network engineers are trying to make best use of the cables that are available) [16:37:07] so I kinda went down that path, it's almost as ugly to engineer as using real latency data [16:37:23] <_joe_> bblack: should we only count user-to-edge, right? [16:37:34] when what I'm looking for is a fast approximation. as in, a daemon can slurp up some input data and recalculate the global network mapping in single-digit seconds or less. [16:38:10] joe: yes, but users are everywhere, including nearby our own edges [16:38:30] <_joe_> bblack: so an algorhythm could be something like [16:39:07] * volans|off have some ideas but need to think more and unable to express them on a mobile keyboard [16:39:12] :) [16:40:06] so if we go with "very fuzzy approximations that are just a little better than raw geographic distance" [16:40:41] if you squint at the global cable map, the most relevant observation is that most cables crossing oceans do so close-ish to the equator, and never go over the poles. [16:40:44] <_joe_> if dist(user, cache_0) < alpha * (dist(user, cable) + min(set(dist(cable, cache_i))) => map to cache_0 [16:41:00] <_joe_> else => map to cache nearest [16:41:05] <_joe_> but yes, it's a ton of data [16:41:13] I'm wondering if we could take advantage of some KLM with continents/countries borders [16:41:16] and really, if you average their latitude, there's a fuzzy "internet equator" that's a little north of the real equator. [16:41:42] if you look a little bit less-fuzzily, there's a sort of sine-wave shape to the overall cable map [16:41:57] it sweeps northwards over the americas and europe, and dowards through asia [16:42:34] and with a fair amount of fuzz, the distance between two points that are close enough (same-continent-ish) is a number you can come up with too [16:42:39] <_joe_> so the easiest way to do this, imho, is to create a mapping avg_latency_to_cache_n from patches of the globe, and use that. This can be computed offline instead of having the running daemon do the calculations [16:42:43] resulting in an algorithm sort of like this: [16:43:54] where C is some value that means they're fairly close and kinda same-continent-scale: if (geodist < C) { use direct distance } else { draw a path to the closest point on the cablemap sinewave from each end, and ride the sinewave between those points } [16:44:16] did I mention that postgres has a great GIS support? :-P [16:44:29] ^ this is something that can be done fast, with no fancy data or database, just coordinate inputs [16:44:38] * volans|off couldn't resist to mention postgres [16:44:42] <_joe_> that could work, if you properly adapt C to be modulated across the globe [16:45:20] I suspect a value for C can be found where the results are, in practice, indistinguishable for reasonable DC-layouts) [16:45:27] <_joe_> I was thinking of something slightly more complex [16:45:45] <_joe_> but this could work as well [16:46:04] <_joe_> bblack: uhm, I think europe and asia are a couterexample [16:46:11] yeah I was towards a more detailed solution [16:46:42] yeah I think I poorly defined how approximate and fast we'd ideally like to be :) [16:46:47] in particular taking into accoubnt some big blocks of land (kinda continents but custom) [16:46:50] <_joe_> volans|off: even if you just modulate C based on a simple function, you might get good results and do something easy [16:47:10] static, in the code, no db [16:47:14] <_joe_> the easiest thing to do is to define continents as boxes [16:47:15] part of the algorithm [16:47:18] <_joe_> :) [16:47:32] yeah but custom boxes with better shapes [16:48:06] true [16:48:06] <_joe_> this is indeed an interesting problem [16:48:21] <_joe_> let's assume the continets are boxes :) [16:48:34] the graph of submarinecablemaps is reasonable data to use for the calculation too, if we get over the "when do you ride the cables" problem, and maybe contintent-boxes is enough for that. [16:49:16] <_joe_> bblack: once I assign you to a continent, then your available datacenters for direct connection are the ones in the same continent (roughly) [16:49:29] not forcely [16:49:37] <_joe_> and you have a set group of cables [16:49:39] one across a cable might be closer [16:49:52] than one in the same continent [16:49:54] <_joe_> volans|off: for direct connection? [16:50:14] <_joe_> I am saying you can calculate the direct connection to the dc in your continent, if any [16:50:25] ah yeaand compare [16:50:26] <_joe_> and then the remote distance to others via the submarine cables [16:50:27] sure [16:50:37] {"id":"africa-1","name":"Africa-1","cable_id":1878,"landing_points":[{"landing_point_id":13489,"id":"antsiranana-madagascar","name":"Antsiranana, Madagascar","latlon":"-12.275253,49.292170","url":"#/landing-point/antsiranana-madagascar"},{"landing_point_id":4361,"id":"jeddah-saudi-arabia","name":"Jeddah, Saudi Arabia","latlon":"21.481542,39.182873","url":"#/landing-point/jeddah-saudi-arabia"},{"l [16:50:43] anding_point_id":3259,"id":"karachi-pakistan","name":"Karachi, Pakistan","latlon":"24.889683,67.028539","url":"#/landing-point/karachi-pakistan"},{"landing_point_id":5896,"id":"mombasa-kenya","name":"Mombasa, Kenya","latlon":"-4.053016,39.672839","url":"#/landing-point/mombasa-kenya"},{"landing_point_id":5942,"id":"mtunzini-south-africa","name":"Mtunzini, South Africa","latlon":"-28.950440,31.757 [16:50:50] 853","url":"#/landing-point/mtunzini-south-africa"},{"landing_point_id":5950,"id":"port-sudan-sudan","name":"Port Sudan, Sudan","latlon":"19.615552,37.219691","url":"#/landing-point/port-sudan-sudan"},{"landing_point_id":9400,"id":"terre-rouge-mauritius","name":"Terre Rouge, Mauritius","latlon":"-20.077813,57.510086","url":"#/landing-point/terre-rouge-mauritius"},{"landing_point_id":9486,"id":"za [16:50:55] farana-egypt","name":"Zafarana, Egypt","latlon":"29.116658,32.649893","url":"#/landing-point/zafarana-egypt"}],"length":"12,000 km","rfs":"2019","owners":"PCCW, Saudi Telecom, MTN Group, Telecom Egypt, Telkom South Africa","url":null,"notes":null} [16:50:56] <_joe_> it's not going to be too hard if you pre-generate a table of distances [16:50:59] heh whoops! sorry [16:51:01] I meant to past the link: https://github.com/telegeography/www.submarinecablemap.com/blob/master/public/api/v1/cable/africa-1.json [16:51:04] but that's the data in that file [16:51:14] we get coordinates of landing-sites, and a km distance for the cable run itself [16:51:22] <_joe_> since it's a graph, the shortest path might be expensive to calculate on-the-fly [16:51:39] <_joe_> bblack: yeah the main issue is how to concatenate cables [16:51:53] <_joe_> say you're in sumatra, and we still don't have singapore [16:51:57] so, one way that cables connect to each other is *at* shared landing sites, for graph purposes [16:52:14] <_joe_> your easiest connect is probably via japan and then the US [16:52:28] but the other way is that you can approximately assume two cables landing on the same continent have a direct overland connection that's probably approximately the overland geo distance. [16:52:36] <_joe_> but if the graph is not properly connected and you have to cross japan, I dunno :) [16:53:05] so we have to construct a matrix of artificial overland cables connecting the landing-sites of a given continent [16:53:26] <_joe_> yes [16:53:38] <_joe_> ripe data looks more and more promising by the minute [16:54:27] IMHO if we use boxes we need to approx the cables and simplify them [16:54:42] that means that we cannot consume directly that map [16:55:04] in terms of approximate computation complexity allowed, the constraint is something like this: [16:55:21] when there's a state-change (e.g. ops marks eqsin as offline for geodns, causing a remapping of things) [16:55:34] basically map the wholenglobe into <30~50 boxes with interconnections [16:56:07] a C daemon has to re-scan an entire GeoIP database of , and re-calculate what the optimal datacenter is for each, and get done in single-digit seconds or less. [16:57:12] we can do that for geographic distance, as each network coordinate calculation is matter of like 3 trig functions and a few multiplies and adds [16:57:32] but it's easy for other solutions to escape reasonable fast re-calculation bounds [16:59:15] the geographic distance calculation is currently this small function: [16:59:17] bblack: but this would re-generate a new ststic mapping for gdnsd? [16:59:19] static double geodist(double lat, double lon, double dc_lat, double dc_lon, double cos_dc_lat) [16:59:22] { const double sin_half_dlat = sin((dc_lat - lat) * 0.5); const double sin_half_dlon = sin((dc_lon - lon) * 0.5); return sin_half_dlat * sin_half_dlat + cos(lat) * cos_dc_lat * sin_half_dlon * sin_half_dlon; [16:59:26] } [16:59:29] which pasted horribly heh [16:59:32] <_joe_> ahahha [16:59:42] <_joe_> link to github maybe? [16:59:42] <_joe_> :P [16:59:44] it's the haversine method, with some bits left off that we don't need [17:00:07] https://github.com/gdnsd/gdnsd/blob/master/libgdmaps/dclists.c#L169 [17:00:37] <_joe_> is that calculated for *every* request? [17:00:39] no that's the old one [17:00:42] https://github.com/blblack/gdnsd/blob/3.x-prototype5/libgdmaps/dclists.c#L177 [17:00:51] ^ that's the one I was trying to paste, it's a little more optimized [17:00:54] <_joe_> or just a map networks => datacenter you count once and use statically? [17:01:24] right, on a state change, we run that calculation for every user-network-coordinate on the globe against our list of datacenter coordinates, generating a static output map [17:01:35] actual runtime dns queries just use the fast static lookup output from that [17:01:41] <_joe_> ok [17:02:02] and state changes can take a while. it's ok if it takes single-digit seconds to react [17:02:19] it's probably not so great if reasonable server hardware takes 3 minutes to recalculate the globe, though :P [17:03:13] <_joe_> so, I would argue that the simplest way to do it is to pre-calculate the distance of each network to each dc at startup, then assigning to the nearest available one [17:03:23] right now, the math part linked isn't even top of the profiling list [17:03:38] we spend more time just on iterating the geoip data and converting it into different structures [17:04:03] <_joe_> heh [17:04:04] but a lot of better calculation schemes that involve, say, real latency data or databases or graphs of cable maps, might make the calculation itself dominant [17:04:25] <_joe_> yes, hence my proposal to detach it from the actual remapping [17:04:36] well [17:04:43] so the current code in releases does that [17:04:57] it does the mapping once at startup, and comes up with a sorted list of DCs for each client-location [17:05:14] then state changes just move down the list to the best-available choice, without remapping everything [17:05:35] each user poiny should calculate the distance to all DCs and save them as a list ordered by distance [17:05:51] yes, that's what the current-release code does [17:05:54] exactly, when a DC is depooled you pick the next [17:05:59] <_joe_> ok, what's the issue with that scheme? [17:06:04] but various other reasons, the new code doesn't want to do that :) [17:06:13] <_joe_> ook [17:06:20] the reasons are two-fold in practice: [17:06:39] have talk to upastream already? :-P [17:06:59] *have you talk to upstream [17:07:28] 1) Part of the optimization problem here is also calculating maximal edns-client-subnet scopes. That is, if two adjacent network numbers like 192.0.0.0/24 and 192.0.1.0/24 are in different countries, but our map ends up sending them to the same place, we desire to aggregate that as 192.0.0.0/23 [17:07:51] <_joe_> right [17:07:55] if we don't pre-cache the whole sorted list of possibilities, and just look at the best option based on the current up/down states, we get more aggregations. [17:08:04] sure [17:08:39] <_joe_> so live requests have less latency going through the list of mappings, it makes sense [17:08:56] well not just that, the lookup is fast enough regardless of "table" size [17:09:25] it's that the remote caches get to cache our responses for larger sets of clients, reducing their cache misses on behalf of DNS clients, and reducing load on our DNS servers, etc [17:09:48] (in the case of edns-client-subnet) [17:09:53] <_joe_> right [17:09:54] anyways, reason the second: [17:10:48] 2) It's desirable to move from a model of a DC just having "up/down" state to having a "weight" state that's runtime-mutable by admins as well. When the admins leave a DC online but cut its weight in half, this also changes the whole mapping in way we can't precache based on startup-time config. [17:11:23] <_joe_> uhm, this is solvable, but the first reason stands [17:11:50] the weight parameter in theory allows for taking more than just latency into account, but also loading. in the static sense you might have a DC's weight lower because it has fewer servers (but that's not the case for us currently, until minipops) [17:11:51] <_joe_> I mean multiplying a distance by an integer weight is fast and could be done upon reload [17:11:54] for the first we could solve it imho just making the data steucture a bit more complex [17:12:09] but also the weight parameter allows us to ramp in caches [17:12:17] * volans|off grabbing a laptop [17:12:21] <_joe_> ahahahah [17:12:24] <_joe_> WIN! [17:12:31] e.g. take esams offline for a day for maintenance, and then bring it back gradually starting at say 0.05x weight to refill caches smoothly. [17:12:51] which basically slowly increases the radius of clients around it that are mapped back to esams [17:12:51] <_joe_> yes, so each change of weight grants a rebalance as well [17:13:08] and that rebalance is going to change the dc-list ordering for various client networks [17:13:37] at which point why bother pre-caching the list if we're recalculating it anyways. the calculation just becomes "what's the best choice right now, until the next state change?", and we do it at runtime on changes. [17:14:51] bblack: will be a recalculation or just some hashing? I guess you also want to limit moving clients around [17:14:54] <_joe_> so, in theory, you could do as follows: precalculate the latencies of each subnet to each dc, then at runtime just calculate the weighted latency, and aggregate the subnets that end up in the same dc based on said weights [17:15:09] so going from 1% to 2% you want to add 1% but mostly keep that 1% that was there still there [17:15:24] yeah it's not hashing [17:15:36] <_joe_> because well, latencies between one point on the globe and a dc is a static measure [17:15:36] it's the closest-1% then closest-2% on the map [17:15:49] (based on whatever "closest" means, the distance functions we're discussing) [17:15:56] <_joe_> so recalculating it at every rebalance seems wasteful [17:15:59] ok [17:16:03] my idea to keep aggregation with pre-calculation is something like this: [17:16:08] 192.0.0.0/23 -> [esams, {192.0.0.0/24 -> [eqiad, codfw, ulsfo, eqsin], 192.0.0.1/24 -> [codfw, eqiad, ulsfo, eqsin]}] [17:16:37] true [17:16:46] but that still assumes static weighting [17:16:54] when the weights change, the whole list can change for every network [17:17:01] <_joe_> volans|off: what about: you cache all latencies - networkA > {esams:200ms, ...} [17:17:08] we don't have latencies [17:17:24] <_joe_> bblack:let's call it a 'distance' [17:17:25] we have "distances" in some arbitrary comparable unit [17:17:50] <_joe_> you can surely calculate it at startup as a static value for each subnet, dc pair [17:17:53] bblack: is the closest-X% mandatory? [17:18:00] I know is optimal [17:18:01] :D [17:18:04] yes [17:18:04] <_joe_> then at runtime you pick the weighted minimum distance [17:18:13] <_joe_> and aggregate that [17:18:25] because it's not only used for ramp-in, it's also used for "this datacenter has like 2 tiny servers, so we're only using it for this nearby population and not others") [17:18:39] <_joe_> in this scenario, your calculation is just an integer multiplication and N integer comparisons [17:18:44] <_joe_> per subnet [17:18:52] also, if you can re-calculate all in single digit seconds, I think it's acceptable to recalculate it every time a weight changes [17:19:01] right [17:19:18] _joe_: yeah I think I get what you're saying, let me repeat it to be sure, differently [17:19:35] <_joe_> ok :) [17:20:34] * volans|off hand over to bblack and _joe_ the award be able to nerd-snipe volans|off into this on a friday afternoon that he was off... [17:20:39] so you're saying I start out with mapping everything based on unweighted "distance" and creating client-net-loc=>[x, y, z, a, b] sorted dclist, which may be a complex operation and take longer time at startup. And then runtime weight adjustments, being just a multiplier, should allow me to re-order the dc-preference-list attached to each network with just simple multiplications. [17:21:16] as long as I attach the resulting distance calculation data to the list, not just the sorted list itself [17:21:28] that makes sense to me [17:21:57] so static slow map at startup + runtime quick map recalculation on weight changes [17:22:02] <_joe_> yes [17:22:06] <_joe_> exactly [17:22:10] so it would be client-net-loc=>[x@500, y@623, z@900, a@1100]. these are the complex distance calculations making a list with distances. [17:22:26] and then I can runtime re-sort based on weight multipliers without re-running the whole base calculation [17:22:29] yes [17:22:30] <_joe_> yes [17:22:33] and aggregate [17:22:37] for the edns stuff [17:22:49] and yes, it's still combinable with volans idea about hierarchical dc preference-listing [17:23:06] although the changes for the weight multipliers in that scenario get slightly trickier [17:23:30] (we'll have to change subtree structure in places as a result of the new weights as we multiply them out) [17:23:40] <_joe_> yeah, I don't think you even need that, the calculation in this scenario is significantly less complex than what you have now [17:23:42] but that's also doable in the single-digit-seconds sort of recalc [17:24:09] <_joe_> a sin( a ) floating point is going to be slower than 5 integer operations for sure [17:24:19] and then the tree-ops [17:24:21] <_joe_> well, depends on the processor [17:25:33] 192.0.0.0/23 -> [esams@100, {192.0.0.0/24 -> [eqiad@200, codfw@300, ulsfo@400, eqsin@500], 192.0.1.0 -> [codfw@150, eqiad, ulsfo, eqsin]}] [17:25:39] I left off some of the static distances [17:26:04] <_joe_> not even sure you need anything more than the simple [17:26:05] but if a state change multiply changes the "esams" distance for example [17:26:18] it could mean we need to de-aggregate these two in the tree [17:26:21] bblack: my structure has just one problem... if the /23 cannot be aggregated for the closest one but could on the second closest, I mean you can just repeat it [17:26:25] but might be trickier [17:26:26] <_joe_> client-net-loc=>[x@500, y@623, z@900, a@1100] -> net_loc_aggregation => DC [17:26:40] true, it could be two separate trees [17:26:55] how? [17:26:57] one that doesn't aggregate, and the recalculated one that does [17:27:00] <_joe_> the first one is static, the second one, you calculate on reload [17:27:08] right [17:27:11] <_joe_> from the first one [17:27:23] but the first one is without aggregation [17:27:30] <_joe_> yes [17:27:33] so list the 192.0.0.0/24 for example [17:27:34] ok [17:27:43] <_joe_> who cares? aggregation is already done on-the-fly [17:28:01] <_joe_> I'm trying to think of a solution that's as performant or better than what bblack has today [17:28:11] no this works [17:28:11] <_joe_> while allowing better distance measures [17:28:19] there's some data size explosition, but within reasonable bounds :) [17:28:23] agree, but at this point in the final mapping you could alredy have aggregation into account [17:28:36] volans|off: the aggregation changes when a DC's weight changes [17:28:49] <_joe_> bblack: yes, I have no idea of the size of the dataset in geoip nowadays [17:28:50] yes ofc [17:28:51] so that tree has to be recalculated regardless [17:29:01] <_joe_> used to be pretty small, but who knows anymore? [17:29:09] so today with our manual mapping: [17:29:13] plugin_geoip: map 'generic-map' runtime db updated. nets: 1123420 dclists: 6 [17:29:29] baham is using 477MB of RAM... ;) [17:29:30] <_joe_> 1 million subnets [17:29:42] all included, OS, etverything [17:29:46] that means out of , we ended up picking 6x distinct ordered dclists, and broke up all the geoip networks into ~1m aggregates [17:29:57] <_joe_> yeah it's in the "runs on m1.tiny" category :P [17:30:04] lol [17:30:24] the "nets" could would be smaller if we were picking a single DC instead of a dclist, of course [17:30:50] but we can combine that idea with all the above. the recalculated aggregation should of course map to single-dc, not a list [17:30:57] <_joe_> I doubt we could end up using 100 MB for this [17:31:13] <_joe_> well, in php, maybe :P [17:31:27] gdnsd process has 80M of RES :D [17:31:50] so yeah, I'm not worried about the additional memory needed for this [17:32:03] <_joe_> actually, if you stored all the data in a byte-packed data structure, you could get away with using 64 bytes per record even in php [17:32:05] maxmind's whole city-level database is ~128MB on-disk [17:32:23] <_joe_> IIRC a zval is 64 bytes [17:32:36] how dare you speak php in here! :) [17:32:40] <_joe_> ahahahaha [17:32:45] <_joe_> I wrote php today [17:32:51] <_joe_> my first php unit test! [17:33:12] * bblack offers condolences [17:33:21] <_joe_> bblack: when I think of memory usage, I always make the mental map to php for a worst-case-scenario [17:33:40] <_joe_> O(PHP) (worst) [17:34:00] php runs the world though [17:34:00] rotfl [17:34:35] <_joe_> I still remember the inane amount of memory any xml tree would eat up in phpland [17:34:42] (or anywhere, really) [17:34:58] XML is a crime against humanity :P [17:35:04] <_joe_> well, multiply that by the efficiency of php [17:35:13] it's the X.500 of file formats [17:35:35] oh noo you remind me when I got $developer asking me to raise the max memory of php cli to 10GB... to process some XML of at most 500MB on disk [17:36:01] <_joe_> bblack: I wrote an insane thing that would take arbitrary formats, reduce them to a standard XML, and create outputs from those using XSLT stylesheets at $JOB~2 [17:36:20] <_joe_> volans|off: the good ole days of $JOB~2 [17:36:21] <_joe_> :P [17:36:53] <_joe_> ok, back to the original problem: how do we want to measure those distances? [17:36:54] anyways [17:37:04] eheheh yeah, and I wrote a pseudo-xml-to-xml parser, to be able to ingest things that tidy was refusing even to open [17:37:24] well, you've convinced me that with the right structure, the distance calculation can be offline and not happen every time a DC goes down or changes weight [17:37:27] <_joe_> I would say we can use RIPE data [17:37:28] sorry for the detour on memory lane [17:37:32] which is the insight I really needed here [17:37:49] now we don't have to think so hard about making distance calculation such a fast approximation :) [17:38:00] <_joe_> eheh indeed [17:38:14] fully agree [17:40:20] if it gets too complicated, it can be separated out as an offline step (basically a pre-processor that combines the geoip database with the static configuration of the DC coordinates + static weights) [17:40:44] (and only gets re-run on geoip config changes and/or when GeoIP database is updated) [17:41:01] <_joe_> yes [17:41:04] geoip config changes hopefully being rare, given the dynamic runtime ability to up/down/weight-change [17:41:11] <_joe_> can you update that on the fly? [17:41:12] yes can be an "utils" tool outside of gdnsd, could be backed by a GIS DB fwiw [17:41:16] <_joe_> the geoip database? [17:41:17] _joe_: yes [17:41:28] <_joe_> so that would take longer I guess [17:41:37] well [17:42:11] worst case the model changes from "gdnsd configuration references the geoip db file and reloads it on changes" to "gdnsd configuration references and reloads 'foo', which you generate with some offline tool when geoip db changes" [17:42:20] you can re-generate the slow mapping outside and then tell gndsnd to reload it and re-calculate the fast mapping in memory [17:42:34] same thing :) [17:43:43] moving the actual distance calculation to an external tool where the "API" is a file format makes it much easier to swap out algorithms, too [17:44:03] indeed [17:45:12] <_joe_> ok, I am going off now for good :P [17:45:14] anyways, go drink and stuff :) [17:45:20] <_joe_> yes :P [17:45:27] just note I sucessfully sniped you out of approximately 80 minutes of it [17:46:11] always a pleasure to have this kind of chats ;) [17:47:53] <_joe_> yeah but I justified myself with the double-snipe I did on volans|off, who's not even supposed to be working today [17:49:24] :) [18:02:48] 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3979339 (10Nuria) Are we planing to use tranquility to move the he data into druid or rather just kafka-> camus-> hive? [21:48:59] 10Traffic, 10Operations, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog, and 2 others: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#3979897 (10Dbrant) 05Open>03Invalid