[10:20:36] hurm. the developer satisfaction survey has non-skippable questions for things which don't apply to me. 'yay'. [10:50:17] kormat: I'd encourage you to ping greg-g about it - if it's too late for this survey it can always be feedback for the next one [10:50:32] paravoid: ack. i've pinged him on the slack thread about it [10:50:37] ah cool [10:55:22] <_joe_> I think it's hard to have a one-size-fits-all survey; OTOH I can see how a sizeable part of it won't apply to most SREs [10:55:43] <_joe_> (basically anyone who doesn't deploy services and/or has a MediaWiki dev env) [10:57:26] greg and I had a thread on how to polish the "production" bits to create more useful feedback for SRE [10:57:55] <_joe_> I see an alert for the nel data being stale on centrallog1001, and indeed we're missing NEL data. Is anyone aware if chris was doing something about it? [10:57:57] and some of the changes you see in the relevant sections came out of that [10:58:27] in the more-sre-specific sections, I mean [10:59:02] <_joe_> data stopped being collected on march 1st https://grafana.wikimedia.org/d/43OLwO2Mk/cdanis-hacked-up-nel-stats?orgId=1&refresh=15m&from=now-30d&to=now [10:59:10] <_joe_> the alert is just 2 days old though [11:00:18] <_joe_> hnowlan: I see restbase 2019 and 2020 both have the cassandra certs with a critical alert (they expire in 29 days) [11:00:27] _joe_: I know (vaguely) that he's been working on an integration with eventgate so that we can have geoip/asn information [11:00:37] but I don't know exactly where's that [11:00:54] <_joe_> yeah I'll ask him later :) [11:05:34] _joe_: yep thanks for the heads-up! will be renewing them with apergos tomorrow [11:05:51] indeed we will [11:05:53] <_joe_> can you then ack the alerts in icinga? [11:06:12] <_joe_> so that no one else bothers you with them in the first place :D [11:06:59] aye, my bad [11:58:20] I am missing some absolute numbers on the app server latency graphs, to differentiate "a few extra slow requests" vs "all requests are a bit slower" [12:42:33] <_joe_> jynus: are you? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 the panels titled "% of responses" have a breakdown by absolute number too [13:14:09] oh, I missed that, thanks, because of the title [14:55:11] yeah indeed, does look like the NEL prometheus exporter broke, I'll take a look -- all data is still in logstash though [15:07:49] kormat: replied on slack, but copy/pasting here for others: [15:07:53] yes it is hard to get these kinds of surveys exactly right. For that specific question, however, I'm sure you do some development locally, yes? 😉 Thus, answering the first two (how satisfied, and how important) are still valid, in my view. The follow-ups are focused primarily on the MW/services side of things but selecting "Other" there is valid and useful for the SRE use case. [15:09:13] (non-stated on slack) by local dev for you I mean your puppet workflows etc. That's "local dev" as well in my opinion. [15:10:08] greg-g: the 'local dev' for puppet is... send it to gerrit, and have pcc run it [15:10:41] because there's basically nothing else you can do [15:10:47] you don't have tests that can run locally? no linting/etc? [15:11:01] there might be, but nothing i'm familiar with [15:11:11] <_joe_> yes, there are [15:11:17] I mean, that's what I do when I make puppet patches, but I'm just updating admin/data.yaml ;) [15:11:22] <_joe_> you can run the whole CI (including unit tests) from your machine [15:11:35] <_joe_> but the compiler is still an important aid [15:11:42] <_joe_> esp if you don't write tests [15:12:25] the compiler is the only way to see what will actually happen short of pushing to production [15:13:11] kormat: that's starting to sound like an anti-unit test argument ;) [15:13:49] greg-g: the puppet results depend a lot on what's in puppetdb, which isn't accessible on your local machine [15:14:15] you can test small isolated stuff with unit tests, sure. but not "what will happen on these 3 different machines" [15:14:34] agreed, but as was mentioned, there are tests that can be run locally, and an env that supports that, so that is how you can answer that question. No local dev env is a "perfect representation of what will happen in production" so that's not a fair comparison. [15:15:48] ok. i might as well answer that i'm very dissatisfied, and that it's not important to my work. ¯\_(ツ)_/¯ [15:16:10] why dissatisfied? you just argued you wouldn't use it? [15:16:38] my point is, listen to _joe_, he said there is one, maybe it's not something you use but that wouldn't mean you're dissatisfied with it. [15:16:41] greg-g: having to push stuff to gerrit and then kick off a pcc run to figure out if stuff is going to do what you want or not sucks. but there's no feasible alternative [15:16:54] because of the nature of puppet [15:17:23] * greg-g goes to finish his morning routine [15:17:39] answer how you see fit! I don't want to bias results :) [15:17:40] <_joe_> kormat: we could make a "compiler env" that we could run locally, but that would require some work [15:18:15] _joe_: and it would be Heavy, i'm assuming. given you'd need to run a puppet db etc, right? [15:19:07] <_joe_> kormat: not really very heavy, but yes a couple containers [15:20:20] so about like mw ;) [15:22:32] <_joe_> greg-g: a couple containers for mw? not for a prod-like dev env :P [15:22:40] <_joe_> I count at least 5 [15:22:55] <_joe_> db, etcd, memcached, httpd, php-fpm [15:23:11] _joe_: I couldn't remember the exact number, and it wasn't needed for the general tone of my joke :P [15:23:33] but yeah :) [15:24:41] greg-g: next section: "satisfaction with beta cluster". i've never used beta. i'm not even 100% sure what it is [15:24:58] <_joe_> lucky you [15:25:11] greg-g: want me to just inject random noise into your survey results? :P [15:25:15] <_joe_> marostegui: I think we should fix that [15:25:46] 3s are in-between satisfied and dissatisfied, so go with that, is my recommendation [15:26:11] greg-g: alright [17:35:29] hey all, any SREs around to help us with beta? in -operations atm [18:15:30] 07:10:08 greg-g: the 'local dev' for puppet is... send it to gerrit, and have pcc run it <-- I left comments explaining that I was dissatisfied with this workflow :) [18:16:15] legoktm: 👍 :) [18:17:46] I also commented about not having a good way to test puppet changes outside of production (can't really test them in beta, I did the self-osted puppetmaster route for that inside of beta and it was a nightmare) [18:18:37] https://wikitech.wikimedia.org/wiki/Puppet/Pontoon aims to solve some of these issues, I believe [18:21:12] how does one do this for an instance in beta? [18:21:26] I dunno, I believe beta has some of its own custom puppetmaster support [18:21:38] in te Pontoon project? [18:22:21] because beta itself has its own puppetmaster, which is why I went self-hosted for a long time, so I could test there without potentially breaking things for other instances [18:33:36] that was the thing I saw recently and couldn't remember (pontoon)! [21:16:00] I'm curious if someone could describe in a nutshell what a Linux bonding driver (such as balance-tlb for "adaptive transmit load balancing") could be used for, and how it relates to LVS? [21:22:18] well, bonding in general is teaming two interfaces together between a host and a single network [21:23:02] e.g. if you only have 10Gbps ethernet cards and ports, but you wanted to support 30Gbps of traffic between HostA and NetworkB, you might choose to bond 3x 10Gbps ports together. [21:23:54] there are a bunch of standards and/or strategies for exactly how you use the 3 ports together: whether and how they fail over if one link fails, whether the other end (e.g. the switch) is even aware of the bonding or not, etc. [21:24:39] https://www.kernel.org/doc/Documentation/networking/bonding.txt <- is the canonical reference for the basic upstream bonding support in Linux [21:26:07] https://en.wikipedia.org/wiki/Link_aggregation is a more general overview [21:26:26] what you're asking about is probably the "balance-tlb" there, which lets the host decide which of those 3 ports to send a given packet over based on either the bandwidth saturation of the links (with various packet sizes in play, a per-packet RR doesn't necessarily utilize the BW fairly), or by hashing the traffic (e.g. on L3/4 tuples, to steer a flow/connection persistently to one interface or the [21:26:31] other) [21:27:06] how it relates to LVS probably depends on the scenario [21:27:40] but since in e.g. our current LVS setup, LVS is only doing one side of the conversation, transmit-only strategies might be more useful/usable than they are for normal bi-directional traffic. [21:30:54] (to detail that compressed sentence a little more: for our LVS, the user->WMF side of e.g. a TCP connection flows through LVS to reach cpNNNN, but cpNNNN sends response packets directly without going back through LVS. So while the LVS machine does receive packets sent to it by our border routers, it only transmits in the inwards-facing logical direction towards realservers) [21:42:04] * Krinkle reads [21:45:17] bblack: Thanks, so while bonding is described as "load balancing" it is not usually associated the kind of load balancing one might do with LVS, Varnish/ATS, Nginx etc but rather more low level about more or less blindly deciding where traffic flows through on its way to its destination, rather than determining the destination itself. [21:45:49] well, in that sense I think the distinction isin the layers of the OSI model [21:46:01] Or is this the point where you say that at scale one could cleverly use/abuse bonding to no longer need LVS, but that this is either a terrible idea, or something we hope to do one day but haven't. [21:46:11] no, they're different :) [21:46:28] bonding/LAG like we're talking about above, is loadbalancing at OSI Layer 2, whereas LVS is more like layers 3+4, and nginx/haproxy/etc are Layer7 [21:46:31] link aggregation has somewhere between little and no idea about separate TCP connections [21:47:02] right, I knew LVS was not at L7, but that at least functionally for us it is behaving much like haproxy would but more performant [21:47:21] (but confusingly, with the correct set of parameters, you can make Linux's L2 bonding pay atteention L4 information to decide how it balances at L2) [21:47:55] LVS functionally for us is very much stuck at L3+4, no higher. [21:48:19] General cergen / TLS question: I'm updating an existing certificate and one of the early steps in https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate is to do a `puppet cert clean wdqs.discovery.wmnet` - but wouldn't this revoke the cert and thus cause issues for the services that are relying on the cert until the new cert is published? [21:48:26] (and, more importantly, it's only L3/4-balancing one side of a TCP connection and not the other. it's like a half-duplex L4 balancer) [21:48:39] Hm.. right, I guess it works for us because for us all cp* hosts handle the same traffic, so we don't need to inspect any host header or some such. [21:48:51] haproxy can be a [bidirectional] L4 balancer, or an L7 one, depending how it's configured [21:49:16] LVS can be bidirectional, too, we just odn't use it that way [21:49:40] so there isn't a way we could e.g. use L2 bonding on LVS hosts to distribute traffic directly to cp* hosts without needing to enter LVS/L4, and still be able to have cp* hosts return direct? [21:49:49] bidirectional-LVS + bidirectional-haproxy-L4 are very similar, but haproxy has more bells and whistles, but LVS does more work in the kernel instead of userspace, too. [21:49:57] (I'm asking mainly to fill in the blanks, not becuase I think it's a good idea) [21:50:44] so, the only way L2 bonding helps there, is if you're trying to use more than one interface worth of traffic [21:50:59] otherwise, it's simpler to just use a single interface for the LVS->CP traffic flow [21:51:47] or in our case, we have 4 interfaces for the 4 rows, so... I guess we could L4-hash flows onto the 4 interfaces, but we'd need the destination macaddress rewrites from LVS as well [21:51:55] oh, I see. So L2 is too low by itself to actually be able to distribute packets between hosts. I guess if one would do that, it would break stuff since maybe that would mean different parts of a requeste end up on different hosts. [21:52:10] so it's only useful if you want to manage multiple paths to the same cp hosts via different links or some such? [21:52:17] so right, rewinding to make that part make more sense: [21:53:11] Krinkle: or making multiple physical links between a host and a switch look like one logical interface [21:53:14] text-lb.eqiad has one IP address: 208.80.154.224 . This address is configured on all 8x of the cp-text nodes in eqiad, on their loopback interface (basically so the host itself knows that it's a legitimate destination, if such traffic magically appeared somehow) [21:53:31] the rest of the network fundamentally doesn't know that that IP lives on those 8 hosts [21:54:17] the way you normally find that out, for normal single-host traffic, is with the ARP protocol, which e.g. lets appserver01 ask the network: what is the L2 (ethernet mac) address for the host that claims 10.1.2.3 on this network? [21:54:35] ARP doesn't work for balancing that same IP's traffic over many hosts with different mac addresses [21:55:23] similarly, the LVS box in question (say it's lvs1013.eqiad.wmnet) also has a logical loopback definition for 208.80.154.224, but it also doesn't have it on a public interface and thus doesn't answer ARP for it. [21:55:28] sorry, just a curious bystander, so basically the cp-text nodes are just instead of sending arp request by default know the MAC address belongs to text-lb? [21:56:07] for that matter, for all of this to work: 208.80.154.224 isn't actually within the netmask of any of our actual vlans, so there's no network on which an arp rely for it would even make sense [21:56:45] ok, I think I'm getting it. The L2 logic helps solve a different problem. Would it be fair to say that L2 bonding could be used to improve/customise how traffic flows but not to change where it ends up. For that you need information from a higher level, either by sitting at that higher level (like LVS) or by re-implementing the layers in between but that would be pointless since the kernal has that logic already and does so as efficient [21:56:45] as we know how, so there'd be nothing we're improving or cutting out, or at least not significant enough that we think this makes sense to do there in a monolithic ad-hoc way. [21:56:51] so if any host, whether it's an internal appserver, or public traffic, tries to send a packet (say a TCP SYN packet to start a connection) to 208.80.154.224, it will end up at one of our hardware juniper routers to see where it should go..... [21:57:22] our LVS servers using the BGP routing protocol to advertise that they're a destination for 208.80.154.224 to the juniper routers. [21:57:32] so the juniper will now forward that packet to lvs1013 [21:57:59] lvs1013 has configuration that tells it that cp1075, cp1077, etc, etc (8 total machines) should handle the traffic for 208.80.154.224:443 [21:58:12] it learns the mac address (L2 ethernet) of them all [21:58:51] it takes the original packet that was destined for 208.80.154.224, and forwards it to (say) cp1075 on the L2 network that lvs1013 and cp1075 share (the vlan for that eqiad row) [21:59:10] with cp1075's macaddr as the L2 destination [21:59:39] thus the packet arrives at cp1075, and then the host doesn't immediately throw it out even though that's not cp1075's actual IP address, because it does have that address configured internally on the loopback. [22:00:02] and then we have regular L7 software on that host bound to listen on 208.80.154.224:443 which receives it in the normal way [22:00:47] and responds in the normal way, which goes back out cp1075's interface with no special handling, and thus straight on to the juniper router for the return path (assuming it was a public IP on the other end of the connection, or a cross-row IP. It could go directly if it were another host in the same row) [22:01:43] so there's a lot of layer-crossing magic involved, and that's why LVS needs 4 physical interfaces attached to the all the rows. [22:02:03] because it has to be directly on the L2 network of the backend service hosts its forwarding traffic to, so that the "rewrite the mac address and send" part works. [22:03:34] so, because of all that, it doesn't make sense to think of bonding the 4 interfaces on the LVS, because they're going to different places. [22:03:57] although maybe you could, if ipvs also understood that level of indirection (that 2/8 hosts were on interface A, etc, etc). [22:04:20] or if bonding understood it without ipvs, or something [22:05:35] the primary problem we have with LVS scalability, just looking at the practical stuff at this moment in time, is that as part of doing its routing magic, it wants to maintain TCP connection state tables for every connection passing through it. [22:06:26] we use the direct return path to avoid that, right? [22:07:01] no, direct return just avoids sending the (much larger) response packets through LVS, which would be a bandwidth bottleneck [22:07:07] ah ok [22:07:15] the request side (user "GET" part of HTTP traffic) is much smaller, our responses are much bigger [22:07:35] so 1x 10Gbps interface easily handles all of the inbound side of text@eqiad, but that wouldn't be true for the return traffic. [22:07:49] but: [22:08:11] ipvs is a fairly generic toolkit. it can be configured in a lot of different ways, including some that are fully bidirectional, etc. [22:08:19] yeah, I suppose we're not bottlenecked between juniper and cp* for incoming bandwidth, we just need lvs hosts to be able to handle processing the traffic (cpu bound?), which might also explain not needing L2 bonding tricks or additional network links between them to manage by hand [22:08:31] right [22:08:51] whenever whomever made design decisions a long time ago, they decided that ipvs's internal implementation should actually track TCP connection states. In some modes, ipvs's logic demands it. [22:08:51] the TCP conn state then on LVS is basically just a waste then right? [22:08:54] it's written but never used? [22:09:19] so that it can, for example, make a random round-robin decision to send a TCP SYN to BackendA, and then record that in the state table so that the rest of the packets for that connection go to the same BackendA [22:09:56] but the way we use it, we'd rather have it be stateless, and just make an in-the-moment decision for each packet: hash(srcip)->ChooseBackend->ForwardOnePacket. [22:10:05] but, it's not stateless [22:10:50] right, if we can randomise only on src and not fully random globally, that would allow not needing that state [22:11:00] and we do that right? we keep H2/TLS conns that way on the cp hosts [22:11:27] so we already hash by srcip for frontends, not strictly round-robin I guess. [22:11:42] functionally we at least need a single TCP connection's packets to all end up at one cp hosts [22:12:01] we prefer that all connections from a given source IP land at one cp host, because that makes ratelimiting at L7 easier [22:12:23] a consistent hash can do that reasonably well on a per-packet basis [22:12:31] it's deeply tricky to be truly stateless in the face of events like realservers not being healthy 100% of the time [22:12:39] (and minimize the disturbance when we depool a node) [22:12:42] feel free to stop any time, but now I'm wondering if this TCP state management cost is also similarly an issue on juniper routers, or that it either scales better there, or that it "delegates" this problem down to lvs/cp by allowing responders to know everything they need to do so directly (thus stateless, no NAT?) [22:13:10] right, so we give up RR per request in favour of RR per srcip. [22:13:13] the junipers are per-packet routers, they don't pay attention to tcp state management like this, which is why they scale ok when the flood comes, but then LVS doesn't [22:13:17] Krinkle: LVS's purpose is to hide that from Juniper / anything outside of LVS [22:13:54] so we also have the option of configuring juniper to do the hash(srcip) part. we've set up some other services to work that way experimentally, including internal recdns. [22:13:57] right, and that's why we have more LVS than Juniper, we fan it out to have the cpu/mem capacity we need for that state mgmt [22:14:23] (notice /etc/resolv.conf on most of our machines has just 10.3.0.1 - which goes to 2-3 different recdns hosts in each DC, using a juniper hash(srcip) scheme which doesn't involve LVS at all) [22:14:53] hm.. so does that mean my request can have packets go to different lvs hosts, and only "re assemble" on the cp* host? I didn't think about it that way before, but I guess that can happen? Or did I miss something just now. [22:15:01] Krinkle: well, no -- we have multiple LVSen, but, only a single one is 'active' at any given time [22:15:07] the problem with *that* approach, is that it's up to the desitination host to depool itself or crash, there's no 3rd party monitoring to depool it [22:15:14] (for a given datacenter and flavor of internal/external LVS) [22:16:09] which, I guess we could fix by having an external observer monitor each service host, and then command it to stop itself, but this quickly regressess into all kinds of edge cases and partial failure modes [22:16:11] oh ok, so Juniper->LVS isn't increasing capacity within e.g. text-lb, only at the wider level for the DC from stuff that isn't text-lb [22:16:31] well [22:16:42] do we expect eg. within the next 10 years to not be able to get away with a single text-lb LVS for all incoming traffic? [22:16:52] I would say the reason the LVS layer exists behind juniper, is to manage the health-state and admin-pooled state of the 8x cp nodes behind text-lb, mostly. [22:17:13] right, pyball comes into that. we need depool quickly etc. [22:17:17] but there are mutiple reasons [22:17:29] and we already have problems with the single text-lb, when we're under attack. [22:17:53] (because its state tracking is too expensive) [22:18:03] but we're getting into topics that are maybe best not for here [22:19:07] if it's "safe" to have different packets of the same srcip/req go to different LVS before reaching the same cp host, then scaling LVS would require traffic to divide itself before reaching it. That would be an L2 problem, I think? Anycast comes to mind, but I don't know if I'm way off by thinking that. [22:19:42] no, it's still an L3 problem unfortunately [22:20:05] or maybe Juniper could just be told to RR that, but then you also get the pooling state again. [22:20:13] we've talked about tacking active/active onto our current LVS [22:20:18] the same way we do for recdns [22:20:30] but then we run into the same problems around health state and coordination [22:20:32] anyway, I'm well into understanding the part I was confused about. happy to move on any time, but also will never cease to be curious :) [22:20:40] it's just moving problems up and down layers, they still exist one way or another :) [22:21:17] but yeah, in theory we could have two different LVSes use the same MED to advertise the text-lb IP to juniper, and it will hash traffic into them. [22:21:18] oh, so with recdns you can have the end-user already aware of the different LVS hosts (maybe not literally, but at least have multiple IPs for it, and we'd assigned them as we see fit) [22:21:49] also, junpier isn't smart enough to route ICMP, that's another way it and LVS differ here [22:23:17] with recdns, we built a setup that balances traffic from the junipers->[several service hosts] without an LVS layer involved [22:23:40] but, it has significant caveats that make it not directly and safely applicable to a case like text-lb :) [22:23:57] MED = Multi-exit discriminators ? [22:24:06] it was an experiment, maybe a shot across the bow of the current setup to explore the rationales for it and see how badly we needed some of them, or not [22:24:09] yes [22:24:14] Krinkle: yeah, basically a BGP-ism to indicate a tiebreaker [22:25:48] the primary concerns with the lvs-less setup are the lack of ICMP routing (less an issue for DNS than it is for HTTPS), and the lack of independent monitoring (a host depools itself, including implicitly if it shuts down or the service daemon shuts down, but nothing outside of that host is monitoring its health and depooling it independently) [22:26:16] right... this reminds me of our ping loadshedding service [22:26:26] that gets funneled from text-lb LVS? [22:26:46] or from juniper already? [22:26:58] it's working out great for recdns, but DNS is just much simpler in protocol terms, and much more reliable at the software layer, too. [22:27:02] and less churny [22:27:18] I think ping-offload is a direct config on the juniper routers [22:28:22] (not even BGP I think, just literally a static route on the junipers) [22:28:46] https://wikitech.wikimedia.org/wiki/Ping_offload [22:29:04] yeah [22:29:32] Right, I vaguely recall the concerns raised at the time. offloading it within LVS would be too late where the damage is mostly or completely already done in terns of load [22:29:44] on the lvs host that is [22:29:55] yeah with ping offload it's not even truly a "load" problem [22:30:26] it's that the lvs boxes have icmp ratelimits that are sane for TCP-connection-related ICMPs, but get overwhelmed by all the nonsense ping traffic [22:30:49] and so sometimes we drop the more-useful connection-related ICMPs we cared about to make room in the ratelimiter for silly pings [22:31:10] ah yeah this is all covered at the top of the wikitech page too [22:31:21] - Linux has internal ICMP rate limiters that can cause the kernel to drop valuable ICMP packets. By offloading ICMP echo, we make sure the "important" ICMP (eg PMTU discovery) doesn't get dropped. [22:33:37] Thanks :)