[08:03:24] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10User-Elukey: logster should not resolve statsd's IP every time it sends a metric - https://phabricator.wikimedia.org/T171318#3473823 (10elukey) After a chat with Moritz and Ema we decided to pick the current jessie version and apply the patch on top of it. In...
[08:08:53] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10User-Elukey: logster should not resolve statsd's IP every time it sends a metric - https://phabricator.wikimedia.org/T171318#3460781 (10MoritzMuehlenhoff) Ideally we upgrade to 1.x in Debian, the version currently in the archive is from 2014 and hasn't been to...
[08:27:57] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10monitoring: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3473875 (10ema)
[08:28:04] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10monitoring: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3473888 (10ema) p:05Triage>03Normal
[09:07:20] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10User-Elukey: logster should not resolve statsd's IP every time it sends a metric - https://phabricator.wikimedia.org/T171318#3473940 (10elukey) New package uploaded to jessie-wikimedia and rolled out to role cache::misc/upload/text/canary.
[09:21:23] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10User-Elukey: logster should not resolve statsd's IP every time it sends a metric - https://phabricator.wikimedia.org/T171318#3473953 (10elukey) 05Open>03Resolved a:03elukey Impact to maerlant and acamar's traffic:  {F8854082}  {F8854085}  In eqiad hydron...
[09:21:26] <wikibugs>	 10netops, 10Operations: "MySQL server has gone away" from librenms logs - https://phabricator.wikimedia.org/T171714#3473956 (10fgiunchedi)
[09:22:12] <hashar>	 elukey: should I upgrade logster on the beta cluster varnishes ? :-}
[09:22:28] <hashar>	 I am not even sure it is setup/running properly there though
[09:23:33] <hashar>	 upgrading
[09:26:13] <elukey>	 +1!
[10:25:32] <wikibugs>	 10netops, 10Operations, 10monitoring: "MySQL server has gone away" from librenms logs - https://phabricator.wikimedia.org/T171714#3474171 (10faidon)
[10:35:29] <wikibugs>	 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3474186 (10fgiunchedi) 05Open>03Resolved This is the resolved, note that the port in https://gerrit.wikimedia.org/r/367875 is required since `$c...
[13:14:12] <wikibugs>	 10Traffic, 10Diamond, 10Operations, 10monitoring, 10Prometheus-metrics-monitoring: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3474646 (10ema)
[13:25:20] <ema>	 still pretty puzzled about yesterday's reboot of hydrogen
[13:25:26] <ema>	 we've now got this weird situation:
[13:25:30] <ema>	 https://grafana.wikimedia.org/dashboard/db/recdns?orgId=1
[13:27:54] <ema>	 the story so far is summed up here https://etherpad.wikimedia.org/p/2017-07-25-hydrogen-reboot
[13:30:07] <ema>	 among other things, possibly unrelated but maybe oterwise interesting, it looks like in eqiad we're using different queuing disciplines for packets sent from the LVS to the two recdns
[13:30:59] <ema>	 hydrogen: 208.80.154.50            ether   78:2b:cb:09:0c:21   C                     eth0
[13:31:10] <ema>	 chromium: 208.80.154.157           ether   78:2b:cb:08:aa:48   C                     eth1.1002
[13:31:24] <ema>	 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
[13:31:31] <ema>	 13: eth1.1002@eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
[13:31:48] <bblack>	 look at eth1 for the queueing, not eth1.1002@eth1
[13:32:00] <bblack>	 the latter is a virtual interface for vlan stuff
[13:32:07] <ema>	 oh, then it's the same (mq)
[13:32:44] <bblack>	 is there some backstory to this I'm missing, or did it just randomly reboot?
[13:33:00] <ema>	 we've rebooted it to pick up the kernel update
[13:33:15] <bblack>	 was it using eth1 before?
[13:33:40] <ema>	 wait: yesterday we've rebooted hydrogen, not lvs1002
[13:33:53] <ema>	 what you see up there is from lvs1002's arp table
[13:33:58] <bblack>	 oh, ok, right
[13:34:28] <bblack>	 ok so let's step back a bit, because some of this is explicable and some isn't, I think
[13:34:59] <bblack>	 for the hydrogen reboot, did anyone do (either through puppet, or manual with it disabled?) resolv.conf fixups on the eqiad LVSes ahead of it?
[13:35:59] <ema>	 nope
[13:36:00] <bblack>	 the LVSes don't use the LVS'd recdns service.  therefore when a recdns goes down, they suffer the glibc failover/timeout behavior, which in turn can be expected to cause various fallout for monitoring there, etc...
[13:36:22] <ema>	 with the current pybal we do pretty much 0 dns requests on the LVS though
[13:36:36] <ema>	 and I don't think hydrogen is the first one in resolv.conf anyways
[13:36:40] <bblack>	 ditto for hydrogen and chromium themselves
[13:37:27] <bblack>	 yeah it seems chromium is first
[13:37:30] <bblack>	 but still
[13:37:31] <ema>	 yup
[13:38:29] <bblack>	 we can look at the initial problem (problems during the reboot period) in two ways:
[13:38:56] <bblack>	 (1) Assume pybal doesn't rely on DNS at all, and try to figure out why some kind of DNS disruption affected it, or
[13:39:25] <bblack>	 (2) Assume pybal does rely on DNS somehow, and assume that hydrogen being in resolv.conf while rebooting is enough to cause issues
[13:39:39] <bblack>	 (perhaps pybal parses resolv.conf on its own, and has a different strategy than glibc for failover?)
[13:40:38] <bblack>	 but in the past (before recent pybal+DNS changes?) we've generally hacked resolv.conf for the LVSes around a hydrogen/chromium reboot (removing the to-be-rebooted node)
[13:40:39] <ema>	 which LVS are we thinking of at the moment? lvs1002 (with the recdns) or lvs1003 (with mw api and other services that failed)?
[13:40:49] <bblack>	 all of them
[13:41:01] <bblack>	 they're all configured like that, to not rely on LVS'd recdns for LVS
[13:41:22] <bblack>	 (at least, not local LVS'd recdns.  they do have a 3rd nameserver which is the other core DC's LVS'd recdns)
[13:41:59] <ema>	 yes, I don't think it's pybal that failed though, but rather the mw hosts
[13:42:15] <ema>	 mainly because we haven't seen any issues on lvs1001 
[13:42:37] <ema>	 but also because hhvm/mysql graphs showed erratic behavior during the timeframe
[13:43:23] <bblack>	 yeah but, if pybal+DNS was the only issue, we'd expect pybal to be depooled MW, and we'd expect errative hhvm/mysql graphs from the fallout of it depooling up to half the servers, etc
[13:43:40] <bblack>	 I'm sure you can read through my groggy typos above heh
[13:43:56] <ema>	 true
[13:44:06] <bblack>	 do we have any direct evidence that mw servers themselves had some kind of DNS-related faults/errors?
[13:44:37] <ema>	 elukey: do we? ^ 
[13:44:48] <bblack>	 (because I wouldn't expect them to, as they use the LVS'd one.  At best, they might have a DNS issue contacting the LVS'd DNS IP, in which case we're back to asking something (perhaps a different something) about pybal/LVS level issues)
[13:45:16] * elukey reads
[13:45:24] <ema>	 another data point: issues seem to have started *before* the reboot, right after depool
[13:45:57] <ema>	 when hydrogen was still up and capable of serving traffic, that is
[13:46:32] <elukey>	 nope I didn't find any direct evidence, my suspicion were only due to the fact that prometheus apache/hhvm metrics showed a general slowdown of api/appservers (the first one more affected)
[13:47:03] <elukey>	 plus apache response time metrics on a couple of api hosts showed a neat increase at the time of the depool/reboot of hydrogen
[13:47:09] <ema>	 which as bblack was saying could be due to the fact that pybal depooled many mw hosts
[13:49:47] <elukey>	 but then on prometheus metrics we shouldn't see all the api servers slowing down right? Only a subset of them
[13:49:48] <bblack>	 it could be that we're depooling it wrong, and it drops a bunch of in-flight requests?
[13:50:15] <bblack>	 recdns is probably a case where we really need a weight=0 -style depool, rather than removing it from the table
[13:50:25] <bblack>	 I think
[13:50:58] <bblack>	 regardless, the whole recdns-via-LVS thing is a flawed design IMHO
[13:51:36] <bblack>	 but getting past it to something better probably means doing much of the same work as anycast recdns, so we may as well solve it with that.
[13:52:41] <elukey>	 another kindof related issue with recdns reboots happened with acamar maintenance: https://phabricator.wikimedia.org/T171048
[13:52:56] <elukey>	 we found the root cause but not the why (or sort of)
[13:53:09] <bblack>	 which is all similar to the last one I reported, too
[13:54:52] <bblack>	 now I can't find that ticket, the one where scb* services blipped on a similar recdns issue
[13:55:04] <elukey>	 yep I remember
[13:55:15] <bblack>	 but see also:
[13:55:17] <bblack>	 https://phabricator.wikimedia.org/T154759
[13:55:26] <bblack>	 https://phabricator.wikimedia.org/T83662
[13:55:42] <bblack>	 I don't really have it up to date in my head how much any of this may have changed with latest pybal fixes
[13:57:32] <bblack>	 https://phabricator.wikimedia.org/T103921
[13:57:42] <ema>	 I've just found this that might be of interest http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.UDP.html
[13:57:54] <ema>	 see UDP timeouts (DNS)
[13:58:01] <bblack>	 ^ that's where we changed from "hacked resolv.conf to use individual recdns instead of LVS'd recdns" for just the recdns LVSes themselves, to all LVSes
[13:58:11] <bblack>	 (that phab ticket, -> https://gerrit.wikimedia.org/r/#/c/221303/2 )
[13:59:59] <bblack>	 the bottom line is, we know that our current recdns setup is in general fragile.  I think in all recent memory, we've never had a daemon or host restart on a recdns box happen without incident/fallout.
[14:00:37] <bblack>	 (and that pybal, at least the old stable pybal, was part of the problem by being overly sensitive, perhaps indirectly exacerbated by glibc's failover strategy)
[14:01:34] <bblack>	 but the fallout tends to be greatly reduced if we pull the to-be-restarted one out of resolv.conf everywhere it's there manually (the local recdns boxes themselves + LVSes)
[14:08:34] <ema>	 so I think I've survived the terrible English and format of that document (LVS-HOWTO.UDP.html) and they mention this:
[14:08:38] <ema>	 > As UDP doesn't really have any state all LVS can do to identify such "connections" is to set up affinity based on the source and destination ip and port tuples. If my memory serves me correctly DNS quite often originates from port 53, and so if you are getting lots of requests from the same DNS server then this affinity heristic breaks down.
[14:12:23] <moritzm>	 one additional data point to all of this: I have done 5-8 reboots of recdns boxes during the last two years, but this is the first time it happened, and since the software stack on the recdns is fairly static (just minimal Debian stable package updates), it might be related to recent changes in pybal
[14:13:17] <bblack>	 ema: looking at their solutions there... some of the sysctl stuff sounds promising for the UDP case, but it's a global effect, and would change the behavior of our TCP balancing in perhaps unpredictable ways, so it's kind of a non-starter just to solve this issue
[14:14:02] <bblack>	 ema: also, some of those concerns listed there are because they're doing bi-directional LVS of UDP, and we're just doing the input side (LVS-DR), which is simpler and raises fewer issues with all of this
[14:14:10] <bblack>	 ema: but the one-packet-scheduler thing, that might help
[14:14:43] <bblack>	 our kernels + ipvsadm are new enough for it, but we might need some pybal code added to enable OPS for just UDP+recdns
[14:18:18] <bblack>	 (and that might be a viable short-term improvement?)
[14:18:45] <bblack>	 anyways, we still should try to just get away from LVS+recdns in general, it seems to be at the root of many sides of related issues
[14:19:54] <bblack>	 the quick and easy way would of course be to reconfigure all our resolv.conf to use direct nameservers, but then we're back to having (much simpler, but definitely more-widespread) issues with glibc timeout->failover across the fleet if a recdns goes down
[14:20:26] <bblack>	 but from the sound of it, it's possible that LVS depool->failover may be worse than glibc with a 1s timeout
[14:21:36] <godog>	 that's the part I didn't expect to be problematic from yesterday's outage, an ipvs depool even for recdns should be "clean" from glibc's pov
[14:22:08] <godog>	 though hydrogen was disabled not weight 0 (?)
[14:22:11] <bblack>	 I think it may not be, and various services handle it better or worse
[14:22:52] <bblack>	 even with a complete removal from the pool, it sounds like ipvs could continue scheduling known existing clients to the dead server for quite a while
[14:23:18] <bblack>	 (because UDP for IPVS-DR isn't very well implemented, basically)
[14:23:28] <ema>	 right, that's my understanding as well
[14:24:21] <godog>	 *nod* and the misrouting would have been enough for things to move to codfw, as we've seen
[14:24:49] <bblack>	 so, the one-packet-schedule (ipvsadm -o support via pybal) should remove at least some concerns, for the clients that are using the LVS'd recdns IP
[14:24:59] <ema>	 now, this is me being naive, but wouldn't it be enough from the LVS point of view to just really send the UDP packet to one of the random backends? If only one is there because the other has been removed, send it to the last one standing?
[14:25:23] <bblack>	 ema: yes, but it doesn't :)
[14:26:05] <bblack>	 it thinks in terms of "connections", and for UDP (especially with no backtraffic) it has no idea what a connection is.  So it balances the first packet for a given client to serverX, and sets up some state-table entry to keep doing that for that client for the next 5 minutes or whatever.  and then depooling doesn't clean that table.
[14:26:29] <bblack>	 (is my rough understanding from your ipvs docs link)
[14:26:52] <bblack>	 but the one-packet-scheduler should make a new decision for every input packet, eliminating that problem
[14:27:16] <ema>	 and now we have a traffic imbalance because god knows what internal state it has regarding chromium/hydrogen?
[14:27:41] <bblack>	 something like that, probably
[14:27:54] <ema>	 UDP  dns-rec-lb.eqiad.wikimedia.o wrr
[14:27:54] <ema>	   -> hydrogen.wikimedia.org:domai Route   10     0          1380286   
[14:27:57] <ema>	   -> chromium.wikimedia.org:domai Route   10     0          1914839   
[14:27:59] <bblack>	 perhaps those state-table entries keep getting refreshed as traffic flows
[14:28:10] <ema>	 last column on the right is InActConn
[14:28:20] <bblack>	 or just the hashing-for-balancing is randomly-awful and it turned out worse this time than last time, I donno
[14:29:03] <bblack>	 but OPS will switch to a mode where every inbound UDP DNS packet uses the scheduler to make a fresh decision (so it would never choose a removed server)
[14:30:39] <volans>	 in this scenario how we explain the failure after the depool but before the reboot? even chosing a depooled server should have worked fine, what I'm missing?
[14:33:02] <bblack>	 honestly I have no idea
[14:33:28] <bblack>	 we probably have multiple problems in play, though
[14:33:58] <ema>	 confirmed that we're keeping track of "UDP connections": sudo ipvsadm -L --connection
[14:34:12] <bblack>	 (be careful with that command btw, it really bogs down the LVS server)
[14:34:25] <bblack>	 (and unfortunately can't be limited to a single service's connections either)
[14:34:36] <ema>	 ok
[14:35:32] <bblack>	 you can read the data slightly-more-efficiently from cat /proc/net/ip_vs_conn , but I seem to remember with both methods there was some risk of bogging things down, maybe I'm remembering wrong, I donno
[14:35:49] <ema>	 good that the man page doesn't mention that :)
[14:37:25] <bblack>	 well the bogging down part, I think they don't really consider how things scale for debugging commands when you have north of 500K entries in the table :)
[14:38:22] <bblack>	 well and that rough estimate of 500K is the tcp for cache_upload
[14:38:34] <bblack>	 then there's the apparent ~3.2M entries for UDP recdns :P
[14:38:52] <ema>	 :)
[14:39:44] <bblack>	 (I guess with best practice on the client side being randomized UDP src ports, every single request gets new persistent entries? I don't know, that doesn't jive with causing issues with table entry re-use though)
[14:41:23] <ema>	 should we restart pybal on 1002 to fix whatever weird situation dns-rec-lb is currently in?
[14:41:39] <bblack>	 well you'd probably have to wipe out the ipvs connection states too
[14:41:45] <bblack>	 (pybal restart won't generally do that)
[14:42:02] <bblack>	 and that may make it a bit more of a user-impacting restart for upload traffic than a quick pybal restart of the primary
[14:42:41] <bblack>	 as in, on lvs1002, disable puppet, stop pybal, delete all ipvs entries with 
[14:42:46] <bblack>	 ipvsadm -C
[14:42:49] <ema>	 right
[14:43:07] <bblack>	 then maybe wait a while for things to settle down from the move to lvs1005 before you start back up and break connections again
[14:44:31] <bblack>	 is the imbalance causing a real problem though?
[14:44:42] <ema>	 not that I'm aware of, no
[14:45:35] <bblack>	 maybe wait for the OPS thing, if we end up doing that as a mitigation for now
[14:45:41] <ema>	 ok
[14:46:06] <ema>	 I was just wondering if it could be the symptom of some more important problem, but +1 for postponing
[14:46:20] <bblack>	 I think pybal would just need a patch to recognize that "scheduler: ops" is different from e.g. "scheduler: sh", in that instead of "-s <scheduler>", that one needs execution as "-o"
[14:46:54] <bblack>	 and then we change that setting in the hierdata just for dns_rec_udp
[14:47:18] <bblack>	 and we might have to do a pybal stop -> delete just dns_rec_udp entries from ipvs tables anyways to get the scheduler change to take effect
[14:49:32] <bblack>	 or does pybal already normally delete->readd existing services on startup? I forget
[14:50:30] <bblack>	 anyways, I would think we'd have far fewer problems with a combination of OPS scheduling, plus being careful about resolv.conf for the servers not using LVS
[14:51:08] <ema>	 bblack: it does: -D -t 10.2.1.1:80 ; -A -t 10.2.1.1:80 -s wrr
[14:51:18] <bblack>	 which I think is just the LVSes themselves actually
[14:51:32] <bblack>	 the recdns boxes seem to use standard LVS-based resolv.conf.  That may be its own problem too, somehow :)
[14:51:52] <ema>	 oh
[14:52:03] <bblack>	 perhaps recdns should have nameservers_override set to two entries: the opposite local box, then remote LVS-based.
[14:52:11] <bblack>	 so it doesn't try to use itself during startup before it's ready?
[14:52:33] <ema>	 that seems reasonable
[14:53:10] <bblack>	 so maybe we puppetize that (add nameservers_override for the recdns boxes themselves)
[14:53:30] <bblack>	 push through a simple pybal patch for OPS and turn it on for dns_rec_udp
[14:54:10] <bblack>	 and then document for now that in case of planned reboot (or daemon restart), or unplanned sudden outage, we need to push a puppet change to remove the to-be-dead (or dead) server from the nameservers_override of the LVSes in site.pp
[14:54:35] <bblack>	 I would think with all of that in play, at least for planned outages, the impact should be pretty minimal, unless we're missing another layer of bug somewhere
[14:55:09] <bblack>	 s/from the nameservers_override of the LVSes in site.pp/from the nameservers_override of the LVSes + recdnses in site.pp/
[14:56:07] <bblack>	 puppetization and puppet-based removal of those manual nameservers_override is not ideal, but all of this is "temporary" anyways, until we move to something non-LVS and get out of this mess
[14:57:37] <ema>	 we do still need to specify the scheduler though, apparently? On pybal-test2001 I've tried `ipvsadm -A -u 207.175.44.110:80 -o` and that defaults to wlc
[14:57:47] <bblack>	 assuming a world in which anycast recdns is too far off to be part of the medium-term plan (likely), our next-best bet might be to: (1) go back to non-LVS IPs in resolv.conf for all machines (e.g. in eqiad, resolv.conf would have 3x IPs for chromium, hydrogen, and 1 of the opposite-DC ones just in case?)
[14:58:58] <bblack>	 + (2) make those IPs we're using virtuals that can be moved around? possibly moved around easily.  So when doing maintenance, we could move the "chromium" official recdns IP to hydrogen first. Either manually, or via puppetization of extra IPs, or better long-term design if we were staying here would be to use something like heartbeat IP failover?
[14:59:39] <bblack>	 ema: yeah I guess that makes sense, and is why it's a separate flag.  the real "-s" scheduler is still making the decision per-packet.  So I guess we need a new argument in the config in hieradata?
[15:00:15] <ema>	 bblack: either that or we default to ops for udp I guess
[15:00:25] <bblack>	 right
[15:00:31] <bblack>	 recdns is our only case, right?
[15:00:49] <bblack>	 or wait, we have some logstash thing in lvs1003 right?
[15:01:06] <godog>	 yeah that's tcp/udp and sh IIRC
[15:01:15] <ema>	 mmh
[15:01:36] <bblack>	 I wonder how it works out there? I have no idea what their UDP protocols look like in terms of multiple correlated input packets or whatever
[15:01:56] <bblack>	 it may be they suffer the same or worse problems, but haven't noticed for lack of outage of the related backends?
[15:02:12] <bblack>	 and it could also be the case that OPS makes things worse for them, I donno
[15:02:16] <ema>	 logstash went belly up too FTR
[15:03:03] <bblack>	 ema: so let's add a new config I guess, so we can at least not risk breaking things for logstash's UDP services
[15:03:20] <ema>	 agreed
[15:07:36] <bblack>	 the anycast recdns design looks something like this in my head:
[15:08:25] <bblack>	 1) Pick a private anycast subnet: 10.0.0.0/24 sounds great.  Pick a couple of IPs (for future plans about local caches backending to it as discussed yesterday): 10.0.0.53 + 10.0.0.52
[15:10:14] <bblack>	 2) recdns boxes: add the IPs to the loopback the normal way (like LVS realserver).  Set up some stuff we haven't dug into yet to ensure that each recdns machine listens on that IP and advertises it to both local routers via BGP, only when the daemon is up and working (this is kinda like a pybal variation without ipvs, but with monitoring + bgp?)
[15:10:52] <bblack>	 3) Do some config on the routers to know about the anycast subnet and the expected advertisements and defaults in case of no advertisers anywhere globally, etc (kinda like the router side of our LVS setup)
[15:11:20] <bblack>	 4) Switch resolv.conf on all clients (except the recdns boxes themselves, I guess, but definitely including all the LVS boxes) to use the anycast IP as the singular recdns IP
[15:11:45] <bblack>	 (5 - decom LVS recdns)
[15:13:04] <bblack>	 that doesn't solve all glibc-level problems, nor does it give us machine-local caches, but IMHO it's superior to what we have today and fixes a lot of things
[15:14:12] <bblack>	 the glibc-level problem that remains with the single-anycasted-IP situation is that we might still see random UDP loss (especially might happen during some routing transition as some recdns box falls out and back into BGP routing during a downtime, possibly?), and glibc will still deal with that loss by timing out for a full second before retry.
[15:16:48] <bblack>	 for the hard bit (part 2 above), if we assume pybal is our starting point and we wanted to modify pybal to suit this purpose, the pybal code-level feature work would be something like:
[15:17:31] <bblack>	 1) Add some global config that puts pybal in "local" mode (as opposed to IPVS mode).  This turns off all the ipvs-related code, but it can still monitor a service and talk BGP.
[15:17:59] <bblack>	 2) Obviously, add some appropriate local monitors so that it can track that local service is working and the daemon is up.
[15:19:06] <bblack>	 3) Give it some per-service config (which might be useful in some IPVS/LVS cases anyways) where it can withdraw/add routes from BGP based on monitored state.  e.g. when depool_threshold is reached, in addition to not depooling further, it also pulls that service IP's route from BGP.
[15:20:26] <bblack>	 so then we have a model where pybal can monitor the local recdns being alive and working, and only advertises BGP for the service when it's alive, and doesn't try to mess with IPVS
[15:21:22] <bblack>	 you could go either way on whether it's easier to slice up pybal with new config like the above, or to break it up first so we can re-use the components (e.g. break up current pybal into a BGP talker, a monitor-daemon, and an IPVS-controller, which can be used independently or talk to each other)
[15:22:09] <bblack>	 or perhaps there's other off-the-sheld stuff for what we need with the recdns-anycast scenario without ipvs, I donno
[15:23:24] <bblack>	 e.g. you could do a really simple version of this, perhaps with "bird" as the routing daemon on the recdns boxes that advertises to the routers, and some simple post-up pre-down commands hooked into systemd that execute "birdc ..." to add/remove the route based on whether the daemon is alive.
[15:23:43] <bblack>	 (the monitoring wouldn't be quite as thorough, but still)
[15:38:17] <bblack>	 we separately still have the glibc/nss problem to deal with in any scenario (to varying degrees), and with either a sufficiently-advanced glibc/nss module, or the 1% chance that a nonlocal socket is enough to grab anycast traffic locally without an interface, we could also add machine-local caches without making failure cases worse.
[15:44:24] <bblack>	 https://github.molgen.mpg.de/git-mirror/glibc/tree/master/resolv/nss_dns <- the standard glibc nss_dns code, was peeking at that to see how hard it might be to write a better one.  Looks ugly :)
[15:47:15] <bblack>	 another option would be to just patch nss_dns and/or glibc (ewww if it came to that) to allow for much shorter timeouts
[15:47:26] <bblack>	 that would at least get us past one angle of the problem
[15:48:51] <bblack>	 if it's patchable within nss, we could basically copy nss_dns to nss_dns_fto (fast timeout) as our "own" nss module with a very small patch, and still track upstream bugfixes, and not have to rebuild glibc locally.
[15:50:58] <bblack>	 heh, no wonder I couldn't find references to the timeout easily
[15:51:14] <bblack>	 there is no internal "timeout".  it's just translated into a maximum number of retransmits :P
[15:54:04] <volans>	 lol
[15:54:21] <bblack>	 err no, I donno, it's confusing.  maybe "retrans" is maximum retransmit timeout value
[15:57:07] <bblack>	 yeah
[15:57:29] <godog>	 ema: FTR I'm not sure logstash wasn't a coincidence, I briefly looked at its logs earlier and it seem the logstash process restarted on bad input
[16:00:00] <bblack>	 so this is the calculation of the seconds-of-timeout for each sent UDP request in glibc:
[16:00:03] <bblack>	         int seconds = (statp->retrans << ns);
[16:00:03] <bblack>	                 seconds = 1;
[16:00:06] <bblack>	         if (ns > 0)
[16:00:08] <bblack>	                 seconds /= statp->nscount;
[16:00:11] <bblack>	         if (seconds <= 0)
[16:00:23] <bblack>	 where "ns" is the index of the nameserver from the set in resolv.conf, and statp->nscount is the total count of them
[16:00:34] <bblack>	 so assuming 3 servers and the default timeout of 5, you'd get
[16:00:39] <bblack>	 ask first server, wait 5 seconds
[16:00:55] <bblack>	 ask second server wait (5 << 1)/3 seconds
[16:01:04] <bblack>	 ask third server wait (5 << 2)/3 seconds
[16:01:11] <bblack>	 I think, which is kinda strange
[16:01:48] <bblack>	 so the second timeout is 3.3333 seconds, 6.6666 seconds?
[16:01:59] <bblack>	 I don't even begin to get what's going on there
[16:02:37] * volans trying hard to not get nerd-sniped into this 
[16:03:19] <bblack>	 lol
[16:04:38] <bblack>	 in any case, resolv.conf "timeout" value and the handling of it for sending UDP queries is all bundled up in resolv/res_init.c and resolv/res_send.c, so the libresolv part of glibc, not nss_dns
[16:04:56] <bblack>	 so there's no easy way to change the timeout behavior by copying and modifying the nss_dns module separately
[16:05:36] <bblack>	 you can obviously write a new nss_foo and just not use glibc's libresolv stuff and implement the config->behavior however you like of course
[16:05:46] <bblack>	 but that's a new scratch NSS module, it's non-trivial
[16:16:03] <moritzm>	 are there existing bug reports in glibc Bugzilla to allow timeouts < 1?
[16:16:24] <bblack>	 perhaps the simplest path, if we want to write a new NSS module that was full-featured and supported all the bells and whistles one could need, would be to use a library like https://c-ares.haxx.se/ underneath the NSS module
[16:16:41] <moritzm>	 glibc upstream used to be pretty abrasive in the past, but during the last ~ 5 years they've become open to suggestions etc.
[16:16:57] <bblack>	 we'd ultimately still be synchronous from the perspective of gethostbyname(), but it would handle the low-level details for us and allow custom timeout behaviors and such
[16:17:42] <bblack>	 moritzm: https://bugzilla.redhat.com/show_bug.cgi?id=1365420
[16:19:14] <bblack>	 apparently musl libc replacement queries servers in parallel (but still with otherwise bland timeout handling) rather than serially, that's something
[16:22:06] <bblack>	 you could do an awful hack where you install the musl C library on the hosts as a dependency, and have an nss_dns_musl linked into glibc's NSS, which just translates e.g. gethostbyname() calls into ones that call into musl's gethostbyname() :)
[16:22:39] <bblack>	 (and get the parallelism, which would at least let you specify both local servers directly in all resolv.conf and survive their restarts/crashes)
[16:22:54] <bblack>	 it seems so hacky to go down that road, though
[16:39:16] <ema>	 how is it OK that everything going wrong with ipvsadm is a "Memory allocation problem"?
[16:40:44] <bblack>	 :)
[16:41:08] <bblack>	 ipvsadm and ipvs don't seem all that well-maintained in general
[16:41:40] <bblack>	 lots of sites rely on it, but I guess everyone who uses it is geeky enough to work through its flaws so nobody bothers to clean them up?
[16:49:02] <wikibugs>	 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3475301 (10fgiunchedi) Turns out this is more data than I expected (just slowly increasing by now)  ``` $ du -hcs /var/lib/carbon/whisper/librenms/...
[16:51:09] <ema>	 ok so I've spent most of my time trying to convince tox to run properly on my machine (it wasn't because of stale .pyc file eeeeeuugh)
[16:51:19] <ema>	 this should work: https://gerrit.wikimedia.org/r/#/c/367903/
[16:56:49] <ema>	 oh and yesterday I found this, which seems interesting: "The Age Penalty and its Effect on Cache Performance" https://pdfs.semanticscholar.org/3e8b/ae8af3faec60668b035727145f6e12da5fcf.pdf
[16:58:51] <bblack>	 +1
[17:00:54] <godog>	 re: the earlier discussion about logstash udp, I think sh is required because of udp chunking so it shouldn't be affected by OPS
[17:03:31] <bblack>	 ema: on a first skim it's hard to grok it all, I'll have to re-read it in depth later.  At first glance at the top you want to say "duh, yes, as objects expire miss rates go up :P", but I think at least in that first skim they eventually get around to saying, basically "You don't want to reach age expiry throughout your cache hierarchy all at once and refresh all the way back to origin, so you sh
[17:03:38] <bblack>	 ould deploy some strategy of early-refresh at various layers"
[17:03:53] <bblack>	 which is kinda what the "grace-within-TTL" concepts + the added randomization there are trying to do, too
[17:04:30] <bblack>	 (and again, there's the caveat in all cases that the origin could be counting down to the only content rollover reliably, in which case "early refresh" tends to just add pointless extra requests)
[17:05:26] <bblack>	 in other words: if the origin really does update a given URL exactly once a week, and always shows a max-age=1w + age:<counting up to 1w>, until it suddenly rolls over to age:0 a week later, there's no effective way to avoid missing through the stack once a week, and early refresh doesn't help.
[17:11:52] <bblack>	 ema: I'm gonna update the general "improve DNS" ticket with a list of the short-term work at least, just a sec
[17:13:41] <ema>	 ok
[17:17:00] <wikibugs>	 10Traffic, 10Operations: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442#3475432 (10BBlack) So to recap a small part of IRC discussion today in the wake of issues with rebooting hydrogen, I think our short-term improvement plan looks like this:  1) Implement OPS (one-p...
[17:23:53] <bblack>	 I'll patch up the recdns resolv.conf thing now (or after my next meeting if it takes a while)
[17:33:31] <ema>	 and the QOTD is: shouldn't we use ipresolve for nameservers_override instead of hardcoding the names?
[17:33:43] <ema>	 s/the names/the IPs/
[17:34:38] <bblack>	 uh
[17:54:14] <bblack>	 ema: sorry I got distracted, but I assume you're trolling me above :)
[17:54:53] <bblack>	 maybe not.  it could work, in theory, but... it seems like asking for trouble too :)
[17:55:14] <bblack>	 (halp, dns server config borked because cannot use dns to puppetize dns server config)
[17:57:30] <bblack>	 gwicke: I'll be a min or three late, other problems ran overtime, grabbing coffee, etc
[18:03:36] <gwicke>	 bblack: okay, we are on he h-o
[18:28:14] <wikibugs>	 10Traffic, 10Operations, 10Mobile, 10Need-volunteer, and 2 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#3475647 (10Dzahn)
[18:38:18] <mutante>	 ^ i added traffic tag because it's about changing varnish config
[18:38:57] <mutante>	 a regex in text-frontend.vcl