[07:17:28] <_joe_> icinga looks like a christmas tree, only we created it three days before the pope stes up his [07:17:42] <_joe_> there are several alerts that are 2 days+ old [07:19:44] <_joe_> I think the cr-{codfw,eqiad,eqord} alerts are related to the two telia maintenances ongoing right now [08:28:03] there is one BFD alarms that is interesting [08:28:29] it is 5d old, and it seems related to cr1-eqiad <-> cr3-knams [08:30:49] from the cr1-eqiad side, 'show bfd session' shows a 'Down' for a ipv6 link local address, that is related to xe-4/2/2.13 (transport to cr3-knams) [08:31:06] from the cr3-knams side, I see the correspondent session up [08:32:17] maybe a simple "clear bfd session $address" on knams side could fix? [08:32:43] Cc: paravoid, akosiaris, mark (sorry for the broad ping) [08:33:17] (or even on the eqiad side) [08:37:45] _joe_ rescheduled some checks for some mw* (showing up as Unknown), the cpu microcode checks runs once a day and those were recent reimaged hosts [08:37:58] the three is now a little better it seems :D [08:38:00] <_joe_> elukey: why are you telling me? [08:38:08] <_joe_> I'm not the one doing the reimages [08:38:38] <_joe_> meaning, tell effie, mutante and rlazarus they should check for that and reschedule [08:39:11] _joe_ it was only conversation since you cared about the alarms, nothing more [08:39:47] I'll tell to myself in the chan next time :) [08:40:00] anyway, I am also taking care of stat1007 atm [08:40:16] big home dirs [08:40:41] those checks are usually resolved with a reboot, but the reimaged hosts should have already been rebooted by the reimage script [08:40:45] unless it failed for some reason [08:40:50] and the reboot didn't happen [08:40:53] just FYI [08:43:03] <_joe_> elukey: sorry it came out wrong :) I wanted just to point it to them :P [08:43:25] <_joe_> volans: it's just rescheduling the check that is needed [08:44:26] I don't get why [08:44:34] we can also adjust the retry interval for the microcode checks so that they retry after an hour, the original thought was that if they fail due to broken hw (like on puppetmaster1001 after the Buster upgrade), they'll fail persistenly, but in the light of reimage noise maybe that needs to be reconsidered [08:44:36] the reimage downtimes the checks [08:44:47] so it should fire after the reboot and everything is good [08:45:22] I'd like to know when they fire timeline wise compared to the reimage process [08:45:57] <_joe_> I think just asking people "check if it misfired, in case reschedule" is enough [08:47:30] volans: isn't that the reimage race we sometimes have? the host record gets removed and recreated during reimage, puppet runs on icinga1001, the check is made, but the host isn't fully installed yet and thus fails, next check is pending for 24 hrs later? [08:47:50] no when all runs smoothly [08:48:10] the reimage issue a delayed downtime on icinga forcing a puppet run before starting the first puppet run [08:48:19] if all tech would always run smoothly, we'd all have different jobs :-) [08:48:23] lol [08:48:43] I mean as long as the downtime cookbook that the reimage starts don't fail (sometime it can if too many in parallel) [08:48:49] for other reasons [08:49:18] it should be covered. [08:49:32] ack [08:49:32] do those checks perform retries? [08:49:38] yes, after 24 hrs [08:49:46] no I mean the retry_interval on failure [08:49:54] or do they alarm at the first critical [08:50:12] *retries and retry_interval [08:50:47] I need to look that up when I'm done merging the ffmpeg pining, I think the alert on first failure as it's an all or nothing thing [08:51:13] ack, if it's on first with 0 retries then there might be another race [08:51:19] small one, but still [08:52:39] the Icinga check itself has proven to be really useful, there were a number of hosts which for whatever reason missed the final reboot after initial installation (and as microcode is only loaded in initrd had it installed, but not loaded) [08:58:56] elukey: I have no idea about that clear bfd command. never ran it before [09:04:01] akosiaris: it should be something like clearing a bgp session, but I am just speculating since I am super ignorant [09:12:03] elukey: those are silly alerts after reimaging, I failed to tell reuven about them, so it is on me [09:12:06] will fix [09:12:28] effie: you are a horrible person, I keep telling you that :D [09:12:46] jokes aside, nevermind, it is really only some icinga spam [09:12:53] you will force me to eat lots of pizza just to get over that [09:12:59] ahahahhaha [09:13:35] please not with pineapples [09:20:42] lol [09:39:35] with no ham or pepperoni. then have all the pineapples you like [11:39:02] jynus: the bacula_exporter on backup1001 can't be scraped by prometheus, known or expected ? [11:40:37] can't be as in... port issue, or latency. Because perforamance goes badly during high activity like now (migrating lots of backups to codfw) [11:41:03] it is returning 500s [11:41:32] I can followup in a task too if that's easier [11:41:51] probably for the same reason [11:42:39] I have thousands of jobs ongoing [11:44:04] no need for a task, it is a know issue [11:44:22] ok! thanks, yeah I see it is on going since yesterday at ~17 [11:44:30] yeah [11:44:43] it is when I configured the migrated jobs [11:44:54] migrating [12:07:18] effie: whoa interesting, where can I read more about those checks? [12:09:07] rlazarus: it is actually silly, they are checks that run 1-2 times a day [12:09:17] like the cpu mitigation check [12:09:47] so all we do is simple reschedule the next run of this check now+5min [12:10:52] rlazarus: the other warning that might pop up in the icinga dashboard [12:11:03] is about a host not being in the DSH list [12:11:14] which means that we forgot to pool the host back :p [12:11:30] ha, got it [12:12:13] sorry I forgot about it yesterday [12:13:57] I'm going to add this to https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reimage after I get some coffee, I'll run it by you [12:16:59] rlazarus: [nit] the first is generic, the DSH is only MW-specific [12:17:05] just FYI ;) [12:20:37] ack, thanks [12:42:02] godog: I think this should fix the 500s https://gerrit.wikimedia.org/r/c/operations/puppet/+/554840 [12:46:01] jynus: amazing! thanks for the quick fix [12:46:26] sorry it happened, I just assumed it was load [12:46:39] and certainly the state now is not a normal state [12:47:04] but I had some trouble with the different data formats on json, bacula, mysql and prometheus [12:47:43] some accept null, some accept nan, some accept '0000-00-00 00:00:00' and I have to create exceptions for those [12:48:23] hehehe I like the last one [12:59:09] yeah, there is this thing wheere 0000-00-00 is an invalid date [12:59:31] but 00:00:00 and 23:59:60 are valid hours [13:05:23] effie: https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&diff=1847248&oldid=1847043 fyi [14:58:23] there is still a bird-related issue between eqiad and eqord, afaics the Telia maintenance on that link is over [14:59:00] err BFD related [15:04:26] show interface optics seems in line with other links [15:16:31] IIUC the impact is that BGP routes in eqiad are not announced anymore to eqord (so no reachability of LVSes etc.. from eqord to eqiad) [15:17:51] did we do some custom config to disable a link during maint or something? [15:18:13] or disable the advert anyways, to avoid flaps? [15:18:24] there's still that 1x BFD alert in icinga as well, I assume same thing [15:20:48] not that I know [15:21:16] the other BFD alert seems to be with cr3-knams, I was wondering about it as well this morning [15:22:04] they show up as admin down, maybe waiting for an explicit "clear" command or similar from us? [15:22:30] donno, maybe mark would know re: cr3-knams state [15:22:33] like too many flaps, will declare this link down and wait for human input [15:22:38] mark: ? [15:23:16] if he's here, something in the back of my head says he may not be today [15:24:25] in the junos docs there is a "clear bfd session etc.." command [15:24:44] I suppose similar to clear bgp neighbor? [15:25:34] yeah [15:25:47] I just don't know anything about the context [15:25:57] maybe there's a reason someone didn't clear it yet :) [15:26:25] ah yes for knams it makes sense, for eqiad/eqord they popped up this morning after maintenance on the Telia link [15:26:34] that should be over now [15:26:37] right [15:37:29] so lol of the day [15:38:10] after doing all this resiliency-adjusting for bgp adverts for recdns, and realizing "well still, if anyone pushes a recdns config change, they need to be careful to spool it out with cumin -b 1 so they don't take down a site pair at the same time..." [15:38:31] I then thought to look at the cron puppet agent timings to see how close they are [15:38:54] keep in mind there's only 5 pairs of servers, one per site. [15:38:55] 2, 2, 3? :D [15:39:01] but our "random" cron splaying has: [15:39:09] the authdns I remember were awful [15:39:12] 2/5 pairs that are 1-minute apart (ulsfo and esams pairs) [15:39:29] and 1 site (eqiad, you know, the unimportant one) has dns1001 + dns1002 with identical timing :P [15:39:38] that's great! [15:39:41] it's almost like it's actively trying to be awful [15:40:09] I've always thought we should make the splay more cluster-aware at the cost of being less evenly distributed globally [15:40:30] ideally cluster-local and cluster-remotes aware [15:40:41] err wait, the two pairs with 1 min separation are eqsin and ulsfo, not esams, but whatever [15:41:05] so I wrote a puppet function to do that, but it only handles certain cases well and isn't meant for "twice an hour" kinds of timings [15:41:28] modules/wmflib/lib/puppet/parser/functions/cron_splay.rb [15:41:47] it only supports down to "hourly" [15:42:05] I guess I could try to hack it up for semihourly [15:42:30] somehow I thought we were using that one for the puppet crontab too [15:42:50] it was originally built for the weekly/semiweekly case, to manage auto-restart of failing varnish backends [15:42:58] I don't think it saw much adoption elsewhere [15:43:20] I used it :D [15:43:24] but yeah [15:43:37] both on debmonitor and cumin puppet code [15:43:42] if it had a semihourly setting that worked, we could set the "seed" param from $cluster too [15:43:47] for the puppet agent case [15:44:55] should do something anyways [15:45:12] right now the risk is someone pushes a minor recursor.conf change and lets cron push it [15:45:37] and dns1001+dns1002 execute it simultaneously, which correctly withdraws both their routes for the pdns_recursor restart [15:46:00] but there's a default fallback route to dns1002 statically in the router [15:46:09] basically we don't ever want them to actually stop underlying service at the same time [15:46:41] either that or we could remove the default fallback route, so that it would send traffic over to codfw recursors when dns100[12] stop service simultaneously. [15:46:56] that's kind of a tricky call regardless of the cron fix [15:52:49] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554898/ [15:52:59] blind attempt, probably even the ruby syntax is wrong or whatever :P [15:55:30] oh look I made a code style violation, how unpredictable [16:03:37] yeah I donno, it's still a pretty half-baked idea [16:03:49] cron_splay also requires an explicit list of all the cluster nodes [16:03:58] and will readjust them all when nodes are added or removed, too [16:04:49] there's no generic way off in modules::base I think to parameterize it out as cron_splay($array_of_nodes_in_cluster_foo, 'semihourly', 'agent_for_cluster_foo') [16:05:01] but even if there were, it would be fairly unstable, which is sometimes annoying too [16:05:42] (you add one node and they all move around a bit, which causes some to effectively skip an interval's run as it's applied, because the cron time jumped backwards over the present minute value) [16:07:55] I think, back when I first looked at this problem and set up cron_splay, I had noticed that fqdn_rand was kind of awful to begin with, even at its documented job [16:08:12] and then went all the way down the rabbithole and ended up at cron_splay to use for some immediate needs [16:08:16] bblack: i'm here [16:08:20] cr3-knams is in production [16:08:36] it should work pretty much exactly as cr2-knams before it did, same connections [16:08:39] so there's a bfd link down? [16:08:47] but we could maybe go back and just do a simpler fix to fqdn_rand itself that makes it more-random, so e.g. dns1001+dns1002 don't somehow accidentally end up on the same time (which is just showing weakness in the randomness) [16:09:04] mark: yeah there's a bfd down, wasn't sure enough to just clear it [16:09:09] o [16:09:12] i'll take a look [16:09:19] i'm off on fridays, not thursdays [16:09:33] but that does mean thursdays tend to be even busier ;) [16:09:37] the BFD down alert is on cr1-eqiad [16:10:25] either that or I misunderstood elukey earlier, but I think he meant that cr1-eqiad BFD alert was about a link to cr3-knams [16:10:59] fe80::7a4f:9b00:d4e:8004 Down xe-4/2/2.13 0.000 2.000 3 [16:11:07] ipv4 is up, weird [16:11:19] same thing for eqiad - eqord mark [16:11:25] after maintenance this morning [16:11:35] there are 3 alerts, and I was wondering why [16:11:45] seems ipv6 only yes, forgot to mention [16:11:45] also GTT? [16:12:18] on eqord I can see [16:12:19] 208.80.154.208 Down xe-0/1/5.0 0.000 2.000 3 [16:12:33] xe-0/1/5 up up Transport: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave] [16:12:36] that is Telia [16:13:00] ah sorry that is ipv4, my bad [16:13:10] well if it's down, still a problem ;) [16:13:35] I was wondering if bfd needs to be cleared manually after some flaps [16:13:41] when in admin down like now [16:13:59] admin down? [16:14:02] it normally should clear itself [16:14:25] so show bfd session extensive shows me [16:14:26] Remote state AdminDown, version 1 [16:14:30] on both sides [16:14:36] (eqord eqiad I mean) [16:15:01] the related link seems up though [16:16:21] there is a clear bfd session X command that is tempting to test, but I avoid anything on the routers that I am not 110% sure about for obvious reasons :) [16:16:48] let's try it [16:16:49] go ahead :) [16:17:19] that specific address [16:17:21] should be [16:17:26] clear bfd session 208.80.154.208 [16:17:28] right? [16:17:34] yes [16:19:16] curious [16:19:21] so BFD rides the OSPF protocol [16:19:23] but ospf is up [16:19:25] elukey@cr2-eqord> clear bfd session 208.80.154.208 [16:19:25] ^ [16:19:25] syntax error, expecting . [16:19:34] didn't expect that :D [16:19:41] clear bfd session address 208.80.154.208 [16:19:55] bingo [16:20:03] now it is up [16:20:04] now it's up [16:20:04] weird [16:20:54] mark: ok if I try the same for eqiad - knams? [16:20:59] yes :) [16:21:17] maybe we can make a runbook out of this [16:21:24] first diagnose if the link is up, ospf on that link is up [16:21:32] and if bfd on that ospf session is not up, let's clear it [16:21:39] yes definitely [16:23:36] interestingly didn't work for eqiad knams [16:23:47] check if underlying ospf is up [16:23:55] it's ipv6 so that would be ospf3 [16:24:17] show ospf3 neighbor [16:24:29] 91.198.174.246 xe-4/2/2.13 Full 128 31 [16:24:29] Neighbor-address fe80::7a4f:9b00:d4e:8004 [16:24:44] yeah looks up [16:24:45] weird [16:25:48] oh hi, just saw all this [16:26:01] arzhel noticed a similar problem where he couldn't figure out why a BFD was down, and a clear fixed it [16:26:08] yeah [16:26:09] not sure if it was for the same link(s) though [16:26:10] so that was one [16:26:13] but now a clear doesn't fix it [16:28:02] it's up now though luca? [16:28:08] ok very weird, it looks up from knams side [16:28:11] ahahah [16:29:03] it is not up only on cr1-eqiad [16:29:08] but up on cr3-knams [16:29:17] weird [16:29:22] i am going into a meeting now [16:29:24] keep going! :) [16:29:37] ok rebooting router.. :P [16:30:47] mmm I wonder if I just have to clear knams side [16:31:35] worked! [16:31:51] no up on both sides [16:31:55] *now [16:32:57] weird [16:34:04] the "admin state down" looks to me as "I am not going to do more, please clear me if you want me to attempt again" [16:47:57] mark - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status [16:48:08] added some basic info, hope it is clear/correct [17:08:20] back on the fqdn_rand topic [17:08:39] I actually dug through fqdn_rand's implementation (as it would be on our puppetmaster, same versions, etc) [17:08:53] I can repro the behavior with a short ruby script and get the same values for dns1001 and dns1002 [17:09:24] but I can't actually find a glaring fault in what any of the underlying bits (fqdn_rand itself or ruby's Random class, etc) are doing [17:09:37] it appears to be just really bad luck, and of course only 30 values to choose from [17:11:44] https://phabricator.wikimedia.org/P9832 [17:12:15] ^ that's the short repo of what our puppetmaster/puppet/ruby ends up really doing for that case, iving the same output value "29" that production sees in the crontab, for both hostnames [17:12:36] it's very suspicious that the fqdn's are off by one shifted bit and the results come out the same, but it really is just bad luck apparently [17:13:01] (the seeds are quite different, etc.. if you debug-print some of the intermediate steps) [17:13:47] so I mean, we could "fix" it by changing the seed, but it will probably create as many dumb cases as it fixes [17:18:12] but really it's 1399 servers + 30 possible cron minute values -> birthday paradox explosion [17:21:21] heheh the speed at which we used to be spammed by puppet failure alerts! [17:25:36] any other scheme I can think of which has no knowledge of other node names, isn't really much better than a glorified re-seed [17:25:50] cron_splay just does better because it has a list of cluster peers to work with [17:26:04] but, yeah, not really a great way to do it for the general case [17:26:28] in general I think it's pretty hard to solve this problem perfectly without some sort of centralization [17:27:18] or bumping the interval [17:27:24] but we've come to rely on it I think [17:27:41] who knows how many things depend on the expectation that agent runs will fixup something or start something in at most 30 minutes [17:28:40] but if you dramatically changed that (say the random runs were like twice a day), the odds of random collision would drop off accordingly. [17:28:52] and the load on the puppetmasters to boot [17:29:07] but I can see all kinds of arguments against long automatic agent run times [17:29:43] (e.g. breakage at a distance. I merge change X and think it's fine, or think I've tested all the nodes it affects, but 6 hours later after I'm long gone, some supposedly-unrelated node finally runs its agent and breaks from my change) [17:32:40] I was hoping I would find something dumb inside puppet or ruby [17:33:03] e.g. the seed was being passed in as a bigint and then truncated to a 32-bit int somewhere inside and damaging things or whatever. [17:33:11] but I looked all over, it all works pretty sane [18:40:25] hi all i have written a irssi plugin which uses the shorten url api to automaticly crab a short url if the url length is above a specific limit (currently 54) and i can get it to snd the result to the channle (example to follow, hopefully) i wondered what people thught about this a) as just me using it, could be an abuse of the service not sure b) having it enabled to send the result to this [18:40:29] testing a plugin please ignore https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ [18:40:31] channle c) adding the functionality to some bot ... [18:40:34] ... and adding the bot in all channles [18:40:36] shorturl: https://w.wiki/DLn [18:41:36] second test a plugin please ignore https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/foobar [18:41:36] shorturl: https://w.wiki/DLo [18:43:04] <_joe_> jbond42: I'm not sure we should use the url shortener for our urls, yep [18:43:26] <_joe_> I've seen others doing it, but I've never looked at the ToS [18:44:19] yes thats the main thing i was unsure of [18:45:04] <_joe_> https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener#Security uhm [18:45:58] <_joe_> This feature has been designed to provide to the Wikimedia editors a secure way to shorten and share links. [18:46:00] <_joe_> This feature has been designed to provide to the Wikimedia editors a secure way to shorten and share links. [18:46:08] <_joe_> err sorry for the double paste [18:46:27] <_joe_> it seems we're abusing it a bit, at least in spirit [18:47:38] yes i definetly can't class my self as an editor (have unloadd for now) [18:55:18] I don't know, we make edits on wikitech, meta, and mediawikiwiki [18:55:21] jbond42: Have you never edited MediaWiki.org? [18:55:42] oh i have one edit on there thats true and a few on wikitech [18:55:43] Wikitech is more parlous as a justification. [18:55:52] But one edit makes you an editor, so welcome aboard. ;-) [18:56:02] :) thanks [19:05:48] <_joe_> cdanis: sure, we're editors, but what I gather is [19:06:45] <_joe_> it's fair use if we shorten link we then use in wikitech. Not 100% sure it's ok to convert irc links [19:07:14] <_joe_> if jbond42 didn't ask, I wouldn't have thought about this tbh [19:08:27] to clarify my main use case i trying to shorten all the grafana/logstash etc links [19:11:35] it seems like it may be an abuse to going from the part _joe_ highlighted but i also think that is is a benefit to the foundation. As an aside to this dose anyone know if the short urls expire and if you can set the expiry time. for most the links shared here an expiry of 24 hours with an option to make permenent would probably surfice (and yes i know feature creep) [19:13:57] No, there's no expiry [19:14:17] ack, thanks Reedy [19:14:58] Noting it won't work for wikitech urls though [19:15:12] Oh it will [19:15:16] Ignore me, I thought it was a different domain [19:15:27] *.wikimedia.org is whitelisted so ofc it works [19:16:11] yes thats why logstash, grafana etc alos work which may be unintentional? [19:16:45] I guess it was "easier" to do *.wikimedia.org than whitelisting every wiki etc [19:16:53] I suspect it's fairly intentional [19:17:58] can we ask for slightly longer urls? [19:18:10] for this use case it seems a waste to use the 3 or 4 chars ones [19:21:26] volans: They're sequential. [19:21:37] So wait a year and they won't be 3 characters long. :-) [19:21:40] that's what I thought :( [19:22:16] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=authdns1001&service=Confd+template+for+%2Fvar%2Flib%2Fgdnsd%2Fdiscovery-kibana.state [19:22:42] ^ anyone have a clue about this? There is just one stale template on the authdns servers, with an ema ack about esams repool back on Oct 22, claiming a warning for 4 days [19:23:04] it's just weird and inexplicable at a glance [19:23:08] bblack: we used to have those during switchdcs [19:23:12] and the cookbook clear them out [19:23:17] what makes them "stale"? [19:23:23] let me see if there was a comment on the exact why [19:23:35] I need to reconstruct that thought [19:23:48] I just don't even get what it's alerting on. The file has the same contents on all 12x authdns servers, but is only warning of staleness on 2 of them [19:24:06] (the other 10 being fresh installs since after whatever this was) [19:24:23] bblack: check for /var/run/confd-template/.discovery-{{{records}}}.state*.err [19:24:36] that's what the cookbook removes [19:24:45] with the comment: Removing stale confd files generated in phase 5 [19:25:21] that I think is outdated as the last phase5 was invert redis replication, LOL [19:26:08] so the check could use improving, basically, in a few ways :) [19:26:09] bblack: ok I think I have it [19:26:19] see [19:26:20] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py#25 [19:26:41] if a service is a/p and you pool the passive [19:26:59] it's a noop but generates an error, then when you depool one it actually apply the change [19:27:06] if I'm reading it correctly [19:27:13] so if something similar happeded for $reasons [19:27:15] it might explain it [19:27:19] but when that happens, the error output gets left behind [19:27:27] until it's manually cleared [19:27:33] correct, AFAIK [19:27:50] so yeah (a) the check should call out that error path it's noting, not the path of the output file which is fine [19:27:52] don't know if it was by design or as a by-product [19:28:20] (b) it's probably by-design because a/p are never supposed to go suddenly-a/a, I'm curious why we put them in an a/a state via cookbook and clean up after it ... [19:28:40] anyways, can fix! [19:29:13] just wanted to make sure I understood before I go reimaging the box, why it had a persistent icinga warning already [19:29:40] I think to avoid any downtime [19:29:53] we're in RO mode at that time [19:30:05] so either DC is ok and is better than downtime [19:30:35] design should probably be changed there somewhere [19:30:37] and if you ask me for an atomic switch of the two, I don't have the answer, need to check the code [19:30:55] *ask me why we don't do [19:31:00] or we decide it's just ok and stop erroring on it in the template [19:31:24] either way the ones that we're calling a/p at the dns layer, will always return one result (if both are active, it's just the first one) to the globe [19:31:53] whereas if we just used the a/a style for them, during that interim period they'd start splitting traffic to both destinations [19:32:14] that in this specific case would also be fine btw I think [19:32:50] but might not be true in other smaller failover scenarios [19:33:04] anyways, I need to start my reimage before one of these ssh sessions times out or whatever, then I'll loop back and look at the confd template mess maybe [19:44:14] so I was poking more at systemd-resolved [19:45:00] mostly it seems like we could use it best by (a) leaving our existing resolv.conf alone -> (b) start running the systemd-resolved daemon by default -> (c) add nss-resolve to nsswitch between hosts and dns as in https://www.freedesktop.org/software/systemd/man/nss-resolve.html [19:45:21] the part that's still bugging me a bit, is the non-traditional dns bits [19:45:50] https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html [19:46:07] well this section more-specifically: https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html#Protocols%20and%20Routing [19:46:28] all the crap in there of the nature of: [19:46:32] "Single-label names are routed to all local interfaces capable of IP multicasting, using the LLMNR protocol. [...]" [19:47:00] there's a real possibility that some unqualified lookups that work "ok" today with resolve.conf search setting, now become hilarious multicast storms or whatever [19:47:10] and I don't see easy config to disable all that other shit and just do plain DNS [19:48:07] maybe if you parse all the words in the related documentation, eventually you find out that all those strange things are opt-in at the level of network interface config, I donno [19:48:15] sounds great for a laptop though! :P [19:49:58] regardless, it's probably tractable [19:50:18] (if nothing else, we can go try to kill the few weird cases like prometheus that are using single-label lookups in important places) [19:51:03] and then we turn down the TTLs on all internal names to something small enough that we don't care about purging, but still reasonable for operational corner cases / blips [19:51:57] if we were capping just the per-host caches I'd still be saying 5s, but maybe since we're capping the central ones too it might be safer to first gradually work them down to say 15s and then live with that for a month or too before getting too crazy. [19:52:09] s/too/two/ [19:52:24] we can't use the TTL-turndown hack on public names so much, though [19:52:39] but most internal stuff, and certainly .svc. and discovery stuff, is all internal names [19:55:05] there's some edge cases with the internal/external split though. [19:55:15] chiefly, reverse dns for our unrouted ipv6 spaces [19:55:50] but I think even for the public IPs, revdns lookups aren't high volume or in any critical service path [19:56:02] so it's probably ok to just turn them down for those unrouted spaces, too [19:56:26] (most of which will come from netbox soon anyways) [20:00:10] yes i alo got a bit worried seeing the LLMR stuff in there especialy as a lot of things hear rely on the search list [20:00:59] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/authdns/files/authdns-check-active-passive [20:01:10] volans: this is the check that breaks those discovery statefiles ^ [20:01:32] it's run as a confd post-check after it templates the output, and the errors out if it sees an A/P with 2xUP [20:01:48] so it's quite intentional, on some level [20:02:12] it just seems "wrong" to me that we have a check like this calling it an illegal state, yet it's a state we use as part of cookbook'd processes [20:02:36] (and further, that it doesn't self-resolve in icinga terms because failing that check causes a persistent .err file that is being watched-for) [20:25:31] agree that is not ideal [20:26:05] * volans have to go [21:14:35] <_joe_> volans: on monday, transportation strike! [21:28:36] i have the same problem with confd templates :) thanks for pointing me here [21:29:14] reads backlog to see how it was fixed :) [21:30:43] check for /var/run/confd-template/.discovery-{{{records}}}.state*.err [21:30:47] i think is the relevant piece [21:31:13] thanks! [21:31:40] yea, it's .git-ssh for me. but there are a bunch of .err files [21:34:46] deleting them fixed it right away. thank you. saved my day [21:34:54] and i think i ran into it before :) [21:37:24] now if only the service itself would work.. but that seems another issue [22:17:20] _joe_: not my fault this time!