[04:18:32] <marostegui>	 I will be restarting s2 and s8 (wikidata) masters in 40 minutes
[04:45:59] <jynus>	 around
[07:25:32] <hashar>	 Hello, I have prepared an update of the Doxygen debian package and could use a final review / build :]  https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/589416/AutoMerge..6
[07:26:40] <hashar>	 it is not any urgent but would be nice to have
[10:43:35] <arturo>	 new etherpad style?
[10:47:02] <mutante>	 10:14 < akosiaris> this time around etherpad upgrade has gone fine. we also have a new skin!
[10:47:05] <mutante>	 yes, too many channels
[10:52:28] <akosiaris>	 yeah, sent out an email as well about it
[11:03:17] <arturo>	 I have a case of https://xkcd.com/1172/ with the new etherpad skin, is there a way to commute to the old one? :-)
[11:06:48] <mutante>	 https://addons.mozilla.org/en-US/firefox/addon/styl-us/ ?
[11:09:27] <arturo>	 apparently there is an option to enable some kind of "skin builder", perhaps it is disabled in our installation?
[11:47:22] <akosiaris>	 arturo: I wouldn't waste time tbh. I would advise to just pretend it's something new and not the etherpad you know. Upstream is per their admission limited on humanpower right now so you 'd essentially be on your own
[11:48:51] <arturo>	 I'm interested in the wide-screen editor. Apparently they advertise the feature in their website, so I guess is not something new to develop but something to enable somewhere?
[11:50:08] <akosiaris>	 could be, but note that this would be global, not per user as we don't really have the concept of users nor prefs in our installation
[11:51:52] <akosiaris>	 we could say something like "skinVariants": "full-width-editor foo bar baz" etc in settings.json, but since we don't have the capability to do a "User A: I want it this way, User B: that way", let's not open that can of worms
[11:52:19] <volans>	 I'd vote for a larger "page" if there was a poll for it fwiw
[11:53:43] <akosiaris>	 if being the key word here. If someone feels like spending the time creating that poll, gathering input, processing it and coming up with the recommendation, can they also take etherpad ownership from me while at it?
[11:57:25] <jynus>	 I vote for volans being new etherpad owner
[11:57:28] <jynus>	 :-D
[11:57:36] <akosiaris>	 seconded!
[11:57:37] <akosiaris>	 :P
[11:57:47] <jynus>	 we are in majority here :-P
[11:59:07] <volans>	 lol, not today jynus...
[11:59:24] <jynus>	 lol at class name "migrateDirtyDBtoRealDB"
[12:00:35] <akosiaris>	 ahaha
[12:30:04] <hashar>	 Hello, I have prepared an update of the Doxygen debian package and could use a final review / build :]  https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/589416/AutoMerge..6
[12:30:20] <hashar>	 It is not urgent but I could use someone to eventually commit to it ;]
[12:58:46] <arturo>	 akosiaris: I don't think you need a poll. After all, the migration from the old skin to the new one didn't require one, right?
[13:04:10] <kormat>	 dbctl, </3 https://phabricator.wikimedia.org/P11234
[13:04:49] <_joe_>	 kormat: it clearly said "enter y"
[13:04:58] <_joe_>	 and it's punishing you for being verbose
[13:05:05] <kormat>	 'k'
[13:05:19] <cdanis>	 kormat: patches welcome ;)
[13:05:32] <_joe_>	 I think cdanis must have read my "adversarial UI" notes
[13:05:40] <_joe_>	 cdanis: lol
[13:05:45] <cdanis>	 I don't think I wrote that part of it originally
[13:06:11] <_joe_>	 yeah it looks like what I had half-assedly wrote when first building it I guess
[13:06:12] <cdanis>	 of course it will be very difficult to tell without looking at the original long-running patchset in gerrit ;)
[13:06:42] <cdanis>	 kormat: I'll work on this some today, for now I suggest you sooth yourself with omfgcats.com
[13:07:03] <cdanis>	 s/sooth/&e/
[13:07:46] <_joe_>	 heads up: I'm disabling puppet on all appservers in codfw to convert them to envoy
[13:08:53] <_joe_>	 kormat: the first version of confctl would ask you to confirm any vaguely dangerous action by typing "Yes, I am sure of what I am doing."
[13:08:56] <_joe_>	 verbatim
[13:10:10] <cdanis>	 _joe_: a past tool I used had a bunch of dangerous might-nuke-your-data options hidden behind an undocumented `-X` (expert) flag, and at least one of them made you type back that you had asked a particular developer that what you were about to do was okay
[13:10:13] <_joe_>	 that's what I call "adversarial UI" :P
[13:10:19] <kormat>	 haha
[13:10:35] <_joe_>	 cdanis: sounds appropriate :)
[13:10:42] <cdanis>	 I had to invoke that one a few times 😬
[13:11:54] <_joe_>	 kormat: the rest of the story is someone complained about it, so I added a switch https://gerrit.wikimedia.org/r/#/c/operations/software/conftool/+/272704/
[13:13:12] <kormat>	 ahhaha. beautiful
[13:26:59] <akosiaris>	 arturo: there was no migration from the old skin to the new skin. There was just an upgrade of the version of the software. Which was done for security and maintainability reasons and not for the skin itself. The skin was just bundled with those, at the discretion of upstream. And more or less that's the level of support I can offer for etherpad. Anything more would require providing a different level of support I can not offer, a
[13:26:59] <akosiaris>	 s I don't have the time (or will) to delve into deviations from the software as provided by upstream. That includes user preferences such as skins but also things like installing etherpad plugins or allowing the creation of "users", which I 've said no to in the past.
[13:27:26] <akosiaris>	 I would go even further and point out that I am a total SPOF for this service and with limited time, and as such
[13:27:26] <akosiaris>	 I am fully unable to have "product" discussions about it, e.g. functionalities, looks, level of support, etc.
[13:28:17] <akosiaris>	 tbh, I am at times wondering whether I should be abandoning it and just filing a code stewardship request so that someone else adopts it. And if noone shows up, just undeploy it and stop offering it
[13:31:07] * kormat wonders if the dbas would volunteer to take it over just to kill it ;)
[13:31:23] <cdanis>	 kormat: you're a DBA 🤔
[13:31:36] <kormat>	 i very specifically ain't. :)
[13:31:38] <marostegui>	 pwned
[13:33:30] <arturo>	 akosiaris: agreed. Thanks for maintaining it BTW :-)
[13:45:38] <kormat>	 is there anywhere i can easily put a small shell script such that it can be fetched by an internal server (for testing purposes?)
[13:45:47] <kormat>	 (over http)
[13:46:42] <kormat>	 ah. people.wm.o works.
[13:47:36] <cdanis>	 kormat: I'm not sure if that's more or less evil than an approach I've taken in the past
[13:47:50] <kormat>	 knowing you, less evil :)
[13:48:02] <volans>	 kormat: are you in the curl | sudo bash trend too? :-P
[13:50:50] <kormat>	 haha
[13:55:03] <_joe_>	 cdanis: less
[13:55:07] <cdanis>	 what
[13:55:11] <_joe_>	 but I prefer your method ofc
[13:55:14] <cdanis>	 mine is auditable and preserves everything in logs!
[13:55:19] <_joe_>	 ahahahah
[13:55:22] <cdanis>	 i'm insulted
[13:56:01] <mutante>	 another approach could be to put it into  puppet/modules/admin/files/home/kormat/
[13:56:40] <mutante>	 oh, over http. nevermind
[13:56:55] <kormat>	 ls
[13:57:19] <cdanis>	 kormat: bin  cookbook_testing  notes.txt  repl  run  tendril  tmp
[13:58:23] <kormat>	 :)
[14:33:07] <elukey>	 herron,shdubsh o/ when you have a moment, can you tell me what you think about T252773 ?
[14:33:07] <stashbot>	 T252773: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773
[14:35:05] * elukey afk for ~20 mins
[14:35:53] <shdubsh>	 elukey: doesn't look particularly concerning.  if the jessie burrow exporter package works on buster, that part ends up really simple
[14:38:23] <cdanis>	 in a few minutes I'll be disabling puppet on all physical hosts so I can do a quick canary of https://gerrit.wikimedia.org/r/c/operations/puppet/+/549683 on a handful of hosts before letting a wider deployment happen
[14:58:58] <cdanis>	 _joe_: you might be pleased to know that the systemd::timer fix we did together worked, and the `OnUnitInactiveSec` timer self-started on all those hosts
[14:59:10] <_joe_>	 great
[14:59:26] <cdanis>	 I think that's a feature we should use more, btw, instead of just OnCalendar
[14:59:30] <_joe_>	 so we can switch a few of our timers to be "every 2 hours"
[14:59:36] <_joe_>	 yes
[14:59:50] <_joe_>	 for most stuff we really want "every N units of time"
[14:59:52] <cdanis>	 maybe worth filing a task or emailing ops@
[15:08:36] <elukey>	 shdubsh: yes I think it shouldn't be problematic.. is something that you guys want to take care or do you prefer my team handling it?
[15:11:21] <shdubsh>	 elukey: either way is ok.  let us know when the new host is built.  happy to help
[15:12:30] <elukey>	 shdubsh: I think that we should decide the ownership of the hosts, it was Analytics originally but now it may be something that observability can own?
[15:13:09] <elukey>	 I am not concerned about the upcoming maintenance but in general who should be the main poc etc..
[15:14:52] <shdubsh>	 elukey: maybe.  this is the first I've heard of it and the other folks on the team ought to weigh in.
[15:17:25] <rzl>	 cdanis, _joe_: oh perfect, I'm most of the way through the maintenance hosts scripts but I still have the "every three minutes" wikidata one, or whatever it is
[15:17:35] <elukey>	 shdubsh: of course, let me know when you guys have discussed it :)
[15:17:39] <cdanis>	 that is perfect for this yeah
[15:17:53] <rzl>	 if I understand it correctly it's more fractally scary than just the "every three minutes" thing but I can at least make that part incrementally better
[15:18:06] <cdanis>	 I think OnUnitInactiveSec is even better for crons that run on every host, where you don't care about synchronization
[15:18:12] <cdanis>	 (and in fact would like to avoid)
[15:18:50] <_joe_>	 rzl: well that script won't work as you want it
[15:19:04] <_joe_>	 because systemd timers don't overlap
[15:19:15] <_joe_>	 and those crons are designed to overlap execution
[15:19:17] <rzl>	 ...
[15:19:26] <_joe_>	 didn't you see the comment somewhere?
[15:19:39] <rzl>	 no I did, I just hadn't fully internalized HOW fractally scary it was
[15:19:42] <_joe_>	 they're achieving parallelism by running a script every three minutes
[15:19:46] <cdanis>	 https://media.giphy.com/media/jUwpNzg9IcyrK/giphy.gif
[15:19:50] <rzl>	 maybe I'll just make it a few different identical systemd timers then
[15:19:52] <_joe_>	 and then killing it after 6 or so
[15:20:11] <_joe_>	 I've been pestering WMDE about it since forever
[15:20:37] <_joe_>	 I'll leave addshore and Amir1 the honour to properly introduce you to the logic though
[15:21:07] * Amir1 appears 
[15:21:20] <kormat>	 👻
[15:21:39] <Amir1>	 Why have I been summoned? :D
[15:22:51] <Amir1>	 _joe_: is it about the dispacthing?
[15:23:00] <Amir1>	 kormat: :P
[15:23:04] <_joe_>	 yes Amir1
[15:23:18] <Amir1>	 yeah, that can of worms
[15:23:25] <_joe_>	 so, rzl was innocently thinking it was a short-lived script that runs every 3 minutes
[15:23:45] <rzl>	 I specifically said it was worse than that
[15:23:57] <_joe_>	 s/thinking/hoping/
[15:23:59] <_joe_>	 :P
[15:24:00] <rzl>	 you can accuse me of a lot of things here but innocence is a bridge too far
[15:24:11] <Amir1>	 lol
[15:24:35] <Amir1>	 I hope we can get to it since we got rid of wb_terms and put it to rest (for good)
[15:25:10] <_joe_>	 rzl: but basically, I think their goal is to run N instances of that script in parallel
[15:25:26] <Amir1>	 the most critical part of connection between wikidata and all other wikis is a cronjob that is being every three minutes
[15:25:26] <_joe_>	 why it's done like that, I have no idea, and I refused to improve it myself
[15:26:11] <Amir1>	 _joe_: yup, technically it's being ran forever but it has a timeout, combining with the timeout, you get the N parallel scripts
[15:26:37] <Amir1>	 I think it's every three minutes with the timeout of 12 minutes = 4 parallel instances
[15:28:09] <_joe_>	 Amir1: is there a reason why we run it with a timeout besides obtaining the parallelism?
[15:28:19] <_joe_>	 it's leaking memory or anything?
[15:29:17] <Amir1>	 It might be leaking memory I wasn't around when it was made like this, it's pretty old script
[15:29:32] <Amir1>	 but it has to do it for LOTS of wikis
[15:29:57] <Amir1>	 so it latches a wiki that is lagging behind, inject jobs, takes another wiki, and does it forever
[15:30:54] <cdanis>	 I have a naive higher-level question, isn't this sort of thing what the jobqueue is for?
[15:31:43] <rzl>	 (for anyone following at home, we're talking about https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/mediawiki/maintenance/wikidata.pp#13)
[15:32:21] <Amir1>	 cdanis: yes but the problem is flow of edits is pretty large and it requires lots of deduplication (and back then, there wasn't deduplication in the job queue)
[15:32:43] <Amir1>	 IIRC, it used jobqueue and brought the whole thing down so they moved to this... hack
[15:33:38] <Amir1>	 because there might be hundreds of edits waiting to be dispatched to enwiki for example, it needs to be batched and deduplicated
[15:33:52] <Amir1>	 it's a many to many relation
[15:35:10] <Amir1>	 (it needs to inject jobs for different pages, an update on the item of Barack Obama might affect and require jobs being inserted on article of himself and articles of his familiy members as well)
[15:36:29] <_joe_>	 Amir1: I think this goes way deeper. We should have some type of dependency tracking service, even extremely simple, that would help with this
[15:36:54] <_joe_>	 in its absence, i think the current model is acceptable
[15:36:59] <cdanis>	 I've seen such systems before, at larger scale, although they relied on lots of infrastructure :)
[15:37:04] <_joe_>	 running as parallel scripts
[15:37:35] <_joe_>	 but having some sort of coordinator spawn 4 workers ain't hard :P
[15:38:09] <_joe_>	 basically we're simulating a service with crons and timeouts
[15:39:07] <Amir1>	 I think it needs its own host, it has been affecting or affected by the stress on mwmaint several times and definitely a systemd service would work for now I assume
[15:40:45] <addshore>	 Such legacy
[15:40:58] <Amir1>	 We might run into memory leaks :D
[15:41:06] <addshore>	 Yes the jobqueue is for this
[15:41:11] <Amir1>	 addshore: o/ I'm not here, don't tell my boss
[15:41:32] <addshore>	 XD
[15:41:43] <Amir1>	 (I'm on "vacation")
[15:41:55] <addshore>	 In terms of "burning things" it's quite a way down the priority list now
[15:42:27] <rzl>	 yeah, the only reason I'm looking at it right now is it's one of the last maintenance scripts that's left as a cronjob rather than a mediawiki::periodic_job
[15:42:34] <addshore>	 Is there evidence of memory leaks? We can easily make then run minuitely, and thus run for shorter periods of time!
[15:42:35] <rzl>	 (which are systemd timers and DC-switchover awayre)
[15:42:39] <rzl>	 *aware
[15:42:50] <addshore>	 How long can jobs run for?
[15:43:31] <cdanis>	 AIUI the important distinction about mediawiki::periodic_job / systemd timers is that maximum one instance of a given name is running at one time
[15:44:02] <rzl>	 right -- although it wouldn't be heard to run dispatcher1, dispatcher2, etc
[15:44:08] <rzl>	 *hard
[15:44:33] <rzl>	 so that's just an implementation detail, it's not a real limitation
[15:44:34] <addshore>	 Indeed
[15:44:51] <rzl>	 it means if they have problems we'll get N alerts instead of 1, but currently we get 0, so
[15:46:18] <addshore>	 Hmm, alerts as in dispatching is broken?
[15:46:37] <addshore>	 We do already get some alerts when things are not right there :)
[15:47:16] <addshore>	 You can probably nerd snipe me into wrapping the logic that is in the maint script into a job
[15:47:21] <rzl>	 oh, okay -- yeah, one of the differences between systemd timers and cronjobs is that systemd timers generate icinga alerts when they exit nonzero status, for example
[15:47:25] <cdanis>	 rzl: I'm going to point you to the `create_resources()` Puppet builtin and then https://media.giphy.com/media/jUwpNzg9IcyrK/giphy.gif again
[15:47:45] <rzl>	 cdanis: I made my willpower save, sorry
[15:48:19] <addshore>	 Just thought I catch up on the context we would want to do this to make datacenter switchover easier?
[15:48:22] <rzl>	 addshore: that would save me the trouble of thinking about this and I'd love it
[15:48:25] <addshore>	 As the main driving factor?
[15:48:25] <rzl>	 yeah, I was about to say
[15:48:44] <rzl>	 one thing the job should do is run in both DCs simultaneously, check etcd for which one is active, and do nothing in the other one
[15:48:56] <rzl>	 (which the maintenance script wrapper does, for those)
[15:49:18] <rzl>	 the idea is that we'll be able to skip the puppet run on the maintenance hosts when we switch DCs
[15:49:23] <addshore>	 Hmm, run in both DC's? What is triggering this job? (I think I missed this context)
[15:49:39] <rzl>	 sec, let me find you a link
[15:49:40] <addshore>	 We can probably opportunistically schedule them post edit
[15:50:05] <addshore>	  / try to schedule them and have them dedupe etc? Hmm
[15:51:13] <rzl>	 so, in the Old World, the active DC's maintenance host had a cronjob for each script, and the passive DC had no cronjobs at all
[15:51:40] <rzl>	 in the New World, both active and passive maintenance hosts run all the maintenance scripts on the same schedule, but each of them is run inside this wrapper:
[15:51:40] <rzl>	 https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.sh
[15:52:33] <rzl>	 so, if this isn't going to be a cronjob/timer anymore, and that sounds great to me, it should still be running identically in the active and passive DCs
[15:53:01] <rzl>	 with the only difference being that, each time it discovers some work and begins to do it, it first checks to see if it's the passive DC, and if so does nothing
[15:54:02] <addshore>	 If the job was schedule post edit though I guess I wouldn't need to worry about that though, as it'll just be scheduled where the edit took place
[15:54:43] <rzl>	 sure -- as long as the queue is short enough that it'll get caught up pretty quickly after we go read-only
[15:54:52] <rzl>	 say, within a minute or two?
[15:55:15] <rzl>	 or, if we don't have to wait for that to happen before we go read-write in the other DC
[15:55:32] <addshore>	 Well, if editing stops (read only), the queue to dispatch isn't getting any longer :)
[15:56:01] <rzl>	 right yeah, I'm thinking about how long it takes to catch up on the queue after that happens -- maybe that only takes seconds, I have no idea :D
[15:56:16] <addshore>	 One thing I'll have to look at is how the locks are managed, (currently in redis) how happy will they be cross dc
[15:56:38] <addshore>	 Well, the queue currently always have a minuite or 2 lag
[15:56:49] <addshore>	 That's mainly due to the process of dispatching itself
[15:57:24] <addshore>	 We could have a totally different job with a different dispatching mechanism and bring that lag time down
[15:57:33] <rzl>	 nod
[15:57:43] <addshore>	 But basically the minute or two delays so far is acceptable, hence we haven't touched it in years
[15:57:50] <rzl>	 yeah, makes sense to me
[15:58:01] <addshore>	 I might take a quick look this evening
[15:58:47] <rzl>	 and to be clear, none of this is dire :) I'd love to cut that puppet run out of the DC switch process, but it's not causing any problems
[15:58:56] <rzl>	 it just means we'd be able to have a shorter RO period
[16:00:15] <rzl>	 if the new dispatcher turns out to be a bigger project that we won't do right away, I'll probably want to at least convert the current one over to a set of parallel periodic_jobs rather than a cronjob -- but I'll wait and see what you think first
[17:36:28] <bblack>	 kinda bikesheddy - but looking for input on the many ways to proceed:
[17:37:39] <bblack>	 I have a custom check_dns_query which is currently used by icinga for some normal (non-nrpe) checks.  So it's defined modules/nagios_common in the usual way and rolled up with the list of such checks in modules/nagios_common/manifests/commands.pp
[17:38:34] <bblack>	 I'd like to also re-use the check command (just /usr/lib/nagios/plugins/check_dns_query , not the icinga config part) so that some local software on cluster hosts can execute it directly (not for NRPE, but for other host-internal monitoring)
[17:38:59] <bblack>	 we don't seem to have examples of this kind of use-case
[17:39:35] <bblack>	 it gets kinda tricky when thinking about the namespaces and ownership and cross-module use, etc
[17:39:56] <volans>	 take it out of puppet and deploy it as a deb package?
[17:39:57] * volans hides
[17:40:01] <bblack>	 :P
[17:40:51] <volans>	 jokes apart I'm not sure if we even support the case of the same script deployed both as for NRPE checks and local icinga runs
[17:41:04] <bblack>	 yeah
[17:41:21] <bblack>	 I could add a parameter to the icinga side of it to source it from elsewhere
[17:41:54] <volans>	 the bottom problem is that puppet modules should not depend on each other (profiles and roles apart)
[17:42:20] <bblack>	 right
[17:42:45] <volans>	 I'm thinking if we deploy it as an NRPE check both on the target hosts and icinga host
[17:42:52] <volans>	 just for the deploy
[17:43:07] <volans>	 and then run it from icinga directly and from the hosts too directly
[17:43:48] <bblack>	 I can also just not use the normal icinga::check_command stuff, and make a seperate dns::checK_dns_query class that deploys it and optionally deploys the icinga-side config file or something, and then bring it into the icinga hosts via their profile rather than nagios_common?
[17:43:53] <bblack>	 I donno
[17:44:20] <bblack>	 it's an easy technical problem, it's just a hard puppet-standards-vs-DRY issue
[17:44:26] <volans>	 yeah
[17:45:14] <bblack>	 maybe I should just sidestep this whole thing, and write a separate check for the internal use-case anyways
[17:45:32] <bblack>	 I was thinking it'd be nice to have the same check_dns_query for both for consistency of healthcheck vs monitoring, etc
[17:45:48] <bblack>	 but maybe there's also an argument that being tested by two independent implementations is more fool-proof anyways
[17:46:28] <volans>	 bblack: out of curiosity, what kind of checks?
[17:46:28] <bblack>	 (besides the icinga one is in Perl, and I could do the other in python or something)
[17:46:40] <XioNoX>	 bblack: https://phabricator.wikimedia.org/T241965 btw, looks like you could close it (soon)
[17:46:50] <volans>	 because the dns module library of spicerack is one of the module that we plan to take out into a wmflib package
[17:46:55] <bblack>	 it's just a check for results of a dns query, like the built-in check_dns.  the custom one just has more-advanced options.
[17:47:12] <volans>	 and that has already some helpful stuff for those kind of checks
[17:47:25] <volans>	 https://doc.wikimedia.org/spicerack/master/api/spicerack.dns.html
[17:48:35] <bblack>	 yeah it could be extended to cover these cases, perhaps
[17:48:42] <bblack>	 and wrapped as a test
[17:49:02] <volans>	 sure, extend at will
[17:49:26] <volans>	 for the dnsdisc ones we have a special module but because that takes care of the confctl side too of those records
[17:49:56] <bblack>	 right
[17:50:13] <bblack>	 in this case, comparing https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/nagios_common/files/check_commands/check_dns_query#40
[17:50:42] <bblack>	 the main thing is checking that it's an AA response, and option to things like query type, source address, etc
[17:52:01] <volans>	 the main missing thing is the authoritative
[17:52:03] <volans>	 bit
[17:54:06] <bblack>	 yeah the source address, I added in a not-yet-merged patch.  I haven't even tested yet if it accomplishes what I intend anyways.
[17:54:32] <volans>	 what do you want to check?
[17:55:09] <bblack>	 the trick is that I want to check that the local daemon is responding to a query, using the actual anycast service IP (that it configures on its loopback and in the daemon listener).
[17:55:32] <bblack>	 but technically, the check could be fooled if the daemon was dead and the loopback was undefined, because it would just query a different anycast instance over the network via the routers
[17:55:52] <bblack>	 so I wanted a flag to set the source address of the query to loopback, so that it would fail properly in that case.
[17:56:21] <volans>	  so you want to check the address of the server in the response?
[17:56:39] <bblack>	 no, I want to set the source address of the requesting test
[17:56:49] <bblack>	 I want to query from a client IP of 127.0.0.1
[17:57:03] <bblack>	 to ensure it doesn't get routed out to Elsewhere and get a success that's actually remote
[17:57:21] <volans>	 ok
[17:57:43] <bblack>	 but I need to check if that strategy even works.  for all I know linux would still rewrite the source and route it or something dumb like that.
[17:57:51] <volans>	 and right, my point was moot, I guess that with anycast you get the anycast IP in the response
[17:58:10] <bblack>	 as the response's source address, yes
[17:59:01] <bblack>	 (everything with DNS checks is confusing to talk about, because meta)
[17:59:06] <volans>	 yeah
[17:59:32] <bblack>	 the point of the check is that this is the local healthcheck on an anycast DNS server, which (if it fails) causes bird to stop advertising it to the routers.
[17:59:55] <volans>	 yeah I imagined
[18:01:16] <volans>	 I'm wondering if we might need a dedicated ip rule to enforce that from some local specific IP in the 127 subnet
[18:01:37] <bblack>	 yeah I donno, I'll find out in testing in a little bit
[18:01:46] <bblack>	 I'm hoping just setting the source to 127.0.0.1 is enough to fail
[18:03:46] <volans>	 ack, lmk if you need anything from my side for it. Happy to help
[18:06:08] <bblack>	 ok
[18:16:55] <bblack>	 yeah, testing confirms the source address thing works (set query source to 127.0.0.1), tested positive+negative cases
[18:17:25] <bblack>	 it's kind of an oddball option and a fairly custom test anyways, it might be simpler just to use the python dns module directly since nothing else will use this
[18:19:14] <bblack>	 (we also want other things that are more testy than functional, like separately and explicitly querying udp-then-tcp, ignoring truncation, etc
[18:19:17] <bblack>	 )
[18:19:58] <volans>	 up to you, I see your points and kinda agree
[18:20:04] <volans>	 too bad to re-invent the whell every time :)
[18:20:52] <bblack>	 assuming dnspython supports setting the source addr, still digging into it
[18:21:07] <volans>	 those docs are the worse to navigte
[18:21:17] <bblack>	 yeah I may as well just read the source :P
[18:21:34] <volans>	 look for resolver.Resolver
[18:21:38] <volans>	 that's what I use
[18:21:41] <volans>	 in spicerack
[18:22:00] <bblack>	 I'll probably have to use the raw query interface
[18:22:26] <volans>	 *source*, a ``text`` or ``None``. If not ``None``, bind to this IP address when making queries.
[18:22:35] <volans>	 in query()
[18:23:02] <bblack>	 the problem with a 'resolver' is usually that it aims to do its best to use the protocol correctly and resiliently, and at best you have to set lots of non-standard things to make it be dumbed.
[18:23:06] <bblack>	 *dumber
[18:23:13] <volans>	 also, feel free to get inspiration from:
[18:23:13] <volans>	 https://doc.wikimedia.org/spicerack/master/_modules/spicerack/dns.html#Dns
[18:23:15] <bblack>	 (like don't retry, don't fall back from UDP to TCP, etc)
[18:23:27] <volans>	 as the basic API is not that trivial
[18:24:08] <bblack>	 whereas a tester wants the opposite approach - I want to do one precise thing and test for a specific response condition and nothing else.  it's try-hard-to-succeed vs try-hard-to-fail.
[18:25:35] <volans>	 yeah, query() has also tcp=False
[18:25:53] <bblack>	 it has specific methods for them as well
[18:26:30] <bblack>	 what a rabbithole! :)
[18:26:52] <bblack>	 but it will be a better check for the existing anycast recdns too (the ticket X linked earlier), so it's kinda worth it
[18:27:23] <bblack>	 but yeah, a short patch to the existing nagios/perl check_dns_query would've sufficed, if there was an easy/standard way to get it into two modules :P
[18:27:34] <volans>	 lol
[18:28:01] <bblack>	 can I just deploy the File resource as an optional modules::base::foo and pull that in from both (is base or standard an exception?)
[18:29:03] <bblack>	 on a positive note, all this did lead me to finding a one-character bug in the existing check_dns_query code, which causes a certain CRITICAL condition to abort the check script the wrong way (which I guess would still fail)
[18:30:37] <XioNoX>	 trying to catch up on the backlog, bblack, you can also change the TTL to make sure it doesn't route to the next anycast host
[18:30:51] <bblack>	 yeah true
[18:31:02] <volans>	 careful, define TTL™ :-P
[18:31:05] <bblack>	 srcaddr works ok though, I tried it out with all kinds of related scenarios on my laptop
[18:33:47] <cdanis>	 XioNoX: i learned the other day that 'TTL security' is something that comes up in BGP sessions sometimes
[18:34:11] <bblack>	 heh, you can't start a commit message with the word Bugfix?
[18:34:58] <XioNoX>	 cdanis: yup :)
[18:36:25] <XioNoX>	 cdanis: some TTL fun https://phabricator.wikimedia.org/T241965
[18:36:37] <XioNoX>	 cdanis: er, https://phabricator.wikimedia.org/T209989#5025258
[18:37:51] <cdanis>	 yikes
[19:28:18] <bblack>	 well, I thought I found a clever-ish way, with: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597329/ -> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597330/
[19:29:17] <bblack>	 but in the second patch, I get the issue that dnsbox indirectly ends up doing a resource-like inclusion nagios_common::check_dns_query twice, indirectly.
[19:29:38] <cdanis>	 would using `include` fix that?
[19:29:39] <bblack>	 I used the resource-like form there because otherwise our style-checks complain if a profile does an include-like on the same class
[19:29:43] <cdanis>	 ah
[19:29:52] <bblack>	 so, yeah
[19:30:11] <cdanis>	 maybe... maybe it's worth bypassing the style guide in this case
[19:30:14] <bblack>	 I guess, I could make another profile class which holds the resource-like inclusion in the profile namespace and then include that from both
[19:33:53] <bblack>	 I try not to bypass if I can help it, I think I'm in the minority in general on how I view these things :)
[19:35:07] <bblack>	 by that I mean: my pov tends to be one where any enforced rule should be strict, and rules shouldn't be introduced unless they're pretty unimpeachable (at least in the sense of not causing issues)
[19:35:33] <bblack>	 I think this is a minority view in general, maybe not even just within our SRE
[19:35:56] <cdanis>	 I think that's a great aspiration, and also, not really compatible with writing Puppet ;)
[19:36:12] <bblack>	 I could save myself a lot of trouble if I could just warp my brain to think differently :)
[19:36:16] <cdanis>	 (also I suspect I have a more pragmatic bent than most)
[19:36:49] <bblack>	 I think some of it's also braindamage from spending years long ago in the perl world
[19:37:41] <bblack>	 which has left probably a few indelible impressions, but the one that matters most here is the view that all code is artistic expression
[19:38:00] <bblack>	 which is, I think, the basis for wanting to avoid putting rules on it unless they're really strong and important ones
[19:38:14] <apergos>	 I also try to stick to the puppet style guide pretty closely, fwiw
[19:38:36] <apergos>	 there are reasons for a number of those rules that have to do with maintainability
[19:39:01] <bblack>	 it's ok to tell 10 painters on a collaborative mural project "None of you are allowed to swing a sledgehammer at the wall as part of your 'painting'"
[19:39:14] <bblack>	 but there's also a point at which you're straightjacketed from a creative standpoint
[19:39:22] <apergos>	 very true
[19:40:49] <bblack>	 also, this is kind of a silly example to stand on in this particular case, because puppet manifests are the antithesis of art to begin with :)
[19:41:13] <apergos>	 there are things I could do to have less code around and fewer roles, that makes me a bit twitchy but once again in the interests of maintainabiity and ease of finding paarameter values I'll live with it
[19:42:10] <apergos>	 puppet doesn't really provide for a clean fits-all-cases solution anyways
[19:42:16] <apergos>	 we just do the best we can
[19:42:19] <bblack>	 right
[19:43:00] <bblack>	 perl's near-ish to one end of a spectrum with its many ways to do things: lots of freedom of expression, and thus also lots of room to make ugly, terrible things.
[19:43:22] <bblack>	 puppet's much closer to the other end of the spectrum: constraint the expressiveness in hopes of fewer ugly terrible things?
[19:43:35] <apergos>	 I mean, it's loads better than it used to be
[19:43:49] <apergos>	 but there's still plenty of room for ugly horrible
[19:44:42] <bblack>	 yeah I think I've proven that many times over :)
[19:44:57] <apergos>	 anyone who has not proven that many times over has not used it very much ;-D
[19:45:26] <bblack>	 which is what makes me question such constraints in the abstract, I think.  They don't really accomplish their aims, at least sometimes when the problem space is hard.
[19:46:14] <apergos>	 well but how much of the time do they fail?
[19:46:24] <apergos>	 if they get the job done 95% of the time say
[19:46:28] <apergos>	 that's pretty darn good
[19:47:31] <apergos>	 I feel like the dbas might have some comments on all this, iirc the mariadb roles were pretty painful to get into compliance
[19:49:07] <cdanis>	 puppet: when you want to experience ugly, terrible things that are also merely glue code
[19:49:27] <apergos>	 lolol
[19:49:38] <bblack>	 apparently hiding it in another profile class works :)
[19:50:32] <bblack>	 creativity wins, it's just of the "creatively bending the rules" kind :P
[20:00:36] <apergos>	 er, "congrats"?  :-D