[00:56:30] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Toolforge, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808)
[06:20:25] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Toolforge, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10Vgutierrez) Currently tools.wmflabs.org is violating [[ https://tools.ietf.org/html/rfc6797#section-7.2 | RFC 6797 section 7.2 ]] by sen...
[06:27:35] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Toolforge, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10Vgutierrez) It's also violating [[ https://tools.ietf.org/html/rfc6797#section-7.1 | RFC 6797 section 7.1 ]] by sending the HSTS header...
[09:22:55] <wikibugs>	 10Traffic, 10Operations, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) >>! In T212312#4866085, @CDanis wrote:  > Anyway I'm making all the 'slow prometheus query' tasks sub-tasks of the prometheus 2.x upgrade T187987 as th...
[10:18:04] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Toolforge, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10aborrero) I assume this is something in our nginx proxy, right? @Vgutierrez Could you please help us review the config?
[10:31:54] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Toolforge, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10Vgutierrez) >>! In T102367#5057208, @aborrero wrote: > I assume this is something in our nginx proxy, right? @Vgutierrez Could you pleas...
[11:45:28] <mutante>	 hi, do we need to do something about "high mailbox lag" alerts
[11:45:51] <mutante>	 cp3036 only
[11:53:42] <jynus>	 mutante obviously not the right person, but evaluate if there is any higher error than usual and depool if so
[11:53:49] <jynus>	 *rate
[12:04:00] <mutante>	 jynus: ACK, thx. done with https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X?_g=h@97fe121&_a=h@7e5f62f   affected host 3036 not showing up there. doing nothing
[12:11:27] <jynus>	 mutante: I am not 100% sure issues would show on that log, I used https://grafana.wikimedia.org/d/000000464/prometheus-varnish-aggregate-client-status-code?orgId=1&var-site=esams&var-cache_type=varnish-text&var-status_type=5&from=1553559064940&to=1553602264940 in the past
[12:12:59] <mutante>	 it seemed to show host names , other than cp3036
[12:13:11] <mutante>	 ok, cool. that also does not show raised errors, bbiaw
[12:34:20] <elukey>	 so from https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=upload&var-server=All&var-layer=backend it doesn't seem a great situation
[12:36:16] <elukey>	 ema: --^ (if you are around later on I guess we'll need to restart varnish on cp3036)
[12:58:42] <vgutierrez>	 it looks like it... depool + varnish-backend restart + repool
[13:14:29] <vgutierrez>	 Krenair: https://gerrit.wikimedia.org/r/c/operations/puppet/+/498920 --> this could mess with anything in labs?
[13:53:24] <ema>	 mutante et al: yes varnish-be on cp3036 is a bit under the weather since this morning. No 503 at all so far though, and the service is due to be cron-restarted at 17:23 today. In case of 503s I'll restart earlier, otherwise just wait and occasionally stare at grafana 
[13:54:17] <mutante>	 ema: good, sounds like what we did then :)
[13:54:20] <mutante>	 thanks
[13:57:51] <Krenair>	 vgutierrez, nope, lgtm
[13:58:19] <vgutierrez>	 ack thx
[13:59:05] <Krenair>	 there's two acme-chief certs in use in deployment-prep but they're both using the directory paths IIRC
[13:59:08] <Krenair>	 mx and unified
[14:01:30] <Krenair>	 yep
[14:04:46] <Krenair>	 vgutierrez, did you see my commit about putting a unified cert in acme-chief to use?
[14:04:57] <vgutierrez>	 hmm the tlsproxy patch?
[14:04:59] <Krenair>	 yeah
[14:05:06] <vgutierrez>	 yeah, it's on my TODO list, sorry :)
[14:05:10] <Krenair>	 no worries
[14:05:13] <Krenair>	 just saying
[14:05:26] <Krenair>	 it looks like between the two of us we have a lot of the pieces done
[14:05:51] <vgutierrez>	 initial list of SNIs for the non-canonical redirect: https://gerrit.wikimedia.org/r/c/operations/puppet/+/499201
[14:06:27] <Krenair>	 I'm a bit worried about what effect that one may have on our account limits
[14:06:33] <vgutierrez>	 me too
[14:06:43] <vgutierrez>	 I'm going to review them
[14:07:01] <vgutierrez>	 and after the review... maybe I split that into 4 commits to be merged in different days
[14:07:12] <Krenair>	 yeah
[14:07:44] <vgutierrez>	 also certs so big (~40 SNIs) maybe require us to refactor how the challenges are passed to the dns-01 challenge sync script
[14:08:16] <Krenair>	 we could have a larger number of smaller certs
[14:08:37] <Krenair>	 but yeah we should probably have tests about handling 100 SNI certs
[14:09:27] <vgutierrez>	 I've took 40 as a soft limit cause it's the current number of SNIs in our global unified cert in production (39)
[14:12:23] <vgutierrez>	 I'm going to wait till the puppet storm of alerts get passed before merging the first clean certs commit
[14:13:23] <vgutierrez>	 and be extra cautious with it.. aka disabling puppet everywhere and let it run somewhere more or less innocuous like netmon2001
[14:15:47] <bblack>	 I don't think it's supposed to trip account limits, but yeah I donno ( to do all those certs together )
[14:16:21] <bblack>	 vgutierrez: one nitpick, it'd probably be wise to have wikipedia.com as the CN on the first (default non-SNI) cert in the set, since statistically it's far more popular than all the rest of the long tail.
[14:16:57] <vgutierrez>	 ack
[14:18:07] <vgutierrez>	 I guess that I'll move every wikipedia.com related SNI to the first cert then
[14:19:32] <bblack>	 oh yeah, I see 3 but I guess that's redundant?
[14:19:52] <bblack>	 There's a *.wikipedia.com, wikipedia.com, and www.wikipedia.com (which is unecessary since it's covered by the wildcard)
[14:20:33] <bblack>	 when we get to implementing auto-generated redirects we may likely have cases where "www" is treated different than the rest of the wildcard (for langs), but for cert purposes there's no reason to expose it separately I don't think.
[14:21:21] <vgutierrez>	 yeah... www.wikipedia.com survived the trimming... I'll get rid of it :)
[14:22:17] <bblack>	 I think only "www.wikibooks.de" is non-redundant, from the list of www's at the bottom of the last cert
[14:23:14] <bblack>	 (but probably we want it in the same cert as "wikibooks.de", so that later the generated redirect rules land in the same cert vhost or whatever the equivalent is)
[14:23:45] <vgutierrez>	 indeed
[14:24:05] <bblack>	 one thing I haven't quite figured out yet every time I go back and think on the subject (but doesn't need solving yet I don't think)
[14:24:20] <bblack>	 is how we'll handle the ongoing trickle of adding new domains legal registers for $reasons
[14:24:52] <bblack>	 if we had a starting state for the live service where say the final cert in the set only had 10 SANs so far and room for 30 more.
[14:25:09] <vgutierrez>	 Krenair: I missed the "unified" cert name on your CR when submitting mine
[14:25:21] <bblack>	 and our workflow was whenever legal registers a new one we just tack it into the last cert and cause it to be re-issued with the new SAN added.
[14:25:52] <bblack>	 you can see corner cases cropping up where they might register one new domain per day for several days in a row, and cause some ratelimit to blow up on renewing the initial existing ones constantly.
[14:26:21] <bblack>	 I think we can solve that in workflow rather than technical ways, I'm just not exactly sure what it will look like.
[14:26:47] <Krenair>	 vgutierrez, sorry, not sure I follow?
[14:26:53] <bblack>	 e.g. maybe we only take a list of newly-registered things from them once a week or once a month, so there's not a trickle of updates like that (if the delays is acceptable, which it should be?)
[14:26:58] <vgutierrez>	 bblack: BTW, your suggested staging time for new certs was 7 days, right?
[14:27:23] <vgutierrez>	 Krenair: I named it global-unified, and it's called 'unified' everywhere else, easy fix on my side :)
[14:27:46] <Krenair>	 okay
[14:27:48] <Krenair>	 well either way
[14:27:58] <bblack>	 vgutierrez: for truly-new certs that never existed before, the correct staging time is probably zero (there's no point delaying to "fix" a problem.  without the cert there's no access at all anyways, no existing cert).
[14:28:19] <vgutierrez>	 bblack: right.. on cert renewal I meant :)
[14:28:44] <bblack>	 I think 5 days is what some statistics pointed at before, but there's a never-ending long tail, so 7 days seems reasonable to catch more of that long tail.
[14:28:56] <bblack>	 assuming there's not other constraints that make us want to not extend it further.
[14:30:01] <bblack>	 whatever we decide is our acceptable time-skew window (5 days or 7 days), we want the renewal flow to try to ensure that limit in both directions if at all possible (obv, not always possible if LE is down/unreachable or some other error for extended periods)
[14:30:16] <bblack>	 so if 7 can work for everything else in both directions, then yeah sounds great
[14:31:00] <bblack>	 the "both directions" thing meaning: we should aim to stage the new cert for 7 days from issue-date -> cert-switch-date, and also ensure the cert-switch-date happens 7+ days before the old cert actually expires.
[14:32:46] <bblack>	 (since the clock skew of a client could be in either direction.  e.g. having a 7 day staging that finally switches with 2 hours left on the old cert means we still caused a bunch of clock errors by not switching sooner)
[14:33:55] <vgutierrez>	 yeah... 7 days should be doable... it will be triggered when the current cert still has 30 days till it expires... so it will be switched when it still has 23 days of life
[14:34:02] <bblack>	 right
[14:34:44] <bblack>	 accounting for the possibility that there could be errors and renewal failures and blah blah and sometimes things are not ideal, if I were making a universal algorithm for picking the cert-switch-time for any renewal scenario, it would probably look like this:
[14:34:58] <bblack>	 if old_cert already expired: deploy immediately
[14:35:21] <bblack>	 else if old_cert expires in <7d, deploy at the midpoint time between new_cert.issue and old_cert.expires
[14:35:33] <bblack>	 else deploy at new_cert.issue + 7d.
[14:36:06] <bblack>	 s/already expired/already expired or doesn't exist (new cert)/
[14:36:52] <bblack>	 hmm that still has errors.  Redo:
[14:37:03] <bblack>	 if old_cert already expired or never existed: deploy immediately
[14:37:15] <bblack>	 else if old_cert expires in <14d, deploy at the midpoint time between new_cert.issue and old_cert.expires
[14:37:19] <bblack>	 else deploy at new_cert.issue + 7d.
[14:38:05] <bblack>	 once there's less than 14d in the middle, you're not getting your full 7d in some direction or other, so may as well split the difference to maximize the window on both, basically.
[14:49:15] <vgutierrez>	 bblack: BTW, I pushed this morning the CRs for our CAA records: https://gerrit.wikimedia.org/r/c/operations/dns/+/499154 https://gerrit.wikimedia.org/r/c/operations/dns/+/499155 https://gerrit.wikimedia.org/r/c/operations/dns/+/499156/1
[14:50:07] <vgutierrez>	 AFAIK the only one that's a little bit too wide it's the one for w.wiki, cause grants LE permissions to issue wildcard certificates and it's not strictly needed for our current unified cert
[14:50:23] <vgutierrez>	 but it made sense to keep things uniform
[14:59:15] <bblack>	 vgutierrez: reviewed
[14:59:56] <bblack>	 vgutierrez: also I didn't look yet, but... all those domains you put in the Non-canonical SAN lists, do we actually have (parked or otherwise) DNS for them at all, or were some missing?
[15:01:17] <bblack>	 (I'd argue if they're missing from our DNS, just skip them for this initial "issue the cert" thing this Q.  We can go back and audit/sanitize all related things later next Q)
[15:03:28] <bblack>	 oh, yeah, in general you probably want to look for them all in DNS on the non-canonical CAA thing
[15:03:50] <bblack>	 because I see at least one example (wikimedia.ee) where it's a separate zonefile in our DNS (not a parking symlink), but it is in redirects.dat and your NC san list.
[15:08:45] <godog>	 hi -- if this is a good time I had a couple of questions re: the infra foundations metrics goal, namely to stop sending statsd from pops, which in turn means deprecating a bunch of daemons that send statsd metrics. Do you see any problem/blocker with that? And would you be available to assist with e.g. code review?
[15:11:11] <vgutierrez>	 bblack: I've found two of them that are missing, wikibook.org and wikiversity.info... I mentioned them in the CR that updates the redirects.dat file
[15:14:28] <bblack>	 vgutierrez: what CR?
[15:15:03] <bblack>	 oh I see
[15:15:25] <vgutierrez>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/292785
[15:15:27] <vgutierrez>	 that one sorry
[15:15:30] <bblack>	 ok yeah, we probably should open a separate CR to kill those from redirects.dat in the short term.
[15:16:31] <bblack>	 but otherwise I think the remaining thing to clean up wrt the new non-canonicals cert and CAA records: is find the cases like wikimedia.ee where it's in your NC-SAN set but it's not using the parking symlink, and propagate the LE CAA to those zonefiles too
[15:27:49] <ema>	 godog: I see no problem with that, happy to help
[15:28:16] <vgutierrez>	 bblack: I've done that already with my script
[15:33:32] <godog>	 ema: ack, thanks!
[15:35:30] <vgutierrez>	 bblack: that's how I found that those two were missing
[15:36:53] <wikibugs>	 10Traffic, 10Operations, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki)
[15:48:58] <bblack>	 vgutierrez: I think we're cross-talking or something, so to be annoyingly super-specific:
[15:49:19] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/499201/2/hieradata/role/common/acme_chief.yaml#170 sets up a SAN through an LE cert for wikimedia.ee
[15:49:39] <bblack>	 operations/dns/templates/wikimedia.ee is not a symlink to the "parking" file, it is its own unique zonefile
[15:50:08] <bblack>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/499156 claims to set up CAA to allow the earlier commit to issue the non-canonical cert, but only updates the parking file, not the wikimedia.ee zonefile.
[15:50:54] <bblack>	 (I don't know if there are other such cases, but that was the one I noticed)
[15:56:42] <vgutierrez>	 hmmm
[15:59:24] <vgutierrez>	 you're right.. I checked the symlinks but I obviously missed something
[15:59:28] * vgutierrez rechecking
[16:08:23] <vgutierrez>	 so.. yeah.. we're missing more than two :)
[16:08:44] <vgutierrez>	 I'll open that CR to get rid of the domains that we don't own
[16:22:06] <volans>	 bblack: from the incident response working group I've a task to make sure we document how to deploy a DNS change without gerrit. IIRC you did that a while ago and I was wondering if you had that written somewhere or in your mind :)
[16:22:53] <bblack>	 volans: I think there's some notional ideas that "oh it should just be a matter of doing X in the general case, because we designed for that ages ago"
[16:23:48] <bblack>	 volans: but while staring at all related things during all the recent work on DNS CI stuff, I've come to think there's some bad assumptions built into that thinking, or at least some minor annoying bugs we haven't ever bothered to look at, which is why I'm hesitant to publish a simple howto as if it's all well-understood and working right....
[16:24:59] <bblack>	 volans: I think at this point we probably need an actual task open for someone to investigate, verify, and document how to handle the various possible failure scenarios ("without gerrit" being one of those scenarios, but there might be other important ones)
[16:25:47] <bblack>	 (e.g. "without working DNS for our own hostnames, because our authdns is broken, which is why we're trying to push the emergency change in the first place")
[16:25:56] <volans>	 sure sure
[16:26:10] <volans>	 anything I can do to help on this?
[16:28:18] <bblack>	 make a task, try to think through what the appropriate set of scenarios is we should be able to handle (without overlap, because if there's two scenarios A and B, but A implies B has gone wrong too, we only need a solution for A that works for both)
[16:28:25] <jynus>	 volans, I sent an email about this a while ago
[16:29:19] <volans>	 jynus: I'm well aware
[16:29:27] <bblack>	 volans: the key bit of evidence that makes me question our best-current-practices about how to update/sync git on the authservers without gerrit, is that under normal operating scenarios with the normal deploy flow, we see:
[16:29:37] <jynus>	 volans: regarding gerrit, the official response I got was "will not document, anyone but DBAs will handle it when it breaks"
[16:30:21] <bblack>	 bblack@authdns2001:/srv/authdns/git$ git status
[16:30:21] <bblack>	 On branch master
[16:30:22] <bblack>	 Your branch is ahead of 'origin/master' by 549 commits.
[16:30:47] <bblack>	 ... which makes me think something's not entirely kosher with how our authdns-git-pull works, when compared with human expectations when needing to operate on that checkout manually...
[16:31:06] <bblack>	 we'd expect it to say it's currnetly in sync with master at last fetch?
[16:31:39] <bblack>	 it clearly "works", just not in a way that makes life easy for the human in a manual emergency
[16:33:11] <volans>	 yeah clearly
[16:34:09] <bblack>	 our old expectation was that without gerrit, we'd be able to make a local commit there on one server
[16:34:25] <bblack>	 and do "authdns-local-update $the_one_server" on other servers to sync the same data
[16:34:28] <bblack>	 or something like that
[16:34:48] <bblack>	 (and then who knows how we undo the carnage when we have gerrit back, I guess reset --hard on their checkouts)
[16:35:39] <jynus>	 I think it is ok to have bugs, as long as they are known/documented (better than unknown)
[16:35:43] <bblack>	 so I suspect authdns-git-pull needs some love, and that maybe we should be deploying authservers' hostnames to each others' /etc/hosts files so that things work sanely/easily when DNS itself is borked, at a minimum
[16:35:54] <volans>	 yeah I'm staring at that file
[16:35:57] <bblack>	 and then maybe we're in shape to document scenarios
[16:36:11] <jynus>	 as in "this is what happened, it is a known issue"
[16:36:24] <jynus>	 "will fix at a later time"
[16:37:05] <jynus>	 I don't think it is reasonable to have every possible scenario in perfect condition
[16:37:08] <bblack>	 I think during that last gerrit outage emergency, I may have just made simple edits directly in /etc/gdnsd/zones/ on the 3x servers and done a manual gdnsdctl reload-zones, or something like that, because it was simpler than wondering about all of these other things
[16:37:54] <bblack>	 that's the ultimate fallback for all scenarios, assuming we can ssh into all the authdnses (or cumin them with a sed command to do the edit)
[16:39:16] <bblack>	 jynus: yeah all scenarios are impossible to predict anyways, but there are clearly "common" ones we should be prepared for and know well.
[16:39:40] <bblack>	 the biggest ones are going to be lack of various services, for various kinds of updates
[16:39:42] <jynus>	 bblack: I think you don't understand what we "need"
[16:39:53] <jynus>	 I don't think a nice script is needed
[16:40:00] <bblack>	 no, just a procedure
[16:40:02] <jynus>	 as much a better understanding
[16:40:20] <jynus>	 of what update-dns does, then let a human take an informed decision
[16:40:36] <bblack>	 but the existing scripts and tooling impacts the viability of the procedures
[16:40:40] <jynus>	 yes, code can be look up
[16:40:49] <bblack>	 not easily, and that's part of the problem :)
[16:40:53] <jynus>	 exactly
[16:41:09] <jynus>	 so it is more of a "here some human-friendly details"
[16:41:21] <jynus>	 "you can do this, but never do this or things will break"
[16:41:25] <bblack>	 it's not a very human-friendly problem space :)
[16:41:31] <jynus>	 I knowq
[16:41:50] <jynus>	 but human SPOF is scarier than technical SPOF
[16:41:55] <jynus>	 :-D
[16:42:00] <bblack>	 right now even for me, in the moment of an emergency, my best understanding is "There are these scripts driven by authdns-update which do $MAGIC under normal conditions, and if conditions are not normal, all bets are off"
[16:42:17] <jynus>	 wait, I thought you coded those?
[16:42:22] <bblack>	 my best quick understanding, I mean.  There's no time to sit and think and stare.
[16:42:35] <bblack>	 yes, but they're not simple when you look at what-depends-on-what-that-might-be-broken
[16:42:41] <jynus>	 I see
[16:42:55] <bblack>	 (I wrote some peices of it, not all.  there was existing stuff before I started working on it whose history goes way back)
[16:42:59] <jynus>	 so I think the whole point is to make an effort to do some work in advance
[16:43:22] <jynus>	 at least the one that is most reasonable to happen
[16:43:40] <bblack>	 right
[16:43:51] <jynus>	 even if the end result is- don't do it
[16:44:06] <bblack>	 you can start with (as we did above): "Let's document how things work so we can handle common scenario X easily"
[16:44:16] <jynus>	 yep
[16:44:56] <bblack>	 I've just jumped one level deeper and said: "I'm pretty sure if we try to do that right now, we'll either create buggy documentation, or our documentation effort will just highlight that we also need to fix the software, becuase the answers aren't simple enough due to deficiencies in the software"
[16:45:05] <jynus>	 for example, even a "it is impossible to update dns without a script" or "impossible at the moment" is useful
[16:45:21] <jynus>	 because it informs decisions on other parts of the infra
[16:45:42] <bblack>	 the only reliable thing we can say in an emergency right now I think, if you want an answer with no deeper investigation or tooling changes, is:
[16:46:17] <jynus>	 oh, I want a deeper investigation, or at least acknolefge a deeper investigation may be necessary
[16:46:38] <jynus>	 I don't need it to happen _now_, there is more urgent issues
[16:46:46] <bblack>	 If anything is broken that will affect the normal "merge to gerrit and authdns-update" flow (DNS problems, network problems, gerrit problems, etc), the only reliable recourse is to ssh to all authservers manually, edit the zone data directly in /etc/gdnsd/zones to fix the problem, and execute "gdnsdctl reload-zones".
[16:47:22] <bblack>	 (and obviously, log and delcare that nobody should touch the DNS gerrit or authdns-update, until things are made sane again)
[16:47:29] <jynus>	 can I copy paste that with a big disclaimer and an "try to ask bblack" first XD ?
[16:48:03] <bblack>	 sure
[16:48:04] <jynus>	 because that would already be useful
[16:48:42] <jynus>	 my criticism about processes like this is that we try to get them correct and perfect on a first try
[16:48:59] <bblack>	 but I don't think that answer is what it should be.  because we already have plans to eventually have far more than 3 authservers, making any manual "ssh to all the authservers and edit files manually" get increasingly painful.  And not all emergency edits can be easily and reliable done with e.g. cumin and a sed command.
[16:49:23] <bblack>	 so it's just not a great answer, and the tooling needs some work
[16:49:27] <jynus>	 bblack: you are still writing very useful information ! :-)
[16:49:53] <jynus>	 I would add that to the wiki as is, and will do- let me do it and will ask for your ok
[16:50:09] <bblack>	 what I'd like to have as a desirable outcome state, would be:
[16:51:13] <bblack>	 So long as the Network is working (meaning IP reachability is unbroken between authdnses), if other things like DNS and/or gerrit are broken, you can ssh into any single authserver, make a new emergency commit in /srv/authdns/git, and run "authdns-update --local-master" (or something, that flag doesn't exist) to sync that commit to them all.
[16:52:02] <bblack>	 and then document how to recover from that state afterwards as well: make a matching commit through gerrit and reset things in the checkouts, etc....
[16:54:03] <bblack>	 the current tooling/layouts was always intended to support such a workflow, I just think there are issues with it that need software fixes in puppetization and/or dns-scripts
[16:54:38] <volans>	 yep, I'll try to summarize that into a task
[16:57:23] <volans>	 bblack: multatuli is the test server right?
[16:58:44] <bblack>	 volans: no, it's the live ns2
[16:59:10] <volans>	 ah sorry, do we have a test one?
[16:59:30] <bblack>	 cp1008 and cp1099 currently have the role authdns::testns , which may be kinda what you're looking for
[16:59:56] <bblack>	 it gives them most of the puppetization of a real authdns server, but doesn't give them the list of the real authservers to sync to, or sync them from others.
[17:00:06] <bblack>	 (or the real service IPs on lo, etc)
[17:00:19] <volans>	 cp1099 will do, Your branch is ahead of 'origin/master' by 89 commits.
[17:00:38] <bblack>	 you can do "authdns-local-update" there in place of "authdns-update"
[17:00:56] <bblack>	 to bring it up to speed since someone last manually did so, which will still leave master looking X commits behind origin/master
[17:01:52] <jynus>	 I've written https://wikitech.wikimedia.org/wiki/DNS#Update_DNS_if_gerrit_or_DNS_are_down_%28on_an_emergency_only%29
[17:02:25] <volans>	 so, my current theory is that we call authdns-git-pull with $REMOTE that is the gerrit url of the repo and when we do git fetch $REMOTE it does fetch and set the FETCH_HEAD (used later) but it doesn't set the remote master branch HEAD to it
[17:02:38] <jynus>	 I know it is hard to document bad procedures, don't think I don't understand
[17:03:13] <volans>	 ofc there might be $reasons that it was done that way so not something to blindly "fix"
[17:03:29] <jynus>	 but in my case, I didn't even know where the dns config lives - and this gives me a start point
[17:04:00] <bblack>	 volans: yeah also, what "authdns-update" does is the fetch from gerrit on the host you execute on, then ssh's to other authdnses and does a git fetch over ssh from the host you run authdns-update on to the others
[17:04:38] <bblack>	 so if you start out running "authdns-update" on ns0, ns0's authdns-git-pull is from gerrit, but ns1's authdns-git-pull is actually from ns0 as the master.
[17:04:40] <volans>	 yes, I knew/saw that but given the approach is the same (fetch from URL, not fetching a branch)
[17:04:52] <volans>	 maybe the fix is to just set $BRANCH=master by default, needs some testing ofc
[17:05:38] <bblack>	 even if you "fix" that, the rest of the workflow says if you always run authdns-update on ns0, ns1 will never be in contact with the gerrit origin to know the current position of the true origin/master at all.
[17:06:35] <volans>	 indeed
[17:06:40] <volans>	 that's the other part to fix
[17:06:50] <bblack>	 yeah, it's tricky
[17:07:23] <bblack>	 if you just had them all fetch gerrit's origin also (separately), and then not do so when given some flag that we're operating gerrit-free in an emergency, maybe.
[17:08:01] <bblack>	 the reason the non-emergency workflow doesn't have them all fetch gerrit independently is to avoid a race where two DNS commits are merged in gerrit independently with close timing, and the first person runs authdns-update and the nsX come out with different HEADs.
[17:08:25] <volans>	 that would open to race conditions of FETCH_HEAD moving, although we do: git merge --ff-only $NEW
[17:08:43] <bblack>	 yeah but still, we don't want a case where ns0 is on revX and ns1 is on revX+1
[17:08:58] <volans>	 we wouldn't
[17:09:00] <bblack>	 (that persists, anyways)
[17:09:20] <volans>	 at most their FETCH_HEAD will be different (their perception of where the origin repo is at)
[17:09:23] <bblack>	 anyways there's still always a race if there's a lack of manual coordination
[17:09:44] <bblack>	 even if we put a trivial lock on authdns-update, two commiters half a minute apart could start running authdns-update from different nses heh
[17:10:14] <volans>	 true, and going forward I already see a lock mechanism when we start doing automated updates from netbox as source of truth for example
[17:10:25] <volans>	 s~when~if/when~
[17:10:47] <volans>	 to avoid races between humans and automation
[17:10:49] <jynus>	 bblack: please have a look at my link, I will delete it if you are not happy or doesn't have large enough disclaimers
[17:12:30] <bblack>	 jynus: step (1) doesn't need to be there, that's the condition for the proposed better future workflow.  the one you're documenting is universal.
[17:14:09] <jynus>	 sorry, I put this under resolvers, it should be under recursors?
[17:15:20] <jynus>	 bblack: https://wikitech.wikimedia.org/wiki/DNS#Update_DNS_if_gerrit_or_DNS_are_down_%28on_an_emergency_only%29 changed
[17:21:40] <jynus>	 can I get at least a "don't delete it", even if it is not perfect 0:-D
[17:29:50] <wikibugs>	 10Traffic, 10Discovery-Search, 10Elasticsearch, 10Operations, 10Patch-For-Review: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Mathew.onipe)
[17:51:15] <bblack>	 jynus: don't delete it :)
[17:52:08] <bblack>	 jynus: the whole page needs some cleanup about the "resolvers" vs "recursive".... I don't know that there's any "right" place for it now, but the place it's at now works
[17:52:51] <bblack>	 resolvers and recursive, in this context, mean the same thing anyways (our powerdns recursors, dns1001 and so-on), and "authoritative" is the other thing we're talking about here (which is currently mixed into those sections?)
[17:53:54] <bblack>	 I can't quite sort it all out in a quick pass, but just looking at the TOC at the top of: https://wikitech.wikimedia.org/wiki/DNS
[17:54:15] <bblack>	 everything in section 2 "recursors" really belongs in section 1 "authoritative nameservers"
[17:54:26] <bblack>	 section 3 seems to be about recursive resolvers
[17:54:45] <bblack>	 I think
[17:55:37] <bblack>	 yeah
[17:55:43] <bblack>	 I'll edit the sections up a little
[17:59:52] <bblack>	 done
[18:00:10] <bblack>	 I didn't fix a lot of the smaller issues, but the sections and basic wording at the top of the article and the sections makes more sense
[18:01:20] <bblack>	 and one more edit coming to move your new HOTO up to the now-correct section heh
[18:02:23] <bblack>	 done
[18:03:27] <jynus>	 thanks, bblack
[18:04:11] <jynus>	 again, you would be surprised on how little I knew about this, as we have very little time for learning
[18:05:28] <jynus>	 I also understand the damage that bad documentation can be
[19:06:21] <vgutierrez>	 bblack: getting rid of some deprecated redirects.dat entries: https://gerrit.wikimedia.org/r/c/operations/puppet/+/499239
[19:07:21] <vgutierrez>	 also.. I've updated the CR that grants LE permission to issue the non-canonical certs: https://gerrit.wikimedia.org/r/c/operations/dns/+/499156
[19:07:35] <vgutierrez>	 as you pointed out, some DNS zones were missing there :)
[19:08:35] <vgutierrez>	 and I've updated the CR issuing the certificate accordingly: https://gerrit.wikimedia.org/r/c/operations/puppet/+/499201/3
[19:09:07] <vgutierrez>	 one tricky case is pywikibot.org and pywikipedia.org, we don't control those domains, but its A record point to our text-lb
[19:09:29] <vgutierrez>	 so we cannot get certs for those domains using dns-01
[19:11:21] <Krenair>	 vgutierrez, can we just tell them that to continue working they need to change nameservers to wikimedia prod and get a zone set up there?
[19:12:01] <vgutierrez>	 so... I didn't remove those redirects from the redirects.dat file
[19:12:30] <vgutierrez>	 so right now those domains shouldn't be affected
[19:12:34] <Krenair>	 ok
[19:13:27] <Krenair>	 I guess we won't be doing HTTP->HTTPS redirects for those
[19:13:33] <Krenair>	 not back to the same domain anyway
[19:13:47] <Krenair>	 and https://pywikipedia.org is already broken
[19:13:47] <Krenair>	 so
[19:13:50] <Krenair>	 no problem
[19:42:13] <wikibugs>	 10Traffic, 10MediaWiki-ResourceLoader, 10Operations, 10Performance-Team, 10Performance-Team-notice: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) 05Open→03Resolved  #### Startup request rate pattern  `name=...