[09:28:17] <_joe_>	 want to see a nice rabbithole of a request flow? https://phabricator.wikimedia.org/T249535#6035576
[09:29:16] <_joe_>	 client -> edge -> mediawiki api -> envoy -> mw parsoid -> 301 to another public url -> edge -> mw appserver for a parsoid url -> 404
[09:58:56] <apergos>	 lovely
[10:01:11] <marostegui>	 mysql
[10:01:16] <marostegui>	 wrong window,sorry
[15:45:18] <andrewbogott>	 robh: ping me when you're awake and working?  I have a question about doneva's recent email
[15:45:26] <robh>	 im here
[15:46:09] <andrewbogott>	 great!
[15:46:48] <andrewbogott>	 So I think either I asked for the opposite of what I wanted, or doneva misread… I was trying to get her to change the resolvers from e.g. cloud-ns0.wikimedia.org  to e.g. ns0.openstack.eqiad1.wikimediacloud.org
[15:47:03] <andrewbogott>	 but then the whois pastes she include show cloud-ns0 as the resolver still...
[15:47:24] <andrewbogott>	 Can you confirm that I asked for what I thought I asked for, and that she did the opposite?  (Or tell me I'm confused)
[15:47:25] <robh>	 im reading this now and honestly was ignoring it otherwise ;D
[15:47:34] <robh>	 i just figured you were asking for legit stuff to domains we own heh
[15:47:38] <robh>	 but indeed, i see what you mean
[15:47:38] <andrewbogott>	 as you should've :)
[15:47:51] <robh>	 the 'new' whois lists the old nameservers
[15:47:54] <andrewbogott>	 I'm happy to follow up once I have external confirmation that I'm not hallucinating
[15:47:57] <robh>	 you are correct, you wanna reply and gently correct?
[15:48:01] <robh>	 you are correct, she did it wrong
[15:48:03] <andrewbogott>	 yeah, I will, thanks
[15:50:07] <chaomodus>	 o/
[16:11:29] <robh>	 andrewbogott: email makes sense to me =]
[16:11:39] <andrewbogott>	 great, thanks
[16:11:42] <robh>	 so hopefully they'l fix it quickly for ya
[16:12:03] <robh>	 also i wonder if maybe we shouldnt find someone more on point to approve these changes but meh i dont mind having out of scope power from the old days ;D
[16:12:10] <robh>	 bwahahahaha
[16:12:14] <robh>	 ahem.
[16:12:32] <robh>	 I have no idea who that person would be.
[16:12:57] <robh>	 maybe traffic cuz domains?  really not sure.
[16:13:21] <robh>	 I like the superlocking mind you, and I dont mind being the person to sign off on things, but its only me cuz ive been here so long ;D
[16:40:08] <bd808>	 robh: ols hats are the hardest to get rid of. :)
[16:40:13] <bd808>	 *old
[16:41:06] <mutante>	 super-dns-admin@wikimedia.org :p
[16:41:16] <mutante>	 that's kind of what it is
[16:43:08] <robh>	 mutante: it cannot be an alias
[16:43:14] <robh>	 it has to be a legally documented individual ;D
[16:43:27] <chaomodus>	 meta-dns
[16:43:28] <robh>	 right now its me, someone in legal, and i think maybe mark
[16:43:42] <robh>	 and they know to contact me cuz mark is busy being director, heh
[16:44:04] <mutante>	 robh: gotcha, so it's like mandatory bus factor.. hrmm
[16:44:20] <robh>	 yep, its a good one though as it prevents our domains being transferred
[16:44:24] <robh>	 or nameservers changed
[16:44:38] <robh>	 I am just amused its still one of my hats
[17:05:07] <robh>	 andrewbogott: i just got email confirm back from the automated sysem that nameservers were changd correctly for those 3 domains
[17:12:06] <elukey>	 herron: o/ any objection if I merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/586460/ ?
[17:13:50] <herron>	 elukey: hey, sure.  fwiw it will install phatality, which wouldn't do harm, but also sounds like isn't needed.  shouldn't be a blocker though if this is needed sooner than later
[17:14:25] <elukey>	 herron: ah ok super ignorant about it, is there a way to exclude it via hiera or should it be reworked?
[17:14:55] <elukey>	 ah no it is buried in the class
[17:15:14] <elukey>	 maybe we could add class { '::kibana::phatality': } conditionally?
[17:15:16] <herron>	 we could add some params and a hiera lookup to enable/disable it
[17:15:17] <herron>	 yea
[17:15:29] <elukey>	 super, following up in the code change, thanks!
[17:15:37] <herron>	 kk sounds good!
[17:38:47] <andrewbogott>	 robh, Doneva's email is asking about glue records but that's just the IP addresses I gave her already, right?  I'll  just respond with "yes, do that" unless I'm missing something.
[17:42:32] <robh>	 seems ok to me yeah just reply yes
[17:42:40] <robh>	 she just cannot assume anything and has to confirm any record changes
[17:46:49] <andrewbogott>	 thx
[17:47:01] <andrewbogott>	 neither can I apparently
[18:40:40] <cdanis>	 I'm having a hard time convincing myself we shouldn't have paged on this: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1586213400000&to=1586217000000
[18:40:52] <cdanis>	 (we did not and I think we should have, on one of these signals alone)
[18:42:46] <apergos>	 this was last night's incident?
[18:42:50] <cdanis>	 yes
[18:44:37] <apergos>	 when looking at the 7 day graph there's another spike on the 5th at around 10:00, not sure what that is
[18:44:52] <cdanis>	 a high rate (> several hundred rps) of ats-be-perceived 50X is possibly worth paging on
[18:45:11] <apergos>	 the ats backend errors spike however stands out
[18:45:26] <chaomodus>	 this brings up a thing i was thinking about the other day, like has anyone implemented HOTSAX type algorithms for time series so that we catn be alerted on anomolies like that
[18:47:37] <cdanis>	 from a quick glance the last time we had >500 rps of ats-be-perceived 50x was 2020-02-11 around 21:00 https://wikitech.wikimedia.org/wiki/Incident_documentation/20200211-caching-proxies
[18:48:13] <apergos>	 yeah sorry I should have been clear the two spikes in the 7 days were of varnish and ats availability drops
[18:48:24] <cdanis>	 a lesser time (~150 rps) was 2020-03-25 around 11:30 https://wikitech.wikimedia.org/wiki/Incident_documentation/20200325-codfw-network
[18:49:18] <cdanis>	 and the final recent time (8k rps!) was 2020-02-04 16:10 https://wikitech.wikimedia.org/wiki/Incident_documentation/20200204-app_server_latency
[18:50:03] <cdanis>	 anyway, for last night's incident, we didn't ever actually page, and I think that's wrong
[18:50:06] <apergos>	 https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1580232305973&to=1586285326413  looking at this to see spikes in that
[18:50:22] <apergos>	 yeah I think paging should happen in the future
[18:53:27] <cdanis>	 i've added a note to actionables
[20:27:02] <_joe_>	 cdanis: ok, so, any threshold that only catches recorded incidents is probably good enough, but also probably mostly redundant?
[20:29:40] <_joe_>	 but I agree we should have some alerts on edge-cache availability
[20:33:00] <_joe_>	 cdanis: I think this https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1586211191414&to=1586219915298&fullscreen&panelId=20 is a better signal
[20:44:31] <cdanis>	 _joe_: the point is, we didn't get paged on anything last night; people who were around happened to notice human reports
[20:45:10] <cdanis>	 but yes the appserver error rate is also a fine signal
[20:48:25] <cdanis>	 _joe_: do you know offhand if there's a task to write some recording rules for the queries used on the appservers red dashboard?
[21:01:46] <cdanis>	 _joe_: I'm now skeptical of this as a good signal: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1580828885378&to=1580842618251&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&fullscreen&panelId=20
[21:01:58] <_joe_>	 cdanis: no, I never got around that
[21:02:06] <cdanis>	 ok
[21:02:30] <_joe_>	 cdanis: I mean if the error rate goes over 5%
[21:02:53] <cdanis>	 the time window to which I linked includes https://wikitech.wikimedia.org/wiki/Incident_documentation/20200204-app_server_latency
[21:03:31] <_joe_>	 yes, that was a different kind of problem, indeed
[21:03:33] <cdanis>	 whereas, compare https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1580828885378&to=1580842618251&fullscreen&panelId=14
[21:03:37] <_joe_>	 we didn't spit as many errors
[21:04:39] <_joe_>	 but yes, we would need probably to page on either, with pretty generous limits
[21:05:13] <cdanis>	 yah
[21:12:46] <_joe_>	 cdanis: I just realized we can probably get rid of that horror by using envoy's telemetry
[21:12:58] <cdanis>	 you mean, the entire mtail portion?
[21:13:06] <_joe_>	 where for "that horror" I mean the abysmal mtail duct tape, yes
[21:13:10] <cdanis>	 sounds great to me
[21:19:04] <_joe_>	 cdanis: see for instance https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-destination=local_port_80
[21:20:11] <cdanis>	 pretty good
[21:25:08] <_joe_>	 there is no distinction based on http method AFAICT though