[09:28:17] <_joe_> want to see a nice rabbithole of a request flow? https://phabricator.wikimedia.org/T249535#6035576 [09:29:16] <_joe_> client -> edge -> mediawiki api -> envoy -> mw parsoid -> 301 to another public url -> edge -> mw appserver for a parsoid url -> 404 [09:58:56] lovely [10:01:11] mysql [10:01:16] wrong window,sorry [15:45:18] robh: ping me when you're awake and working? I have a question about doneva's recent email [15:45:26] im here [15:46:09] great! [15:46:48] So I think either I asked for the opposite of what I wanted, or doneva misread… I was trying to get her to change the resolvers from e.g. cloud-ns0.wikimedia.org to e.g. ns0.openstack.eqiad1.wikimediacloud.org [15:47:03] but then the whois pastes she include show cloud-ns0 as the resolver still... [15:47:24] Can you confirm that I asked for what I thought I asked for, and that she did the opposite? (Or tell me I'm confused) [15:47:25] im reading this now and honestly was ignoring it otherwise ;D [15:47:34] i just figured you were asking for legit stuff to domains we own heh [15:47:38] but indeed, i see what you mean [15:47:38] as you should've :) [15:47:51] the 'new' whois lists the old nameservers [15:47:54] I'm happy to follow up once I have external confirmation that I'm not hallucinating [15:47:57] you are correct, you wanna reply and gently correct? [15:48:01] you are correct, she did it wrong [15:48:03] yeah, I will, thanks [15:50:07] o/ [16:11:29] andrewbogott: email makes sense to me =] [16:11:39] great, thanks [16:11:42] so hopefully they'l fix it quickly for ya [16:12:03] also i wonder if maybe we shouldnt find someone more on point to approve these changes but meh i dont mind having out of scope power from the old days ;D [16:12:10] bwahahahaha [16:12:14] ahem. [16:12:32] I have no idea who that person would be. [16:12:57] maybe traffic cuz domains? really not sure. [16:13:21] I like the superlocking mind you, and I dont mind being the person to sign off on things, but its only me cuz ive been here so long ;D [16:40:08] robh: ols hats are the hardest to get rid of. :) [16:40:13] *old [16:41:06] super-dns-admin@wikimedia.org :p [16:41:16] that's kind of what it is [16:43:08] mutante: it cannot be an alias [16:43:14] it has to be a legally documented individual ;D [16:43:27] meta-dns [16:43:28] right now its me, someone in legal, and i think maybe mark [16:43:42] and they know to contact me cuz mark is busy being director, heh [16:44:04] robh: gotcha, so it's like mandatory bus factor.. hrmm [16:44:20] yep, its a good one though as it prevents our domains being transferred [16:44:24] or nameservers changed [16:44:38] I am just amused its still one of my hats [17:05:07] andrewbogott: i just got email confirm back from the automated sysem that nameservers were changd correctly for those 3 domains [17:12:06] herron: o/ any objection if I merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/586460/ ? [17:13:50] elukey: hey, sure. fwiw it will install phatality, which wouldn't do harm, but also sounds like isn't needed. shouldn't be a blocker though if this is needed sooner than later [17:14:25] herron: ah ok super ignorant about it, is there a way to exclude it via hiera or should it be reworked? [17:14:55] ah no it is buried in the class [17:15:14] maybe we could add class { '::kibana::phatality': } conditionally? [17:15:16] we could add some params and a hiera lookup to enable/disable it [17:15:17] yea [17:15:29] super, following up in the code change, thanks! [17:15:37] kk sounds good! [17:38:47] robh, Doneva's email is asking about glue records but that's just the IP addresses I gave her already, right? I'll just respond with "yes, do that" unless I'm missing something. [17:42:32] seems ok to me yeah just reply yes [17:42:40] she just cannot assume anything and has to confirm any record changes [17:46:49] thx [17:47:01] neither can I apparently [18:40:40] I'm having a hard time convincing myself we shouldn't have paged on this: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1586213400000&to=1586217000000 [18:40:52] (we did not and I think we should have, on one of these signals alone) [18:42:46] this was last night's incident? [18:42:50] yes [18:44:37] when looking at the 7 day graph there's another spike on the 5th at around 10:00, not sure what that is [18:44:52] a high rate (> several hundred rps) of ats-be-perceived 50X is possibly worth paging on [18:45:11] the ats backend errors spike however stands out [18:45:26] this brings up a thing i was thinking about the other day, like has anyone implemented HOTSAX type algorithms for time series so that we catn be alerted on anomolies like that [18:47:37] from a quick glance the last time we had >500 rps of ats-be-perceived 50x was 2020-02-11 around 21:00 https://wikitech.wikimedia.org/wiki/Incident_documentation/20200211-caching-proxies [18:48:13] yeah sorry I should have been clear the two spikes in the 7 days were of varnish and ats availability drops [18:48:24] a lesser time (~150 rps) was 2020-03-25 around 11:30 https://wikitech.wikimedia.org/wiki/Incident_documentation/20200325-codfw-network [18:49:18] and the final recent time (8k rps!) was 2020-02-04 16:10 https://wikitech.wikimedia.org/wiki/Incident_documentation/20200204-app_server_latency [18:50:03] anyway, for last night's incident, we didn't ever actually page, and I think that's wrong [18:50:06] https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1580232305973&to=1586285326413 looking at this to see spikes in that [18:50:22] yeah I think paging should happen in the future [18:53:27] i've added a note to actionables [20:27:02] <_joe_> cdanis: ok, so, any threshold that only catches recorded incidents is probably good enough, but also probably mostly redundant? [20:29:40] <_joe_> but I agree we should have some alerts on edge-cache availability [20:33:00] <_joe_> cdanis: I think this https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1586211191414&to=1586219915298&fullscreen&panelId=20 is a better signal [20:44:31] _joe_: the point is, we didn't get paged on anything last night; people who were around happened to notice human reports [20:45:10] but yes the appserver error rate is also a fine signal [20:48:25] _joe_: do you know offhand if there's a task to write some recording rules for the queries used on the appservers red dashboard? [21:01:46] _joe_: I'm now skeptical of this as a good signal: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1580828885378&to=1580842618251&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&fullscreen&panelId=20 [21:01:58] <_joe_> cdanis: no, I never got around that [21:02:06] ok [21:02:30] <_joe_> cdanis: I mean if the error rate goes over 5% [21:02:53] the time window to which I linked includes https://wikitech.wikimedia.org/wiki/Incident_documentation/20200204-app_server_latency [21:03:31] <_joe_> yes, that was a different kind of problem, indeed [21:03:33] whereas, compare https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1580828885378&to=1580842618251&fullscreen&panelId=14 [21:03:37] <_joe_> we didn't spit as many errors [21:04:39] <_joe_> but yes, we would need probably to page on either, with pretty generous limits [21:05:13] yah [21:12:46] <_joe_> cdanis: I just realized we can probably get rid of that horror by using envoy's telemetry [21:12:58] you mean, the entire mtail portion? [21:13:06] <_joe_> where for "that horror" I mean the abysmal mtail duct tape, yes [21:13:10] sounds great to me [21:19:04] <_joe_> cdanis: see for instance https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=api_appserver&var-destination=local_port_80 [21:20:11] pretty good [21:25:08] <_joe_> there is no distinction based on http method AFAICT though