[12:32:12] bblack: morning! What is this about? https://phabricator.wikimedia.org/T142410 [13:03:21] ema: analytics has been looking at the varnishkafka cache_status field to make inferences. And that comes basically straight from varnish's own stats on the frontend (which aren't really as accurate as X-Cache about various dispositions, and don't account for deeper layers). [13:03:41] ema: so, if they're going to build stats that use cache_status, we need to make it actually reflect reality. [13:18:21] ema: also related: https://phabricator.wikimedia.org/T128132 [13:19:23] (translating some of that, because it took me a few minutes to grasp what's going on: at some point in the past, we created an anonymized historical dataset of our request traffic publicly available, and various third parties used the dataset to help work on caching algorithms. They're trying to generate/analyze another such dataset now) [13:59:58] bblack: there's a spike of 500s apparently https://grafana-admin.wikimedia.org/dashboard/db/varnish-http-errors [14:00:34] bblack: also quite a few icinga criticals on cp* hosts (Connection refused or timed out) [14:00:56] network? [14:01:00] possibly [14:01:03] I see pvoid messing with interfaces [14:01:16] yeah there's a critical on cr2-eqiad router ifaces too [14:02:18] bblack: the errors started before para.void's log though [14:03:08] of course, paravoid was fixing, not causing :) [14:03:12] he was investigating and responding [14:03:40] yes, I was fixing, not causing [14:03:53] paravoid: <3 [14:03:55] set the interface as down, mailed our vendor (noc@ is cc'ed) [14:06:16] kinda sad that a single link flapping can cause mayhem [14:06:30] first hint in varnish stats is 14:39, but intermittent and it didn't rocket up higher until ~13:55, then over with at 14:01 [14:06:30] OSPF fast reroute could have helped, on my TODO for ages :/ [14:06:44] Aug 9 13:48:19 re0.cr2-eqiad mib2d[1715]: SNMP_TRAP_LINK_DOWN: ifIndex 549, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-5/2/3 [14:06:52] heh all kinds of errors in the above, let's scratch that out [14:06:52] faidon@re0.cr2-eqiad> show log messages| match SNMP_TRAP_LINK_DOWN | match xe-5/2/3 | match "Aug 9" | count [14:06:55] Count: 126 lines [14:07:05] first hint in varnish stats is 13:49, but intermittent and it didn't rocket up higher until ~13:55, then over with at 14:01 [14:07:37] 0 2016-08-09 13:59:41 UTC by faidon via cli commit synchronize [14:07:38] those stats are 1-minute aggregates, so lines up with the above [14:08:00] yeah, I saw weird stuff happening in traceroutes too [14:16:30] 10Traffic, 06Operations: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2536531 (10BBlack) I didn't think to link into this bug in the commitmsg (oops!) but I patched our chapoly logic further in: https://gerrit.wikimedia.org/r/#/c/303700/ . This was based on analyzing... [14:35:07] oh wow [14:35:09] nice work bblack [15:32:44] 10Traffic, 10Analytics, 06Operations: Correct cache_status field on webrequest dataset - https://phabricator.wikimedia.org/T142410#2536733 (10Nuria) @elukey : If I understand things right this changeset will emit a new header that we need to publish via varnishkafka. The header value should replace whatever... [15:37:41] 10Traffic, 10Analytics, 06Operations: Correct cache_status field on webrequest dataset - https://phabricator.wikimedia.org/T142410#2533998 (10BBlack) @Nuria + @elukey - the second patch just merged above takes care of the varnishkafka part. So this should gradually go live over the next ~30 minutes. [15:50:14] 10netops, 06Operations, 10ops-codfw: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2536841 (10Papaul) a:05Papaul>03RobH @RobH This is complete let me know if you have any questions [17:08:12] 10netops, 06Operations, 10ops-codfw: audit network ports in a4-codfw - https://phabricator.wikimedia.org/T140935#2537105 (10RobH) 05Open>03Resolved ports ge-4/0/0 through ge-4/0/11 were all labeled wrong on the switch port description. also removed the description on ge-4/0/39. >>! In T140935#2482666,... [17:27:59] elukey, bblack: we need a follow-up to the RB doc change, to keep the previous canonical doc URL working: https://gerrit.wikimedia.org/r/#/c/303825/2 [18:07:29] gwicke: ok [18:11:51] gwicke: slightly-amended so /api/rest_v1 still works (no trailing slash) [18:13:57] gwicke: oh, now I see that was discussed in PS1 too heh. Either way, it doesn't break anything and seems better this way. [18:15:30] yeah, it should not make a difference [18:15:37] it's a dead code path [18:16:41] yeah the trailing-slash thing is tricky. technically it has no meaning. /foo/ and /foo are just distinct sets of characters making up unique URL paths. [18:17:18] but it's kinda baked into server/browser history about filesystem traversal, etc, too, that /foo/ could be /foo, and that /foo might commonly redirect to /foo/, etc [18:17:35] (and some users' and coders' brains) [18:18:24] if the "real" one never had a trailing slash it wouldn't be worth worrying about, but it seems like if the real canonical one is /foo/, supporting /foo is wise. [18:18:29] if it's easy to set up a redirect in Varnish, then that would be nice [18:18:52] I'm not even sure we send /api/rest_v1 to RB at all [18:18:54] alternatively, we could pass it through & set up a redirect in RB [18:19:01] no, we don't [18:19:04] which is why this is dead code [18:20:40] :) [18:20:58] I guess that goes back to the whole api namespacing thing, too. [18:21:01] kinda [18:21:24] I could go edit all the relevant VCL and make the trailing slash omittable, but... meh? [18:22:05] next step would be some kind of "did you mean?" suggester [18:22:16] levenshtein distance, or something [18:26:10] :P [20:13:47] 10netops, 06Discovery, 10Elasticsearch, 06Operations, 03Discovery-Search-Sprint: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2537985 (10BBlack) 05Open>03Resolved My comment above was under the false assumption that relforge100[1... [20:14:05] 10netops, 06Discovery, 10Elasticsearch, 06Operations, 03Discovery-Search-Sprint: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2537993 (10Gehel) Thanks to @BBlack, routers are now configured. I tested from `cirrus-browser-bot` and con...