[05:43:20] 10Traffic, 10Operations: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) Implement logging of SSL Elliptic Curve used: https://github.com/apache/trafficserver/pull/5724 has been already merged into master. The API... [09:35:01] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [09:43:00] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [10:23:26] $ curl -s http://localhost:3904/metrics | grep ^ats | head -n1 [10:23:26] ats_backend_requests_seconds_bucket{backend="swift.discovery.wmnet",le="0.05",method="GET",prog="atsbackend.mtail"} 358 [10:23:32] \o/ [10:28:36] nice [10:31:01] sweet [10:31:42] godog: now I guess we've got to do something with those values :) [10:31:49] I'll open a CR shortly [10:32:40] heheh indeed, including porting dashboards [10:33:20] that should, in theory, just a matter of s/varnish/ats/ in metric names I think? [10:33:41] s/should/should be/ [10:37:16] in theory that's correct yet [10:37:19] yes [10:39:11] and in theory, theory and practice are the same thing right? [10:41:31] that's what theory likes to think! [10:41:41] in practice they aren't [10:41:53] exactly! [10:42:18] for instance: in theory git fetch against gerrit should take less than 32 minutes [10:46:17] LOL [10:47:40] 10Traffic, 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10fgiunchedi) [10:50:26] also jenkins check failed twice in a row without a valid reason, now it's passing [10:51:38] godog: going for lunch now, please take a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/523898/ when you have a sec! [10:53:12] will do [11:07:22] bblack: let's feed ncredir some wikipedia non-canonical domains? https://gerrit.wikimedia.org/r/c/operations/dns/+/523902 [11:07:39] bblack: those are already in the non-canonical TLS certs set and configured in the redirection rules [11:08:12] LOL.. damn timezones, it's ~5 AM in texas right now [11:08:18] :_) [11:08:29] (6 AM) [13:07:47] 10netops, 10Operations, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) [13:26:53] vgutierrez: \o/ [13:26:59] I woke up late today! [13:27:25] I have one thing to add, re: ncredir, putting up a patch for yours to depend on shortly: [13:28:19] ok [13:28:24] https://gerrit.wikimedia.org/r/#/c/operations/dns/+/523924/ [13:28:33] (I cherried yours onto it and reuploaded already) [13:29:09] thx [13:29:21] I'll merge them later [13:29:33] ok! [13:31:50] btw what do you think about enabling HSTS on ncredir? [13:31:59] not right away of course [13:38:53] eventually yeah [13:39:05] and preload too [13:40:10] submitting and verifying preload is annoying when we get to the "hundreds of junk domains" case, but we can probably make a simple script or something. [13:40:40] maybe even aim to eventually automate it with something that e.g. runs once a day or once a week and just rips through the whole list, queries preload status, and enables any that aren't enabled. [13:41:09] (using the https://hstspreload.org/ interface) [13:43:57] I guess re-running the list seems wasteful in the long run given there will almost never be anything to do. but otherwise it seems easy to miss some new ones. [13:44:29] (and most other ways to cull the excess work there involve tracking state on our end somewhere) [13:46:17] we could maybe get the best of both worlds by first downloading Chromium source repo's list to cull the ones that have been in long enough to show up there. [13:46:59] (fetch their json file from git, remove from our big list all domains that are already in chromium upstream json, then use the hstspreload.org API to check->set the remainder) [13:48:00] the repo is kinda big though, and fetching that one file over https is ~87MB currently. I guess it could be worse. [13:48:22] https://chromium.googlesource.com/chromium/src/+/master/net/http/transport_security_state_static.json is the URI of the bulk json for chromium [13:48:39] (no point clicking that in a browser, it's a huge json file) [13:49:58] ugh that's some awful html-ized view of it [13:51:02] the real file in the repo is only ~8.6MB [13:51:33] just have to find a convenient way to download the latest raw copy without pulling the whole chromium repo. Some kind of shallow clone of a subpath is possible with git? [13:53:55] yeah maybe something like https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository/52269934#52269934 [13:56:32] oh duh, use github mirror's raw file URIs [13:56:37] https://raw.githubusercontent.com/chromium/chromium/master/net/http/transport_security_state_static.json [14:41:51] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by otto@cumin1001 for hosts: `cloudvirtan[1001-... [15:04:33] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Alright, nodes are role spare::system and decommed/downtimed in icinga. [15:04:41] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [15:06:39] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) @cmjohnson back atcha :) [15:19:31] 10Traffic, 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) @Pruem ^^^ :) [15:42:10] 10netops, 10Operations, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10ayounsi) Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possible to change the CNAMEs instead? [15:43:14] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) we will be replacing lvs2006 with lvs2010 [15:43:34] 10Traffic, 10Operations, 10ops-codfw, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) p:05High→03Lowest [15:48:54] 10Traffic, 10Operations, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10BBlack) 05Open→03Resolved a:03Vgutierrez >>! In T203194#5308402, @MoritzMuehlenhoff wrote: > @Vgutierrez The firmware update on the NICs fixed this for good, right? Can we clos... [16:01:23] 10netops, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) >>! In T228275#5341475, @ayounsi wrote: > Network devices are set to use the CNAMEs syslog.codfw.wmnet and syslog.eqiad.wmnet is it possibl... [16:49:22] 10Traffic, 10CX-cxserver, 10Citoid, 10Operations, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pchelolo) [16:53:54] $ curl -v https://en-wp.com/wiki/Special:Random -o /dev/null 2>&1 |fgrep -i location: [16:53:55] < location: https://en.wikipedia.org/wiki/Special:Random [16:53:56] wonderful [16:54:06] it just works [17:00:42] :) [17:10:39] nice to see how telegram is able to show a proper preview of https://es.wikipedia.com [17:11:04] that wouldn't work before due to the awful TLS errors [17:11:37] bblack: hmm I'm wondering.. it makes sense to keep the staging time of a week for the non-canonical redirect domains or that's too conservative? [17:30:25] staging time is on every renewal right? [17:31:30] vgutierrez: I'd say keep it. The kinds of people that pass around links to non-canonical domains and/or follow them are probably the same ones with their clocks set horribly-wrong-enough to break things without it :) [17:32:09] yeah, on every renewal [17:32:20] it's skipped for new certs of course [17:32:31] vgutierrez: this is awesome [17:32:44] bah.. just nginx doing its magic :) [17:33:27] but yeah.. on http traffic we are improving VS the old setup, we are saving a few 301s and redirecting to https at the first chance [17:34:05] it's going to be even better when we have ganeti clusters deployed on every edge DC [19:01:58] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10JAufrecht) [19:15:38] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn)