[01:37:05] bblack: do you know if this still matters at all nowadays? [01:37:07] modules/mtail/files/test/varnish_test.py [01:37:21] there is the phab server host name in there and we just switched it [01:37:40] but also jenkins-bot hates me just changing it [01:37:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/552604 [02:13:58] mutante: yeah it doesn't really matter to change it to match prod, it's just an artificial test [02:14:23] bblack: ok, thanks for confirming. i'll ignore that for right now then [02:14:28] but do something with it later [02:14:31] but modules/mtail/files/test/logs/varnishbackend.test has the matching bit [02:14:37] ah! ack [02:14:40] if you change one, you have to change the other [02:14:45] gotcha [02:15:54] amending [03:42:11] 10Traffic, 10Operations: Make DNS operations resilient against predictable failures - https://phabricator.wikimedia.org/T239711 (10colewhite) p:05Triage→03Normal [05:15:30] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [05:16:06] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) https://ticket.wikimedia.org (OTRS) has been switched to use https://ticket.discovery.wmnet (envoy on mendelevium). [08:24:59] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) As expected the SSL cert rollback undoes all of the TLS handshake regression: {F31456216, size=full} https://grafana.wiki... [08:26:19] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) [13:25:26] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns5001.wikimedia.org', 'dns3001.wikimedia.org', 'dns4001.wikimedia.... [14:11:39] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3001.wikimedia.org'] ` Of which those **FAILED**: ` ['dns3001.wikimedia.org'] ` [15:05:45] 10Traffic, 10Operations, 10Patch-For-Review: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['ganeti3003.esams.wmnet'] ` The log can be found in `/var/log/wmf-aut... [15:28:35] 10Traffic, 10Operations: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] ` Of which those **FAILED**: ` ['ganeti3003.esams.wmnet'] ` [15:30:50] 10Traffic, 10Operations: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10BBlack) 05Open→03Resolved a:03BBlack Our `ns2` service address is now re-routed to `dns3001`, and `ganeti3003` is reimaged back to `spare::system`. [15:50:16] I was wondering, do we (or even can we) send other header info with navtiming metrics? [15:50:32] it'd be nice to be able to split them on other factors, like cache hit [15:50:38] yeah [15:50:41] it would be nice :) [15:50:47] or cache host! [15:50:59] I don't know how little control JS has over the situation there, since it's a browser thing [15:51:47] parsing X-Cache would give us plenty of info [15:52:04] at the end of the day, the 15% regress for the EU is "interesting", and it's a good driver to go find some low-hanging fruit [15:52:24] but we fundamentally changed a lot of things (the software at two layers, the lack of second-level backend caching, the use of TLS to the applayer, etc) [15:52:54] it would be shocking if there weren't a perf diff in some direction or other, and the drivers for the change are more important than any reasonable perf loss, basically. [15:53:37] I wonder if eqsin saw a similar issue? It has even higher latency to the applayer than esams does. [15:53:53] yeah eqsin did also see a similar regression [15:54:15] from the ats-be switch or the ats-tls switch? or were they too close together? [15:54:16] or rather, "Asia" [15:54:45] asia had the certificate impact as well, but the timing would've been more-separated [15:54:53] since the ats conversions there happened much earlier [15:56:05] right [15:56:07] see https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1&from=now-35d&to=now&var-metric=responseStart&var-location=Asia&var-prop=p75&fullscreen&panelId=3 [15:56:23] it seems to align well with ats-be conversions [15:57:05] Oct 28 was the first reimage, Nov 4 the last [15:58:04] my suspect of the day is lack of HFP in ATS [16:00:19] by using "read-while-writer" and "open_write_file_action" (coalescing in ATS-world) we essentially have sequential access to the origins for lots of stuff which is just hfp in vanrishland [16:00:35] planning on monitoring that with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554503/ [16:01:36] but a glance at the logs seems to show that we have a significant amount of >100ms Cache{Read,Write}Time for uncacheable stuff [16:02:46] so now the plan is to plot in prometheues ttfb from the varnish-fe<->ats-be perspective, as well as cache read/write time, and then disable coalescing altogether on ats-be for a few hours and see what happens [16:03:14] I thought we had a switch we could flip for ats-be to basically stop all coalesce? [16:03:40] oh nevermind, now that I finished reading your output, I see you already plan to flip that switch :) [16:04:17] I remember us talking about it a while back, mentally I had moved past it and figured it was already disabled, and I was worried that now we're running out of low-hanging fruit to chase. [16:04:34] so yeah, good news to me :) [16:04:43] :) [16:05:30] I thought I did disable coalescing on cp3050 by setting cache.max_open_read_retries=-1 and cache.max_open_write_retries=1, but that's only part of the coalescing implementation [16:05:39] the other is read-while-writer [16:06:19] essentially as far as I understand the former is used to coalesce requests while a origin server request is in flight but the headers haven't been received by the origin yet [16:06:41] the latter is used while the response body is being fetched by ats [16:07:42] anyways, let's add the metrics to atsbackend.mtail and see [16:48:14] bblack: I haven't reverted the digicert-2019a change yet BTW [16:49:54] yeah I pushed up a revert, but didn't rebase/merge/etc yet [16:49:59] I'm lost in meetings + dns stuff [16:50:08] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554535/ [16:50:21] apparently I didnt even set a bug# [16:52:45] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns1001.wikimedia.org', 'dns2001.wikimedia.org'] ` The log can be fo... [17:23:20] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns1001.wikimedia.org', 'dns2001.wikimedia.org'] ` and were **ALL** successful. [22:29:35] 10Wikimedia-Apache-configuration, 10Commons, 10SDC General, 10WikibaseMediaInfo, and 4 others: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 (10Multichill) >>! In T222321#5710031, @EBernhardson wrote: > In summary, it seems we need to merge the patch[1] for the /entity/ e...