[05:37:45] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ema) [05:37:51] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ema) p:05Triage→03Normal [06:29:22] alright, ensure_max_age middleware deployed to ms-fe1005, I'm starting to see the first responses with Cache-Control on cp1078 [10:48:38] 10Traffic, 10Operations, 10serviceops, 10PHP 7.2 support, and 2 others: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10Joe) 05Open→03Resolved [12:23:08] 10Traffic, 10DC-Ops, 10Operations, 10observability, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177 (10jbond) >>! In T183177#4191082, @fgiunchedi wrote: > The correctable errors check has been deployed and it is yielding some results already. Myself and @herron... [12:27:03] cp3038 has varnish mbox trouble, reimaging to ATS [12:28:18] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3038.esams.wmnet'] ` The log can be found in `... [12:28:42] that should fix it :) [13:02:54] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5170415, @Ottomata wrote: > I think there are a few more branches: > > - prod... [13:10:02] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3038.esams.wmnet'] ` and were **ALL** successful. [13:32:55] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) > I'm not sure I see the value in breaking up the broker name into broker_hostname, broker_id,... [14:22:37] CPU usage is interesting, it looks significantly better after the reimage to ATS [14:22:41] https://grafana.wikimedia.org/d/000000610/ats-instance-drilldown?orgId=1&var-site=esams%20prometheus%2Fops&var-instance=cp3038&from=now-6h&to=now [14:26:39] yeah. Almost every type of cpu% is down a bit, but the huge dropoff is in iowait [14:27:11] (which makes sense!) [14:31:23] ah I've just realised that last month cp3037 died (T222041), so we have 11/12 hosts in upload@esams [14:31:23] T222041: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T222041 [14:34:32] (which is great, we'll complete the transition to ATS a little faster!) [14:37:44] this is fun to watch too, the rise of esams: https://grafana.wikimedia.org/d/000000569/ats-cache-operations?orgId=1&from=now-3h&to=now [14:39:17] and it's one ats server only, not bad [15:20:46] ema: cp3036 is lag-alerting :) [15:21:45] bblack: are you suggesting we should upgrade it to ATS? :) [15:25:40] maybe! [15:30:11] bblack: I've just found something fishy, a 503 from swift got cached on cp3038's ATS [15:30:23] that's the reason for https://grafana.wikimedia.org/d/wI0nURqiz/ats-cluster-view?panelId=1&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&from=1557499929722&to=1557501809153 [15:30:41] 15:21 is when I've purged it [15:31:28] my understanding is that ATS doesn't cache 503 (proxy.config.http.negative_caching_list for us is only 404 and 414), unless they have Cache-Control/Expires [15:32:30] I don't think 503 responses from thumbor set CC though (I hope!) [15:34:18] to be safe we could maybe explicitly unset CC in lua for status > 499 [15:40:05] (the errors in the graph are all caused by one silly client asking for the same non-existing thumb again and again, so no "real" traffic has been harmed so to say. But still) [15:57:22] can we repro the 503 directly against swift and observe it's sending CC headers? [15:58:16] bblack: that specific one was transient, we can try and catch another one though of course [16:00:38] uh [16:01:01] it seems ATS caches 503 regardless of CC, just reproduced on my workstation [16:01:08] CONFIG proxy.config.http.negative_caching_enabled INT 1 [16:01:16] CONFIG proxy.config.http.negative_caching_list STRING 404 414 [16:01:25] and I got a 503 cached [16:02:28] so yeah that's a bug if I read https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#negative-response-caching correctly [16:03:15] I'll disable negative caching over the weekend and investigate on Monday [16:04:14] s/503/503 without CC or Expires/ above [16:06:41] the docs read funny to me; negative_caching_enabled defaults to 0, but then the prose reads "The following negative responses are cached by Traffic Server by default:" [16:07:10] I'm guessing that's supposed to mean "these are the response codes that we consider constitute a negative response" [16:07:14] or something? [16:08:52] I think they mean that those are cached "by default" if you don't override negative_caching_list? [16:08:55] confusing [16:13:19] ok [16:28:02] done, this was surprising! [16:29:32] off for the weekend, ring me up if needed! [16:30:00] bye bblack, cdanis and whoever may be lurking [16:30:12] have a good weekend :) [18:22:14] 10Traffic, 10Operations, 10Performance-Team (Radar): Support brotli compression - https://phabricator.wikimedia.org/T137979 (10ori) >>! In T137979#4118215, @BBlack wrote: > Re-reading above: probably the better blend of options would be to swap gzip for brotli in Varnish one-for-one (without the whole storin... [18:59:54] "As it has only happened once looks like a one time occurrence." <- Thank you Juniper support [19:01:42] "It was a memory bit-flip from a random cosmic ray, don't worry it'll never happen again" [19:02:21] (said Sun support to me I don't know how many times about crashes in their software we reported, many of which were eventually fixed in future bug patches) [21:55:31] 10netops, 10Operations: esams/knams: advertise 185.15.58.0/23 instead of 185.15.56.0/22 - https://phabricator.wikimedia.org/T207753 (10ayounsi)