[05:57:43] 10Traffic, 10InternetArchiveBot, 10Operations: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr) [05:59:21] 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Engineering: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Tgr) Tagging Platform Engineering to get feedback about the optimal way of getting the page source. [12:10:52] 10Traffic, 10Operations, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10dr0ptp4kt) 05Openā†’03Resolved I was able to reproduce the new behavior observed by @ckoerner on a number o... [12:44:37] 10Traffic, 10Operations, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Initial results of the 6.0.0 experiment on cp3054 are encouraging: for the past 12 hours [[ https://grafana.wikimedia.org/d/Lp_BDKJMz/em... [14:15:09] ema: [yes it's async day, feel free to wait to read or respond till Monday] - going on the assumption of your 6.0.0 results holding and thus narrowing the window of changes a bunch... there was a change in 6.0.1 that looks like a good candidate [14:15:15] https://github.com/nigoroll/varnish-cache/commit/a46dd95e967ac70754dd2eccfe4e72fc5f8b2590 [14:15:30] ^ this happens to add a new 10s timeout when certain socket errors happen on the backend-facing side [14:15:43] and then there were those 10s spikes in that graph the other day... [14:16:35] maybe we get occasional ECONNREFUSED from applayers, and it kicks in these hold timers and causes a mess, which also results in a net p75 rum regression indirectly [14:18:05] it's kind of a long shot, but it seems to hint in the right directions anyways [14:18:51] err I guess ECONNREFUSED would be a new 0.25s timeout rather than 10s. still possible [14:19:08] EADDRNOTAVAIL would be 10s though, which could happen with TIME_WAIT pileup or similar [14:20:28] in any case, it's an easy theory to test. if this were the sole cause, then setting the two new parameters to 0.0 would "fix" it on 6.0.7. [14:24:00] 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Engineering: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10jbond) p:05Triageā†’03Medium [14:24:54] hmmm, but the 6.1 docs say those params are new in 6.1. The 6.0.1 changes.rst mentions the bug being fixed, though, as if they were added there. Confusing... [14:26:26] they do seem to exist in 6.0.1+ [14:28:47] going out on further limbs: if 6.0.0 does seem to work, and these holddown timers prove to be the cause and setting them to zero returns us to prior performance, we should probably still investigate why we're getting a bunch of socket errors and whether they could be avoided in the first place and then put the holddown timers back at some reasonable value :) [15:06:42] 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Engineering: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Thursday, as in yesterday? Iā€™m not aware of anything that should have been running to create that massive level of requests. [15:07:17] 10Traffic, 10InternetArchiveBot, 10Operations, 10Platform Engineering: IAbot sending a huge volume of action=raw requests - https://phabricator.wikimedia.org/T269914 (10Cyberpower678) Especially to Wikidata. [15:16:23] bblack: re:10s, we've got proxy.config.http.connect_attempts_timeout set to 10 in ats-be, which might be explain the graphs [15:16:29] https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/files/records.config.en.html#proxy.config.http.connect_attempts_timeout [15:17:21] we did see those spikes on hit-local, so in theory you'd think it's unrelated, but there's the caveat of 304s from origins being considered hit-local too in case of hit-stale [15:17:58] to confirm we should set that to 9 or 11 perhaps [15:18:54] "might be explain" is not English but you get my point [15:21:33] and yes, we do have backend_local_error_holddown / backend_remote_error_holddown on 6.0.7 [15:27:16] there's a statistic also to keep track of connections held down, and it's 0 on the cp30xx nodes I checked [15:27:22] see: `varnishstat -1 -n frontend | grep helddown` [15:30:00] so maybe that's not it, or maybe the stats are unreliable, in any case the theory sounds plausible :) [16:31:03] yeah the stats are probably reliable though, oh well [17:34:02] 10Traffic, 10Operations, 10Performance-Team: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10Gilles) [21:45:42] 10Wikimedia-Apache-configuration, 10Continuous-Integration-Infrastructure, 10Operations, 10Patch-For-Review: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164 (10hashar) `DirectorySlash` redirecting to http instead of canonical https is #upstream Apach...