[08:41:06] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2440815 (10elukey) Finally we found a repro thanks to https://bz.apache.org/bug... [08:46:23] Hi everyone, I have a few questions with regard to https://phabricator.wikimedia.org/T128132 (which is about compiling a request data set to simulate the performance of cache policies on WMF traffic). [08:47:06] The Analytics team could compile such a dataset from Hive (Analytics/Data/Webrequest). However, one concern is the size when creating a data set of all of WMF's request traffic (see the discussion on the phabricator item). [08:49:17] So, I thought we might be able to just focus on the traffic served by a few caches, which brings me to questions about request routing. [08:53:35] If I want to simulate the performance of the first two tiers of Varnish caches, i.e., Front(mem) and Local(disk), how could we limit the request data to get exactly those requests routed to those caches. [08:55:38] The hive data (according to https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest) gives us the x_cache field. So, I think we should be able to query based on a subset of Front(mem) and Local(disk) caches. E.g., look only at cp3045 or so. [08:58:52] But then, I understand that request routing between memory and disk caches is different. So, when I simulate this two-level hiearchy, requests to cp3045 might go to a whole set of disk caches (not just one). So, in order to simulate the impact of the caching policy on the memory cache, we'd actually need to look at several memory caches and disk caches. Am I making sense? [09:41:19] dberger: hi! A single HTTP request hits exactly one memory cache and possibly multiple disk caches [09:43:35] if you want to work on the first two layers only, Front(mem) and Local(disk) using your terminology, you could limit to a single PoP [09:52:00] 10Traffic, 06Operations, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#1531608 (10ema) a:03ema [10:10:30] 10Traffic, 06Operations, 06Wikipedia-iOS-App-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2440936 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:12:16] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2440940 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:14:54] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2440942 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:17:58] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2440958 (10ema) p:05Triage>03Normal [10:18:59] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Shop: Canonical URL in Store points to HTTP address, should be HTTPS - https://phabricator.wikimedia.org/T131131#2440959 (10ema) p:05Triage>03Normal [10:19:54] 10Traffic, 10DNS, 06Operations, 13Patch-For-Review: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930#2440962 (10ema) p:05Triage>03Normal [10:20:28] 10Traffic, 06Operations: Set up LVS connection sync - https://phabricator.wikimedia.org/T136944#2440963 (10ema) p:05Triage>03Normal [10:21:02] 10Traffic, 06Operations, 13Patch-For-Review: Make upload.wikimedia.org cookieless - https://phabricator.wikimedia.org/T137609#2440964 (10ema) p:05Triage>03Normal [10:42:44] Thank you, ema. I agree that simulating a whole PoP would be very interesting. However, people seemed to be scared of writing out a dataset with 100k req/sec (see the phabricator item's comments) - and since we only have a few PoPs, focusing on a PoP might not reduce that rate by a big enough factor. [10:44:31] I think part of the problem is that I'd like to prevent aggresive sampling of the data (like 1:1000), because a high sampling rate introduces bias into cache simulations (1:10 is probably ok-isch, but anything above 1:100 sampling lead to unrealistic cache performance). Whereas limitting the number of caches would still be realistic (except that the request routing is playing against us). [10:45:54] So, given the different request routing (different hashing for memory and disk caches), I take it that there's no smaller subset within a PoP that is self containted? [10:48:02] Which PoP has the smallest request rate on its Varnish instances? [10:59:28] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2441042 (10elukey) The two patches needed are the following (even simpler than... [11:18:27] Ok, I found the request rate per DC numbers on grafana. Ignore my last question. [11:18:56] ema: are your Varnishcon slides available somewhere? [11:22:00] Here's another question: are all bans/purges executed on the Varnish servers visible in the hive data (i.e., by combining the fields cache_status, uri_query, http_method)? [11:37:03] dberger: re slides - https://www.mediawiki.org/wiki/Presentations#2016 [11:37:29] also if you want to see ema in all his greatness https://www.infoq.com/fr/presentations/varnishcon-emanuele-rocca-scaling-wikipedia [11:41:49] Thank you, elukey. [11:46:59] dberger: also about T128132 - I am in the analytics team, and my manager (Nuria) told me that we are going to work on it probably next quarter. We are really interested to help you but we have been swamped with other tasks [11:47:00] T128132: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132 [11:56:32] elukey: thanks for letting me know about the schedule! It would have been great to get the request data set earlier, because these simulations could contribute to a few current engineering questions (e.g., T124954 , T96853 , T135384 ), but I understand that you're quite busy. My goal (with these questions here) is to get a well-defined description for T128132 so that it's easier for you to work on it. [11:56:32] T128132: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132 [11:56:32] T96853: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853 [11:56:32] T135384: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384 [11:56:33] T124954: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954 [12:00:53] yep :) [13:20:28] dberger: all purges should be in the hive data (http_method=PURGE) [13:48:07] don't we filter them in varnishkafka? [13:48:14] * elukey is confused [13:54:58] elukey: uh, good point! No purge data then [14:53:40] I've uploaded varnish 4.1.3-1 to debian unstable and based 4.1.3-1wm1 on it: https://packages.qa.debian.org/v/varnish/news/20160708T133842Z.html https://gerrit.wikimedia.org/r/#/c/297992/ [14:53:47] next week we can upgrade! [14:59:36] upgrade all the things! [15:02:29] 10Traffic, 10Varnish, 06Operations: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2441604 (10ema) The XKey vmod is part of a collection of VMODs called [[https://github.com/varnish/varnish-modules/blob/master/src/vmod_xkey.c|varnish-modules]], already packaged in Debian. I've just [[http://a... [15:03:01] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2441605 (10elukey) Websocket upgrades filtered out from Varnishkafka. The last step is to wait for https://gerrit.wikimedia.org/r/#/c/... [15:17:39] ema: but we can find page edits (for text caches) and uploads (for the upload caches) in the hive data, right? Do you think this is sufficient to reproduce the changes to the cache state (i.e., lost hits in simulations after a purge)? [15:19:38] dberger: I'd suggest to join #wikimedia-analytics to ask these questions, we will be happy to help :) [15:26:55] dberger: perhaps! Note that, for instance, editing a page invalidates both the desktop and mobile version. It might be hard to find out what caused the invalidation from the hive data. But yes, as elukey said, you might want to ask these questions on #wikimedia-analytics [16:00:17] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2441866 (10elukey) Deployed the patched version of httpd but the errors are sti... [17:21:55] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2442162 (10elukey) I copied an example of 304 in mw1061:/home/elukey/error_log_... [21:28:08] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2442995 (10elukey) I found a repro (If-Modified needs to be modified accordingl... [22:09:04] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2443063 (10elukey) Interesting fact: I tried the same curl request on mw2244 (t... [22:23:35] 10Wikimedia-Apache-configuration, 06Operations, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2443135 (10hashar) So you at least have a proof `AH01070: Error parsing script... [23:11:49] 10netops, 10Datasets-General-or-Unknown, 06Operations, 07TestMe: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#2443295 (10Danny_B) Still an issue? [23:44:43] 10Traffic, 06Operations, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2443362 (10MaxSem) [23:44:47] 07HTTPS, 10Traffic, 06Operations, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2443361 (10MaxSem)