[13:40:19] FTR, what I'm running for the cache kernel reboots... [13:41:26] on neodymium in root's homedir there's a file "caches_to_reboot" - it has a list of all the caches still running the older kernel, which I spent some time mangling/interleaving to try to maximize the average spacing between hitting two machines from the same cluster+dc [13:41:59] and I'm running a script there (also temp in root's homedir) called "reboot_caches", which looks like: [13:42:02] for hname in `cat caches_to_reboot`; do short=`echo $hname|cut -d. -f1` echo === $hname \($short\) === set -x salt -v -t 90 'neon.wikimedia.org' cmd.run "icinga-downtime -h $short -d 600 -r automated-cache-kernel-reboot" sleep 15 salt -v -t 10 $hname cmd.run 'touch /var/lib/traffic-pool/pool-once; echo reboot |at now + 1 minute' sleep 400 set +x [13:42:07] done [13:43:15] so basically it's serial, about 420-ish second spacing overall, and it downtimes each machine in icinga for 10 minutes before asking for a self-repooling reboot (via at, so the command doesn't hang or anything) [13:43:46] that's all running in a root screen session too, in case I get disconnected [13:44:36] and then over on palladium as a regular user, I have this running to keep an eye on the total global list of depooled cp* services in etcd (in case they stack up because some machine never comes back from reboot, etc) [13:44:40] while [ 1 ]; do echo ==================; date; curl -s 'https://conf1002.eqiad.wmnet:2379/v2/keys/conftool/v1/pools?recursive=true&sorted=true'|json_pp |grep -B 1 '/cache_.*wmnet'|grep -A 1 'pooled.*no'; sleep 37; done [13:45:33] should take about 13h if all goes well, but I'll probably stop the job if I have to step away for a while. worst case do half of it today and resume the rest monday. (and trim the file past the last one done before starting up again obviously) [14:00:05] bblack: nice [14:01:01] heh "key" : "/conftool/v1/pools/ulsfo/cache_bits/varnish-be/cp4001.ulsfo.wmnet" [14:01:12] apparently the old cache_bits keys still exist, I guess from the service-delete bug [14:01:16] we killed that a while back :) [14:05:05] bblack: I don't seem to be able to find any occurrence of ipcast in our VCL code. Is that possible? [14:07:28] ema: yeah [14:07:44] ema: we were using ipcast + tbf vmods for ratelimiter [14:08:01] but in Dec we ran into a memleak/crash problem and tbf ratelimiter VCL looked suspect, so we pulled it out [14:08:13] so yeah the current VCL doesn't use it [14:08:26] I have one pending patch that uses ipcast for something else, though [14:08:36] (and eventually tbf will go back in, once we fix it) [14:09:04] OK, I've packaged tbf and std.ip() should do the same as ipcast.ip() anyways [14:09:13] right [14:09:21] I just wanted to look for an example of how we're currently using ipcast but couldn't find any :) [14:09:59] basically at least under varnish3, every time you reload VCL for a running daemon, it leaks a bunch of memory related to the old version of the compiled VCL that's not needed anymore [14:10:37] and even though vmod_tbf has a destructor we were calling on fini, it was causing the amount of memory leaked per VCL reload to skyrocket (proportional to how huge the ratelimiter table of client IPs would get) [14:10:52] and we think that was causing crashes for us once it eventually got out of control [14:11:23] that and also, at least the way we were using it before, tbf sucked at startup time [14:11:49] essentially its limit logic equated to "at daemon start, everyone's already at the limit", instead of assuming empty buckets in the algorithmic sense [14:12:16] so varnish would restart from those crashes and start excessively ratelimiting clients for a few minutes afterwards just to increase the pain from the incident heh [14:12:39] s/empty/full/ depending which way around you tend to mentally describe a token bucket filter [14:12:45] right [14:13:38] so basically we want to fix the module code and/or our parameter input to make sure we err on the lax side with ratelimits on fresh start, and also figure out how to make sure it deallocates the ip db memory on VCL reload [14:13:55] (might be nice to figure out if, in general, varnish4 still has the problem that old VCL shared object continue to consume memory, too) [14:16:30] I was thinking of creating a ticket with the list of things to check while testing on pinkunicorn [14:16:36] this sounds like a good candidate [14:16:41] :) [14:19:29] ema: this is the patch I have pending that might start using ipcast again (not for ratelimiter): https://gerrit.wikimedia.org/r/#/c/266486/1/modules/varnish/templates/vcl/wikimedia.vcl.erb [14:21:11] bblack: OK, so checking whether std.ip succeeds with a proper IP and returns 127.0.0.1 in case of parse error should be enough [14:21:25] yeah [14:25:37] bblack: cool, then it looks like all VMODs are working fine at a very basic level [14:25:40] https://phabricator.wikimedia.org/P2573 [14:25:47] https://phabricator.wikimedia.org/P2574 [14:25:51] https://phabricator.wikimedia.org/P2575 [14:26:17] :) [17:04:12] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2002590 (10BBlack) Took another log of all traffic today, for ~1 hour. Excluding our own healthcheck/monitoring r... [17:22:18] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2002627 (10GWicke) While breaking parsoid-prod.wmflab.org's favicon is a heavy price to pay, all those sacrifices... [17:25:44] 10Traffic, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2002633 (10BBlack) Sometime between now and the 22nd, I'll try to get a full capture for a period of multiple days... [18:06:53] 7Varnish: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2002857 (10Krinkle) 3NEW [18:07:00] 7Varnish, 6Performance-Team: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2002864 (10Krinkle) [18:07:47] 7Varnish, 6Performance-Team: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2002857 (10Krinkle) [18:37:46] 10Traffic, 6Mobile-Apps, 10RESTBase, 6operations: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#1366601 (10bearND) I would like to see this implemented. T89177 is not public though. [18:43:26] 10Traffic, 6Mobile-Apps, 10RESTBase, 6operations: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#2003122 (10BBlack) @BearND - you'd like to see what implemented, that isn't already? [18:50:00] 7Varnish, 6Performance-Team: Understand and improve streaming behaviour from Varnish - https://phabricator.wikimedia.org/T126015#2003200 (10BBlack) So, in the varnish world, do_stream is all about the backend fetch. In cases where there is no backend fetch, there is no do_stream issue. And if our hitrate is... [19:25:08] 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2003345 (10Krinkle) [19:29:24] 10Traffic, 6Mobile-Apps, 10RESTBase, 6operations: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#2003358 (10bearND) @BBlack I'd like to see the W0 header (X-CS) added to API responses for regular domains, in addition to the m-dot doma... [19:31:20] 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2003363 (10Krinkle) [19:39:25] 10Traffic, 6Mobile-Apps, 10RESTBase, 6operations: Enable RESTBase for mobile sites, or support zero headers in text varnishes - https://phabricator.wikimedia.org/T102524#2003398 (10BBlack) I'll make a separate task about that. In theory it's trivial to enable, but it probably needs some coordination with... [19:47:07] 10Traffic, 10MobileFrontend, 6Zero, 6operations: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2003432 (10BBlack) 3NEW [20:31:27] 10Traffic, 6operations, 10ops-esams: cp3032 is dead - https://phabricator.wikimedia.org/T126062#2003595 (10BBlack) 3NEW [20:38:38] 10Traffic, 6operations, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003621 (10ori) 3NEW [20:39:34] 10Traffic, 6operations, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003635 (10BBlack) See also: T124954 [20:41:56] 10Traffic, 6operations, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003639 (10GWicke) Put differently, what would the performance impact be if we reduced `s-maxage` to a) 2 weeks, b) 1 week, c) days? [20:53:32] 10Traffic, 6operations, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003690 (10ori) Script to get cached HTML ages: ```lang=python # -*- coding: utf-8 -*- """ cache_age ~~~~~~~~~ Retrieve random pages from random Wikimedia projects and scrape thei... [20:54:16] 7HTTPS, 6Research-and-Data, 10The-Wikipedia-Library, 10Wikimedia-General-or-Unknown, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#2003691 (10Sadads) @Johan Your team also metioned that it might be a good idea to... [21:01:50] 10Traffic, 10MobileFrontend, 6Zero, 6operations, and 3 others: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2003728 (10bearND) [21:03:17] 10Traffic, 10MobileFrontend, 6Zero, 6operations, and 3 others: Enable X-CS headers for non-mobile domains - https://phabricator.wikimedia.org/T126053#2003432 (10bearND) [21:05:24] 10Traffic, 6operations, 7Performance: Estimate effective cache time for text - https://phabricator.wikimedia.org/T126063#2003745 (10BBlack) My point in the other ticket is it's not really about the percentage of pages which are cached longer than X, it's about the percentage of requests. In the extreme poss... [21:34:36] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2003887 (10ori) > (which implies that a lot of cache hits there don't send Age:, so I probably need to find better ways to look at this) Should Apache or MediaWiki add a header with the current timestamp?... [21:43:24] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2003929 (10BBlack) Yeah. Ideally, in *nix epoch seconds, because it's much easier to do math on that. It would also be handy for emergency invalidations of all object mediawiki emitted from timestamp X to... [22:00:07] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2003974 (10GWicke) @bblack, do we know why the `Age` header would be missing? And does it matter / would it change the overall result? [22:22:58] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004045 (10BBlack) @gwicke - I think I had a bad filter or I misinterpreted results, one or the other. In another run I just did (also just 10 minutes on 1x cache_text, it seems like virtually all cache hi... [22:29:10] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004072 (10BBlack) The stats like the above (if taken on a broader and longer scale) still don't really simulate what would happen if we capped the TTL lower. It's not the case that we'd see a 0.7% increas... [22:33:20] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004096 (10GWicke) Yeah, long time survivors should likely be hot, and forcing them to be refreshed after 1-2 weeks shouldn't significantly alter the hit rate for those objects (or overall). To me it looks... [22:55:54] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004198 (10BBlack) @Gwicke - rather than lowering s-maxage in app code, IMHO we should lower it in the VCL cap we have here: https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/te... [22:58:25] 10Traffic, 6operations: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2004204 (10BBlack) In any case, before this goes from idea to action, I need to get finer-grained stats over a broader set of requests, and even then we'll probably want to slowly drop in stages Just In Cas... [23:28:18] 7domains, 6operations: traffic stats for typo domains - https://phabricator.wikimedia.org/T124237#2004361 (10Dzahn) I merged https://gerrit.wikimedia.org/r/244092 and deactivated the wikimediacommons.* domains --> parking [23:34:53] 10Traffic, 10Deployment-Systems, 6Performance-Team, 6operations, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2004391 (10Krinkle)