[05:31:08] 10Traffic, 10Operations, 10ops-esams: cp3048 crashed - https://phabricator.wikimedia.org/T180424#3757577 (10BBlack) [08:44:10] 10Traffic, 10Operations, 10ops-esams: cp3048 crashed - https://phabricator.wikimedia.org/T180424#3757731 (10Peachey88) [09:32:58] 10Traffic, 10Operations, 10Performance-Team (Radar): Upgrade cache_upload to Varnish 5 - https://phabricator.wikimedia.org/T180433#3757836 (10ema) [09:41:15] 10Traffic, 10Operations: Uncacheable content handling: hfp vs hfm - https://phabricator.wikimedia.org/T180434#3757861 (10ema) [09:41:18] 10Traffic, 10Operations: Uncacheable content handling: hfp vs hfm - https://phabricator.wikimedia.org/T180434#3757876 (10ema) p:05Triage>03Normal [09:50:49] the new hfp syntax in v5 requires a ttl for the hfp object, while in most cases we're currently just setting beresp.uncacheable=true without specifying a ttl... [09:52:17] I've prepped https://gerrit.wikimedia.org/r/#/c/391171/ and used default_ttl for those cases, but I'm now wondering what's the hfp ttl in v4 in those cases (perhaps beresp.ttl)? [10:54:31] 10Traffic, 10Operations: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3758143 (10Nemo_bis) I assume the same would apply to the "UseDC" cookie? [11:13:21] 10Traffic, 10Operations: Puppet / LVS: confusion in service vs IP name - https://phabricator.wikimedia.org/T180257#3758164 (10ema) p:05Triage>03Normal [12:25:42] ema: yeah it's beresp.ttl, can we just forward that through where it makes sense? [12:26:03] bblack: yup! [12:27:06] bblack: there's been a weird behavior reported by amir re:cp3007. varnish-be was failing fetches with 'no backend connections', I couldn't immediately figure the problem out so I've restarted the service [12:27:27] varnishlogs on cp3007:~ema/503.log [12:27:45] bblack: does that ring a bell? [12:28:10] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1510649535123&to=1510662481510&var-datasource=esams%20prometheus%2Fops&var-cache_type=misc&var-server=All [12:47:06] fun problem of the day! Slow varnish-be request logs end up in logstash nicely. varnish-fe ones only look good if varnishncsa's pid is <= 9999 :) [12:47:18] https://gerrit.wikimedia.org/r/#/c/391199/ [12:49:26] ema: re: cache_misc, the biggest change there lately is the phab notification sockets. it's possible there are way more users of phab than we thought, and they all keep websocket pipes open through our varnish stack all the way back to phab1001.eqiad.wmnet:22280 ... [12:51:23] maybe we have the max_conns for phab:22280 itself set up ok, but within that limit, esams<->eqiad are hitting a max_conns for inter-cache conns? [12:51:34] (due to the new websockets stuff) [12:53:35] that seems plausible, yes [12:54:03] per-backend connections cp3007-cp1061 flattened roughly when the issue started [12:54:25] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?panelId=6&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cache_type=misc&var-server=All&from=1510637802447&to=1510663955020 [12:55:02] 'max_connections' => 100, [12:55:06] ^ cache_misc inter-cache :) [12:55:06] heh [12:55:28] I suspect that legacy value is from back when cache_misc was only doing a few things that were more grafana-like [12:56:43] probably should go unlimited like we've done for text there (upload too?) [12:57:04] also I really wonder about our historical 100K limit for fe->be [12:57:30] since be only has one port, and fe will only use one source address, only ~40K is probably possible before it can't allocate any more sockets anyways.... [12:57:51] (whatever the ephemeral port limit is) [12:58:23] ~61K I guess [12:58:42] * ema also has the feeling that these limits have caused more harm than good so far :) [12:59:05] under a spiky/artificial/attacky load, the fe's really can forward a crapload of conns to the be. at first glance, it seems to make sense to throttle at some sanity-limit there, but: [12:59:58] 1) If the max_conns is 100K, we're going to run out of sockets first, and that probably fails in more- subtle/awful ways than hitting a lower max_conns value first before we run out of sockets. [13:00:43] 2) If things are going to fail anyways, does it even make sense to have the artificial limit in place? Either way, things will fail. [13:01:34] I guess there's still a pro-limits argument for fe->be though: it might limit the impact of a single bad client IP (buggy, naively-attacky), and let the rest of the frontends handle traffic ok through the backends still. [13:02:33] maybe if graphs show the avg/max fe->be conns are well under the 61K socket limit (I'd hope so), we set a more-realistic value there. Maybe 50K? [13:03:42] misc be<->be: https://gerrit.wikimedia.org/r/#/c/391204/ [13:03:57] yeah, let's see re:fe<->be [13:07:34] we probably want to ask prometheus something like: `max(varnish_backend_conn{layer="frontend"})` (or avg) [13:21:58] alright it's slow but it seems to work [13:22:01] https://grafana.wikimedia.org/dashboard/db/varnish-backend-connections?orgId=1&from=now-24h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=misc [13:22:30] see how it flattened to 100 there around 10:00 [13:30:09] yeah [13:31:06] all the usual maximums are tiny relative to the numbers/arguments being made above [13:31:19] but also, the whole ephemeral socket thing applies for be->be as well [13:31:47] it probably is better to have varnish cap itself and say "no backend connections" than to run out of ephemeral sockets and see what falls apart and how at that point [13:32:23] so, maybe set max_conns for all clusters' fe->be and be->be to 50K? [13:33:06] or, the counter-argument would be to set them all to zero (unlimited) [13:33:44] on the grounds that Varnish's accounting of connection sockets seems to include (at least sometimes) closed sockets in various time_wait-like states, and we have some kernel tunables that make those recycle a bit faster [13:34:16] so it's entirely possible >61K varnish-level sockets "works", up to some unknown/variable limit before the established+unavoidably-timewaited ones reach ~61K. [13:36:50] (but then, what's the behavior when we run out of ephemerals for that connection pool? maybe varnish source can tell us, I think it would fail during connect(), maybe) [14:37:39] bblack: could you do a quick sanity check of https://gerrit.wikimedia.org/r/#/c/391199/ when you have a sec? Works fine in labs [14:45:06] ema: done [14:46:17] bblack: thanks, merged :) [15:53:30] alright https://gerrit.wikimedia.org/r/#/c/391171/ is ready to be reviewed: pcc looks good, tested on a misc (v5) and upload (v4) test hosts and nothing burned [16:21:21] ema: some review comments added [16:21:33] thanks [16:21:37] bblack: the 'no backend connection' issue is affecting cp3007 again, it looks like changing the yaml file didn't do much [16:21:55] maybe we're overriding it in modules/profile/manifests/cache/misc.pp? [16:21:57] changing the yaml file? [16:22:14] yup https://gerrit.wikimedia.org/r/#/c/391204/ [16:22:33] oh yeah, revert that heh [16:22:45] doh, that's app_def [16:22:48] reverting [16:22:54] that's only for the application-layer backends, not inter-cache, and it changed the default for a bunch of them :P [16:23:10] yes, edit it in profile [16:23:35] $be_cache_be_opts [16:24:07] and that's why you need code reviews! [16:25:14] lol [16:26:30] bblack: how about bumping cache_misc varnish-be max_connections (intercache this time) to 50k, following the reasonings above? [16:27:15] yeah, or all of the max_conns in $[fb]e_cache_be_opts in all 3 profiles really [16:27:36] either way sane/normal values are way way under 50K for all clusters' fe->be [16:28:00] it's just a question of how things fail under a massive attack we probably can't avoid falling over to anyways :P [16:30:11] honestly unless we look at how varnish fails when it runs out of ephemerals, I don't know if there's any reason to prefer 0 or 50K as a value [16:30:30] either way it's going to die at that point, though. [16:31:35] bblack: https://gerrit.wikimedia.org/r/391233 [16:33:10] ema: https://gerrit.wikimedia.org/r/#/c/391216/ (I still need to add some VTC to this, but I have to stop for now and do Meetings soon. I think the code is sane, I've tested the core of it in a little CLI program manually) [16:34:03] ema: There's two things to sort out (1) stupid code bugs causing crash or deviation from the intent and (2) whether the intent is sane (which will probably require poking some MW people later) [16:35:04] this is an outstanding issue from nearly 2y ago (well, that's when I last worked on it anyways). In general it's a cool thing to get right, because there are cases where we end up serving uncacheable responses and shouldn't have to, over this encoding bullshit [16:35:31] e.g. you can replace an "a" with a %61 in a random WP title and fetch it anon, and it's pass instead of hit/miss :P [16:35:36] (and many other such cases...) [16:36:20] the reason those cases do pass is because MW returns uncacheable headers, because that was easier than having N potential stale copies out there not getting purged because they're not cached under the canonical URL encoding. [16:36:52] so maybe caches get a bit of a hitrate bump from it, not sure by how much. [16:37:51] the important reason I'm looking back at this problem after a long hiatus, though, is I suspect the real answer to the WP0+Upload.wm.o abuse issues is to apply the same MediaWiki path-normalization to all upload.wikimedia.org incoming requests as well (it currently only applies on cache_text). [16:38:36] in the upload case non-canonical forms aren't causing a cache-pass, they're just cached separately and thus not purged, and I think some People have figured this out as a way to avoid delete->purge until caches naturally clear out. [16:39:11] smart! [16:39:54] when there's porn there is a way I guess [16:39:59] :) [16:40:37] even applying the path normalization we have today over on cache_upload would help, but because our current normalization isn't perfect, they'd probably quickly figure out how to evade it again [16:41:04] the new rewrite forces the issue: there is only one possible canonical and correct encoding of the path [19:15:28] 10Traffic, 10netops, 10Operations, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10chasemp) We are dealing with a bit of a resource crunch (T161118, T171473, T178937, etc) and need to rebalance and do some rounds of... [19:19:48] 10Traffic, 10netops, 10Operations, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10bd808) +1 once labvirt1015 is online [19:41:23] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3760561 (10RobH) Ok, did the following: * pulled cpu1 entirely because I didnt want to waste thermal compund swapping it to cpu 2. * put suspected cpu 2 into cpu 1 socket * installed os, got cpu error duri... [19:44:41] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3760567 (10RobH) a:05RobH>03BBlack @BBlack: Assignign this back to you, please reimage or place this system back into service as you see fit. The CPU error hasn't shown back up during the OS install si... [20:07:30] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3760577 (10BBlack) For now I'm puppetizing it back into the cluster (and ipsec lists), but not repooling yet... [23:07:04] 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, 10Mobile: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#3761037 (10Krinkle) Note that as of writing (November 2017) the "m.wikipedia.org" and "zero.wikipedia.org" landing domains **do** act... [23:07:10] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Operations: m.{project}.org portal/redirect consistency - https://phabricator.wikimedia.org/T78421#3761041 (10Krinkle) [23:07:12] 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, 10Mobile: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#3761039 (10Krinkle) 05stalled>03Open p:05Normal>03High [23:07:30] 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, 10Mobile: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#730151 (10Krinkle) [23:10:16] 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, 10Reading-Infrastructure-Team-Backlog, and 2 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3761048 (10Krinkle) [23:32:44] 10Wikimedia-Apache-configuration, 10Discovery, 10Operations, 10Zero, and 2 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3761093 (10Mholloway)