[00:25:09] _joe_: hmm, unless i misunderstood vgutierrez implied otherwise in PS4 of https://gerrit.wikimedia.org/r/512925/ "I couldn't find a related commit allowing the load balancers to reach the configured ports in cloudelastic[1001-1004].wm.o" [00:25:39] would certainly be simpler, as they are already configured to accept from the expected ranges [07:53:00] 10Traffic, 10Operations, 10Performance-Team, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10ema) >>! In T225998#5264757, @Gilles wrote: > loadEventEnd seems to have regressed around the time the change was deployed I'm gonn... [08:29:08] here's the pie mentioned yesterday: https://logstash.wikimedia.org/goto/37eb25926fdfdb4c0c8ebb136956a10a [08:29:37] we've got a low but fairly constant amount of 'straight insufficient bytes' errors [08:29:58] 61% of all fetch errors in the past 12 hours [08:33:16] note that they don't mean user facing errors (we retry 503 errors once at the frontend layer), but it would of course be good to find out more [08:34:54] that specific error usually happens when the origin server sends less data than advertised (eg: CL=42 but body bytes sent=30) [08:36:14] another scenario which we've encountered in the past is wrong CL with gzip [08:36:49] for instance when the origin server sets CL to the uncompressed response size, and then proceed to send a compressed body [10:07:44] lots of those 'straight insufficient bytes' FetchErrors are "Resource temporarily unavailable", perhaps due to threads being limited? I've tried bumping thread_pool_max on cp3030, let's see [10:10:05] varnish-be threads *are* being limited, and that does not seem good: https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?panelId=53&fullscreen&orgId=1&from=now-3h&to=now&var-server=cp3030&var-datasource=esams%20prometheus%2Fops [10:23:15] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Marostegui) [13:25:53] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10BBlack) [13:37:46] ebernhardson, _joe_ backend server will see the balanced traffic with their actual source IPs of course, but pybal monitoring will reach the backend servers with the lvs IP.. and that traffic needs to be allowed [13:38:07] otherwise... pybal would think that the backend servers are down [13:38:26] <_joe_> oh I misunderstood what the issue was, heh [13:38:28] <_joe_> sure [13:38:51] so as far as I know the lvs instances need to be able to reach the service port on the backend servers [13:39:04] not for balancing itself but for monitoring purposes [13:39:10] sorry for not being extra verbose on that :) [13:48:39] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10Krinkle) If re-using `techblog.wikimedia.org`, please take care not to break existing urls. The root path would be fine to change as it... [13:54:21] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10BBlack) Implementing a blanket redirect to the legacy blog URI for `^/20(0[7-9]|1[0-8])/` should be feasible in VCL or Lua at the edge.... [14:12:03] 10Traffic, 10Operations, 10Performance-Team, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [14:12:27] 10Traffic, 10Operations, 10Performance-Team, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Stalled→03Open [15:01:17] 10netops, 10Operations, 10Wikimedia-Logstash, 10User-herron: Migrate network device syslogs to Kafka logging pipeline - https://phabricator.wikimedia.org/T224128 (10ayounsi) 05Open→03Resolved All done here! [16:34:16] 10netops, 10Operations: Outbound BGP graceful shutdown - https://phabricator.wikimedia.org/T211728 (10ayounsi) Interesting! This would be useful before doing maintenance on a whole router. I opened an issue upstream asking for a per AS option, see https://github.com/mwiget/bgp_graceful_shutdown/issues/1 My su... [21:12:16] 10Traffic, 10Operations, 10Performance-Team, 10Performance: Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Aklapper) 05Stalled→03Open p:05Triage→03High [23:55:39] 10Traffic, 10Operations, 10Performance-Team, 10Performance: Sometimes some pages load slowly on de.wp in Europe (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10PM3) The "some" can be removed from the caption. I am experiencing this problem since Tuesday (dew...