[09:20:00] Amir1: That should be fairly straightforward the listener already exists in envoy hieradata/common/profile/services_proxy/envoy.yaml L240/245 [09:23:10] Amir1: I think mediawiki has all the listeners defined by default [09:23:28] let me check [09:25:33] Amir1: yeah, it's listening on localhost:6101 for eqiad and localhost:6201 for codfw [09:25:51] just need to make sure network policies are ok [09:27:09] Amir1: yup they're already added, apparently used by ThumbnailRenderer already [09:27:16] hit me up if you need more info [09:28:01] Ah you've figured it out and merged your patch already :D [09:35:38] And reverted [09:36:36] Ah yes. Well Amir1 when you get in, I can help you look into the errors if needed [09:37:49] I wish gerrit automatically copied over the Bug: footer when doing a revert [11:21:44] claime: morning, yeah, roughly 5% of uploads failed because of this :((( [11:22:08] I have no idea how to even debug this [11:22:56] Amir1: wanna sync talk about it? [11:23:20] yeah, give me ten min [11:24:22] Sure np, I'll send you a meet link in dm [13:55:21] claime: Amir1: can I help with debugging? [14:24:12] cdanis: I always appreciate help! claime got to this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1269420/ [14:26:24] ahhh [14:26:26] ack [14:43:58] Great netsplit [14:44:08] cdanis: sure, what do you have in mind? [14:59:33] claime: I think trying a higher `timeout` (as well as stream_idle_timeout) is a good idea, for one [15:00:00] cdanis: So for stream_idle_timeout, envoy defaults to 5 minutes [15:00:11] so the mesh has a higher one than swift which sounds ok? [15:00:46] But we can align the route's idle_timeout to swift's stream_idle_timeout, and bump the general timeout yeah [15:01:15] It's a bit muddy to debug because the errors I can see from the mesh on the mediawiki side are 503 UC [15:04:37] cdanis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1269420 I've bumped the route timeout to 90s [15:05:05] +1! [15:07:09] Ok so that will require redeploying mw-on-k8s to take effect, I wonder if we should do that at the same time we try and redeploy Amir1's patch, because there's nothing else using that listener in mediawiki right now [15:07:25] (redeploying mw-on-k8s post puppet merge + run on deployment server) [15:12:17] I'm slightly curious as to how a more restrictive idle timeout might result in 503 UC (not an SI or something) but this seems like a solid thing to try [15:13:38] swfrench-wmf: Same. Does a stream reset trigger a 503 UC though? [15:14:17] because that's what the envoy doc says can happen if an upstrean response header has been received but the timeout happens afterwards [15:14:34] claime: we were live-hacking on mw-experimental yesterday, before deploying -- could do the same here [15:14:44] cdanis: sure [15:15:01] I can merge it and let you do that again, I won't have the time to live hack today [15:15:30] I don't want to yolo too much with a 5% error rate on Special:Upload (IIRC?) [15:15:44] oh, interesting - yeah, if that case is explicitly called out in the docs, then I'll believe it :) [15:16:40] cdanis: Yeah understandable, merging the listener and redeploying mw-on-k8s is 0 risk as that listener is unused [15:17:13] And then you can start hacking on mw-experimental as you like [15:37:09] Change deployed [15:37:43] sorry, just got back from the meeting. The problem is that I can't really test the issues. We have no choice but to push the mw config change and see if they start to show up again [15:38:10] unless you know a way to reliably reproduce the upload issue [15:40:52] (reminder that mw-experimental will need a manual helmfile apply if you'd like to test there) [15:42:48] oh yeah, I can do that [15:44:11] done [15:45:11] thank you [15:59:36] Amir1: have you tried a ginormous file? and also maybe https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/throttling/index.html or https://developer.chrome.com/docs/devtools/settings/throttling#network-throttling [16:17:33] cdanis: it'd be still not 100% that it wouldn't work [16:17:52] so it is likely that my upload would go through and then we still see the problem [16:18:14] I'm just gonna retry it again [16:18:39] we had 300 errors in total of one hour. It's not much [16:19:28] ok sounds good to me :) [17:06:11] I'm confused about the transcode issue