[07:57:23] 10netops, 10DBA, 06Operations, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2824480 (10Marostegui) Holding connections on the master: if there are 5-10 jobs running it shouldn't be a big deal as I assume only 10 c... [09:25:19] 10netops, 10DBA, 06Operations, 10Wikidata, 07Performance: DispatchChanges: Avoid long-lasting connections to the master DB - https://phabricator.wikimedia.org/T151681#2826465 (10jcrespo) @Manuel, @Daniel Actually it is a problem, because masters have a limit of CPU# or 32 active threads on the pool of co... [10:13:03] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2826529 (10Joe) a:05Joe>03None [10:14:39] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2824625 (10Joe) Please do not assign tickets to me directly, as others that could have more time to fix this would not look into it assuming I am, which is not true at the moment. [10:14:54] <_joe_> ema: can you take a look at ^^ ? [10:16:52] <_joe_> the bug report is a bit sparse, but it seems like a real issue [10:19:24] _joe_: sure [10:20:34] <_joe_> could be related to the API issues we had during the weekend, though [10:25:09] mmh these errors (502) are coming from nginx though, the 503s in T146451 were coming from varnish instead [10:25:10] T146451: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451 [10:31:20] <_joe_> yeah nothing to do with the old ticket [10:43:53] I'm starting to suspect proxy_request_buffering might be involved [10:44:52] according to the bug report the issue started ~2 weeks ago, which is pretty much when we've turned request buffering back on [10:50:58] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2826652 (10Paladox) I believe this was fixed when godog restarted the api servers. Because there were a lot of errors being logged in #wikimedia-operations [12:46:40] 10Traffic, 06Operations, 13Patch-For-Review: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2826978 (10ema) 05Open>03Resolved a:03ema [13:57:05] 10Traffic, 06Operations, 13Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2823504 (10ema) Me and @elukey have been working on this a bit on Friday. Specifically, we have tried to bump `vsl_space`: ``` vsl_space ยท Units: by... [15:55:08] 10netops, 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2827362 (10RobH) Since both this and recently died power supply on cp4008 are out of warranty, the current plan is to steal the other power supply from cp4008 to replace the bad one in lvs4002. [16:27:57] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2827426 (10elukey) @Paladox: would you mind to write down the URLs that you are trying to access? I know that you put all the data but having the actual links would help me and other people not super used to write API... [16:29:23] 10Traffic, 06Operations: several 502 Bad Gateway - https://phabricator.wikimedia.org/T151686#2827429 (10Paladox) @elukey Hi it should all be fixed now. The link I tried was en.wikipedia.org, trying to add something to the watchlist and then removed it caused problems but should all be fixed now by godog restar... [16:41:49] yeah it'd be nice to get cleaner repro conditions on the 502 [16:41:59] I wonder if it's very large POST data? [16:43:41] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2827486 (10GWicke) @gilles, the client can communicate the exact format(s) it prefers using either the URL, or via the Accept header. For the vast major... [16:45:31] 10Traffic, 06Multimedia, 06Operations, 13Patch-For-Review, and 2 others: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2827488 (10BBlack) Can't be a regression of the specific TLS bug we had here. [17:13:20] bblack: according to paladox's comment on the ticket adding something to the watchlist and then removing it was causing the issue, I guess those are not particularly big POSTs [17:21:25] yeah shouldn't be [17:21:31] hopefully it's not proxy_request_buffering :) [17:28:17] (if it is, we may have just traded one problem for another with the same requests, and we may just need to ramp up the http2 initial buffer significantly) [17:41:40] ema: possibly relevant nginx bugfix from ~1h ago: http://hg.nginx.org/nginx/rev/52bd8cc17f34 [17:43:11] maybe, anyways [17:43:34] our client_body_buffer_size and our http2_body_preread_size are both 64k though [17:44:02] (so, in theory, we shouldn't hit that case mentioned in the commmit, but still) [17:44:27] client_body_buffer_size is how much client POST data we'll buffer in memory before decided to spool to disk (which is a shmfs in our case) [17:49:55] in general, we probably could/should raise both parameters, and ensure the http2 one is smaller than the body_buffer_size one [17:50:35] especially in the upload case, I guess we're mostly saved by chunked uploads being common for larger files [17:51:24] might be nice to get the mem buffer much larger there, and maybe stop using shmfs and use the real disk instead of trying to grow shmfs even bigger [17:55:04] bblack: oh that patch looks interesting :) If the bug could actually be reproduced with watchlist changes it's unrelated though [17:58:02] yeah [17:58:11] but IIRC you were mentioning 502s spikes at some point in the past somehow related to nginx buffering [17:58:35] well, the ones in the distant past were from experimenting with keepalives for nginx->varnish [17:58:44] oh keepalives [17:59:12] I think there are state-leak problems in general there. if MW or Varnish causes certain kinds of errors, normally they'd be limited to the request that caused the error somehow [17:59:23] but with keepalives, they can fuck up the state of a connection which still gets reused for another request [17:59:33] I think the 502s for keepalive had to do with that [18:00:10] probably cases similar to what we've seen before with bad gzip outputs, or connection-close after sending a certain amount of response (but not all), etc [18:01:18] basically I think unless you assume the whole stack is bug-free in HTTP protocol terms, using keepalives for the nginx->varnish bit is risky since they interpret standards a bit differently at times. thus it will always leak error states across user request boundaries [18:01:56] so I more-or-less decided we shouldn't pursue keepalives there I think, and instead look towards future stack changes, and/or using unix domain sockets [18:02:21] (well and also recently I did that thing with the 8x listening ports for the frontend to reduce timewait issues and such, that buys a lot of time on the keepalives-related issues) [18:03:07] right [18:03:54] I guess we could gut keepalives_per_worker complexity but haven't [18:05:32] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2827829 (10BBlack) Should have resolved/rejected this back in T107749#2662491 - at this point it's just a collector of semi-related commits, but I don't think we plan to... [18:05:37] 10Traffic, 06Operations, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2827831 (10BBlack) [18:05:40] 10Traffic, 06Operations, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2827830 (10BBlack) 05stalled>03declined