[18:48:41] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736960 (10BBlack) [18:48:57] ^ more rumblings [18:49:29] I still don't suspect "bad record mac" is any kind of root cause with a TLS implementation bug on our side (although it's remotely possible!) [18:49:53] I think it's just common to see that if the connection is reset on TLS, or similarly if anything mucks with the traffic. [18:50:09] But we're still left with a question of why we have so many reports like this in various circumstances. [18:50:17] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736966 (10Josve05a) >>! In T148917#2736958, @BBlack wrote: > So, one of the possible fact... [18:50:20] it's clearly not widespread in the sense of affecting a large percentage of traffic [18:50:30] but it's popping up here and there too much to ignore [18:52:11] Sherry was the other case (direct contact, not through phab), having much larger issues reaching any of our wikis with Safari 10, but worked fine in FF. I still think her case could be some improbable corner-case falout of the GlobalSign issue. [18:52:21] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736968 (10Paladox) >>! In T148917#2736958, @BBlack wrote: > So, one of the possible facto... [18:54:11] paladox report in the ticket linked here claims it's affecting (with slightly different visible results) all of his UAs on Win7: Chrome, IE, Edge. [18:54:33] (I think he's also running MS developer-preview releases of things, though) [18:54:42] could be MS's Defender? or something with the router. [18:55:16] if it was really a generic error with our connections, I'd think we'd see it more-broadly (as in, almost all clients reporting it at least rarely, instead of a few clients reporting it more-heavily and reproducible) [18:56:42] sorry paladox is a dev release of Win10 underneath [18:56:47] another report was on Win7 Pro [18:58:12] It would be nice to get a clean repro report from a knowledge users that's absolutely sure of no interference from: unreleased beta client-side OS/browser software, AV/Malware tools, etc... [19:09:24] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736970 (10Josve05a) I changed WiFi to my phones personal hotspot, and that didn't chnage... [20:22:58] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2737059 (10BBlack) I've reproduced this now, at least once. I took a lot of retries. I u... [20:48:47] I'm starting to have a theory that this could be something else with TCP in+out of nginx [20:49:14] perhaps we're back to our TCP port issues with the timewait stuff, etc [20:49:30] since the reports tend to happen in esams, and tend to happen at peak request-load, etc [20:49:45] and upload has more data to transfer per req, leading to more parallelism [20:50:19] so to that end, uploading a patchset we've discussed before as an alternative to keepalives or unix sockets: give varnish-fe 8x inbound ports and have nginx roundrobin them. [20:50:38] (we'll have to do frontend restarts across the board first, but we've got reboots coming anyways) [20:51:50] should maybe raise the various conn/fd limits in nginx too JIC [21:06:10] are you interesting in some kind of info? [21:06:34] I managed to reproduce quite easily after browsing three pages [21:07:28] a net::ERR_SSL_BAD_RECORD_MAC_ALERT this time [21:07:45] but just looking at the pcap will probably not give much insight… [21:22:36] well the pcap would confirm that the server is sending RST to trigger the errors [21:22:47] as opposed to the client seeing some kind of true error in SSL data and then sending the RST to the server [21:24:45] yes, when it reports net::ERR_CONNECTION_RESET, the server is sending RSTs [21:24:55] when net::ERR_SSL_BAD_RECORD_MAC_ALERT there are no resets [21:25:00] well [21:25:23] assuming modern client and HTTP/2, the client and server are probably multiplexing several image fetches over a single connection [21:25:37] the RST would be for one TCP connection, but it would break all the images streaming through it at that time [21:26:53] at least for me, when I've seen it there's been at least one connection reset error, sometimes just those, sometimes coupled with perhaps many of the bad mac alert things [21:27:08] but I've not yet seem (myself) the bad mac alert messages without at least 1x conn reset message in the mix somewhere [21:27:41] (we were also debugging something semi-related on cache_misc last week where did a lot more repros, and that seemed to be the case then with network captures I saw) [21:29:03] I saw some net::ERR_SSL_BAD_RECORD_MAC_ALERT without a tcp-level RST [21:29:15] ok [21:29:54] I would've thought we'd at least see them the other direction if that were the case [21:30:14] (the theory being if the first thing that goes wrong is the client actually detected a bad mac, the client would RST->server) [21:30:24] seems a tcp RST → net::ERR_CONNECTION_RESET, [21:31:22] I also captured the preshared keys [21:31:35] but wireshark doesn't seem to be decrypting the packets :/ [21:31:43] https://forum.avast.com/index.php?topic=191762.0 [21:31:57] ^ is one of the only very-recent google hits on the error, talks of Avast WebShield interference [21:32:31] getting wireshark to see through modern TLS is hard heh [21:32:35] I don't have any of this in the middle [21:32:38] thanks to forward secrecy [21:33:07] that's why I ran the browser with SSLKEYLOGFILE= :P [21:33:22] and no, I don't have anything in the middle [21:33:40] does your wireshark support x25519? [21:34:11] Version 2.2.0 [21:34:49] it is showing some TLSv1.2 Record Layer: Application Data Protocol: http-over-tls subentries [21:34:53] which I hadn't seen before [21:35:54] I can send you the files if that's useful to you [21:39:26] hmm, the client did send some RSTs before [21:39:57] but that was on a different port than sever→client one [23:17:57] bblack, happen to be around? [23:19:01] I've been thinking about T133548 [23:19:01] T133548: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 [23:20:37] How are we going to have multiple servers behind a domain while using LE, without NFS-like sharing of /var/acme/challenge? [23:30:02] it's fine if we only have one server doing it, but how is it going to work behind LVS? moving to a backup server would involve temporarily user-visible issues too