[00:16:12] 10Traffic, 06Operations: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2727450 (10BBlack) [00:16:15] 10Traffic, 06Operations, 13Patch-For-Review: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2727448 (10BBlack) 05Open>03Resolved a:03BBlack [07:54:52] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2727782 (10elukey) Found other occurrences of the same issue but with different URIs on other hosts: ``` - D... [08:38:53] 10Traffic, 06Operations: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2727839 (10Joe) [11:17:26] 10Traffic, 06Operations: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2728212 (10Joe) 05Open>03Resolved [12:32:30] 10Traffic, 10MediaWiki-Cache, 06Operations: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2728350 (10elukey) p:05Triage>03Normal [12:33:26] 10Traffic, 10MediaWiki-Cache, 06Operations: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2636791 (10elukey) It is not super clear to me if we need to keep both Traffic and Operations tags (and if so what is requested from both), @hashar let me know :) [12:43:46] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2728385 (10elukey) p:05Triage>03Normal [13:58:13] ema, bblack: I´m still thinking about the behavior of https://phabricator.wikimedia.org/T141373 Right now I would like to collect more data on it. My idea was: Create an article which is not used by anyone, with no references or pictures. Pretty much no way to get to it without the direct link. Then request this article from different IPs in such a way to periodically hit the backend. Requesting from IP A, miss in backend + frontend A, [13:58:13] then some time later request from B, hit backend + miss in frontend. Repeat this with every Server with a time interval in such a way that frontend A purges before it is requested from IP A again. [13:59:01] This should result in periodic hits on the backend without hits on the frontend. This might give more insight into the behavior. What do you think? [14:29:18] <_joe_> Snorri: except that won't work, because you have a ton of bots looking at the recent changes stream [14:29:27] <_joe_> and then fetching the articles [14:29:36] <_joe_> not always via the API [15:11:09] I'm slightly puzzled by https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/55660/console loading the page fine, but subsequent requests for related resources show up in chrome's console as net::ERR_CONNECTION_RESET or net::ERR_SSL_BAD_RECORD_MAC_ALERT , chrome 53 on linux, chrome on osx loads the page just fine [15:11:39] and trying to load a resource on a separate tab works fine, e.g https://integration.wikimedia.org/ci/static/03f038ee/scripts/yui/yahoo/yahoo-min.js [15:12:13] I'm talking to misc-web in esams [15:13:28] godog: is this the same as the paladox ticket the other day? [15:13:44] T148595 [15:13:44] T148595: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595 [15:14:41] godog: where do you see it in the console stuff? [15:15:12] literally under "console", taking a screenshot now [15:16:21] http://esaurito.net/~godog/sshot/screenshot_UzCBSl.png [15:16:39] lol great domain! [15:16:51] haha thanks [15:18:24] godog: the page loads or doesn't load fine when the console alert happens? chrome-onlinux/osx works fine == has console msg, or not? [15:19:49] bblack: doesn't load fine no on chrome 53 on linux with the console messages [15:19:50] FWIW, I use Chromium: Version 53.0.2785.143 built on Debian stretch/sid, running on Debian stretch/sid (64-bit [15:19:53] and no console [15:20:00] (no console output, I mean) [15:20:05] on chrome 53 on osx loads fine, no console output either [15:20:48] yeah same build as you 53.0.2785.143 but chrome not chromium [15:21:55] any chance local clock time on your box is messed up? [15:22:38] (I tried turning chrome off and on again, same result) but no local clock seems fine, Wed Oct 19 16:22:32 IST 2016 [15:23:08] other things on misc-web seem fine though, e.g. grafana [15:25:24] hmmmm [15:25:32] godog: I can repro it on FF 43 on an old Linux, only with console open [15:25:36] what's your linux running on? is it some strange platform? [15:26:12] nope, standard amd64 / intel i7 [15:26:32] same [15:26:34] godog: can you send your client IP in private? then at least maybe I can watch some logs while you reload [15:27:11] sure [15:27:41] volans: same error too? [15:27:53] bad mac alert yes [15:28:00] godog: reload again w/ repro? [15:28:08] bblack: {{done}} [15:28:20] I get nothing from nginx, ok [15:28:50] bblack: if can helps, is not all the time for me [15:29:02] also w/ the console opened just one reload every ~10 [15:30:06] maybe a single host acting badly? [15:30:11] yeah it is erratic for me some time too, http://esaurito.net/~godog/sshot/screenshot_Hxacji.png [15:30:16] some 200s some fail [15:31:18] I never got the page to fully load though without errors [15:32:26] godog: I have only one error in console tht is SyntaxError expected expression, got '<' console:6:4, that looks the opening of the DOCTYPE tag that is not at the start of the document [15:33:00] but just that in console, could just be a different way of showing it between FF and Chrome [15:48:34] mostly what worries me is whether this is some subtle fallout of the openssl-1.1 stuff [15:49:19] but paladox report of similar w/ IE was on commons (text or upload, one of the two), and before we had switched for that at any DC [15:49:37] what is strange bblack is that when it happens most of CSS/JS show this error while the HTML is loaded fine, but still some JS/CSS are loaded fine [15:49:46] and when it doesn't happen, ALL resources are loaded fine [15:50:46] this should exclude the single host, I would have expected to always have at least one failing resource in that case [15:50:59] godog: is the same for you? [15:51:08] in any case, let me try doing a full restart of nginx on esams cache_misc (because if we want to test downgrading nginx, we should first test that the implicit restart doesn't temporarily clear things up on its own) [15:52:50] relaunch browser to ensure no cached connection to the old copy of the daemon, and try repro again? [15:53:04] volans: yeah I get that mixed success/fail some time but most time only the html loads fine and all resources fail [15:53:10] ok trying again [15:53:29] ok trying again, godog then it's different than me [15:54:09] also, can you look at Security tab of dev console too? it will show green/red dots on various resources and maybe some diff in TLS details [15:54:51] bblack: same here [15:55:08] the security tab of the devtool on FF shows just the SSL error [15:55:32] same error, no more detail? [15:55:55] (also, what's if FF's error? also bad record mac?) [15:56:12] s/if/is/ [15:56:13] for the resources that fails just the ssl_error_bad_mac_alert [15:56:13] yeah same error here, Security tab says all OK [15:56:22] for the ones that succeed what do you need? [15:56:46] I assume the ones that succeed show normal expectations: TLSv1.2, ECDHE_ECDSA, AES_128_GCM [15:57:04] yeah, The connection to this site is encrypted and authenticated using a strong protocol (TLS 1.2), a strong key exchange (ECDHE_ECDSA), and a strong cipher (AES_128_GCM). [15:57:11] yes and SHA256 [15:58:07] the ones that fail don't have anything, no timing, no response or response header (ofc) [16:04:15] ok trying another debugging step, can you still repro on fresh browser/connection? [16:04:57] yeah I can, within a freshly opened incognito window [16:05:19] does incognito definitely not share TCP? I'm not 100% sure [16:05:43] (when I do the seamless restart of nginx, it leaves the old daemon running in parallel handling all existing open connections, which can be long for http/2) [16:07:02] I can repro, closed FF ps fax | grep -i fire empty [16:07:18] indeed, same result for me when closing and reopening chrome [16:07:19] this time some of the resources have the time for connecting [16:07:25] around 360ms [16:08:33] yeah [16:09:00] I tried a repro using a later FF-on-linux, I get different probably-unrelated things, but... [16:09:46] FF 50.0b1 [16:09:48] The resource from “https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/55660/console” was blocked due to MIME type mismatch (X-Content-Type-Options: nosniff). [16:10:21] lol [16:10:42] some report in chineses indicates a similar issue for someone else was an MTU issue [16:10:51] (truncated packets -> bad mac ssl error) [16:11:25] also, FF50 notes that doing a normal GET load of that page loads JS which in turn does an XHR POST back to the server [16:11:40] something to do with server<->client time comparison for the JS on the page [16:12:07] https://integration.wikimedia.org/ci/extensionList/hudson.console.ConsoleAnnotatorFactory/hudson.plugins.timestamper.annotator.TimestampAnnotatorFactory2/usersettings [16:12:12] ^ is the URL POST'd to [16:13:31] the POST body is empty, too [16:13:39] the POST is successful also when the other resources fails [16:14:07] for chrome this should be http/2, how old is your "old" testing FF? [16:14:29] (I could try disabling http/2 too, although if that "works" it could point at lots of unrelated things) [16:15:07] is to show the timings (the POST) [16:15:10] FF 43 [16:15:14] 43.0 [16:15:36] I have to go afk for a bit, ping me here though and I'll read later [16:15:45] or you want openssl versions kernel and stuff? [16:16:21] turned off http/2 now, try again? [16:16:32] FF should use its own bundled NSS implementation [16:16:56] g.dog's chrome should use bundled boringssl too, so two different implementations, neither of which is using openssl [16:17:38] same repro for me [16:18:01] (closed and reopened ofc) [16:20:12] ok, trying two more things (turned off server-side TFO support, and server-side tcp metrics save) [16:20:15] try again? [16:21:11] ok trying [16:21:29] if that doesn't work, I think we're down to trying nginx+openssl downgrade [16:22:00] same at third retry [16:22:23] btw the HTML is awful, I was looking to see if only resources defined in the HEAD or elsewhere were failing [16:23:42] nginx downgraded, try once more? [16:23:53] sorry I know this annoying, I wish I had a local repro [16:24:15] (nginx+openssl downgraded I should say, they go together) [16:24:36] still happening bblack :( [16:24:41] hmmm [16:24:48] the version I downgraded to, we've been on for a while [16:24:54] (on the high volume clusters too) [16:24:55] no problem at all, happy to help if possible [16:25:04] so it seems unlikely there's some big bug there [16:25:39] lol, noticed this in one of the dmesgs: [16:25:41] [1050147.633557] TCP: request_sock_TCP: Possible SYN flooding on port 5666. Sending cookies. Check SNMP counters. [16:25:46] I think that's the icinga nrpe port :P [16:25:50] eheh [16:26:18] hey [16:26:28] what caches are you getting in X-Cache response header? [16:26:35] cp3009 as the frontend? [16:26:51] I've checked them before to see if I was getting a pattern [16:27:01] the problem is that I get the header only for those that works [16:27:08] the frontend should be consistent, from LVS source IP hashing [16:27:17] (the final entry in X-Cache) [16:27:37] I get different cp, 3007, 3008, 3009 reloading [16:27:47] all values as pass ofc [16:28:04] for the final one, the frontend? [16:28:16] yes, the third value, let me confirm I can still get all 3 [16:28:46] mostly cp3007 [16:28:58] both when succeeding and when failing [16:29:18] mostly? [16:29:24] it's supposed to be hashed on your client IP [16:29:43] now only that one, I was reloading many times [16:29:51] but before I'm almost sure it was chaning [16:29:59] but cannot say it 100% [16:30:37] yeah seems always cp3007 now [16:31:32] cp3009 has a bunch of memory errors the past week [16:31:34] [Wed Oct 19 09:03:46 2016] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5) [16:31:38] ... and similar [16:32:16] it's depooled at all layers now, though [16:32:42] try yet again? [16:32:43] ok now I got the html from 3007 and ALL resources from 3010, and it failed [16:32:49] before you said try again [16:32:52] was loaded 1m ago [16:32:54] ok [16:33:10] so html from 3007 was fine, all the resources came from 3010, and they all failed? [16:33:24] I still don't even get when it should be moving [16:33:30] s/when/why/ [16:33:37] bblack: next time i'm there i'll swap with memory modules from the decom'ed ones, 3011-3022 [16:33:38] no the 3010 were success, but was a case in which many resources were failing [16:33:46] also the resources that fails seems always the same [16:33:46] and that should be within the next few weeks [16:34:15] volans: can you load just one of the failing resources on its own and still repro? [16:34:39] it still bugs me that ipvs sh is not sticking you to one frontend in general [16:34:56] unless you're going through some kind of NAT that has multiple random exit IPs per connection or something [16:35:14] bblack: so now ALL resources from 3010 and I can still repro, if I open a single resource no I cannot repro [16:35:17] they load fine [16:35:42] ok maybe your earlier attempt was in the midst of the depool and it rehashed [16:35:46] maybe it will stay on cp3010 now [16:37:04] let me see if I can dredge up more tricky things to disable ... [16:40:01] volans: one more time? :) [16:40:06] sure :) [16:40:32] I'm turning off all kinds of non-default things, we're about back to a very very stock sort of default nginx ssl install for these nodes [16:40:40] still repro :( [16:40:43] undoing tcp tweaks, undoing local config customizations, etc [16:41:01] undid the cloudflare dynamic tls record sizing, put the session cache back to a sane size [16:41:15] turned off TFO, turned off tcp metrics saving, turned tcp autocorking back on [16:41:27] and of course downgrading back to the openssl+nginx we had before [16:41:43] all this for me... so sweet :-P [16:41:50] although, now that I say that... we did do a minor bugfix upgrade of the openssl-1.0.2 package not that long ago [16:42:13] maybe 1.0.2j introduced some new subtle bug while fixing things (and it's also in 1.1.0b)? [16:43:07] next step in terms of easy->reward, I'm going to disable the cipher you're using and force you to use something else [16:43:18] need my IP? [16:43:36] I guess you already have it :D [16:44:48] volans: try again? [16:45:44] same, at 6th retry [16:45:47] lol [16:45:55] ok that's nuts [16:46:01] but why only with console open? [16:46:10] I'm going to turn puppet on and bring everything back to where it was, so far out on a misconfigured limb now [16:46:19] no actually now I was able to repro without cosole [16:46:29] lol, ok [16:47:42] I guess our puppet sysctl stuff doesn't check when there's no catalog change? [16:47:55] it failed to reset sysctls in any case [16:47:56] probably not [16:48:11] bblack: if it helps when it fails it take more time loading before failing [16:48:23] while when it's ok load pretty quickly [16:51:10] ok everything back to sanity [16:51:26] bblack: an additional datapoint, I can't repro with Ctrl+Shift+R [16:51:34] only with Ctrl+r [16:51:49] and you can only repro when the console is open, not when it's closed? [16:51:58] I thought I heard that at some point [16:52:02] I Was able to repro with closed console too [16:52:05] ok [16:52:08] I though it too [16:52:14] just tried harder XD [16:52:33] and being js/css I guess they should be 304 [16:52:45] and they are 304 when everything is ok [16:53:02] all I can see is that mime-type issue, and that it does seem to load really slow [16:54:06] so when forcing without cache it all works fine [16:54:55] that seems... odd [16:55:09] so are they 304s when they fail? [16:55:11] but are two different layer SSL and HTTP :D [16:55:16] or are they just loaded from local cache? [16:55:34] when they fail they are nothing [16:55:44] no size, no content no mime [16:55:46] well, ok [16:56:37] can you repro whatever paladox saw in IE11? [16:56:49] https://commons.wikimedia.org/wiki/File:FEZ_trial_gameplay_HD.webm [16:56:56] 6 he says he was viewing that with IE console open [16:57:38] what is the issue for paladox? [16:57:54] T148595 [16:57:54] T148595: Visting [[c:File:FEZ_trial_gameplay_HD.webm]] in IE11 shows errors in developer console about insecure data:image/png;base64 "URL" - https://phabricator.wikimedia.org/T148595 [16:58:05] it's a load.php resource on the page, failing SSL stuff in a different way [16:58:25] but it's common that it's an SSL error on a subresource, recent, seeing it in dev console. but on commons, with some build of IE [16:59:29] maybe we have some strange interaction there between the R1 root issue, the 4 day window it was "bad", local browser caching for days of a previously-network-loaded subresource, etc [17:00:14] no all clear to me [17:00:21] also playing the video no issues [17:00:24] trying to reload [17:00:56] nope all clean, both network and console [17:03:01] is the repro special to this one URL? I'm assuming you're still re-using the integration like g.dog had [17:03:06] other thing, the images never fail, only css/js [17:03:15] yes let me try other urls [17:03:26] https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/55600/console [17:03:32] ^ is a slight variant [17:03:39] but phabricator is on the same terminators [17:03:45] this is the one I was trying [17:04:03] the one that godo* posted before [17:04:26] so one possible theory that fits is this: [17:04:36] the project page doesn't fail so far [17:04:50] we've seen both "bad mac" and "connection reset". maybe "bad mac" is just another variant of what error can happen when the connection is closed/reset randomly during TLS [17:05:00] and this is pass-traffic through varnish [17:05:13] and maybe the integration backend misbehaves or bombs out on some requests randomly [17:05:31] and that passes all the way back up the chain to http protocol error at the nginx level, which causes it to immediately ungracefully close the connection [17:05:33] also a single job page doens't fail, trying console [17:06:35] mmmh what did you fix? I cannot repro anymore :D [17:07:02] oh finally repro again [17:09:21] but only con console pages [17:09:22] so far [17:09:32] _joe_: So a new "unused" article does not work? How about an old article with almost no traffic? In my current data set I do have at least 1 article which is changed very seldomly and has almost no traffic other than my current crawler. Do you think that might work? [17:09:56] <_joe_> that could, yes [17:10:41] bblack: now that I talked also the single job page failed id 55680, so maybe is the backend bombing [17:11:02] but is this limited to integration.w.o or also other sites on cache_misc? [17:12:39] And the basic idea should work, right? Cycling through the frontend to alsways get misses and only hits on the backend. With 6 different IPs this would be 4 hours between requests. Using 7 IPs with 4 hours delay should bypass the frontend and with almost no other traffic might produce relevant data. And if not the traffic should not influence anything in a bad way. [17:14:22] volans: I only of these repros for integration [17:14:27] err, I only know [17:14:46] volans: are you still seeming to stick to cp3010? [17:14:48] other sites on the same SSL terminators? phab? [17:14:58] yes I'm on 3010 [17:15:38] volans: other sites would be "git grep misc-addrs" in the DNS repo. there's a lot [17:16:23] I got a repro with chromium 53.0.2785.143 [17:17:04] volan: do another repro? [17:17:12] no repro with chrome 52.0.2743.116 so far [17:17:19] sure, retrying [17:17:58] ok holy crap at the number of requests that created when I looked at varnishlog [17:18:25] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2729247 (10elukey) Got some requests logged with the new query string but no trace of backend tags. Maybe this i... [17:19:44] volan: what was the first failed resource on that attempt? [17:20:24] which attempt? [17:20:29] "sure, retrying" [17:20:37] I saw a deluge of requests I can only assume was that attempt [17:20:44] style.css [17:20:48] color.css [17:21:00] but I Was trying to repro also other sites [17:21:10] if you want I Can do a retry at a time [17:21:26] if you're debugging live [17:21:50] can you send me your IP to be sure? [17:23:06] ok logging now [17:23:10] try 1x repro attempt? [17:23:27] done, it worked though [17:23:32] all good [17:23:45] bleh [17:23:55] started fresh log, try again? [17:24:00] do whatever works best for repro [17:24:05] yay failed [17:24:13] ok looking at log... [17:24:21] style.css first in list failed [17:24:28] color.css too [17:24:32] and many others [17:24:56] 69 requests in your attempt there [17:25:01] http fetches I mean heh [17:25:14] eheheh it has so many resources [17:25:25] FF says 97 requests [17:25:36] 1.3 MB, 5.15s [17:25:54] it says color.css was a 304 [17:26:11] color.css was a bleh [17:26:25] /ci/static/03f038ee/css/style.css [17:26:30] is the style.css in question? [17:26:35] (there's a bunch of others with longer paths) [17:26:47] yes [17:26:57] is this one [17:27:21] they look ok, they're both 304s with pass through the layers [17:27:43] responsive-grid.css ? [17:27:54] dragdrop-min.js ? [17:28:04] I can continue ofc :D probably useless [17:29:13] ema: your repro I assume is also on the same integration URL(s)? [17:29:20] bblack: right [17:29:44] ema: we're probably running the same browser. what was the trick to repro? [17:29:50] I still cna't ever do it here [17:30:07] I've set my mtu to 1280, but it might be irrelevant [17:30:27] after setting it back to 1500 and restarting the browser I could still reproduce the problem [17:30:41] nice subtle bug in varnishapi.py -> https://github.com/xcir/python-varnishapi/issues/65 [17:32:55] ema: care if I reboot cp1008? [17:32:59] bblack: my completely random guess is that the problem might be related to path mtu discovery and that you're using ipv6 while me, godog and volans got repros with v4 [17:33:09] I'm also not hittin gesams [17:33:27] feel free to reboot the unicorn [17:34:21] ema: true but in that case if ICMP type 3 code 4 is blocked your connection hangs [17:34:37] or you think the issue is in the fragmentation? [17:34:43] (also, I'm not using v6) [17:34:51] ok :) [17:34:57] volans: yes I was thinking of fragmentation [17:35:19] I have to go, though I see you can reproduce at will [17:35:23] yeah the two most-likely candidates are either strange network-layer stuff with MTUs and whatnot [17:35:54] or something with integration having a buggy responses that makes it through the cache-pass chain and causes nginx to interrupt the TLS connection abrubtly [17:36:08] ouch! [17:36:11] I'd believe the network case if we could find some different repros [17:36:21] see you tomorrow! [17:36:25] bye godog! [17:36:30] surely it's not such a special edge case that only integration outputs can see it [17:36:31] I've tried grafana and noc without luck [17:36:33] bye godog [17:36:34] bye godog :) [17:37:04] maybe is worth checking the logs of jenkins? [17:37:09] I'll try redirecting myself to esams for integration [17:37:18] (maybe!) [17:37:35] being jenkins might be a waste of time too [17:38:49] heh [17:38:58] I switch to esams and I repro on the 3rd try [17:39:09] now we're talking :D [17:39:42] I'll go try ulsfo too [17:40:23] the HTML *never* failed to me [17:43:49] I can repro the slow 304s easier [17:43:59] after going to ulsfo and then back to esams, now I haven't yet reproduced the failure [17:45:05] ok I saw something weird that time [17:45:12] volans: so you did manage to reproduce with ctrl-r after an initial successful page load? [17:45:13] I had the Network tab open [17:45:35] and I saw a bunch of request lines start filling in there, and then they went red with (canceled), and dissappeared, and more filled in [17:45:43] once the page was done loading, it all looked like success [17:45:54] oh no, the cancels are still there in the list [17:46:11] bunch of pngs were canceled [17:46:14] ema: I was able to repro w/ and w/o console/network open, w/ ctrl+r but *not* w/ ctrl+shift+r [17:46:21] right [17:46:25] canceled, no I don't have those [17:46:44] I've only hit it once, I'm still looking at the results [17:46:51] 251ms to load a png then canceled with no response data [17:47:00] no ssl errors or any other console errors, just shows in Network list [17:47:22] the same "canceled" PNGs show up successful later in the network list, too [17:48:33] never saw a png failing for me [17:48:38] always had the network tab [17:48:41] open [17:48:57] to see the failures given that my console is empty [17:49:15] finally got another repro [17:49:39] on a ctrl+shift+R, after several other success attempts at various kinds of reloads [17:50:34] do you get an Uncaught Type Error from behavior.js at the bottom of your repros? could just be random fallout of missing JS, too. [17:50:55] in console? [17:51:29] yeah [17:51:35] was at the bottom after all my bad mac errors [17:51:55] yes I also do get random js errors [17:52:07] Uncaught References [17:52:26] also, in my latest repro, the first error in the console is a connection-reset one, then all the rest are bad mac [17:52:29] is that always the pattern? [17:52:42] no my console has only an html error always and that's it [17:53:02] is clean [17:53:15] all my failures I see tehm in the network tab, I guess this is FF specific vs Chrome [17:53:29] I think this has to be the result of the esams nginx sending me a RST, or similar [17:53:33] the html is ~4k while the other resources (which fail) seem to be smaller [17:53:36] even so, we'd have to understand why [17:53:52] but I can at least sniff locall for that kind of thing [17:55:01] any change network related in esams or esams<->eqiad? [17:55:10] then since when... we don't have a start date [17:55:39] I am getting RST from nginx [17:56:47] so the first connection the browser made... [17:57:39] seems to be going ok at the tcp headers level until this point: [17:57:40] 17:55:18.685091 IP (tos 0x0, ttl 46, id 62026, offset 0, flags [DF], proto TCP (6), length 52) 91.198.174.217.443 > 192.168.42.12.54396: Flags [R.], cksum 0x212a (correct), seq 26780, ack 7599, win 105, options [nop,nop,TS val 284454778 ecr 22040297], length 0 [17:57:49] which is a RST from esams->me [17:58:10] immediately after that my browser opens 6 new connections to esams (all with new source ports) [17:58:20] (as in, 6x outbound SYN) [17:58:48] then right after that I get 11x RSTs for the old connection that was RST'd before. [17:58:53] (wtf) [17:58:57] wow [17:59:17] any repro in other DCs? [17:59:26] that have unique seqnos, they're not duplicates [17:59:27] does all the hosts in esams have the loopback alias? [17:59:59] this reminds me of RST sent when it was not set (playing with C some time ago) [18:00:00] huh? [18:00:09] are we using LVS-DR right? [18:00:32] can you check all hosts have the right lo:LVS alias (or whatever is named) [18:00:50] I know so many other things should fail if not and is so unlikely [18:00:51] yes [18:01:02] if they didn't, they wouldn't even accept the incoming requests [18:01:18] but yes, they show up currently in 'ip -4 addr' [18:01:42] yeah I know the would have RST all the conns [18:07:14] great firewall of europe? [18:21:20] I've gotta go, see you all tomorrow! [18:21:27] bye! [21:10:06] 10Traffic, 06Operations: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523#2730003 (10BBlack) 05Open>03Resolved a:03BBlack Done for now, assuming we don't find a reason to revert! [21:10:37] 10Traffic, 06Operations: Extend check_sslxnn to check OCSP Stapling - https://phabricator.wikimedia.org/T148490#2730007 (10BBlack) a:05BBlack>03None [21:14:10] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2730013 (10BBlack) [21:14:13] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2730015 (10BBlack) [21:14:49] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1149975 (10BBlack) Seems like the nginx-internal + ssl_stapling_proxy path is the way to go, so folding the other subtask back in (alre... [21:15:31] 10Traffic, 06Operations: OCSP Stapling: support truly-independent ECC/RSA Certs+Staples - https://phabricator.wikimedia.org/T148132#2715770 (10BBlack) [21:15:34] 10Traffic, 06Operations: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134#2730018 (10BBlack)