[00:40:56] Krenair: honestly I'm not too worried about that part. We can take backups JIC and restore with a puppet role and an rsync [00:41:20] Krenair: but they're non-canonicals anyways, and shouldn't be in prominent use to begin with. We're just preserving them because they exist and there's links to them. [00:41:32] ok [00:41:39] Krenair: if once in a blue moon we suffer a hardware failure and it takes a little bit to get them up... meh? [00:42:05] in theory we could just let the scripts re-fetch them, but I think they might run into ratelimits possibly, hence backups and rysnc [00:42:43] I don't know if you saw, I uploaded some code for a basic service with nginx+letsencrypt [00:42:49] I donno, it's possible that if we structure it right we don't even hit ratelimits [00:43:04] where at? [00:43:27] oh I see [00:44:16] yeah there's some tricky bits with templating them out 100x per cert, and also templating out language subdomains for the ones that are currently wilcards expecting them [00:45:17] yeah, it doesn't handle the wildcard stuff yet [00:45:20] I had been thinking of maybe just moving them over first and dealing with LE after, too, but we can go either way [00:45:37] (right now they have no TLS, so we could move them to a new IP that also has no TLS first, and then start tacking on TLS + LE) [00:45:44] yes, possibly [00:45:53] We *could* do HTTP-only for those wildcard ones [00:46:15] yeah it's all such a tricky subject when it comes to how well we should support them [00:46:24] Or we could leave them for now [00:47:10] And in future decide whether we want to purchase certificates (unlikely), move them in as HTTP-only, or shut them down [00:47:25] from a functional perspective I'd like to say we just deprecate and someday remove all the non-canonicals, they're a bad idea in general. but then there's legal/squatting/typo reasons for a lot of them. [00:48:11] we could also park them and have them redir to HTTPS and then just hit a brick wall static page saying "yeah WMF owns this, but it's not useful, try these manual links to our projects!" [00:48:28] but all of those things are uphill battles for every sub-community that has a pet domainname we currently redirect [00:48:57] if we deploy a generic LE-based redirect service, it moves us past all of those battles for now at least, and lets us not have truly-insecure redirects on any of them. [00:49:45] What we could do is take a list of all currently supported language subdomains in each case [00:49:48] List those [00:49:51] Stop adding new ones [00:50:00] maybe? [00:50:50] yeah [00:51:06] it's a mess once we start trying to categorize and define a level of support for the hundreds of domains in the list [00:51:43] (but most of those, we don't support in redirects.dat today, they're just in a list of things registered to us. so should be easy to park) [00:54:55] in any case the next step is to decide if you want to use the code I uploaded. If so, we can polish it up for production usage, and we can extend it to do HTTP-only cases if wanted later on [07:00:22] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736599 (10Arseny1992) T148830 describes a related caching issue T145811 and T93041 descri... [09:06:20] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2737447 (10elukey) One thing that is worth mentioning: the `ReqAcct` tag in the logged requests without `Timesta... [09:47:08] 10Traffic, 06Discovery, 06Maps, 06Operations, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2737493 (10Gehel) 05Open>03Resolved Resolved during maps incident as pointed out by @BBlack. [11:31:32] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2737624 (10ema) As @BBlack mentioned on IRC, the reports tend to happen in esams at peak r... [12:19:07] of course the "obvious" answer would be a TLS issue with nginx or openssl [12:19:35] but we tried the downgrade during the cache_misc thing to no avail. tried all the other obvious things like disabling TFO, disabling dynamic record sizing, disabling tcp metrics save, etc [12:20:57] I wasn't observing a lack of TCP ports during the repro period yesterday, either [12:21:25] include TIME_WAIT, the port 3127 socket count was coming in slightly under 32K on each check [12:22:13] we should have ~63K in our local_port_range, and nginx should be spreading conns between child procs and thus well under its own fd and connection limits [12:22:38] but the fact that it was persistently just under a magic number like 32K is suspicious, too [13:09:09] bblack: the jenkins issue in _misc is probably related to this one in upload though, right? [13:09:24] I mean, it's the same error message on the client side [13:10:14] yeah possibly [13:10:20] the link between them is tricky, but there's something [13:10:54] but in misc we would have to exclude port issues [13:11:04] (I think?) [13:11:06] it is the same issue client-side, but I still tend to think the "bad mac" stuff is just fallout of ungracefully killing the TLS conn [13:11:40] and that integration could be causing connection aborts via backend behavior on streaming "pass", and the esams upload case could be having them for an entirely different reason [13:12:18] but it's also worth questioning nginx and/or openssl one more time [13:12:38] the only evidence the package updates aren't the problem is that integration's problem didn't go away when I downgraded the packages on esams cache_misc [13:12:44] hey folks -- fwiw, I've experienced the failed thumbnails a couple of times myself [13:13:21] but (a) if the causes are distinct, proof it didn't affect integration doesn't mean much and (b) that was before I had a self-repro, so we were relying on timing feedback between me and someone else [13:13:36] from my own computer that is, no antiviruses, proxies or anything like that [13:13:52] and with Firefox [13:13:53] 10Traffic, 06Operations, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2737841 (10Joe) 05Open>03Resolved a:05ema>03Joe [13:13:56] paravoid: I've gotten it twice, while trying hard on category pages [13:14:15] I think it's real, but ... it's been a bitch to find an obvious cause so far [13:15:07] paravoid: when did you get a repro? Was it at esams peak time? [13:15:39] sorry, I don't remember [13:15:44] and I didn't check :( [13:15:49] I will if it happens again [13:16:36] thanks. I've tried reloading category pages a lot earlier this morning and never bumped into the issue [13:17:00] ema: but you get them now? [13:17:09] bblack: nope [13:17:25] it was kind of weird, at some point almost no images were loaded for the page [13:17:29] dozens of them [13:17:32] in any case, the 8xports thing is a good measure to have in place in general [13:17:36] so it seemed like a H2 issue [13:17:39] but I didn't debug further [13:17:52] if we put it in place and it doesn't fix anything, we can leave it there as a general improvement [13:17:57] yes [13:18:00] if I clicked on the thumb, it loaded [13:18:10] the full-sized image I mean [13:18:15] paravoid: yeah that's the sam ebasic symptoms others had. it never fails to load a single image's URL on its own [13:18:25] it's the category page with many many thumbs that will invoke the behavior [13:18:55] paravoid: from what I understood reading the backlog here, the server sends an RST when the problem shows up so yeah, the whole h2 connection would be affected [13:19:02] so, yes H2 is likely a factor, but killing H2 and seeing it go away doesn't mean we've fixed the underlying issue either. it may just make it rarer [13:19:20] right, without H2 you'd probably only have 1x image fail if/when it happened [13:19:32] shift+f5 didn't fix it though [13:19:34] but with H2, you might have 37 images in flight multiplexed on a single connection that gets aborted or whatever [13:21:18] in one of my (only two to go by though) repros, I counted them and there were exactly 64 broken images [13:21:32] which sounds a lot like some implementaiton limit on one side or the other for max multiplexing [13:22:00] the two biggest outstanding questions, IMHO, are: [13:22:54] 1) Is the ssl bad record mac error just a fallout of a reset connection for other reasons, especially with many xfer multiplexed in one connection? Or is some TLS protocol failure causing that message and causing the connection to be reset? [13:23:09] [I tend to think the former, but the latter is technically possible] [13:23:26] 2) Even assuming it's not a TLS protocol issue and it is a conn-reset issue - what's causing it? [13:23:35] I wonder if T148412 could be related [13:23:35] T148412: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412 [13:24:41] the last good pointer there was an issue with chosing/using varnish backends, right? [13:25:03] yes, and in particular varnishlog shows the request headers as being received, but 0 bytes are sent back to the client [13:25:31] ok so... that could be another sort of connection limit [13:26:17] which the 8x varnishes would certainly improve [13:26:23] thinking about this: a) our arbitrary limit-from-the-past is 10K conns for 1xFE->1xBE in the backend params for varnish<->varnish and 2) One of the effects of dropping the cacheable size limit in the FEs on upload would be more BE conns [13:26:34] ema: different layer of connections [13:27:00] oh this is fe<->be [13:27:08] the 256K limit (vs past much-higher limits) means a larger percentage of connections into the fe will result in connections to the local be [13:30:14] just checking cp3044, for port 3128 I see ~7600 total connection table entries, but only ~232 in ESTABLISHED [13:30:32] still, maybe bumping the 10K limit there isn't awful [13:31:28] oh I'm wrong [13:31:38] the existing parameterization in puppet is fe->be is 100K, be->be is 10K [13:31:42] (or 1K, depending on cluster) [13:31:50] so it's not that [13:33:32] if we assume the ssl bad record mac == strange way to say "tcp connection reset", and we also assume the integration problem has a separate cause in jenkins/apache aborting connections bubbling up to nginx via pass+stream in a way that causes nginx unclean connection abort [13:33:47] then we're really back to square one anyways on finding the separate cause for connection reset in this upload case [13:34:03] should potentially re-test things tested before during the integration case [13:35:08] some of the things tested then were far-out wildcards anyways, e.g. TFO (the overall attempts on that aren't high enough for it to be this, and you wouldn't expect it to break a conn after data already flowed) [13:35:34] but some other bits could be related: maybe the dynamic record sizing from cloudflare, if we're looking for something closer to the TLS stack. [13:35:50] and of course maybe this local port exhaustion thing [13:36:00] and we can test package downgrade again [13:36:18] the 0-bytes issue was reported by elukey on October the 17th, which is before the nginx upgrades FWIW [13:37:09] last night while I had reproducers online, I tried disabling x25519 key exchange just in esams+upload too, just in case of some effect or bug on that, since it's new [13:37:13] no effect [13:44:20] (also, some repro reports are from UA combinations that lack both x25519 + chapoly, so that tends to rule out that new code in general) [13:53:00] another reason to rule out max_connections as a possible source of troubles here is that varnish would return a 503 in that case and I don't think we've seen any [13:55:13] yeah [13:55:19] should, anyways :) [13:55:54] right :) [13:57:59] it would be good at any rate to monitor when the number of connections gets close to max_connections and such [13:58:13] dreams for a prometheus future [14:23:12] more fun [14:23:30] apparently varnishd can fail to open one or more of its listen sockets, silently [14:24:09] only happened to 2x instances out of 100 [14:24:26] they both failed to open 1/8 of their ipv4 listen sockets [14:24:44] cp1053:3120 and cp1067:3124 [14:25:03] even checking lsof, it's just missing in the list (and the socket fd# are sequential) [14:25:14] the prots are there for v6, but not v4 heh [14:25:18] s/prots/ports/ [14:26:56] fixes itself on restart, just random fluke... [14:27:06] <_joe_> reassuring... [14:27:56] most likely there was a conflict with an ephemeral (already being fixed up in a future commit pending) [14:28:04] but still, you'd expect failure-to-start, or at least a log message :P [14:31:15] <_joe_> bblack: yes [14:31:31] <_joe_> btw, do we still have your script runnign to restart API appservers? [14:31:35] yes [14:31:42] it triggered over the weekend on uptimes, btw [14:31:49] restarted a bunch of them in a relatively short window [14:32:05] leading to some very small 5xx spike events every 10 minutes for a while [14:32:06] <_joe_> which is expected [14:32:19] <_joe_> which is why I prefer to have cron-splay for that [14:32:48] also from the weekend: cp1052 has some ethernet failure of some kind - T148891 [14:32:49] T148891: cp1052 ethernet link down 2016-10-22 14:11 - https://phabricator.wikimedia.org/T148891 [14:33:05] <_joe_> yeah noticed that [14:33:09] basically the link just went down (in host dmesg, like someone pulled the cable) [14:33:36] after futzing with everything else, I was able to bring it back with a local ifconfig down->up, and then rebooted, and then a few minutes later after booting up fine, it went down again [14:33:44] so it's left down/depooled now [14:33:55] (well "down" on the interface, the host software is up) [14:34:36] could be: switch port, cable, bnx card, ? [14:34:37] who knows [14:34:48] but it's causing all the ipsec spam crits in icinga for all the other related hosts [14:35:16] I could comment it out in hieradata I guess [14:35:44] in any case, on with the port# crap... [14:42:11] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Release-Engineering-Team, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2738137 (10demon) p:05High>03Low >>! In T140658#2684595, @BBlack wrote: > Is there more to do here on the MW-C... [14:50:23] the ipsec-spam crits could perhaps be avoided by ignorning messages about depooled hosts in check_strongswan [14:53:08] yeah... [14:53:15] sounds like an interesting task! :) [14:54:00] which I'm gonna create now :) [14:56:17] so I'm playing with the 8xport thing on 1x esams upload cache [14:56:31] definitely one of the several changed things affected things [14:56:55] root@cp3044:~# netstat -an|grep ':312[0-7] ' |wc -l [14:56:55] 128460 [14:57:07] whereas before that was just under 32K all the time [14:57:32] at first I thought with the ~32K and ~128K, it must be the change in the nginx #fds/#conns params [14:57:43] but doubling those again doesn't shift the number up any more [14:57:55] the trick is, my netstat sees both sides of the connection [14:58:20] (maybe?) [14:59:00] maybe not [14:59:28] if it was just the 8x multiplier from the varnish ports and still not enough, you'd expect to see it around ~256K [14:59:39] 10Traffic, 06Operations: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976#2738250 (10ema) [14:59:50] in any case, there's also something to be said for the limit being soft-ish. either way we're recycling timewait ports [14:59:52] 10Traffic, 06Operations: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976#2738265 (10ema) p:05Triage>03Normal [15:00:08] maybe this just relieves pressure on how quickly we recycle them, and recycling them too quickly can cause RST [15:00:43] still, if it's only a rare problem at peak req-rate, multiplying by 8 should be enough, I'd think [15:02:57] in any case, it's working on cp3044 as intended apparently, so releasing it to the rest with puppet enable->run [15:03:06] nice [15:03:32] from there, I guess we see if anyone can repro still or not [15:06:56] (trying myself again...) [15:08:54] bblack: how are you trying? Simple page reloads? [15:09:50] I'm trying all forms randomly since I'm never sure what will get it [15:10:06] pressing enter on the URL bar again, ctrl+r, ctrl+shift+r [15:11:17] got one with ctrl+shift+r [15:11:51] slight behavior change, though: they're all net::ERR_CONNECTION_RESET rather than the ssl bad record mac thing [15:12:02] interesting [15:12:06] usually the bad record mac was more common I thought, with connection reset often alongside it, but not spammed [15:12:56] http://sd.keepcalm-o-matic.co.uk/i/keep-calm-and-press-ctrl-r.png [15:12:59] * volans hide [15:13:36] if you need my "repro machine" just ask bblack ;) [15:14:45] well and I've also got some stranger repro now [15:15:21] no console errors, so I thought this request was non-repro. but scrolling back up the list of fetched things in the Network tab shows several with the response code replaced with (canceled), and looking at the page there are broken images. [15:15:44] bblack: do you have TFO enabled? For some reason I still cannot repro [15:16:11] (I do have it enabled) [15:16:42] not that it should make much of a difference [15:18:08] yeah I don't think TFO is likely [15:18:31] I think "(canceled)" might be unrelated. It might be me pressing reload before the previous reload is done. [15:20:09] it bugs me that the #sockets (incl TIME_WAIT) for nginx->varnish was hovering a bit under 32K before, and now hovers a bit under 128K [15:20:14] both sound like a limit is reached [15:21:19] root@cp3030:~# netstat -an|grep ':312[0-7] ' |wc -l [15:21:19] 200553 [15:21:46] ^ esams cache text blows past that limit at present though, so maybe it's a red herring that we're ~128K and just under on the esams upload cache I'm observing [15:22:15] (esams cache text has higher reqs/sec, but upload has more connection parallelism for it's slightly smaller reqrate) [15:22:22] (longer request lifetime, if you will) [15:24:29] yeah my testing gets confused by reloading too-fast [15:25:01] if I do a pair of ctrl+shift+r too quickly, the network pane is showing requests from 2x reloads, and a bunch of (canceled) on the earlier ones, and sometimes 1x connection-reset in the log [15:28:33] I still have yet to see a bad record mac thing though, that's something [15:29:11] indeed [15:29:17] (I still have to see a repro) :) [15:30:48] so, something related we could tune as well [15:31:00] https://nginx.org/en/docs/http/ngx_http_v2_module.html#http2_max_concurrent_streams [15:31:16] in chrome net-internals, I can see our 128 (it's advertised to the client). google's servers tend to use "100" instead [15:31:29] reducing that number might eliminate some edge cases here [15:31:55] we're telling the client they can start up 128x requests in parallel over a single http/2 connection [15:32:07] and on the backend side of nginx, it's all HTTP/1.0 with single conn-per-req [15:32:13] so it could cause high parallelism on that side [15:33:38] and this category page I'm reloading to test, it makes 256 total reqs heh [15:33:54] so it would be something that would push that edge case probably, unlike most pages [15:40:39] this could still of course be a bug in nginx's http2 code where they haven't do a lot of testing of high-concurrency pages? [15:40:51] in which case moving it from 128 to 64 would make it trip more and be more obvious [15:40:57] either way, seems like a win to drop this number [15:44:46] I managed to get a repro, just a single ERR_CONNECTION_RESET on https://commons.wikimedia.org/wiki/Category:Art [15:45:00] heh [15:45:12] how much lag from repro to typing here? [15:45:22] < 1 minute [15:45:34] I had just fired off a big salt to run puppet agent and reload nginx [15:45:44] nginx reloads aren't supposed to trip up running connections though :P [15:45:57] just odd timing heh [15:50:01] it seems like total load time on a ctrl+shift+r has decreased slightly with the drop in http2_max_conccurent_streams [15:50:11] no repro of any kind yet for me since that landed [15:53:10] big repro here, dozens of ERR_CONNECTION_RESETs :( [15:53:20] but no bad mac anymore? [15:53:35] nope [15:54:30] does your net-internals show the 64 limit on the current connection? [15:54:36] (if not, it may still have been open/idle from before) [15:54:59] it does [15:55:53] might be a coincidence, but lowering the max seems to have made the issue much easier for me to reproduce [15:55:53] did it get any easier to reproduce, or still rare? [15:55:59] hmmm ok [15:56:02] much easier yeah [15:56:23] well, I could bump it to something crazy like 300 [15:56:53] maybe the repro requires (a) a page to actually attempt to reach the http paralleism max + (b) some kind of network timing (maybe easier for you because you're closer to esams) [15:57:06] and in that case it maybe is an nginx http2 implementation issue [15:57:24] note that till ~ half an hour ago I didn't get a single repro, now the issue shows up consistently at every page reload [15:57:38] half an hour ago? [15:58:11] ops meeting! [15:59:16] the concurrency change didn't go in until ~ :45 [15:59:40] but if you have consistent repro now [15:59:50] ema: which upload cache do you hit (frontend in xcache)? [16:02:37] https://forum.nginx.org/read.php?2,265430,265430#msg-265430 [16:04:00] ^ sounds eerily familiar :P [16:04:59] ema: if you can still repro easy, I can go play with the param just on your frontend. maybe try bumping it to 300 and see if that makes it go away... [16:05:45] 10Traffic, 06Operations, 10ops-eqiad: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2738539 (10faidon) [16:05:49] (or drop it even further and confirm it makes things even worse, if that's possible) [16:06:26] 10Traffic, 06Operations, 10ops-eqiad: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2738539 (10BBlack) cp3021 is not in service. it's meant to be decom/spare, but waiting on the decom part... [16:06:53] 10Traffic, 06Operations, 10ops-esams: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2738554 (10faidon) [16:14:04] 10Traffic, 06Operations, 10ops-esams: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2738592 (10BBlack) Found decom ticket: T130883 [16:21:56] bblack: no repros at the moment [16:22:41] bleh [16:24:49] bblack: however I had to move to another room (other access point, possibly other public IP I don't know now) because of the ops meeting [16:25:02] the connection started lagging badly [16:25:48] if (h2c->processing >= h2scf->concurrent_streams) { [16:25:48] ngx_log_error(NGX_LOG_INFO, h2c->connection->log, 0, [16:25:48] "concurrent streams exceeded %ui", h2c->processing); [16:25:51] status = NGX_HTTP_V2_REFUSED_STREAM; [16:25:52] ^ in nginx code [16:25:55] goto rst_stream; [16:26:04] ha [16:26:06] I wonder if rst_stream means rst the whole connectioN? [16:26:17] also, we wouldn't show log_info in our logs, generally too noisy [16:26:31] guess which tool I'm gonna propose using now? [16:27:19] systemtap :P [16:27:25] :) [16:27:31] so, rst_stream is supposed to only kill the one stream, not the whole TCP [16:27:40] but it can fail and kill the whole TCP instead [16:28:05] which is consistent with what I've seen, given that I've seen a single image failing as well as dozens of them [16:29:34] cp3044 is the frontend I'm hitting at the moment (and no repros) for the record [16:29:55] it would be nice to have separate settings for the adverstised limit and the limit at which it does RST_STREAM [16:30:03] e.g. advertise 100, but don't take action until 200 [16:30:32] rather than just starting the resets immediately, yeah [16:30:40] another apparent generally-related issue: the HTTP2 RFC notes that this is initially "unlimited" until the server sends the client a SETTINGS frame that sets it explicitly [16:30:55] it could be the client fires up >max_concurrent before it even gets the settings frame [16:31:35] (the RFC also recommends a minimum of 100 for this value) [16:32:42] in the nginx forum thread you mentioned above, they talk about "users who have good bandiwdth" [16:32:49] yeah [16:32:55] I certainly wasn't a user with "good bandiwdth" during the repros [16:33:26] (and I've misspelled bandwidth twice the same way) [16:33:28] it might be reasonable, as a client implementor, to assume 100x concurrency until the settings frame arrives, which might exacerbate things at the current "64" [16:33:50] also, greater latency probably == larger time window to fire off concurrent requests before the firs server settings frame arrives... [16:34:11] oooh [16:34:38] but I'm way over in texas and didn't make it easier for me to repro [16:35:04] ~146ms RTT for me [16:35:26] yeah but at a certain point I had ~8 seconds ping to upload downstairs... [16:35:32] heh [16:35:47] true, I'm on very good wifi and a decent local wired connection after that [16:35:48] if nobody fixes anything I'll try to repro after the meeting :) [16:36:14] (I'm like 15 feet from an unladen 5Ghz N-router with very few clients, and then 75Mbps VDSL from ATT from there) [16:40:07] I got a repro [16:41:12] good network conditions now though [16:41:51] looking at chrome's network console I see a "black" failed and then a bunch of red failed [16:43:21] so basically a request got response headers back but no body [16:44:53] all the subsequent requests only show "provisional headers" (referer and UA) [16:46:23] we could of course just shut off HTTP/2 and see if that makes reports die off [16:46:43] but I fear it could be something more subtle and http/2 is just an antagonist that makes it easier to repro [16:48:17] bblack: when you managed to reproduce, did you also notice one request kind of half-failing (resp headers OK but no body) and then the next ones failing entirely? [16:50:00] I don't know that I was looking for the right thing there [17:01:04] ema: I'm going to try going the other direction, and set the limit to 512, as an experiment (http2 concurrency) [17:02:42] ok [17:02:59] I'll go downstairs and try to repro again with the bad AP [17:07:26] bblack: re: "Either that or move towards using a purer inbound TLS proxy that operates at the PROXY or TCP level with some solution for client IP forwarding and SNI forwarding, allowing varnish to handle all HTTP protocol details" [17:07:45] bblack: nginx can be a plain TCP proxy as well [17:07:55] that's the "stream" module AIUI [17:08:03] yeah that's been on my mind too [17:08:23] the stream_ssl and http_ssl modules aren't quite in sync in code though, there's some review to do on that idea [17:08:36] (or in config options either, I think) [17:09:28] and we'd lose some of the things we're doing now at the nginx http level (setting informative headers that varnish picks up: X-Client-IP (maybe solveable via PROXY protocol?), and then X-Connection-Properties, etc [17:10:03] yeah [17:10:18] it's really fucking annoying that varnish can't do TLS [17:10:25] :) [17:10:48] it's also annoying that the PROXY protocol isn't a little more extensible, to send/receive extra custom-defined fields [17:12:07] but: we could do something where X-C-P and such are logged by nginx to a custom log, and we move our stats gathering for that to there [17:12:46] we'd lose the ability we have to easily correlate them in varnishlog, though (e.g. see that the funny requests for /foo/ are coming from clients with certain connection properties, which has been an interesting investigative hint in the past) [17:14:41] heh [17:14:54] with the 512 setting for http2 concurrency, repro has apparently become very easy for me now [17:14:54] 64 bytes from 8.8.8.8: icmp_seq=33 ttl=47 time=3381 ms [17:14:59] much better :P [17:15:47] well, it was [17:15:47] I got 2/2 failures right off the bat [17:15:49] now I can't repro again [17:15:56] (also, chrome seems to cap that 512 value at 256) [17:17:34] I can repro consistently, bad network conditions and I'm hashed to cp3046 now [17:18:27] it's still fairly easy for me to get connection resets [17:18:33] still no bad mac like before though [17:18:42] it seems like 512 is much worse than 64, from my pov [17:21:10] so lets go a little more extreme and move it back to 32 [17:21:11] ? [17:21:19] +1 [17:21:21] well I donno, there's issues potentially in both directions [17:21:41] it could be this isn't actually the issue, but greater concurrency exacerbates it, and values <100 may cause new unique issues on top of the other one [17:21:58] the current state is quite bad for me now, most of the images fail to load [17:22:04] yeah me too [17:22:47] let's try 32 anyways, and if that doesn't placate things, move back to 128 default and move on [17:23:05] ok [17:28:47] that's deployed now [17:29:04] huge difference [17:29:21] my repro has gone away too [17:29:23] so far [17:29:36] I can still repro, but a much smaller amount of images fail to load now [17:30:00] well, that's still a repro [17:30:15] and fits the theory that it's an unrelated (to http2) connection-reset issue [17:30:31] less http2 concurrency == fewer images affected by one TCP reset [17:30:40] right [17:32:38] I'm going to reset this back to defaults [17:34:27] at this point I think we have to try package downgrade again [17:34:41] the only evidence we have that it's not the new packages was the lack of effect while testing the integration issue [17:34:53] which might be unrelated [17:35:02] which may have had a separate cause, or could have been faulty timing/conditions on testing package rollback [17:36:33] bblack: we could first try to downgrade only the two hosts we're currently being hashed to [17:37:10] which is yours again? [17:37:13] cp3046 [17:37:58] heh now I cannot repro anymore [17:39:17] I just upgraded [17:39:20] well, downgraded [17:39:32] but just now, not back at :37 when you said that... [17:40:23] eh timing is everything, maybe [17:40:59] network conditions are great here now [17:41:10] oh I hit the wrong host for you anyways [17:41:22] I did cp3034 + cp3036, you're on cp3046 :) [17:41:47] nice, now we now it's *definitely* not the downgrade that improved the situation :) [17:42:20] can you try to reach a repro again, before I downgrade yours? [17:42:25] trying [17:43:20] I'm not getting any repo on my downgraded endpoint, yet [17:43:24] repro! [17:44:12] consistent-ish? or just one-off again? [17:44:17] one-off for now [17:44:28] ok downgrading cp3046 [17:44:45] I'm not sure how you ensure the old http2 conn is closed, either [17:44:47] other than restart browser [17:44:57] or get a RST :P [17:44:59] (nginx is keeping old connection alive on the old copy of the daemon) [17:48:12] I've added a 1000ms delay here and no repros (but crappy IRC) [17:48:59] I can't repro either, still [17:49:25] thinking ahead while we wait on a repro [17:49:56] if the downgrade fixes it, we've got a connection reset issue with the newer nginx+openssl, which is apparently triggerable by pages that fetch a crapload of stuff over a single http2 conn [17:50:12] 256 requests, 1.4MB is what chrome console says (that's between text+upload for that page, but most of it's upload) [17:52:01] openssl code is new in general. and then specifically, we've switched chapoly implementation from cloudflare's to openssl's, and x25519 key exchange. [17:52:08] nginx's changes for 1.11.4 (vs .3) aren't big or related, though [17:52:22] and our browsers probably aren't using chapoly because we have AESNI. mine for sure anyways [17:52:54] and I disabled x25519 on esams upload yesterday and others could still repro [17:54:21] still, there could be some subtle bug on either side of that fence that only gets hit with specific connection profiles (in terms of xfer and packet spacing, etc... unique to http spamming images?) [17:54:28] s/http/http2/ above [17:58:14] the only real http2 fix since 1.11.4 in master is this (but seems unrelated): http://hg.nginx.org/nginx/rev/5e95b9fb33b7 [17:58:31] there was a spam of http2 fixups just before 1.11.3 was released, but we've been on 1.11.3 for a while (it's what I downgraded to) [18:00:11] still no repro? [18:00:20] nope [18:01:06] and latency-wise I'm pretty much in the same situation as before thanks to tc [18:01:17] I can't yet repro either [18:02:14] checking whether all the caches still have the old package archived (since I'm installing from /var/cache/apt) [18:02:29] all but 2x [18:02:42] not awful, I can copy manually to those two [18:02:56] if somebody leaves their linux laptop unlocked as root and walk away I suggest this command, possibly also in /etc/rc.local for added fun: [18:02:59] tc qdisc add dev wlan0 root netem delay 2000ms 2000ms distribution normal [18:03:27] :) [18:05:45] repro! [18:06:33] oh wait it says "canceled" in chrome's console [18:07:10] yeah that's something else from spamming reloads or something, I think, if no other symptom [18:11:31] I think all downgraded now, confirming success with salt... [18:12:19] yes, all downgraded [18:13:02] ok, we could perhaps try upgrading cp3046 once more to see if I can get a repro given that with the older version I didn't get a single one [18:13:37] let's just try harder to get a repro from anyone in the downgraded state first [18:13:46] or else we'll get a reporter on phab not looking at x-cache, etc [18:13:55] sounds good [18:15:26] if this kills it, then yeah we can bisect a bit [18:15:50] try going back to nginx-1.11.3 + openssl-1.1 (with a few extra backported fixups), or trying doing nginx-1.11.4 + openssl-1.0.2 [18:16:09] (on one cache host for repro or whatever, maybe) [18:18:01] in the state they're in now, the dpkg state is sane. they show 8 packages needing upgrade (nginx stuff), but don't auto-upgrade because some deps are new (requires dist-upgrade) [18:18:09] so it's a fairly safe state to sit in for a while [18:27:33] oh we don't have nginx-full-dbg 1.11.4-1+wmf3, only +wmf2 [18:27:49] no [18:28:59] I'm not sure why, either, but it didn't build it [18:29:41] I think the only change from the +wmf2 build to +wmf3 was the last lua fixups, too [18:29:41] ok, then the stap probe becomes a problem for another day :) [18:30:32] see you [18:30:53] see you! [18:31:42] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2739171 (10BBlack) Can anyone still repro this issue? We've done a bunch of debugging tod... [18:33:32] 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2739176 (10Paladox) Nope carn't reproduce it yet. [18:57:16] 10Traffic, 06Operations: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976#2738250 (10Dzahn) fwiw, i saw these in Icinga web-ui but icinga-wm was apparently not talking about them on IRC, but i did not see "disabled notifications" or ACKs in the web... [19:00:55] https://trac.nginx.org/nginx/ticket/891 [19:01:21] ^ 9 month old ticket, but sounds eerily-related. it could be that 1.11.4 and/or changes to openssl-1.1 io stuff just subtly change conditions and make an issue like that appear for us [19:03:12] https://trac.nginx.org/nginx/ticket/1026 has some interesting things to say throughout, too. I don't think it's exactly our problem, but an interesting read nonetheless [19:07:13] proxy_request_buffering off; is a possibility, based on what's in there [19:07:28] it's remotely possible our disabling of the new autocorking stuff in sysctl is at fault, too [19:07:38] but you'd think there's people trying this software combo on kernels too old for autocork, too [19:08:45] (to be clear above: we already do "proxy_request_buffering off;", but it may be we need the default "on" behavior to mitigate this?) [19:09:44] also, that and the use of http/1.1 (as opposed to 1.0, but still without persistence) are both under varnish_version4 conditional in our nginx config and relatively-new [19:10:06] too old to be the only cause here, but again, could be a subtle bad interaction of non-standard tuning/config that worked for us in the past, but doesn't with the new libs/packages [21:13:45] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2739847 (10CRoslof) If @Naveenpf or #operations is still interested in pursuing this task's specific request, I'd appreciate an answer to my clari... [22:04:16] hello, I am one of the authors of https://tools.wmflabs.org/pageviews and many users in South America are saying data later than October 16 is not being shown [22:04:20] proxy_request_buffering seems like it can't be it, it's only about POST bodies, and these aren't POST reqs in the triggering case... [22:04:22] bd808 said people in that area might be hitting the esams varnish servers, and that you kind folks might be able to tell me how to cURL that server? [22:05:26] I'm dumb. South America would hit codfw probably/maybe [22:05:42] Portugal is not in europe :) [22:06:05] oh oops, no I'm dumb [22:06:09] wait... you said portugal right musikanimal ? [22:06:10] it is in Europe [22:06:13] either way [22:06:16] yes [22:06:27] is this UI on tools/pageviews pulling data from some prod endpoint? [22:06:33] dang, alright... so I've got a handful of users in Portugal and one in Brazil [22:06:41] yes [22:06:48] what is that endpoint's URL? [22:06:50] bblack: yes. it hits https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/... [22:07:00] I can't test with a ... :P [22:07:10] there are other API requests, too, but that is the only date-specific one [22:07:11] https://tools.wmflabs.org/pageviews/?project=pt.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Wikip%C3%A9dia:P%C3%A1gina_principal [22:07:23] grr not the right thing [22:07:43] the only other API endpoint it hits is prop=info [22:08:03] bblack: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/pt.wikipedia/all-access/user/Wikip%C3%A9dia%3AP%C3%A1gina_principal/daily/2016100400/2016102300 [22:08:24] * bd808 slinks out and lets musikanimal take it from here [22:08:30] hehe, thanks bd808 ! [22:09:18] so Brazil and Portugal, but it's not the language because I can change that manually and everything works [22:09:27] and one user was viewing it in Italian [22:09:53] I was convinced it was location-specific until the report of Brazil came in... [22:10:59] okay... it's not the RESTBase API. You may disregard any concerns with that [22:11:20] let me refrain from polluting this IRC channel until I know more... thanks! [22:14:03] I've checked, the caches are outputting the same data everywhere for that query [22:14:13] (bd808's link) [22:14:52] the output includes data up through 20161023, and checking every single cache box individually, they all output the same data according to sha1sum [22:16:05] for future hackers' reference, how to check something like that across all the cache frontends: [22:16:08] salt -b 100 -v -t 5 -G cluster:cache_text cmd.run 'curl -H "X-Forwarded-Proto: https" -H "Host: wikimedia.org" http://127.0.0.1/api/rest_v1/metrics/pageviews/per-article/pt.wikipedia/all-access/user/Wikip%C3%A9dia%3AP%C3%A1gina_principal/daily/2016100400/2016102300 2>/dev/null|sha1sum' [22:16:28] (and then if the sums agree, go execute it with full output on a single box and confirm what the output looks like and same sum) [22:23:26] musikanimal: did you check UA and OS? [22:23:49] yes they've tried multiple browsers, etc [22:23:58] it's something in my code... even though I didn't change anything [22:24:00] maybe our tests failed due to asomething different [22:24:01] :/ [22:59:39] still racking my brain digging through tickets, source code, google, not finding any obvious explanation for the TCP resets [23:00:12] it's interesting that nobody else is reporting it, but I doubt very many are on nginx-1.11 + openssl-1.1 yet. and this didn't really affect text, just the big multi image transfers on upload with http/2 [23:00:58] probably need a more-reliable repro. it seems likely an extremely large image thumbnails category page would be better [23:01:58] is there a request parameter to get a category thumbnail page to display more than 200 at a time? [23:03:23] I'm re-upgrading just cp3034 to experiment more [23:04:00] (because it's the one my requests map to) [23:13:11] fuck ipvs sh + pybal :P [23:13:12] if it ever fails a healthcheck, I get scrambled to a different frontend persistently :P [23:13:52] ah? I thought you'd be mapped back when the backend un-failed the healthchecks [23:15:30] nope! [23:15:40] I think it has to do with how pybal adds and removes servers from ipvs [23:16:10] healthcheck dies it just deletes one entry, healthcheck succeeds it adds one (now at the end of some internal list) [23:16:28] so you get 3x different hash orderings: the original, the one with 1x server down, and the new one with them all back up [23:16:34] unless th eone that went down happened to already be last in the list [23:17:16] ema mentioned Category:Art earlier, it is much earier to repro with [23:19:28] also Special:NewFiles has plenty of thumbs, and you can limit=250 or limit=500 [23:21:58] oh that's nice :) [23:22:01] very easy repro with that [23:23:25] hehe it's been my "are we ok?" test case for the codfw switchover, works well enough [23:23:52] downside is not all files you'd necessarily want to see, includes spam [23:30:35] experiments undone [23:30:40] (back to all on older packages/settings0 [23:31:02] I've toyed with every option or tunable that seems remotely relevant, nothing seems to stop the problem reproducing [23:31:43] except downgrading of course [23:32:13] I guess next step tomorrow will be splitting up the changes [23:32:34] (building an alternate 1.11.4 that's on openssl-1.0.2 for instance, or vice-versa building a 1.11.3 that's openssl-1.1) [23:43:32] incidentally, I was able to repro with http/1.1 too [23:43:33] (http2 disabled on the cache I was hitting) [23:43:37] only with a very large newfiles pages, with a few retries [23:46:20] it'd be nice to have a cp1008/pinkunicorn -like environment for cache_upload for this. see if it happens isolated from other traffic, and debug harder with fewer public conns in the way [23:49:30] hmmm one thing I didn't think to try earlier was downgrading the protocol for nginx->varnish-fe to http/1.0 [23:49:40] not that it's an answer, but it might be a pointer [23:50:27] (to some kind of issue with connection semantics. backend conn:close within the fetch for 1x stream closing the whole http2 conn)