[00:00:09] <bblack>	 http://hg.nginx.org/nginx/rev/e4c1f5b32868
[00:00:16] <bblack>	 http://hg.nginx.org/nginx/rev/0fa883e92895
[00:00:34] <bblack>	 ^ both of those are outlier possibilities too, for nginx-1.11.4 changes that could have some subtle related impact
[00:01:14] <bblack>	 the latter refs: https://trac.nginx.org/nginx/ticket/1037
[00:03:13] <bblack>	 (which doesn't sound like the right problem, but it's notable that it involves a lot of the same moving parts, and that the ticket says they weren't really fixing an nginx bug so much as working around bad backend behavior....  maybe the workaround broke us)
[01:28:29] <wikibugs>	 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2740896 (10Naveenpf)  Hi @CRoslof,   Please find my answer inline.   Thank you naveenpf  >>! In T144508#2644566, @CRoslof wrote: > I'm not sure I...
[01:35:14] <wikibugs>	 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2740901 (10Naveenpf) @Aklapper Can you please change title to.... add new IP address ? We have changed to new server for better performance.   Our...
[01:37:21] <wikibugs>	 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2740905 (10Dzahn)
[01:37:40] <wikibugs>	 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508#2602033 (10Dzahn) >>! In T144508#2740901, @Naveenpf wrote: > @Aklapper Can you please change title to.... add new IP address ?   Done
[07:23:04] <moritzm>	 ema, bblack: FYI, not sure if you've see the sensationalist "SSL death message" advisory for openssl, that is already fixed, it only concerns distros which ship isolated security bugfixes instead, while the patch already landed in the latest bugfix releases. also this probably doesn't affect "no-ssl3" builds anyway
[08:33:19] <wikibugs>	 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2736599 (10Nenntmichruhigip) >>! In T148917#2736954, @Aklapper wrote: > @Paladox: Please s...
[08:52:52] <ema>	 so no more repros yet 
[08:54:32] <elukey>	 FWIW I didn't see any data consistency checks during the past hours
[08:54:50] <elukey>	 (assuming that it is related, which is still not clear)
[08:59:11] <ema>	 elukey: how often did you receive the alerts in the past days?
[08:59:39] <ema>	 does it look like the downgrades correlate time-wise with the alerts not showing up anymore?
[09:05:40] <elukey>	 so we have warnings and errors - the first ones do not stop the "refine" hadoop jobs (basically data transformations), meanwhile the second ones do
[09:06:03] <elukey>	 and I was seeing them more or less during early morning UTC time
[09:06:21] <elukey>	 meanwhile warnings were spread during the day
[09:06:39] <elukey>	 I correlated the ERRORs with visits from the fb crawler mostly
[09:07:04] <elukey>	 the last big one was on sat 21st
[10:53:12] <bblack>	 ema: thanks for the debug patch :)
[10:53:44] <bblack>	 I was thinking on all related things last night.  I think I'm going to build on that and try to back out related diffs from nginx-1.11.4
[11:50:57] <paravoid>	 I think there is a debhelper backport in jessie-backports
[11:51:08] <bblack>	 well either way
[11:51:17] <paravoid>	 and we should probably use that in the future, and add a dbgsym section to our repos or something
[11:51:20] <paravoid>	 until then, that patch is great :)
[11:51:58] <bblack>	 I'm taking a shotgun approach to potentially reduce testing cycles
[11:52:16] <paravoid>	 which one is that?
[11:52:16] <bblack>	 reverting the one most-likely offending nginx commit from 1.11.3->1.11.4, and adding in 3x post-1.11.4 bugfixes
[11:52:25] <bblack>	 all in one go, for a new package to test
[11:52:28] <paravoid>	 interesting :)
[11:52:30] <paravoid>	 which commit is that?
[11:52:37] <paravoid>	 is this for the thumbnail issue?
[11:52:45] <bblack>	 http://hg.nginx.org/nginx/rev/0fa883e92895
[11:52:55] <bblack>	 ^ is the most-likely offender, if the problem lies in nginx at all
[11:53:13] <bblack>	 and yes, for the thumbnail issue
[11:53:16] <paravoid>	 ok :)
[11:57:27] <bblack>	 so, building that
[11:57:39] <bblack>	 at least we have easier repro now, I feel pretty confident it's testable
[12:01:22] <paravoid>	 oh we do
[12:01:27] <wikibugs>	 10Traffic, 06Multimedia, 06Operations, 15User-Josve05a, 15User-Urbanecm: Thumbnails failing to render sporadically (ERR_CONNECTION_CLOSED or ERR_SSL_BAD_RECORD_MAC_ALERT) - https://phabricator.wikimedia.org/T148917#2741442 (10BBlack) Content-Encoding issues are a separate thing unrelated to this ticket....
[12:01:28] <paravoid>	 sorry, I haven't been following too closely
[12:04:09] <bblack>	 TL;DR from the past 24h: it is the package upgrades that cause it (Friday we didn't think it was, but it clearly is now with more-stringent testing).  And it's easier to reproduce the bigger the thumbnail page is, so Special:NewFiles?limit=500 is a better test.  We done multiple up/down-grade cycles on various servers and confirmed it's the packages that cause it.
[12:04:53] <bblack>	 we just don't know *what* about the packages caused it.  there's the update from nginx-1.11.3->nginx-1.11.4 in there, and also the OpenSSL 1.0->1.1 upgrade
[12:04:57] <bblack>	 going after nginx first
[12:06:00] <paravoid>	 oh that's awesome
[12:18:45] <_joe_>	 some users are reporting broken ssl on OTRS
[12:19:07] <bblack>	 I'm not surprised
[12:19:16] <bblack>	 I think there's still lingering issues with globalsign, just not widespread
[12:19:43] <_joe_>	 bblack: see #-operations
[12:20:20] <bblack>	 need better data on dates/times/errors, and yes OS/UA
[12:20:28] <bblack>	 I think Sherry's case is likely the same though
[12:21:02] <bblack>	 Sherry was still complaining to me as of yesterday at least.  Her Safari 10 intermittently (but constantly reproducible) has SSL errors with us too, but FF on the same box doesn't
[12:22:29] <mark>	 are we still on R3 now?
[12:22:56] <bblack>	 yes
[12:23:19] <mark>	 ok
[12:23:32] <bblack>	 R3 seems safer than R1
[12:23:46] <bblack>	 I don't know why hers are intermittent, either.  Says reloading a few times sometimes works, sometimes doesn't.
[12:23:58] <bblack>	 but without more corroborating cases, it's hard to say anything definitive about her case.
[12:24:40] <bblack>	 getting off of GS (at least for now) would erase a lot of question-marks though
[12:24:57] <mark>	 ok
[12:32:18] <ema>	 v
[12:33:36] <ema>	 just a random character dropped there by mistake ^ :)
[12:34:42] <bblack>	 heh
[12:34:57] <bblack>	 well, I just went through an attempted upgrade (on just my terminator in esams) to the new nginx package
[12:35:21] <bblack>	 sequence: no repro on a few tries, upgrade packages, repro easily, downgrade packages again, still repro
[12:35:34] <bblack>	 I'm waiting for all the "shutting down" to nginx procs to drain JIC, though
[12:35:46] <bblack>	 maybe somehow I'm stuck on an old (upgraded to faulty) proc
[12:36:43] <bblack>	 yeah that must have been it.  once the last relevant process died, I can't repro anymore
[12:37:11] <bblack>	 will keep trying though, JIC.  but AFAIK nobody's been able to repro on the older packages (when we're sure it's being processed by a daemon from the older package)
[12:40:10] <bblack>	 I got a screencap of the OTRS report in question.  it doesn't seem like a very strong case for GS issues.
[12:41:55] <bblack>	 so, my 1.11.4+wmf4 attempt is still a fail.  We do have debug packages at least, if we want to go down that route
[12:42:14] <bblack>	 (systemtamp and try to find the codepath that's closing the connection to a specific client IP we're testing from?)
[12:43:22] <bblack>	 another thing we can try is bundling up the same basic 1.11.4 package but built against openssl-1.0.2
[12:50:19] <bblack>	 (or go back to 1.11.3+openssl-1.1.  But at this point, I don't think there's much hope of that being better.  most likely it's either in openssl-1.1 itself, or it's a general ongoing bug in nginx's use of openssl-1.1)
[12:52:36] <bblack>	 oh, I guess I should also try re-upgrading and doing the HTTP/1.0 thing, since it was never tested yesterday.
[13:13:37] <bblack>	 http/1.0 thing doesn't work either
[13:30:48] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2741637 (10ori) I suspect @elukey is right and the problem went away due to some change in the interval...
[13:31:16] <bblack>	 ok same nginx-1.11.4 baseline (including the same backports/fixups as the last repro), but built/running against openssl-1.0.2j (confirmed build output + ldd), no repro
[13:31:29] <bblack>	 so it's not an openssl-version-agnostic regression in nginx
[13:31:46] <bblack>	 it's either a general issue with nginx+openssl-1.1, or a general issue in openssl-1.1
[13:33:10] <bblack>	 will try repro harder for a bit to confirm
[13:34:34] <bblack>	 also, past testing has confirmed we can strip away most of our special tuning and it still happens, so I doubt it's specific to nginx+openssl-1.1+our_special_config (well, the not-strictly-functionally-necessary tuning bits anyways.  could be specific to revproxying into varnish in general)
[13:38:13] <bblack>	 and I did try disabling x25519 over the weekend with others' repro, and my browser's negotiating aesgcm, so those are unlikely too
[13:38:24] <bblack>	 (the new crypto elements)
[13:40:32] <wikibugs>	 10Traffic, 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2741647 (10elukey) 05Open>03Resolved From an IRC conversation with BBlack we decided to close this t...
[13:42:53] <ema>	 so building against openssl-1.0.2j fixes it?
[13:43:18] <bblack>	 yes
[13:43:35] <bblack>	 but that doesn't necessarily implicate openssl, either.  it could be that nginx needs further adaption to changes in openssl, too.
[13:43:48] <ema>	 right, great find though :)
[13:43:59] <bblack>	 https://github.com/openssl/openssl/issues/1774
[13:44:39] <moritzm>	 I don't think the openssl 1.1 patches for nginx changes haven't seen much real world testing, even Debian (which is pushing to move to 1.1) only has packages in experimental...
[13:44:40] <bblack>	 ^ less than 24h old, someone's found a bug in bn_mul_mont
[13:45:13] <bblack>	 montgomery multiplication issue on AVX2 cpus.  in the repro test in that ticket, only with certain RSA keys (data-dependent)
[13:45:28] <bblack>	 but I think if bn_mul_mont is borked for at least some inputs, that potentially has fallout other than with RSA
[13:45:38] <bblack>	 (I think it's a primitive used by EC code too)
[13:51:00] <bblack>	 still it seems an odd fallout if that's it
[13:51:27] <bblack>	 you'd think it would fail at initial key exchange or server verification, not midway through the symmetric crypto
[13:54:41] <bblack>	 did openssl upstream write any kind of "how to port forward to 1.1" guide for server implementors?
[13:54:54] <bblack>	 something like that might have hints at something missed in nginx
[13:55:58] <moritzm>	 bblack: this is their entire porting documentation... : https://wiki.openssl.org/index.php/1.1_API_Changes
[14:17:22] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005#2741709 (10elukey) 05Open>03Resolved a:03elukey Will keep an eye on the logstash dashboard but everything looks good.
[14:36:01] <bblack>	 it's possible there's something there with rbio/wbio, but it's hard to say
[14:36:28] <bblack>	 it might be interesting (not as a real target, but as another way to test and reduce the problem surface) to build this nginx against libressl and/or boringssl and try to repro
[14:54:13] <bblack>	 (it's also possible some of the changes post-1.11.4 that I havne't backported are relevant)
[14:56:51] <bblack>	 another possibility it that it's the cloudflare dynamic record sizing patch.  I've tested disabling the functionality of the patch via config, but I haven't tested removing the patch completely.  There's at least one effect from it that can't be turned off.
[16:32:59] <bblack>	 ema: regardless of all the other moving pieces right now (and there's a lot, ugh), we need to get the reboots for the kernels going too
[16:33:30] <bblack>	 I can at least rip through some of the lower-hanging fruit today (maps, misc, lvs:secondary), we'll see where we are after that.
[17:07:31] <wikibugs>	 10Wikimedia-Apache-configuration, 06Operations, 06Performance-Team, 07HHVM, and 2 others: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2742105 (10hashar) Based on https://logstash.wikimedia.org/app/kibana...
[23:06:41] <wikibugs>	 10Traffic, 06Operations: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2731084 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1047.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610252306_bblack_5171.log`.
[23:07:55] <wikibugs>	 10Traffic, 06Operations: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2743386 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1047.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610252307_bblack_5594.log`.
[23:32:44] <wikibugs>	 10Traffic, 06Operations: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2743429 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp1047.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201610252332_bblack_14641.log`.
[23:44:50] <bblack>	 reminder-to-self for next round of nginx debugging: try moving the package forward to latest nginx-master, undoing the even tpipe revert, and removing the cloudflare dynamic record size patch....