[00:07:40] 10Traffic, 10MobileFrontend, 6Operations, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2101562 (10Jdlrobson) [00:33:36] 10Traffic, 10MobileFrontend, 6Operations, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2101616 (10Jdlrobson) SW... [11:16:48] 10Traffic, 6Operations: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2102515 (10ema) I've seen the same issue on my test instance in labs. @Ottomata try adding codfw: 127.0.0.1 to cache::text::nodes in ./hieradata/labs.yaml [11:27:59] 10Traffic, 6Operations: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2102532 (10ema) p:5Triage>3Normal [12:28:54] 10Traffic, 6Operations: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (10faidon) [12:29:18] elukey: perhaps https://phabricator.wikimedia.org/T129344 is something for you? :) [12:31:52] paravoid: sure I can work on it after the vk4 porting! [12:35:08] 10Traffic, 6Operations: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102834 (10elukey) a:3elukey [13:31:19] moritzm: does our new 4.4 have a -kbuild- package? [13:31:45] I had some issues with the existing one, it seems to use some kbuild pkg with a 4.3-based name, but then some other bits want the 4.4 name [13:33:00] sorry I don't recall "some other bits", it was last night and I closed that window. but it was systemtap stuff wanting the headers + kbuild packages, and something wouldn't work right with installing them [13:33:26] not yet, that's built from linux-tools, I'll get to that soon along with perf [13:34:09] it's a separate source package than src:linux [13:34:26] should be ready be Friday I think [13:35:13] ok thanks [13:49:13] ii linux-headers-4.4.0-1-amd64 4.4-1~wmf1 amd64 Header files for Linux 4.4.0-1-amd64 [13:49:16] ii linux-headers-4.4.0-1-common 4.4-1~wmf1 amd64 Common header files for Linux 4.4.0-1 [13:49:19] ii linux-image-4.3.0-0.bpo.1-amd64 4.3.3-7~bpo8+1 amd64 Linux 4.3 for 64-bit PCs [13:49:22] ii linux-image-4.4.0-1-amd64 4.4-1~wmf1 amd64 Linux 4.4 for 64-bit PCs [13:49:25] ii linux-image-4.4.0-1-amd64-dbg 4.4-1~wmf1 amd64 Debugging symbols for Linux 4.4.0-1-amd64 [13:49:28] ii linux-image-amd64 4.4+71 amd64 Linux for 64-bit PCs (meta-package) [13:49:31] ii linux-kbuild-4.4 4.4-4 amd64 Kbuild infrastructure for Linux 4.4 [13:49:34] bblack: this is the mix-and-matchi situation I have on sid ^ [13:49:54] s/matchi/match/ :) [14:28:13] bblack: the only situation I could reproduce where both SSL_select_next_proto and SSL_get0_next_proto_negotiated are called for the same connection is when no agreement was found during ALPN [14:28:23] and the nginx code seems to confirm this: [14:28:28] SSL_get0_alpn_selected(c->ssl->connection, &data, &len); [14:28:32] if (len == 0) { [14:28:39] SSL_get0_next_proto_negotiated(c->ssl->connection, &data, &len); [14:33:03] OK so in this scenario we would skew our statistics: the client sends a list of supported protocols with ALPN and we count those, nginx does not agree on any of those, SSL_get0_next_proto_negotiated gets called and we potentially count spdy twice [14:35:01] ema: just managed to run vk without VUT :) [14:35:08] \o/ [14:36:00] now I need to figure out args and other stuff that was done by VUT, but the basics are covered [14:53:28] ema: I want to say in theory a client could send just h1 over alpn and then spdy over npn and nginx's method would screw them out of spdy [14:53:39] but I wouldn't think anyone would implement a client that stupidly [14:56:12] right [14:56:19] * ema cries [14:56:20] ERROR: module version mismatch (#1 SMP Debian 3.19.3-9 (2015-11-10) vs #1 SMP Debian 3.19.3-9 (2016-01-04)), release 3.19.0-2-amd64 [14:58:19] ema: in any case, if we have a systemtap that should work (and the only thing I think needs fixing for a first iteration is logging spdy-only alpn, too), we can put it in puppet under modules/tlsproxy/files/utils/ or something just to get in a repo [14:58:46] wherever those horrible sniffer scripts I put are, I think it was there [14:59:04] bblack: yes I've added spdy-only alpn. Will drop the .stp file there then [14:59:10] thanks! [15:00:44] bblack: sure thing! I've added VTC support to v3, please take a look at it if/when you have time https://gerrit.wikimedia.org/r/#/c/275779/ [15:03:51] ok [16:10:47] fyi, https://phabricator.wikimedia.org/T66214 is being discussed today/tonight [16:11:07] it's still very light on the implementation details (...) but "varnish" was mentioned [16:14:09] paravoid: :) [16:14:10] thanks [16:14:27] I haven't been involved in it but I'm planning to attend that meeting [16:18:16] I probably won't make that meeting [16:19:14] so I've been staring at a mystery on and off for a few days now [16:21:22] well there's a few layers of mystery yak-shaving here, but the deepest one is this: [16:21:40] on cp3030, I can run this varnishlog for long periods and get zero or very few requests: [16:21:43] varnishlog -u -c -n frontend -m 'RxRequest:GET' -m 'TxStatus:403' [16:22:15] (so varnishlog claims cp3030 frontend does few to zero 403 responses to GETs) [16:23:52] but I see GET->403 for cp3030 quite frequently in oxygen's sampled-1000 [16:24:09] e.g. $ tail -2000 sampled-1000.json|grep cp3030|grep 'frontend int(' |grep 'https=1'|jq '.http_status=403'|jq '.http_method="GET"' [16:24:37] in this case I'm going even narrower: GET->403 on cp3030, and only X-Cache indicating varnish-internal response and x-analytics indicating https=1 [16:25:00] the tail -2000 tends to be a short recent window, so repeating that command over and over tends to give you a few very recent ones [16:25:19] the datestamps on them don't show much lag from reality->oxygen log, yet the whole time I can't catch them in cp3030 varnishlog [16:26:06] the few I do catch on cp3030 are not internals, they're e.g. 88 TxHeader c X-Cache: cp1055 pass+chfp(0), cp3012 miss+chfp(0), cp3030 frontend pass+chfp(0) [16:27:05] it's almost as if for some reason these varnish-frontend-generated 403s get logged to varnishkafka->oxygen somehow, but can't be seen in varnishlog [16:27:55] a more fine-grained varnishlog to match what oxygen's looking for would be: varnishlog -u -c -n frontend -m 'RxRequest:GET' -m 'TxStatus:403' -m 'RxHeader:X-Cache:.*int' [16:29:19] anyways, if they're showing up fairly routinely in sampled-1000, I should be seeing a pretty good stream of them in direct varnishlog... [16:29:22] none of this makes sense [16:34:54] moving up one layer of wtf: the GET->varnish-internal-403 reqs I see in oxygen sampled-1000 have no clear explanation. Some are for /beacon/ stuff, some are for normal /wiki/Foo. They're rare in terms of percentage, but they happen at a pretty constant background rate. There's nothing obvious about the requests that would match any of our VCL "error 403" [16:37:39] ignoring the whole problem with varnishlog -vs- oxygen and just digging in VCL: text frontend only has 2 real sources of internally-generated 403 now: the "insecure POST" check in wikimedia-frontend (which should only apply to HTTP reqs, and only to POST, so can't be this?) [16:37:48] and the 403 Noise matches in text-frontend [16:38:23] of the 3x 403 Noise checks, one is POST-only and one only matches zero.wiki domains [16:38:33] the other is this one: [16:38:34] req.http.referer && req.http.referer ~ "^http://(www\.(keeprefreshing|refreshthis|refresh-page|urlreload)\.com|tuneshub\.blogspot\.com|itunes24x7\.blogspot\.com|autoreload\.net|www\.lazywebtools\.co\.uk)/" [16:39:03] but the 403s I see don't seem to match that on the surface, but maybe that regex is horribly malformed in some subtle way [16:40:46] the other options are things like some kind of corruption in the pipeline from varnish->vk->... analytics ...->oxygen-sampled-1000 [16:41:14] where varnishlog actually is correct and this really isn't happening, but something goes wrong generating the oxygen logs (some mis-mapping of the reported http status codes somehow, etc) [17:27:50] bblack: oh yeah that sounds like a WTF indeed [17:28:19] I have my own wtf of the day, the stap script starts fine on cp1048 but doesn't "get" any ssl function calls [17:29:27] maybe I found out what the problem is, nginx was started on Mar02 and libssl got upgraded afterwards [17:29:57] bblack, moritzm: shouldn't we automatically restart nginx after libssl upgrades? [17:30:46] ema: the last libssl upgrade wasn't critical, I think is why we didn't [17:30:59] we probably should roll through them anyways though, just to be pedantic [17:31:06] right [17:31:29] bblack: is it OK for me to restart nginx on cp1048 or are there any precautions to take before doing that? [17:40:20] bblack: the 403 thing is really strange [17:40:57] I'm trying with varnishncsa without any regexp just for fun [17:41:02] and indeed there are no 403s [17:41:15] varnishncsa -F '%s' -n frontend |grep 40 [17:41:23] this shows a few 404, but no 403s at all [17:44:50] there are a few 403s if you look long enough, but usually they're from MediaWiki for OPTIONS requests [17:45:29] if you filter down to GET and X-Cache matching 'int' for internally-generated on the frontend, those are the mystery ones that virtually never happen in varnishlog/ncsa AFAIK, but show up in oxygen sampled-1000 [17:45:51] (text-cluster btw) [18:25:22] 7HTTPS, 10Traffic, 6Operations, 6WMF-Communications, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2104009 (10Florian) @Jalexander: Is there any news you could share or a status update? :) [19:15:51] ema: your systemtap script is pretty crazy [19:15:53] good crazy [19:37:49] ema++ :) [19:41:36] from a prelimary run of a slightly-earlier variants on eqiad upload for 1 hour, the stats look better than the sniffer ones did before (which have known inaccuracies, now): [19:42:11] 38.5% none 50.9% both 8.6% spdy-only 1.9% h2-only [19:42:46] and we've got ~ 2 months to the absolute cutoff, during which they should only get better [20:27:29] 10Traffic, 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2104514 (10BBlack) Status updates on the 3x things mentioned a couple updates above: 1. (codfw direct): not yet t... [20:58:18] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2104625 (10GWicke) [21:27:59] 7HTTPS, 10Traffic, 6Operations, 6Research-and-Data, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#2104704 (10DarTar) John Vanderberg [[ https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_referrer_policy#Also... [21:42:14] 7HTTPS, 10Traffic, 6Operations, 6Research-and-Data, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#2104753 (10Tgr) Note that before the change target sites which were linked over HTTPS received the full URL, not j... [22:15:26] 7HTTPS, 10Traffic, 6Operations: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2104923 (10CCogdill_WMF) I can confirm the site is ready for SNI for our next event. Thanks for your help! [22:15:33] 7HTTPS, 10Traffic, 6Operations: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2104924 (10CCogdill_WMF) 5Open>3Resolved [23:04:16] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105088 (10brion) Quick note from IRC regarding the thumb-URL needs for mobile apps/web: The primary use case... [23:11:54] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105138 (10Tgr) IMO the two main questions here are: # is this going to be supported by MediaWiki or just by so... [23:13:55] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105146 (10Tgr) Also, a 100% compatible VCL-layer mapping of nice URLs to old URL is just not gonna happen. For... [23:39:38] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105215 (10brion) >>! In T66214#2105138, @Tgr wrote: > IMO the two main questions here are: > # is this going t... [23:48:10] 10Traffic, 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Varnish support for shutting users out of a DC - https://phabricator.wikimedia.org/T129424#2105242 (10BBlack) [23:53:27] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: BlockUse content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105268 (10Jrtorres432) [23:55:19] 10Traffic, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105283 (10brion)