[08:05:36] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2597465 (10SindyM3) >>! In T118388#2595125, @Multichill wrote: >>>! In T118388#2594683, @Dzahn wrote: >> @SindyM3 I think this should probably have a new tick... [08:12:48] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2597481 (10SindyM3) @Dzahn I'm sorry for not creating a new ticket. The admin of schippers.wikimedia.nl is Lemonbit and the send me this email. So I'm lookin... [08:35:16] 10Traffic, 06Operations: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2597553 (10Danielsberger) I have been working for some time with Ramesh on cache admission policies for Akamai workloads, and I can contribute the following takeaways # frequency-based admission -... [09:18:17] 10netops, 06Operations, 13Patch-For-Review: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2597652 (10akosiaris) Given the patch set above, I 've removed the rule for port 7000. I 've LGTMed the patch as well.Thanks! [09:19:10] 10netops, 06Operations, 13Patch-For-Review: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2597653 (10elukey) thanks a lot and sorry for the delay! [09:26:35] 10netops, 06Operations, 13Patch-For-Review: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2597664 (10akosiaris) 05Open>03Resolved Resolving then. Thanks! [11:09:53] http://spectrum.ieee.org/tech-talk/telecom/internet/researchers-map-locations-of-4669-servers-in-netflixs-content-delivery-network [11:28:01] for some reason I was convinced that Netflix used commercial CDNs [11:28:04] really nice [11:36:52] I remember also a facebook talk about their caching/pop infra since all the app-servers needs to reside in the U.S. (or at least, this was true a couple of years ago) [11:37:21] TLS termination on the caching pops and long lived connection with the US data centers [11:47:06] elukey: yeah they use AWS for "everything before you hit 'play'", their own CDN for video delivery [11:47:28] "the cloud" [11:48:37] the researchers found a typo in one of netflix's hostnames, Carrasco International Airport is abbreviated MDV instead of MVD :P [11:49:29] ahhaah [11:50:58] they also found a few servers in the middle of the pacific (Guam) [11:54:14] very interesting paper, they estimated traffic volumes by looking at the IPv4 ID field which on FreeBSD is predictable http://arxiv.org/pdf/1606.05519v1.pdf [11:55:10] any chance the ulsfo zayo link issues are confusing our testing, at least on the 503s? [11:56:23] bblack: possibly, we could perhaps look into icinga logs and try to correlate? [11:56:47] also the maintenance window is over in 4 minutes, we should repool ulsfo I guess [12:04:38] ema: I don't think icinga has an accurate view [12:04:44] bblack: mmh, only upload is affected by the 503 spikes though, so probably they're not due to the ulsfo zayo link [12:04:55] they still could be [12:05:06] upload's traffic and upload's software are different than text [12:05:25] text is shorter transactions on misses across that link, upload has bigger things that take a little longer [12:05:47] probably librenms + varnish-aggregate 503s would say if they correlate [12:06:26] if even some of the 503s are from ulsfo link problems, it's confusing what we're looking at with v4 [12:11:00] 10Traffic, 06Operations: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2597958 (10BBlack) >>! In T144187#2597553, @Danielsberger wrote: > frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonder... [12:13:56] ema: mostly I bring up the link issue because I saw Mark's email to Zayo this morning, that said: [12:13:59] We are a customer of yours, and we're currently experiencing problems with one of our 10G wavelength links, circuit OGYX/124337//ZYO, between Carrollton and San Francisco. [12:14:03] This circuit has been flapping repeatedly during the past week, including at least 11 flaps today since 05:31 UTC. [12:14:23] we had 3 links down concurrently [12:14:34] and ulsfo has been completely down a few times [12:14:38] so definitely, some 503s would be due to that [12:14:53] if "repeatedly during the past week" is our little 503 plateaus we've been seeing in ulsfo 503 rates.... the 0-byte CL thing could also be related. it would still be a varnish bug, but it would be one with a known trigger heh. [12:15:03] interesting [12:15:15] yeah that one link has indeed been down repeatedly [12:15:16] mark: would that be cr2-ulsfo:xe-1/3/0? [12:15:17] ema: do you have a librenms login? [12:15:19] the other two not I believe [12:15:22] checking [12:15:40] bblack: yes, I'm trying to find the right graphs to look at :) [12:15:53] ema: correct [12:16:00] ok [12:17:27] well the link issue should be fixed now maint is done [12:17:38] so if your issues are suddenly gone... :) [12:18:16] well, I depooled ulsfo-upload yesterday just to avoid users seeing the (minor) issues while neither of us was actively investigating overnight [12:18:19] so we don't know yet :) [12:20:05] bblack: when did you depool ulsfo upload yesterday? [12:21:00] a little bit after 20:05 UTC [12:21:14] ah, that explains the traffic drop here then https://librenms.wikimedia.org/graphs/to=1472645700/id=7518/type=port_bits/from=1472559300/ [12:21:19] (that was the commit time, and it was a pretty quick commit->merge->deploy) [12:45:26] I cannot really correlate the 503s with cr2-ulsfo:xe-1/3/0 issues [12:45:55] (assuming that cr2-ulsfo:xe-1/3/0 issues means In/Out traffic going to 0 here https://librenms.wikimedia.org/graphs/to=1472646900/id=7518/type=port_bits/from=1472546100/) [12:46:36] for instance, we had a long 503 plateau starting around 16:30, and the traffic graph for zayo looks fine at that ime [12:46:43] *time [12:50:57] oh well :( [12:51:48] so yeah we could recap theories, but really at this point, digging into the code path on the CL:N->CL:0 transition on the frontend is probably the most fruitful path [12:51:58] as you said yesterday, maybe systemtap or similar [12:52:02] yeah [12:52:26] can probably narrow down to only a few cases in the v4 source where CL could possibly be set to zero [12:53:28] also I haven't really been looking at the 503s in the same detail when they spike up, but I think we did think they were concurrent with the CL:0 traffic. they may have related underlying causes, and might be easier to understand in varnishlog [12:54:38] stupid idea of the day is coming, ready? :) [12:54:48] would it make sense to pass on CL:0 and Status:200 on the frontends to alleviate the issues a bit? [12:55:55] we don't know CL till the response side, though. [12:56:16] and we don't want to HFP, as it may work fine for later requests. we could beresp.ttl=0s+beresp.uncacheable, though [12:56:29] but it's just going to hide the issue from view for debugging [12:56:59] mmh yeah [12:57:32] well [12:57:45] on the other hand, we don't need the cache hits to see the issue I guess, just the ones that happen on miss [12:58:08] actually it might help pinpointing the issue given that we would get rid of the frontend hits with CL:0 [12:58:30] so yeah, maybe in the backend_response code for all layers, if status==200 and CL:0, beresp.ttl=0s + beresp.uncacheable [12:58:35] in a v4-only block for now [12:59:00] ok I'll give it a shot [13:07:58] 10Traffic, 06Operations: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2598030 (10BBlack) Tying the X-Cache idea together with the tuneable hitcount and size from the paper, we could look at something of the form `if (hc / (os * X) < Y) { uncacheable; }`, where `hc` is... [13:29:14] shall we reenable ulsfo? [13:29:59] yeah I think so. there's 2x commits there to revert [13:30:13] (alex's for all of ulsfo, and then my earlier one for upload-ulsfo) [13:30:45] once we do that i'll post a brief incident report [13:33:27] ema: ok for upload traffic, or want to puppet that patch out first? [13:35:06] EC logging patch for nginx works! :) [13:35:08] 20 RxHeader c X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-AES128-GCM-SHA256; EC=prime256v1; [13:35:11] 20 RxHeader c X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=AES128-SHA; EC=UNDEF; [13:35:11] \o/ [13:35:14] 20 RxHeader c X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-RSA-AES128-GCM-SHA256; EC=secp384r1; [13:35:22] bblack: I'd say we can proceed without puppeting the patch first, in any case we probably have to ban whatever has been cached with CL:0 and Status 200 [13:35:29] ema: ok [13:38:10] also notable is that, at least with openssl itself as the test client, you can't have different curves for the ECDHE and ECDSA parts [13:38:28] our ECDSA cert is fixed, it is defined and signed as a prime256v1 key [13:38:49] whereas ECDHE in openssl-1.0.x can do several curves, the important ones being the nist 256, 384, and 521 curves [13:39:07] at defaults it will use primve256v1 for both an ECDHE-ECDSA [13:39:35] but if you tell openssl s_client to use curve secp384r1, the server will negotiate that for the ECDHE part and then fall back to the RSA certificate, since it's not the same curve [13:39:39] bblack, ema: FYI, I recently uploaded a new kernel package to apt.wikimedia.org: 4.4.2-3+wmf3, which is based on 4.4.18 and includes a fix for (mitigated) CVE-2016-5696 amongst many other fix. It's now running on 36 jessie hosts, so should be ready for the cp/lvs hosts as well [13:39:52] I'm just not sure if that locked-together limitation is server-side or client-side in openssl-1.0.x [13:40:39] moritzm: ok thanks! we'll need to find some time to do depooled reboots and such. I'll at least get the package installed and ready today though. [13:41:35] I'm assuming that with openssl-1.1.x the client and server are both capable of using different curves for the ECDHE exchange and the ECDSA cert, or else openssl-1.1.x's support for x25519 would be mostly-pointless. [13:41:41] thanks, no hurry on that one, it doesn't fix any severe kernel issue except the mitigated one [13:46:11] https://wikitech.wikimedia.org/wiki/Incident_documentation/20160831-ulsfo [13:50:22] uh "Comparison of different types: STRING '==' INT" [13:50:30] beresp.http.Content-Length == 0 [13:51:07] CL is a string apparently [13:53:35] :) [13:55:30] as much as I love automatic type conversion convenience... [13:55:47] in VCL it's probably better this way, or we'd have one more layer to decode on the truthiness of strings [13:56:29] i.e. hieradata->puppet->ruby->VCL all with different idea of whether the string "0" is false and so-on [13:56:49] yeah we would miss situations where 10 != "10" and go mad [13:56:50] (sadly we alrady face some of that, but at least lack of implicit VCL string->int takes away a little of it) [13:59:28] any reason why CL should be a string instead of something looking like a number? [14:00:12] CL doesn't get special treatment, it's an http header field and thus a string [14:00:25] ah! [14:00:47] it's not that awful though. in v3 we just use regexen to fixup size compares [14:00:56] in v4 vmod_std can convert to int for us, though [14:01:27] oh wait, it can't heh [14:01:47] they did include convertors for reals to integer or time, and time to integer or real, though [14:02:29] no, wrong the second time, right the first time [14:02:38] :) [14:02:45] it does have integer("foo", 5), which converts foo to an int, or falls back to 5 if conversion fails [14:03:33] so you could do e.g. if (integer(beresp.http.Content-Length, 32*1024*1024) >= 32*1024*1024) [14:03:43] in place of our current regex number-counting hacks to make such comparisons work [14:13:53] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598217 (10Dzahn) @SindyM3 No problem. I don't know who is in charge off that, but WMNL people would know. When looking at https://www.wikimedia.nl/blog i se... [14:19:31] they should have gone with lua [14:20:10] like ATS :) [14:20:54] https://docs.trafficserver.apache.org/en/5.3.x/reference/plugins/ts_lua.en.html [14:25:56] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598286 (10SindyM3) @Dzahn WMNL people don't know because I am one of them :( [14:28:08] httpd also offers a nice mod_lua, maybe it could be used in apservers land? [14:28:11] * elukey hides [14:41:03] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598330 (10siebrand) @sindyM3: it's your server that's being exploited and its maintained by a third party that you pay to keep you safe. It's probably time t... [14:41:26] systemtap is not doing what I say [14:41:45] (just wanted to share that) [14:43:04] :) [14:46:44] bad systemtap bad! plus the rolled newspaper [15:49:11] 07HTTPS, 10Traffic, 06Operations, 10Wiki-Loves-Monuments: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2598561 (10Akoopal) @siebrand I will send a mail about this, but imho lemonbit has done the correct steps. [17:22:09] 10netops, 06Operations: configure port for frdb1001 - https://phabricator.wikimedia.org/T143248#2598949 (10Jgreen) [17:55:58] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2599097 (10dr0ptp4kt) Heads up @Jhernandez, I think this implies there's a different, more can... [17:57:40] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2562534 (10Mattflaschen-WMF) If this affects window.Geo, could you coordinate to avoid breakin... [18:02:12] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2599113 (10BBlack) The cookie itself isn't going away (and is normally set by our servers in t... [19:40:43] I took a long break, came back, managed to convince systemtap to be reasonable [19:41:09] it seems that all CL rewrites with status 200 happen in cnt_vdp [19:41:32] but I didn't manage to spot any rewrite to 0 [19:42:03] vdp is generally ESI and gzip right? [19:42:46] looks like [19:43:01] and Range I think [19:44:07] we've been here before, in this same area of code, debugging other things in the past... [19:44:19] I think this is where some of my failed patches where for the other CL:0 issue [19:44:24] s/where/were/ [19:44:34] resp_len = http_GetContentLength(req->resp); [19:44:34] if (bo != NULL) [19:44:34] req->resp_len = resp_len; [19:44:34] else [19:44:34] req->resp_len = ObjGetLen(req->wrk, req->objcore); [19:44:37] etc... [19:44:50] this is the stap script I'm running on cp4005: https://phabricator.wikimedia.org/P3953 [19:44:50] the whole thing about whether it's a busyobj or not [19:45:00] output in ~ema/T144257.log [19:45:49] something like this is happening (or well, something kinda related to this idea): [19:46:03] 1. the object has a defined, non-zero Content-Length header [19:47:00] 2. but the object is busy, so rather than defining the function-local "resp_len" from CL, we get it from the object, which is what's happening in the if/else pasted above [19:47:21] 3. somehow the busy object returns a length of zero, even though there is a non-zero CL [19:47:36] then way down at the bottom of cnt_vdp: [19:47:38] } else if (resp_len >= 0 && resp_len == req->resp_len) { /* Reuse C-L header */ [19:47:56] this causes resp_len != req->resp_len (resp_len is the real CL, and req->resp_len is now zero) [19:48:07] which leads to: [19:48:09] "Content-Length: %jd", req->resp_len) [19:48:27] right [19:48:58] I'm printing resp_len and req->resp_len and check if the latter is ever set to 0 [19:50:32] ok [19:50:35] but so far no zeros? [19:50:43] may just have to wait for another "event" [19:51:06] so far no zeroes [19:51:09] my gut feeling is this is a race condition on nukes and/or purges [19:51:39] also awesome systemtap, the probe uses 0% cpu and 0% memory :) [19:51:40] there's a "busy" object which is half-freed/nuked, which hits like it's still useful and does have a CL header, but the body's already partly or mostly evicted and ObjGetLen() gives zero [19:53:25] we have 503's ~14:00-18:00, but missing at present, probably matching the lack of CL:0 in your tap logs [19:53:46] right [19:53:55] the peak rate is lower before, perhaps due to the uncacheable/CL:0+200 thing before [19:54:24] I'm also printing stuff in places that make little sense like cnt_synth and anywhere CL gets written really [19:54:54] I'm out for a bit. nice work with systemtap. [19:55:20] I'd say leave it all as-is for now and see what turns up overnight from logs [19:55:59] it doesn't look that user-dangerous statistically, the failures we see now with the mitigating VCL in place [19:56:39] nice [19:57:18] cya :) [20:07:35] interestingly, there has been no output from the backend yet [20:07:43] only varnish-fe [20:30:39] oh well varnish-be is depooled on cp4005 [20:33:27] I'm now running the script on all cp4* upload hosts, output in ~ema/T144257.log [22:37:40] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600311 (10awight) a:03AndyRussG [23:10:47] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 4 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2600401 (10DStrine)