[07:27:33] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10Peachey88) >>! In T132521#2454541, @demon wrote: > Did we ever make that an official... [07:29:48] Hello traffic people, I am going to upgrade varnishkafka in misc [07:31:01] and then this afternoon I'll add the -T 700 argument to increase the VSL API timeout [07:31:16] this should solve the analytics inconsistencies [07:32:24] elukey: cool! [10:07:09] thank you vcl for the free headache you gave me this morning [10:12:14] :) [10:40:50] ema: anything against https://gerrit.wikimedia.org/r/#/c/298929/ ? [10:47:31] elukey: nothing, +1ed [10:48:33] super thanks [10:48:38] will merge after lunch [11:26:19] done! [11:26:29] I run puppet on cp3009 and everything seems good [11:26:39] logs are flowing and memory consumption is good [11:27:56] nice [12:03:38] bblack: https://phabricator.wikimedia.org/P3440 [12:03:43] (likely a brainfart) [12:04:07] tentative solution of the problem "how to set bereq.http.Range on v4) [12:04:45] I've cheated by putting vcl_backend_fetch in templates/varnish/upload-common.inc.vcl.erb but I know it's Wrong [12:41:48] :) [13:12:40] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2461831 (10BBlack) We probably have to be a little bit careful about broad "https all the things... [13:14:22] 10Traffic, 10DBA, 06Labs, 06Operations, 10Tool-Labs: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2461834 (10jcrespo) 05Open>03Resolved a:03jcrespo [13:32:02] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2461880 (10elukey) We are now in pending verification of fix, let's see if oozie will not complain during the next couple of days. It... [13:53:55] Snorri_: re the Hitler article I'm monitoring - there was an actual logged PURGE circa 2016/07/13 16:51:30 (UTC), and the Age did reset to zero (I'm polling the response Age and X-Cache headers every 10 minutes or so). This seems to be a little over an hour *after* the last edit timestamp on that article, but was probably related (maybe the PURGE queues are that backlogged in jobrunner and relat [13:54:01] ed?) [13:54:29] Snorri_: so far I haven't seen an Age drop (or a drop from the frontend cache I'm hitting at all) since then, my last poll shows: [13:54:32] Thu Jul 14 13:53:26 UTC 2016 [13:54:35] < Age: 75691 [13:54:37] < X-Cache: cp1068 hit/1, cp1065 hit/419 [13:54:58] so it will hit the 24H mark in about another 3H. [13:56:12] I'm hoping there's no more edits (if we're lucky for days!), so we can see if something else pushes it out, or resets Age somehow [14:15:56] I´m interested to see if it will be purged at 24H or if this will work like intended. Also you are connecting to the equiad directly. If it does not happen than it would maybe just be happening to esam. Which would mean the age is not transitive and it would be even stranger. [14:25:43] even in eqiad Age needs to be transitive, as we have both frontend and backend caches there [14:26:21] it is possible that esams has a significantly different hot dataset that would have a different natural rollover behavior, but still I don't expect an organic effect like that to coincidentally line up on the 24H mark [14:27:15] If the frontend re-fetching an old object cached in a backend resets Age (when the 1d TTL cap expires in the frontend), then yeah we have some kind of Age transitivity problem, which would be unexpected indeed... [15:10:21] 10Traffic, 06Operations: Lower geodns TTLs from 600 to 300 - https://phabricator.wikimedia.org/T140365#2462333 (10BBlack) [15:10:54] 10Traffic, 10netops, 06Operations: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#1341583 (10BBlack) [15:10:56] 10Traffic, 06Operations: Lower geodns TTLs from 600 to 300 - https://phabricator.wikimedia.org/T140365#2462348 (10BBlack) [15:20:58] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2462364 (10elukey) Seems working! After 11 UTC no more empty dt fields. ``` ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatal... [16:41:29] Hitler stuff still on-track: no new PURGE yet since the one noted yesterday, and last poll is: [16:41:32] Thu Jul 14 16:33:27 UTC 2016 [16:41:35] < Age: 85292 [16:41:37] < X-Cache: cp1068 hit/1, cp1065 hit/523 [16:54:29] Thu Jul 14 16:43:27 UTC 2016 [16:54:29] < Age: 85892 [16:54:29] < X-Cache: cp1068 hit/1, cp1065 hit/529 [16:54:29] Thu Jul 14 16:53:27 UTC 2016 [16:54:29] < Age: 11 [16:54:31] < X-Cache: cp1068 hit/1, cp1065 miss [16:54:40] Thu Jul 14 16:54:11 UTC 2016 [16:54:40] < Age: 55 [16:54:40] < X-Cache: cp1068 hit/1, cp1065 hit/1 [16:55:03] I was doing 600s polling (the last one is manual), and it did roll over at the 1d mark [16:55:37] there was no associated PURGE, and almost certainly the "event" that happened there was the capped 1-day TTL on the frontend ran out [16:55:59] but note cp1068 still was a hit on the miss, so the reset of Age: is the mystery here [16:57:29] hmmm but that hit-counter shouldn't incremented [16:57:33] (on cp1068) [16:57:36] something's not right... [16:58:49] so maybe it's not Age-transitivity, but for some reason it fell out of the backend when it fell out of the frontend. It wasn't PURGE, and we know the appservers are still sending 31d s-maxage, and the TTL cap there is 7d [17:00:45] confirmed it's not a VCL templating issue (the real live deployed VCL's ttl cap code does say "7d"), and mediawiki does send this with the response: [17:00:48] < Cache-control: s-maxage=2678400, must-revalidate, max-age=0 [17:01:21] Okay so...is the Mem-Cache Purge through TTL maybe cascading to the backend? [17:01:29] For some unknown reason? [17:01:47] that doesn't make sense, but then again nothing about this does, yet. [17:02:09] that it fell out of the frontend at the 86400 is expected, but why did the backend miss at the same time as welL? [17:02:26] we also know MW isn't sending Expires or ETag, so it's probably not related to those things [17:02:47] excactly...and I´m not an expert on this. Never worked with varnish before my thesis. So... [17:03:04] it could be that on expiry the frontend is still sending an IMS request to the backend, and the backend decides to IMS to the wiki, and the wiki's Date always updates and it always fails IMS checks? [17:03:37] that's the best theory I have right now [17:03:45] What does IMS and MW stand for? [17:03:48] MW emits a fresh Date: header on every fetch of the article [17:04:07] MW = MediaWiki (the application server from varnish's point of view). IMS is If-Modified-Since. [17:04:29] Ahh. [17:05:05] the idea is if a cache has a stale object with a Date header and needs to refresh, it can send the new request with "If-Modified-Since: ", and the responding server can either send fresh content or say "Nope, it's still the same old object with the same old date, keep using it" via a 304 response. [17:06:04] It wouldn't be surprising that IMS checks would never 304 from MediaWiki. But I wouldn't think that the varnish-frontend doing an IMS check to the varnish-backend would cause a further IMS cascase to MediaWiki. It should've just satifisied it with the object it still had cached. [17:06:21] *cascade [17:06:41] And the new Date might set the IMS flag and therefore the cache thinks it changed so it loads a new object and therefore it resets the Age? [17:08:14] well ok, I tested IMS on MediaWiki, and it actually behaves better than I expected [17:08:22] curl -vs -H 'Host: en.wikipedia.org' -H 'If-Modified-Since: Wed, 13 Jul 2016 17:07:07 GMT' http://mw1268.eqiad.wmnet/wiki/Adolf_Hilter >/dev/null [17:08:35] < HTTP/1.1 304 Not Modified [17:08:35] < Date: Thu, 14 Jul 2016 17:07:52 GMT [17:08:39] < Server: mw1268.eqiad.wmnet [17:08:39] < Cache-control: s-maxage=2678400, must-revalidate, max-age=0 [17:09:03] So it does geht the 304 Not Modified. [17:09:08] so even though my IMS date is 24H ago (a little after the last legitimate purge), MW does tell me it's not-modified, and sends updated Date+CC [17:10:38] So it´s unlikely that this is the reason for the purge... [17:10:38] I could see 304 re-setting Age, too (but need to test that) [17:11:19] But why should the 304 reset the Age? At least logically it shouldn´t, right? [17:11:20] but still, the backend (in my case, cp1068), should've kept the object for 7 days before it even did an IMS check. The CC header says 31d, and we cap obj.ttl to 7d there. The object was only 1d old... [17:11:52] Snorri_: I don't know that it does, but it could, and it wouldn't be catastrophic if it did. It would just mean we need to know that to interpret what Age really means in this context. [17:13:33] interestingly, if I poll the backend (cp1068:3128) directly, while it is showing a reset Age value, the LM date is still the old one [17:14:18] Well...my idea when I started my thesis is not usable. At least as easily as I had hoped. [17:14:20] $ curl -vs -H 'Host: en.wikipedia.org' http://cp1068.eqiad.wmnet:3128/wiki/Adolf_Hilter 2>&1 >/dev/null |egrep '(Age|Last-Modified):' [17:14:23] < Last-Modified: Wed, 13 Jul 2016 16:51:30 GMT [17:14:25] < Age: 1033 [17:14:59] so it's showing Age on the backend as being reset recently when we hit the 24H mark, but it's also showing an LM timestamp from back at the last real purge, >24H ago [17:15:44] if it had truly fetched new content from MediaWiki (a full 200 OK response), the LM date would've bumped [17:15:49] on 304 I'm not sure [17:15:55] Okay so the Age is reset individually from the last modified (consistend with my collected data) [17:16:32] I still don't see why cp1068 would've reset its age at the 24H mark. Nothing should've expired at that point there. [17:17:01] unless, as you said, somehow the frontend's (probably IMS) fetch from the backend *causes* the backend to re-check IMS with MediaWiki on its otherwise not-stale object. [17:18:07] even the refreshed content in the frontend has that same >24H Last-Modified stamp [17:18:21] That´s the only thing I can think off right now. But as I said...you are obviously a lot more familiar with what happens with the caches and how varnish operates [17:18:44] not always, apparently! :) [17:19:03] hehehe [17:19:42] so my best working theory so far is there's two effects I wasn't expecting: (1) Age gets reset on successful IMS->304 response + (2) FE->BE IMS triggers BE->MW IMS, even though BE has an object with plenty of local TTL left in it [17:19:53] Well...my collected data was only stored with full 200 OK respones and the LM date was often >24h ago and still my Age reset. [17:19:56] we can spin up some test daemons and validate whether those assertions are true and why [17:20:52] I really don't think the object was re-transfered from MW->BE or BE->FE, but I think FE's expiry after 1d (ttl cap) caused a chain of 304's and Age-resets all the way through to the applayer. [17:22:27] I could see how this would make sense from varnish's perspective, too. If it gets a regular request for an object with plenty of cache lifetime left, it responds with the object without re-validating. But if it gets an IMS check on such an object, it tries IMS to the backend first, since it really doesn't know if it's been modified since, regardless of TTL. There could be some HTTP standards o [17:22:32] Even though I feel I only understand like 20% of what happens here it is really faszinating :) [17:22:33] r sanity reason that things have to operate that way. [17:23:26] from our perspective it's not awesome, though. We'd really rather the backend trust its own TTL and assume no modification-since at a deeper layer. [17:23:58] (and the Age: reset is unfortunate for gathering statistics, but not terribly bad for us in any functional sense) [17:24:41] perhaps this behavior is only in the case of "must-revalidate" being present in the applayer CC-header [17:25:10] Well this is not a bad asumption though. It could be some "failsafe" to be sure to check even long living data so no purge can go missing for some reason or something like that [17:26:19] from an HTTP standards perspective, quite possibly. from our perspective we've assumed s-maxage -based TTL calculations would overrule that. [17:26:21] Yeah for my original thesis it only means I have to look at the data of each day and then accumulating it. [17:26:45] Snorri_: or calculate an artificial Age: header of your own from (now - Last-Modified) [17:27:11] that would give the age from the last full re-fetch of the object into cache, but ignores that we've sent 304-checks since then. [17:27:27] we saved the bytes-transfered, but not the requests/sec. [17:28:10] anyways, I'll have to do more thinking and testing and reading to make sense of whether there's a way we can avoid this or improve on it without breaking anything [17:28:31] thanks for bringing this up in the first place :) [17:29:12] Okay but....now I do have a question about the last modified field [17:30:02] the answer is probably "it's caches all the way down" :) [17:30:09] I thought the last modified field is only updated when the article is actually modified. Not when the object is fully refetched. [17:30:52] correct [17:30:55] Nope...that I already knew as I discussed a lot with ema already. Getting to know your cache structure in the first place. [17:31:29] LM remains "Wed, 13 Jul 2016 16:51:30 GMT" in every related fetch: if I do a full fetch direct from MediaWiki, or a 304 IMS fetch from it, or any kind of fetch from cache. [17:31:59] Okay...good. Than the artificial age does not really help ;) [17:32:16] If I had to take an educated guess, I think the LM date is set when MediaWiki stores the outputs into its parser cache, along with the content. [17:32:29] yeah good point, it does mean the artificial age isn't useful [17:33:14] I want to determine if the Age value can be used to approximatly see how popular a website is. So the artificial age does not help [17:34:06] But with knowing there is the 1d purge (for some reason) I can look at a few days individually and then accumulate it to a longer time-span [17:34:11] assuming my "best working theory" above is true (still need to spend some time confirming that), "Age" is useful, in that it tells us how long it's been since we did at least an IMS check to the application layer, which is a request. That it was a 304 and didn't send the content bytes along is not all that important. [17:34:34] but there's no combination of available metadata that will indicate to us when the whole content was last fetched up into the cache infra from mediawiki :( [17:35:29] too bad. Such metadata could be quite useful. In more cases than mine I think. [17:36:10] definitely [17:36:46] we had assumed in the past that Age told us these things. And it probably approximately did back when everything was 30d, before we started efforts to progressively cap the TTLs lower and lower at various varnish layers. [17:37:10] (historically, MW emitted 31d TTLs, and all layers of varnish capped TTL around the 30d mark as well) [17:37:17] could this be setup? With the switch to varnish 4 it would be the perfect time to try to integrate something like this, right? [17:37:47] I don't know. first I need to understand the mechanics of this issue [17:38:00] Yeah I read that in the phabricators :) back than it should have done that yeah. [17:38:21] Sorry for putting more work on your table ;) [17:39:04] well it's a mystery that needed solving anyways. Someone probably would've eventually noticed, and the sooner the better :) [17:39:59] it probably means our earlier analyses based on statistical sampling of Age: were flawed, too. [17:43:05] It might be. But with a longer TTL it might not be that flawed. With for example 7d as TTL in both mem and data cache the data might be flawed but maybe only in some insignificant was (guessing that most of the articles will get modified more often than every >7d, at least if popular enough to be in that long) [18:53:02] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2463436 (10Tbayer) Update (@faidon: I'm still unsure if you are receiving the updates from the internal ticket with Automattic): On April 21, I had asked Automattic's support about the... [19:00:44] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2463476 (10faidon) @Tbayer, I still do. In fact I've been receiving these ticket (or series of tickets?) updates since Jul 24th 2015. They supposedly support HTTPS-by-default since Apri... [19:13:45] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2463578 (10BBlack) Nice! The redirect functionality looks correct. However, the STS header is `strict-transport-security:max-age=31536000`, whereas it should be `strict-transport-secur... [19:55:31] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Blog: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2463813 (10Tbayer) >>! In T105905#2463476, @faidon wrote: > @Tbayer, I still do. In fact I've been receiving these ticket (or series of tickets?) updates since Jul 24th 2015. OK, from...