[02:23:20] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765427 (10mmodell) [02:23:27] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765440 (10mmodell) p:05Triage>03Normal [02:27:17] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765442 (10mmodell) [02:39:26] 10Traffic, 10Gerrit, 10Operations, 10Phabricator, 10periodic-update: Phabricator and Gerrit: Improve the way that maintenance downtime is communicated to users. - https://phabricator.wikimedia.org/T180655#3765463 (10demon) Ideas for implementation from IRC: * Environment variables--might require Apache r... [04:37:03] 10Traffic, 10Operations: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3765500 (10Krinkle) >>! In T180407#3761847, @BBlack wrote: > Does RL make use of the CP cookie information to use different module-loading strategies for H/1 vs H/2? I remember that being th... [04:52:25] 10Domains, 10Traffic: Domain Hacks - https://phabricator.wikimedia.org/T180657#3765523 (10UpsandDowns1234) [05:19:29] 10Domains, 10Traffic, 10Operations: Domain Hacks - https://phabricator.wikimedia.org/T180657#3765537 (10UpsandDowns1234) [05:25:23] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765540 (10Bawolff) [05:45:17] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765523 (10greg) w.wiki is the current domain that we'll use for short links, per T108649 (and T108557). Do we need more? [07:46:51] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765669 (10UpsandDowns1234) I like domain hacks, and they make lives a lot easier. (Goo.gl/e) or (google.com)? (Group.me) or (GroupMe.com)? [07:48:08] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765672 (10UpsandDowns1234) And w.wiki url shortening is disabled. Bummer. [07:49:28] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765673 (10UpsandDowns1234) And the idea with url shortening is that the article name will be the thing after the url so that it is easy to enter in. [08:31:35] bblack: right, especially when it comes to transient storage usage [08:32:37] I was under the impression that all synth responses would end up there, but that comment seems to imply that only shortlived objects get stored in transient storage [09:32:10] mmh maybe instead of just waiting a certain amount of time between varnish upgrades we should wait for a certain hitrate [09:32:27] I'm now trying this: [09:32:37] while ! varnishstat -n frontend -1 -f MAIN.cache_hit -f MAIN.cache_miss | [09:32:41] awk '/hit/ {h=$3} /miss/ {m=$3} END { if ((h/(h+m)) < 0.6) { exit 1 } }' ; do [09:32:44] echo "Waiting for frontend hitrate to reach 0.6" | ts [09:32:46] sleep 60 [09:32:49] done [09:34:16] maybe 0.6 is ambitious though [09:39:06] and instead of a single machine/layer we'd need overall DC hitrate as in https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&panelId=8&fullscreen&orgId=1 [09:39:35] godog: can we get numbers out of grafana AFAYK? [09:42:12] ema: good question, I'm not sure [09:45:14] ema: alternatively when "varnistats" scripts are ported to prometheus the same data should be available there too afaiui [09:46:57] 10Domains, 10Traffic, 10Operations: Purchase domains mediawi.ki and media.wiki to use as a url shortener - https://phabricator.wikimedia.org/T180657#3765523 (10Dzahn) Please see T88873, T88873#1691739 (and T105829) for a long history of buying .wiki domains (that never ended up being used). [09:47:54] godog: yes but I'd need, eg, the upload-ulsfo aggregate [09:48:20] rather than host-specific values [09:52:38] ema: hah, in that case it should be actually already possible [09:53:01] which also begs the question on what to do if the query fails for some reason [10:05:52] godog: tell me more about the "already possible" part :) [10:07:13] ema: I was thinking if you are looking for cache_hit and cache_miss those are already in prometheus [10:07:27] unlike the numbers in varnish-caching dashboard [10:09:06] so you can already query for e.g. ulsfo upload frontend hitrate [10:11:25] right, is there some kind of api to get those values out of prometheus let's say from a script running on a cache host? or neodymium with cumin integration (hehe) [10:13:50] ema: yeah, look at what check_prometheus_metric does in puppet [10:14:34] unexciting but essentially curl | jq [10:14:55] oh yeah true! [10:47:18] * ema clicks around https://grafana.wikimedia.org/dashboard/db/varnish-daemons-hitrate proud of his grafana-fu [11:43:50] <_joe_> ema: interesting data in that graph [11:44:40] <_joe_> for instance what's making the hitrate limit for upload frontends almost hard-capped at 82-85% ? [11:45:32] <_joe_> it's almost like we do something that deliberately causes that [11:47:08] _joe_: off the top of my head, for one thing we do not cache objects bigger than a certain size [11:47:35] <_joe_> yeah but that won't make up 20% of requests, would it? [11:47:41] <_joe_> I hope at least [11:48:12] <_joe_> I was thinking of storage inefficiencies, but that would assume the traffic patterns are constant and equal in all DCs [11:48:36] no, that was just one of the factors that came to mind [11:48:44] <_joe_> maybe we have a 20% of objects we serve through upload that are not cacheable according to the actual backend? [11:49:29] it's a very complicated topic involving: cacheability of objects, admission policies, evictions, purges, storage, ... [11:49:32] <_joe_> anyways, just curious, but I suspect understanding this could unveil something unexpected :) [11:49:42] <_joe_> yeah I know, that's why it's interesting [11:49:44] <_joe_> :) [11:49:48] very! [11:49:56] <_joe_> if it was simple it wouldn't be. [11:54:38] lunch & [13:13:14] _joe_: that graph he linked is the frontend only, which is very limited storage. the backends with larger storage pick up most of the rest to bring it to ~95% [13:14:21] well, that's text, not quite 95. also, there are different definitions of "hitrate" :) [13:14:33] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&var-cluster=text&var-site=esams [13:14:40] ^ same cluster/site (text+esams) [13:15:31] which shows "True Hitrate" being ~87% at the frontend and ~95% when including backends [13:15:53] but then the "Disposition" graph is a different sort of view [13:16:34] Hit: ~74% Miss: ~4% Pass: ~11% Int: ~12% [13:16:59] the True-Hitrate numbers are checking hit/(hit+miss) [13:17:37] Pass are ones that we can deduce are really uncacheable (due to backend response headers, or our own VCL rules about the inbound query) [13:17:48] Int is internally-generated responses from Varnish (errors, redirects) [13:19:30] upload@esams in the same graph has a bigger spread in true-hitrate, with the FE averaging ~81% and the overall at ~96% [13:20:30] (because cache_upload's data has more total objects of larger average size, and our memory-size-limited FE caches hard-cap the object sizes they're willing to cache in the name of prefering object-hitrate to byte-hitrate there) [13:23:18] (note also the pass/int rates in Disposition for esams/upload are small. 0.1% pass, 1.0% int) [13:24:32] upload's dataset is very cacheable, it's just also very big in every sense. the frontends are just doing their best to offload the most percentage of requests they can, mostly on smaller objects. [13:25:17] and even backends with their larger storage cannot actually contain the whole dataset well enough to completely prevent churn, but they get close. close enough that adding more storage at this point produces greatly diminishing returns. [13:29:32] chasing the final ~4% of misses on cache_upload is ... tricky. known miss causes: actual new/cold objects, all objects >=1GB (we refuse to cache at every layer), lack of URL encoding normalization (at least, theoretically this could cause some of it), our constant cron restarts of backends to clear up buggy varnish behavior, which wipes cache contents. Edge-cases/races on inter-layer expiry jus [13:29:38] t as objects are expiring out. [13:30:31] that latter bit being about the expiry boundary. when backend1's copy of the object reaches internal ttl =~ 0, and another cache fetches the object from it right around that moment... there are some edge cases there that aren't ideal. [13:47:15] and a pie chart too! https://grafana.wikimedia.org/dashboard/db/varnish-daemons-hitrate?orgId=1&panelId=3&fullscreen [13:48:14] hitmiss showing up on misc instead of [13:48:16] hitpass [13:53:55] also note that the "uncacheable" or "pass" on upload@eqiad backends is deceptive [13:54:11] most of that is the traffic we're chosing to pass rather than miss+fetch from remote DCs [13:54:30] (so they're not actually uncacheable objects, they're just pass-by-choice for more-optimal global behavior) [14:05:51] bblack: re: using a simple if(ttl<=0){hfm} in https://gerrit.wikimedia.org/r/#/c/391171/5/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb [14:06:17] you were saying that it might complicate miss/pass statistics [14:06:41] however, there's an explicit hit_misspass counter on which we could rely [14:07:08] oh but there's the X-Cache ones mmh [14:08:03] hit_misspass? what is that? :) [14:08:25] hits on HFMs and HFPs? [14:08:26] haha I meant hitmiss [14:09:45] it's the number of hits on HFMs as opposed to hitpass, the number of hits on HFPs [14:10:00] right, but it's still potentially a source of statistical ambiguity [14:10:32] do we call those misses for hit/(hit+miss) purposes, or do we consider them more like passes (outside the scope of true hitrate) [14:11:29] I'd consider them as passes, as they're only goal is avoid request coalescing really [14:11:48] if you ignore the whole issue about conditional request pass-through, basically there's no reason to ever use HFP, may as well turn them all into HFM. Except then a bunch of pass-y traffic now counts as misses. [14:12:09] if you count them as passes, well, sometimes that's just not true and so it distorts our stats viewpoint [14:12:27] e.g. when exp() admission policy choses to HFM a cacheable object, that really is a true miss at that layer [14:13:12] yeah [14:13:45] maybe a more meaningful way to think of this is the 'terminal-layer' approach. Is the request served at this layer, or is it passed further down the stack? [14:14:28] and then you have hit/(hit+non_hit) as the hitrate formula [14:15:25] yeah well there's int too [14:16:09] yes, and pass [14:16:57] the two layer-views that make sense are to view each layer in isolation, or to take a global view of the overall disposition like X-Cache-Status [14:17:08] I don't know that it would produce meaningful results to blend layers in other ways [14:20:08] are we saying that there are two classes of HFM? True misses (eg: exp) and passes (eg: ttl<0)? [14:21:20] because if that's what we're saying for statistical purposes then we can maybe set X-CDIS accordingly [14:24:37] I'm not sure [14:24:54] depends how we sort out the HFM-vs-HFP issue on our existing uncacheables :) [14:25:46] if we stick to the simple rule that existing HFPs stay as HFPs, and existing uncacheable+ttl=0 become HFMs with some non-zero ttl, then nothing changes and I think our pass/miss stats stay as correct as they are today. [14:26:34] most such cases in our VCL today are fairly simple to convert one way or the other [14:27:05] except for https://gerrit.wikimedia.org/r/#/c/391171/5/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb [14:27:27] (the hard case we've been discussing) [14:28:05] that's a case where we've been using HFP because there was no better option before, but some of the cases passing through that if-condition really don't belong as HFPs, and we need to re-think how we handle this. [14:29:42] to recap some the things I said about this case earlier: [14:30:41] it used to be if(ttl<=0) { hfp; }, but this was causing undesirable hfps on what should be cacheable URLs sometimes, due to transient 5xx on cacheables, and expiry boundaries + grace mode. [14:31:45] so now it's if(ttl<=0 && !5xx && there-was-not-another-varnish-beneath-us-indicating-a-hit) { hfp; } [14:32:20] the third part of that conditional is kinda meh. it papers over some of the crack, it's not ideal. [14:33:30] earlier in the same vcl subroutine we have this: [14:33:33] /* Don't cache private, no-cache, no-store objects */ [14:33:33] if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") { [14:33:33] set beresp.ttl = 0s; [14:33:33] // translated to hit-for-pass below [14:33:35] } [14:33:59] (translated to hit-for-pass below is referring to the block we've been talking about) [14:34:51] in a world where HFM is an option, we could/should do better about this stuff [14:35:53] the Cache-Control check (does it need expanding or refinement on that regex?) should result in a real HFP. CC indicates this is uncacheable. [14:36:16] 5xx's shouldn't really generate HFM or HFP [14:36:53] otherwise ttl<=0 without a CC header that says uncacheable, should probably generate an HFM [14:41:12] (5xx's may have CC:no-cache from the backend too, so it's important to pull apart that specific case in above thinking) [14:41:24] maybe we should log the two different cases (CC header that says uncacheable vs. ttl<=0) and see how many objects belong to each group [14:42:31] just to understand the dataset a bit better [14:42:35] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766633 (10Aklapper) 105.66.130.22 is another Moroccan mobile (?) IP that could register on Phab. Might welcome investigation too. [14:51:12] for purposes of cumin aliases, do these classes count as "nonprod"? [14:51:15] or O:authdns::testns or O:cache::canary [14:51:25] yes [14:51:35] ok, cool, i'll add that with another minor change [14:51:40] https://gerrit.wikimedia.org/r/#/c/391721/2/modules/profile/templates/cumin/aliases.yaml.erb [14:52:11] nice [15:21:58] mmh [15:22:36] I'm looking at CC:uncacheable responses on upload, expecting to see basically none [15:22:45] varnishncsa -g request -q 'BerespHeader:Cache-Control ~ "(private|no-cache|no-store)"' -F '%r %{Cache-Control}o' [15:22:51] GET http://upload.wikimedia.org/wikipedia/en/thumb/2/2a/BYUtv_2010_logo.svg/200px-BYUtv_2010_logo.svg.png HTTP/1.1 no-cache [15:23:05] [...] [15:23:19] any reason why thumbnails would be CC:no-cache? [15:24:25] ah, I forgot to print the status codes [15:24:39] they're all 404/429 [15:25:13] still, we might want to cache the 404s I guess [15:26:07] we do cache 404s in VCL [15:26:35] I guess swift or thumbor (thumbor in your example above) are emitting no-cache for them though [15:26:59] heh, the cachable 404 issue came up in the past, iirc thumbs' 404s are not cachable due to races at upload time, so after upload the user browser tries to load the thumbs but they are not there yet and if the 404 is cached then it'll take a "long" time to recover [15:27:20] I'll find the task, though things have changed enough that we should reconsider [15:27:27] well the same issue exists with new articles or logo images, etc on the wikis too [15:27:43] the acceptable tradeoff there was the 404s are cacheable, but we cap caching them at like 10 minutes [15:28:22] the same 404=10m cap applies on upload as well, if they're allowed to be cached by the backend at all [15:29:50] ah ok, good to know! [15:29:51] either way though, the pass-rate on upload is tiny, it's not a huge concern the way it is [15:30:45] indeed, thumbor peaks at sth like 80/s for 404s [15:31:25] swift sth similar, twice that [15:32:06] btw, godog, do you have any insights on the issue of canonical URL encoding of upload.wikimedia.org URLs? :) [15:33:01] skipping over 95% of all the lead-up mess, the bottom line about that topic from upload's perspective is this: [15:33:34] for a given cache_upload image URL, there is a single representational encoding considered canonical by MediaWiki (e.g. whether certain special chars are %-encoded or not) [15:33:46] it differs from the canonical encoding used by cache_text for actual wiki traffic [15:34:18] this can be observed in this example: [15:34:23] https://commons.wikimedia.org/wiki/File:Sweet_William_in_Aspen_(91167).jpg [15:34:48] ^ correct MW encoding-normalization for that file (parens are un-encoded) [15:35:04] yet the image links on that page point to: https://upload.wikimedia.org/wikipedia/commons/4/40/Sweet_William_in_Aspen_%2891167%29.jpg [15:35:10] (parens are %-encoded) [15:35:51] my observation is that Swift and cache_upload are tolerant of basically any aliasing there and accept all possible legitimate encodings [15:36:10] so the same object is also fetchable as: [15:36:13] https://upload.wikimedia.org/wikipedia/commons/4/40/Sweet_William_in_Aspen_(91167).jpg [15:36:55] and: [15:37:02] https://upload.wikimedia.org/wikipedia/commons/4/40/Sweet_William_in_%41spen_(91167).jpg [15:37:12] (your browser will auto-correct that one, but curl can send it and it works) [15:38:36] so I'm trying to figure out, basically, what MediaWiki's encoding rules are when it generates its "canonical" upload.wikimedia.org URLs, and why those rules make sense (e.g. if there's some issues on the Swift/Thumbor side that require certain chars to be encoded?) [15:40:35] off top of my head I don't recall any restriction like that around percent encoding in thumbor/swift [15:40:54] so to be clear, the issue seems to be that mw is "over encoding" characters that shouldn't be [15:41:07] the only way I know to be sure is to try both encoded and un-encoded forms of every character, but first I have to find legitimate image URIs that contain examples of each heh [15:41:18] but then when e.g. swift decodes the url it still works, because in swift it is stored with () in your case above [15:41:46] godog: well in the case of those parens it's over-encoded on the MW side. I don't know what the whole encoding map is, there might be under-encoded cases vs MW's normalization, too. [15:42:29] and yeah, I don't know what Swift does with it either. When it's first stored to swift, does it have the ( or the %28? [15:42:56] (is there a canonical internal represtation in Swift, or does swift re-encode some things based on the URL it got from MW? I have no idea about any of it at this point) [15:43:24] so the object in swift is stored with (), Object: 4/40/Sweet_William_in_Aspen_(91167).jpg [15:43:31] ok [15:44:37] https://www.mediawiki.org/wiki/Manual:$wgIllegalFileChars was linked to me earlier, too [15:45:03] which states that in addition to all the usual title rules, File:Foo in MW doesn't allow Foo to contain any of : / \ [15:45:36] (so they're converted to dashes on creation) [15:46:16] and therefore, I'm guessing I won't be able to find any examples of legitimate upload URIs containing those characters un-encoded, although I'm not sure about encoded variants I guess.] [15:48:03] anyways, I guess the only real answer is "try it all and observe", I'll have to search and find some example URIs containing all the allowed characters and see how it looks in practice [15:48:29] *nod* so at swift upload time the container/object names must be urlencoded, perhaps mw is linking to the same objects with the same encoding to match [15:48:34] but it doesn't strictly have to be [15:48:53] what does "urlencoded" mean in this context? [15:49:08] (because the rules for which chars need %-encoding vary a lot and are basically application-specific, in the general case) [15:50:32] godog: for when I dig more on this later, do you have a handy command for the Object: lookup you did earlier based on I guess the x-object-meta-sha1base36 header value I observe in varnish? [15:50:50] (so I can see how they're internally stored) [15:51:43] bblack: yeah, it involves essentially https://wikitech.wikimedia.org/wiki/Swift/How_To#Impersonating_a_specific_swift_account [15:52:05] bblack: in the above the account used is "dispersion", you can use the "mw" account instead [15:52:22] bblack: and then https://wikitech.wikimedia.org/wiki/Swift/How_To#Show_specific_info_about_a_container_or_object [15:52:31] so what I did above was [15:52:32] root@ms-fe1005:~# swift stat wikipedia-commons-local-public.40 '4/40/Sweet_William_in_Aspen_(91167).jpg' [15:52:50] ironically the filename is quoted because bash [15:53:48] bblack: to answer your previous question, I didn't check the code but I'm assuming we're talking urllib.quote() type of urlencoding [15:56:43] python? [15:56:57] or php? I donno, I'll look around [15:58:23] python yeah [15:58:34] on the swift side that is [15:58:39] right [15:58:46] I wonder what MW is doing before it sends it over to Swift, though [15:59:15] anyways, meeting-time [15:59:20] * godog too [16:18:04] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766948 (10Dispenser) 05Open>03Invalid The team squandered a perfect opportunity where a WP0 pirate broke the ISP blackholing, registered an account on mediawiki.org, and f... [16:19:16] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766950 (10Aklapper) @Dispenser: See T174342#3737108 - I don't know who "the team" and why this task would be invalid. [16:29:24] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3766994 (10Dispenser) @Aklapper The Checkuser information is irrecoverably gone and thus the task can no longer be completed and Invalid. You can change it to decline if you t... [17:00:56] 10Wikimedia-Apache-configuration, 10Operations, 10Mobile, 10Puppet, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3767077 (10Train2104) This is breaking numerous... [17:10:51] 10Wikimedia-Apache-configuration, 10Operations, 10Mobile, 10Puppet, 10Readers-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3767127 (10BBlack) Do we have answers about wha... [17:12:49] ok so DNS stuff, let's start with the eqsin/ulsfo simulation bit [17:13:19] https://gerrit.wikimedia.org/r/#/c/391357/ [17:14:20] XioNoX: lemme know when you're ready-ish [17:14:32] Always ready! [17:14:55] ok [17:16:02] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Backfill librenms data in graphite with historical RRDs - https://phabricator.wikimedia.org/T173698#3767149 (10ayounsi) Looks like the tools from the python-whisper package are good enough to tackle the conversion and backfilling ( https://github.com... [17:17:31] bblack: who !log and who merges? [17:18:47] XioNoX: I'll log it, do you want to merge, or? [17:20:12] sure [17:20:25] ok [17:21:12] running authdns-update in a few sec [17:21:46] running... [17:26:58] 10Traffic, 10Operations: VCL: handling of uncacheable responses in wikimedia-common - https://phabricator.wikimedia.org/T180712#3767222 (10ema) [17:27:07] 10Traffic, 10Operations: VCL: handling of uncacheable responses in wikimedia-common - https://phabricator.wikimedia.org/T180712#3767237 (10ema) p:05Triage>03Normal [19:49:03] 10Traffic, 10Operations, 10Phabricator, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3767879 (10Aklapper) 05Invalid>03Open My point was that Checkuser info isn't the only source as I've posted IP ranges in T174342#3737108 and T174342#3766633 which makes thi... [23:47:06] 10Traffic, 10Operations, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3768594 (10BBlack)