[08:37:12] 10Traffic, 06Operations, 13Patch-For-Review: graphite.wikimedia.org 503s on some css/js resources - https://phabricator.wikimedia.org/T135515#2301423 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi confirmed this is fixed on the graphite side, thanks @BBlack @ema ! re: mod_deflate, @elukey was mention... [11:33:22] 10netops, 06DC-Ops, 06Operations, 10ops-esams: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#2304535 (10mark) [12:23:08] I've opened https://github.com/varnishcache/varnish-cache/issues/1956 [12:37:38] ema: awesome [12:52:23] bblack: do we have a version of https://upload.wikimedia.org/wikipedia/commons/thumb/e/e8/Latency_map_world.png/800px-Latency_map_world.png including codfw? [13:01:11] ema: not that I'm aware of. I think faidon built that from RIPE data [13:02:43] ema: related: https://phabricator.wikimedia.org/T114659 [13:03:01] the little map I colored in there is what we used, but it was just a random guess based on geography [13:04:10] for the no-failover case, we could do a little better with sending everyone to their optimal latency endpoint (or at least, fairly close) [13:04:48] it would be nice, from the same data, to have a secondary choice for each country (or US state / CA province) too, but we can make reasonable guesses at the secondary choices based on primary [13:05:12] (reasonable guesses meaning ulsfo fails to codfw, esams to eqiad, eqiad to codfw, codfw to eqiad) [13:06:11] probably the most interesting thing to look at for primary choices vs where we're at today, is Mexico and everything south of it through south america, and the islands nearby. see if any of them currently on eqiad are better at codfw [13:06:12] this eventually found its way to https://github.com/RIPE-Atlas-Community/datacentre-latency-map [13:06:20] but it'll need data, and we don't have probes set up yet [13:06:27] I think we did map mexico to codfw already as a good guess [13:08:34] I'm always tempted to expand codfw's reach in the US states too, to give it more load, since it gets so little [13:08:48] it would steal some load from eqiad that way, too, which might help when esams goes there [13:09:01] but officially, our model is still not to consider load, only latency [13:10:56] one thing we could do that doesn't violate that model, because it's outside the model, is switch our fallback primary choice (when geoip really gives us nothing to go on) from eqiad to codfw [13:11:17] I'm really not sure what % of traffic falls in that bin [13:19:53] while looking at config-geo I've discovered the existence of Jersey [13:20:42] :) [13:20:46] (not the new one, the old one) [13:21:42] while looking at config-geo, after having this conversation, I've realized our config-geo has some notable faults :) [13:22:20] we have a global 'default' key, but we don't have defaults anywhere else in the hierarchy [13:22:29] especially in the EU it probably matters [13:23:07] we have every EU country we've named listed as esams-first, but if there's any EU traffic that falls into a country we forgot, or all geoip knows is the IP is in the EU continent but no better info, it's going to use the global default and go to eqiad [13:23:24] ah! [13:23:51] OC and AF only have a contintent-level setting anyways, so they're ok [13:24:16] having NA use the global default is probably fine, since the global default is also eqiad,codfw (or reverse if we do above, but either way ok) [13:24:44] but EU and AS could use a default entry [13:25:09] AS is tricky since we split that area between esams and ulsfo mostly. but either way eqiad's not ideal I don't think. [13:26:00] probably not [13:27:06] oh there is a virtual "country" under "AS" for "AP" (asia-pacific) [13:27:32] (which is already ulsfo) [13:28:15] any specific reason for having max 3 entries in the failover list? [13:28:18] it would be nice to know how much of asia traffic ends up with no country ID, and then whether most of the AP stuff ends up in AP with mostly non-AP stuff falling all the way to AS, etc... [13:29:16] ema: because codfw+eqiad are primaries, and every list we have is either just the primaries in either order, or it's a cache-only DC followed by the two primaries. [13:29:36] ema: in theory you could put the 1-2 missing cache-only DCs after the primaries in every set. But if both primaries are dead we're dead anyways. [13:30:17] right [13:30:32] (although you could make the argument that we might want to shut out geodns user traffic from both primaries and force it all to the cache DCs only for some strange reason, which bypasses that argument, since they're actually up and we just don't want users hitting them directly) [13:30:39] (but I have trouble imagining such a scenario) [13:31:00] (maybe as a stress-test?) [13:31:47] or because $mild_disaster is happening at both primaries [13:31:56] yeah [13:32:19] I'm not sure what kind of mild disaster would make us move users to cache DCs only, but leave applications working fine through the cache layers at one or both. [13:32:40] we could stick them in there just for completeness :) [13:34:37] why not! [13:35:02] I'll do that [13:35:03] :) [13:35:14] wait a sec before you do, to avoid diff conflicts [13:35:18] sure [13:37:05] I'll merge at least the EU one, I'm not sure about AS, it can wait and conflict after yours :) [13:38:14] bblack, ema: nginx on the cp* hosts needs a restart for the expat update, do you want to make the restarts so that it doesn't interfere with other traffic work (or just tell me when the time is right and I'll do it) [13:38:49] moritzm: I'll do it shortly, thanks [13:39:31] ok, thanks [13:42:13] not many ripe probes in China unfortunately, perhaps the west might be better served by esams? [13:42:57] ema: possible [13:43:30] on the other hand, as much as we'd theoretically like to be operating on latency-basis only, esams already has a ton of load :) [13:44:11] I doubt my default-switch for EU will make much percentage impact, just improve things for a few minor edge cases. But a chunk of china might :) [13:44:13] yeah, adding China to it might not be wise [13:46:49] interesting how we distinguish between, say, the Holy See and Italy but Russia is all the same :P [13:49:58] :) [13:50:32] well Holy See is technically a separate "country" (maxmind countries are sometimes different than real countries, but not in this case) [13:50:41] we haven't ever mapped sub-country outside of US/CA [13:51:34] I seem to recall faidon and I did some manual checks on russia once, using vladivostock as a test point [13:51:51] since it's far-east edge of russia, and a major city with universities and fiber landing, etc [13:52:18] I think it was inconclusive? I really don't remember. I remember we were baffled at the results a bit, since you'd expect ulsfo to be significant faster for it [13:52:33] http://blog.wikimedia.org/2014/07/09/how-ripe-atlas-helped-wikipedia-users/ says esams is better [13:53:10] (counterintuitively) [13:53:12] ah yes, apparently those results are recorded somewhere :) [13:54:06] on a completely different note, I'm now trying to port my working varnishxcache script to varnishxcache4... [13:54:19] it's a response header, one that starts out with a value from the backend, and then we edit it in VCL... [13:54:35] so the naive approach that's like: varnishlog -n frontend -I RespHeader:^X-Cache: [13:54:49] gets you 2x X-Cache lines per request (one before modification, one after that we really want) [13:55:08] do we have some generic solution for this? to make varnishlog or python-varnishlog only pay attention to the last version of something? [13:56:56] mmmh [13:57:19] I don't think we have one, no [13:58:19] what we really want is the old behavior: don't trace every modification, just show the final result headers after VCL is all done. I'm surprised varnishlog4 doesn't have a mode for that :/ [14:00:15] anyways, I can hack around it, but it's ugly [14:01:33] yep [14:03:44] hmmm but the corner cases suck [14:04:11] the only way to know is we get 2x X-Cache outputs for every transaciton, and they share a transaction ID, and we want the second one [14:04:51] but in theory, it's possible for a request to only have a first one and not a second one I think (if the backend responds with X-Cache, and then we take a synthetic error path and never go through deliver?) [14:04:54] is there perhaps a way to revert to the old behavior by changing grouping? [14:05:03] I tried, doesn't seem so [14:05:22] I could do a cache of seen transaction ids, and in the normal case it would stay very small [14:05:51] if xid not in cache, ignore this record and insert xid into cache. if xid is in cache, delete from cache and use this record. [14:06:09] and some kind of timeout so half-done items get culled from the list if they sit there for, say, 5 minutes or longer. [14:07:03] alternatively, we could fix this in a more-natural way by changing how we do X-Cache [14:07:37] we could have all backend instances use X-Cache-Internal or something, and frontend VCL uses that (if it exists) to build X-Cache for the client just once and deletes the internal one. [14:07:44] then X-Cache is only set once on frontends [14:08:10] that sounds good [14:08:21] is there anything depending on backends setting X-Cache? [14:08:27] stats, or so? [14:08:33] no, it's purely information output so far [14:08:38] great [14:08:51] and nothing looks at it but humans and analytics via varnishkafka, but that's frontend-only too [14:09:24] varnishkafka has its own way of dealing with this, since it's recording a bunch of stuff from each txn and lets later copies overwrite earlier ones, IIRC [14:10:12] at some point we'll need VCL-based cache loop detection too as a protective measure, which X-Cache might've been suited to. [14:10:25] but we can either use X-Cache-Internal for that, or do something completely separate [14:10:49] X-Been-Here-Already [14:10:53] :) [14:11:28] naively, we should be able to have all backends search X-Cache for their own hostname (before they modify X-Cache in this request), and 503 if they see their own name [14:11:35] or whatever error we pick [14:12:46] (the loop protection is because eventually to get where we need to be, inter-cache routing will be per-application-service, and there's all kinds of ways that can theoretically go wrong with traffic going both ways between eqiad<->codfw for different applications based on req.url/hostname patterns, etc [14:13:18] ) [14:14:41] in that model, we'll still set a cluster-global route table of paths between caches. but it will have an explicit loop in the route table for eqiad->codfw and codfw->eqiad [14:15:01] and per-appservice data decides which of eqiad or codfw (or both) are useable for applayer traffic for that service [14:15:17] and then the traffic drops out to applayer wherever it makes sense for that service, as it traverses the routes. [14:15:36] (that also theoretically supports deploying some services in remote cache DCs too) [14:27:48] is there more work to do on T114659? [14:27:49] T114659: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659 [14:37:38] ema: I donno really. the work done was a guess, we have no latency stats to support it. but really, we should be re-taking latency stats on the whole world (but mostly, NA+SA side of the world) to check out what to move to codfw in the broader sense [14:37:52] maybe close off that one and start a new one that's broader than NA. or re-title it. [14:39:02] probably close makes more sense. it's old, and it was "done" well-enough for the time. [14:44:11] it's hard to imagine much changes for the rest of the world in the other hemisphere, since codfw is inbetween existing DCs from their pov [14:44:33] but re-running the whole world is a good idea anyways. network conditions have maybe changed since last run. [14:59:42] bblack: are we routing India to eqiad instead of esams for load reasons? [15:00:20] (I've started updating config-geo) [15:02:28] ema: I think so [15:03:17] ema: https://gerrit.wikimedia.org/r/#/c/257843/2 [15:03:36] I think that was just a re-write of a much earlier patch, where we had notes about how we were waiting on esams link improvements first [15:04:09] ok [15:04:58] paravoid: ^ do you remember if we're stalling that waiting on the new esams link(s)? [15:08:50] yes [15:08:51] we should [15:08:57] we're paying a lot of money for overage charges right now [17:49:18] https://grafana.wikimedia.org/dashboard/db/varnish-caching :) [17:50:06] bblack: nice! [17:51:05] this goes straight to my slides :) [17:52:48] maybe it will look better after it gets a day or two of history :) [17:56:17] bblack: I've added all DCs to config-geo, hopefully without breaking anything https://gerrit.wikimedia.org/r/#/c/289433/ [17:59:04] see you tomorrow o/ [18:00:03] cya! [18:48:52] bblack: I've got to go, but quickly before I do... [18:48:59] we really should investigate this purge storm a little more [18:49:16] it seems all like wikidata-derived pages, mostly for certain persons [18:49:29] look at this pattern: https://www.wikidata.org/w/index.php?title=Q6044835&action=history [18:49:32] this may be the cause [18:49:36] or at least part of the cause [18:49:49] another interesting data point that may or may not be related is [18:50:11] the eqiad-knams link's usage [18:50:36] https://librenms.wikimedia.org/graphs/to=1463597400/id=4113/type=port_bits/from=1432061400/ [19:47:25] paravoid: which purge storm? [19:53:47] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2306727 (10BBlack) cp3048 today: 181G virtual, 88G resident [20:00:59] paravoid: (by that I mean... it's been crazy since Dec, it doesn't seem much crazier now) [20:37:33] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2306885 (10Ottomata) Grr, these are getting close to full. Luca and I tried to dynamically set topic retention, but kafka d... [20:46:11] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2306971 (10Ottomata) I take it back! The command I had run previously looks like it had a larger retention.ms than the defa... [21:10:01] 10Traffic, 10Analytics-Cluster, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2307079 (10Ottomata) Ok, brokers have deleted webrequest_upload data older than 48 hours. I've removed the topic config ove...