[07:12:50] bblack: it just happened to cp3030 now [07:22:53] the timing seems interesting, this is the time of the day when esams traffic ramps up [07:34:13] system load on cp3030 went up to > 250 [10:49:47] bblack: vtc tests pass, +1 [10:52:54] bblack: mmh although, I am now wondering, are we always properly unsetting x-miss2pass? [10:53:16] we don't unset it in the hit-case, but then there's no backend it's being sent to [10:53:39] unset req.http.X-MISS2PASS is in the if vcl_config.fetch('pass_random') conditional [10:53:53] I guess technically in the real-pass case we're not unsetting it now (before this new patch) [10:53:58] and yeah, that [10:55:16] but it's just optimization to not send extra headers further down the stack, it doesn't affect anything either way [10:55:34] yeah [10:55:48] anyways, moved it out past the conditional [10:56:24] cool, ship it! [10:59:40] did agent run on all codfw caches, since that's the only case it matters in present routing. [10:59:56] ok [10:59:59] for ulsfo I'd start with the low-numbered ones [11:00:03] looking into the ulsfo situation [11:00:05] alright [11:00:12] cp[4008,4010,4018,4027-4028].ulsfo.wmnet [11:01:30] starting [11:06:25] bblack: now, is all of this still fundamentally due to file storage issues? [11:13:10] I don't know. probably? [11:13:31] mbox lag is in the picture, so probably? :) [11:13:44] yeah but the file storage itself may not be at issue [11:14:08] honestly, it's not the root issue with mailbox lag even in the upload case [11:14:28] it's just that there's file-storage-level optimizations we can (did) make that help alleviate the mbox issue [11:15:13] the root of the issue is how varnish(4, anyways?) handles expirations and evictions [11:15:52] the idea that the rest of varnish is supposed to scale over hundreds of threads [11:16:08] while only one poor thread takes care of expirations [11:16:09] but evictions are all happening from a singular expiry thread [11:16:17] yeah [11:16:31] oh, for the record, we *did* have issues on text before [11:16:33] and there's this one-way message queue from N workers -> 1x expiry-thread [11:16:37] https://phabricator.wikimedia.org/T145661#3234480 [11:17:00] so there's locking on the queue itself, and then there's other locking on the LRU structures and such during actual expiry actions of the expiry thread [11:18:40] so the architectural problem with the mbox lag is that centralized queueing of eviction work down to one thread and all the involved locking [11:19:14] the reason it impacts file more than mem storages is two-fold: [11:19:37] 1) file is slower, so various workers + expiry-evictions take longer, holding locks longer, exacerbating the issue [11:20:31] 2) file is more subject to nuke-explosion since it can host larger objects (nuke explosion being when the storage of 1 new larger object forces LRU eviction of many smaller objects) [11:21:00] so our file storage splitting attempts to address the second point [11:21:02] also file is larger, holding more objects [11:21:09] yeah, that too [11:21:32] less than 2x than frontend though, at least in the case of cp3030 which I'm looking at now [11:21:40] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=13&fullscreen&orgId=1&from=now-7d&to=now [11:21:43] storage splitting keeps the object sizes in a given file storage within a constrained range (16x size spread), to minimize eviction-nukes-per-store [11:24:23] actually, there's yet another way in which nuke-explosion is worse on file than mem storage: with malloc storage, it's literally just malloc(object) and free(object), varnish isn't mallocing a giant pool and trying to manage it manually within [11:25:27] it's up to the malloc implementation to do that, and it can do it pretty well. the limitations are soft. if we configure malloc=256GB, at any given point in time the real memory taken by the combined effect of the many malloc()s could be larger as the malloc lib shuffles things around and tries to deal with fragmentation efficiently and so-on [11:25:53] (notably, good mallocs like jemalloc do a size-split sharding for the malloc storage pools, not unlike what we did for file storage on upload) [11:26:43] but for the file storage, it's a contiguous chunk of storage that varnish is managing for itself. so with malloc, if storage is filled with 1M objects and you allocate a 100M object, you just have to free 100 objects.... [11:27:05] but with "file", you have to keep freeing those 1M-sized objects until you find a *contiguous* 100M chunk [11:27:48] (although to be fair, I think varnish does commonly split up the allocations into smaller chunks to avoid most of that problem, at least when streaming) [11:29:18] re: cp3030, sometimes a little more than 2x, but I don't see 3x. but the weekly backend restarts are also a limiting factor for bringing a better long tail into backend storage [11:29:35] (but still, there's fewer total objects under any given hotness threshold on text than upload) [11:31:17] I don't think text really uses all the backend storage we give it, usually [11:31:28] cp4010, which I'm looking at now, has ~4M fe objects and ~12M be, so 3x [11:31:31] at least, not very effectively [11:31:57] waiting for cp4008's backend to refill a bit and man! it takes a while [11:33:04] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&from=now-6h&to=now&var-cluster=text&var-site=ulsfo [11:33:27] ^ I'd say you can push the restart rate on those until you see a real negative effect on true hitrate there that scares you [11:33:38] sounds good [11:34:33] the small divergence of local+overall can be seen since your 4008 restart [11:34:52] that's a loss of local-be cache contents as expected, but codfw and/or eqiad seem to be covering most of it fine [12:04:22] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3620910 (10TheDJ) FYI: Seems the change missed the cut-off for the Safari 11 release. iOS 11 Safari 11 not yet fixed: Mozilla/5... [12:34:28] ema: re: all the complex things about possible coalesce implications of MISS2PASS: we could always go back to setting req.hash_ignore_busy in vcl_recv for this case, like earlier variants of the patch did [12:35:03] I removed that on the theory that we weren't likely to experience those complex coalesce issues anyways, so why add another line of confusing code and behavior change, etc? [12:35:29] but you could also argue that even if we're unlikely to experience the problems that warrant that hash_ignore_busy line, it's cheap insurance [12:35:54] it has the downside of potentially causing parallel fetches with potential miss-fetches, but that problem doesn't scale up very big anyways [12:36:39] to elaborate that point, the inefficiency it can cause goes like this example: [12:37:06] 1) request for ObjA arrives at eqiad-fe, misses to eqiad-be, misses to applayer [12:37:55] 2) a request for ObjA arrives at esams-fe, misses to esams-be, misses to eqiad-be, and ends up being concurrent inside of eqiad-be with the above [12:38:33] if (2) arrives at eqiad-be before (1), it doesn't matter what the h_i_b setting is [12:39:02] if (1) arrives at eqiad-be before (2), without h_i_b (2) would coalesce-stall and wait to reuse the object coming into cache on (1)'s behalf [12:39:27] but with h_i_b (2) would ignore that and eqiad-be would make a second parallel request to the applayer for the same object [12:39:54] but given the chashing that all requests in question are subject to (that matter for this discussion) [12:40:19] at most the corner-case limit is that you end up with N parallel requests for this object, where N = count_of_datacenters [12:41:27] (and you only get there in the weird scenario that ObjA is missing in every single cache layer+dc globally, but then gets suddenly requested from them all simultaneously with the timing precision that they all arrive at eqiad-be before the one from eqiad-fe manages to complete the fetch into storage) [12:41:57] even then, ok the object gets requested 4-5 times from the applayer instead of once, and then the problem goes away [12:43:27] it does sound like a cheap insurance cost against the possibility of coalesce-stall on the miss->pass conversion (or the object lookup itself, really) [12:45:43] https://gerrit.wikimedia.org/r/#/c/379221/ [12:48:42] 10netops, 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3621017 (10mark) [12:48:46] 10netops, 10Operations, 10monitoring, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3621014 (10mark) 05Open>03Resolved a:03mark This will do for our aggregate power stats. Thanks! [12:50:31] bblack: yup! [12:53:33] all text-ulsfo backends done [12:53:40] so, a bit later today, I'll probably try to resume those legacy ulsfo depools [12:53:51] assuming the earlier problems stay dead, etc [12:54:41] it seems likely the problems with those depools yesterday were all related to this. shouldn't happen now if we've fixed that problem. so, good to test. [12:59:29] oh and could/should restart codfw text backends JIC too. I can do it later if you want, I don't think it's urgent. [12:59:33] I'm off for a bit for now [13:00:13] it's alright, I can do that :) [13:00:44] so I was now trying to see if anything stood out looking at *eqiad* graphs during the esams issues of today [13:01:23] for example, cp3030 had trouble from ~06:45 till ~07:15 [13:01:53] in that timeframe, pretty much all eqiad nodes have a ramp up in session closed with rx_timeout [13:02:40] that means that the esams<->eqiad session timed out, I imagine [13:03:12] eg, cp1067: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=26&fullscreen&orgId=1&from=1505889053672&to=1505892362885&var-server=cp1067&var-datasource=eqiad%20prometheus%2Fops [13:04:00] sc_rx_timeout – Session Err RX_TIMEOUT (DIAG) [13:04:00] Number of session closes with Error RX_TIMEOUT (Receive timeout) [13:04:38] any idea of sc_rx_timeout means "timeout after receiving some-but-not-all bytes of a request" or "timeout between requests while waiting for the next one" or both? [13:04:39] it's not 100% clear to me whether, from the varnish standpoint, "receive" means received from the client or from the backend [13:04:57] well yeah and that, but I'm assuming its rx timeout on the esams<->eqiad reqs [13:05:11] you'd think so, but then there's also [13:05:16] sc_rx_bad – Session Err RX_BAD (DIAG) [13:05:17] Number of session closes with Error RX_BAD (Received bad req/resp) [13:05:59] so given that resp is mentioned, it seems that they might mean received both from client or backend? [13:06:07] if the definition encompasses the idle timeout between requests, it's possibly just an expected effect [13:06:43] that if the esams-be in question was getting all stalled and locked-up in general, its outbound request rate would slow to a crawl and the usually-persistent connections to eqiad would start dropping from timeout [13:06:59] (its outbound request rate as perceived by eqiad, anyways) [13:07:03] right [13:10:45] in all software work, we tend to reassure ourselves with ideas that what we're doing is rational, validateable, testable, and sane. that it's more like math than it is. [13:11:27] and usually that isn't really the case. usually there's a whole lot of guesswork and intuition and hoping going on, but we mask it in a bunch of ways that make it look more rational than it is [13:12:07] still better than economics though [13:12:25] at least we plot real stuff [13:12:27] the complexity level of this caching crap that I've experienced here at the WMF, though. it pushes things so far out where you can't really plausibly deny the irrationality anymore. you lose your comfort blankets [13:12:54] it's all obviously guesswork and intuition most of the time. and it gets unconformtable and unnerving to be constantly smacked in the face with that realization :P [13:14:35] even when we plot real stuff, it's almost like reading tea leaves [13:15:23] we seldom go into anything with a solid prediction of the expected outcomes on the graphs apriori. we look for the results we want to find, and what we're really doing is just trying to reinforce or invalidate the intuitions that weren't based on the data :) [13:16:57] * ema hugs bblack and starts codfw restarts [13:19:19] I think the meta-lesson here is that in any sufficiently complex system, you can't rationally calculate or predict much of anything with the kind of certainty. Trying to calculate your way to the solution is never as effective as jumping through guesswork experiments using experienced intuitions as your guide. [13:19:41] s/with the kind/with any kind/ [13:20:07] but I guess that's still not an excuse to be un-rigorous in the few places where you can be rigorous :) [13:20:16] anyways, bbl in the day! :) [13:20:29] o/ [14:07:13] i managed those caching clusters entirely based on intuition, bblack ;p [14:07:21] aided by a tiny bit of squid stats [14:13:45] 10netops, 10Operations, 10fundraising-tech-ops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3621226 (10Jgreen) [14:15:05] 10netops, 10Operations, 10fundraising-tech-ops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3621259 (10Jgreen) [14:15:27] 10netops, 10Operations, 10fundraising-tech-ops: remove fundraising firewall rules related to ganglia - https://phabricator.wikimedia.org/T176319#3621226 (10Jgreen) [15:02:52] 10Traffic, 10Operations, 10Browser-Support-Apple-Safari, 10Upstream: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3621439 (10Nuria) Super thanks for following up [16:12:10] 10Traffic, 10MediaWiki-API, 10Operations, 10monitoring, 10Services (watching): Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#3621638 (10GWicke) FTR, this is the graph with the alert I mentioned: https://grafana.wikimedia.org/dashboard/db/restbase?pane... [16:25:39] 10Traffic, 10Analytics, 10Operations: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3621689 (10phuedx) Deterministic bucketing is also available in MediaWiki core via [[ https://github.com/wikimedia/mediawiki/blob/00c769eb8d7746dfddff525ccc813f276046dea8/resources/src/mediawiki/medi... [17:11:02] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3621935 (10RobH) [17:17:30] 10netops, 10DC-Ops, 10Operations, 10ops-eqiad, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3621976 (10RobH) [18:14:07] 10netops, 10Operations: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3622219 (10ayounsi) [18:14:27] km@km-tp ~> curl -I "https://upload.wikimedia.org/wikipedia/mediawiki/6/69/Hesaidsemanticga2.jpg" | grep content-type [18:14:27] content-type: application/x-www-form-urlencoded [18:14:36] is that intentional/expected? [18:24:09] legoktm: definitely not, but not a Traffic issue [18:24:18] content-type for upload is stored as metadata in Swift [18:24:27] and is being set by MW [18:24:52] ok, should I file a task for it in those projects then? [18:26:43] Account: AUTH_mw [18:26:43] Container: wikipedia-mediawiki-local-public [18:26:43] Object: 6/69/Hesaidsemanticga2.jpg [18:26:43] Content Type: application/x-www-form-urlencoded [18:26:46] legoktm: yes [18:29:56] https://phabricator.wikimedia.org/T176346 [18:35:32] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3622292 (10RobH) [18:40:21] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3622317 (10RobH) I tried to connect to cr3-esams and pull chassis hardware info, with no joy: ``` robh@re0.cr3-esams> show chassis hardware... [18:42:56] 10netops, 10DC-Ops, 10Operations, 10ops-esams, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#3622339 (10RobH) [19:50:34] 10netops, 10Operations, 10ops-ulsfo: connect new office link to asw-ulsfo - https://phabricator.wikimedia.org/T176350#3622485 (10RobH) [21:14:25] 10Traffic, 10Operations, 10ops-ulsfo: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622887 (10BBlack) [21:15:06] 10Traffic, 10Operations, 10ops-ulsfo: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622901 (10BBlack) [just pre-creating the task, we're not quite ready to take action yet. These systems are now depooled, but we'll wait a few days before un-configuring in case a reason to repoo... [21:33:39] 10Traffic, 10Varnish, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#3622947 (10greg) [21:37:36] ema: FYI no issues with retrying the ulsfo depool [22:03:06] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Collect Backend-Timing in Graphite - https://phabricator.wikimedia.org/T131894#3622996 (10Krinkle)