[09:03:03] 10Traffic, 10netops, 06Operations: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193128 (10Volans) [09:04:44] oh cp2002 seems to have caused some trouble during the night [09:17:53] OK it was near to the weekly restart which happened at 7ish, bad timing [09:25:55] 10Traffic, 10netops, 06Operations: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193170 (10Volans) Ping from various codfw hosts confirms packet loss: - from `lvs2004` ``` --- lvs2001.codfw.wmnet ping statistics --- 165 packets transmitted, 146 received, 1... [09:41:56] 10Traffic, 10netops, 06Operations: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193128 (10ema) Note that, perhaps interestingly, the number of Icmp_Outmsgs on lvs2001 reached 1000 at a certain point and then flattened there. {F7631080} [10:36:02] oh, interestingly esams hosts are also affected by the lag now [10:36:10] cp3044 specifically [10:36:37] I don't think that ever happened with esams routed to eqiad [10:38:11] also in this case varnishd has been running for > 6 days, we're close to the weekly restart [10:39:35] 10Traffic, 10netops, 06Operations, 13Patch-For-Review: lvs2001: intermittent packet loss from Icinga checks - https://phabricator.wikimedia.org/T163312#3193210 (10Volans) 05Open>03Resolved p:05Triage>03High a:03Volans Increased the max ICMP out packets to 3000 to overcome the bottleneck. Packet l... [10:42:17] varnish-be depooled as it was causing a few 503s (not too many, but still) [10:47:19] 10Traffic, 10Monitoring, 06Operations: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3193217 (10fgiunchedi) Upstream issue: https://github.com/prometheus/procfs/issues/40 [11:16:41] yeah cp2002, got hit by mailbox lag when traffic switched over to it several hours before weekly restart was due, and I depooled it for a little while and it recovered, then repooled [11:17:16] I imagine the esams case is similarly from the additional impact of changing traffic patterns (changing hit rates underneath from codfw, etc?) [11:19:31] most likely, yes. Interesting how esams hosts have never been affected in the past though (I think?) [11:20:58] well, esams upload is also in another unique position this week [11:21:08] having lost 2/12 to reboot issues, right? [11:21:14] oh very true! [11:21:18] I guess 3 are down [11:31:46] volans: no, it's 2 down (cp3038 and cp3045) [11:32:11] oh right, cp3003 is decom IIRC [11:32:32] had forgot that [11:32:43] it's maps :) [11:35:01] bblack: I'm gonna repool cp3044, it recovered from the lag pretty quickly [11:35:35] ok [11:35:36] can't we use serverhands to have a look at cp3038/cp3045? [11:35:51] since they also switched the switch recently [11:36:00] yeah we should be able to at least have them pull the plugs for a minute (which will quite likely fix this) [11:36:25] racadm racreset --harder [11:36:33] :) [11:38:57] maybe starting with asia we should start using manageable PDUs too, would save some pain in these cases [11:39:46] I've used the APC ones in the distant past and they weren't awful [11:41:34] e.g. these things (they come in different models for different layouts and specs and plugs): https://www.newegg.com/Product/Product.aspx?Item=9SIA24G0P79621 [11:42:14] you can ssh into them and turn individual ports on and off [12:15:56] yup, I used similar kinds, APC isn't bad, and usualy have lots of features [12:19:53] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10Marostegui) Cool, I will talk to Jaime tomorrow in our weekly meeting and we will try to see how to fit our stuff before/after it. I w... [13:09:33] 10Traffic, 10netops, 06DC-Ops, 06Operations: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3193493 (10ayounsi) [13:13:34] 10Traffic, 10netops, 06DC-Ops, 06Operations: Interface errors on asw-c-codfw:xe-7/0/46 - https://phabricator.wikimedia.org/T163323#3193512 (10BBlack) To do a soft-ish failover, on lvs2002 we can disable the puppet agent and stop pybal temporarily, wait a few minutes for traffic to settle over to lvs2005, a... [13:16:46] from HN today: http://danluu.com/2choices-eviction/ [13:17:05] he's look at the tiny and very different CPU cache case, but it would be interesting to study for large HTTP disk caches too :) [13:17:29] the basic idea is that while a full LRU-based eviction beats the much simpler "evict a random entry"... [13:18:00] you actually get pretty decent results with much less overhead and work than full LRU by just picking two random cache entries and evicting whichever of those two is least-recently-used :) [13:18:39] oh so LRU-between-two-random? [13:19:16] nice :) [13:20:55] I imagine there are all kinds of variations you could try on that theme [13:21:09] repeating from #-ops because there's too much spam there (sigh) [13:21:15] but when you have millions of cache objects, just tracking global LRU is a PITA [13:21:26] what are those HTTP 5xxs alerts that are flapping? [13:21:45] I would assume they're related to the RB spam in -ops? [13:22:39] that was my guess but Eric said that the RB issue is eqiad-only and reads are codfw-only now, so it shouldn't [13:23:24] currently in a interview, sorry [13:24:34] "uri_host":"en.wikipedia.org","uri_path":"/api/rest_v1/page/html/Stapes/776001088" [13:24:38] etc [13:24:56] once I filter out upload.wikimedia.org and performance.wikimedia.org (both of which are also throwing some 5xx), I see lots of RB URLs [13:25:32] bblack@oxygen:/srv/log/webrequest$ tail -10000 5xx.json|grep -v upload|grep -v performance|grep -v phab.wmfuser|wc -l [13:25:35] 7565 [13:25:37] bblack@oxygen:/srv/log/webrequest$ tail -10000 5xx.json|grep -v upload|grep -v performance|grep -v phab.wmfuser|grep '/api/rest'|wc -l [13:25:40] 7221 [13:26:10] I wonder if the upload 5xx rate is "normal" or related to lvs2002 interface issue [13:26:20] but relatively-unimportant all things considered [13:38:58] unrelated but since I'm looking at icinga: cp2014 has a expiry mailbox lag CRIT [13:40:46] yeah those alerts just aren't perfect yet, but they're still good-to-know [13:41:31] we can depool it for a bit and clean it up [13:42:02] (before it becomes an actual issue) [13:43:07] [done, will repool later after stats recover} [13:56:50] 10netops, 06Operations, 10ops-eqiad: switchover icinga.wikimedia.org from einsteinium to tegmen - https://phabricator.wikimedia.org/T163324#3193573 (10akosiaris) [13:58:39] re: cp2014, what I've found these 2 recent times when I depooled to drain mailbox lag, is the trailing indicator is the expiry lock ops graph [13:58:42] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=24&fullscreen&orgId=1&var-server=cp2014&var-datasource=codfw%20prometheus%2Fops&from=now-1h&to=now [13:58:55] once the mailbox lag recovers, I've been waiting for that lockops graph to drop back down before repooling [13:59:47] (which it has now done) [13:59:56] 10netops, 06Operations, 10ops-eqiad: switchover icinga.wikimedia.org from einsteinium to tegmen - https://phabricator.wikimedia.org/T163324#3193593 (10akosiaris) a:05Cmjohnson>03akosiaris [14:08:13] 10netops, 06Operations, 10ops-eqiad: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 - https://phabricator.wikimedia.org/T163326#3193640 (10akosiaris) [14:10:22] yeah so mailbox lag, we need a better solution eventually [14:10:59] the abrupt mid-week pattern changes tend to exacerbate it (esams from the 2x lost hosts, and codfw from the abrupt shift of actual query pattern by moving different traffic through it) [14:11:59] I think the ideal short-to-medium term fix is to restart faster than 1/week [14:12:14] :( [14:12:16] but I don't want to go all the way to 1/day, and our current code solution doesn't do anything inbetween for the cronjob spacing [14:13:30] we could also explore the idea of breaking up storage even more-finely [14:14:07] right now the bins have a 16x size range (meaning the max size of a bin is 16x the min size), which helped a lot [14:14:18] we could make them 8x size range and twice as many bins, for instance [14:17:31] bblack: most of the objects seem to end up in bin0, at least in cp2017's case [14:17:37] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=37&fullscreen&orgId=1&var-server=cp2017&var-datasource=codfw%20prometheus%2Fops [14:18:20] what exactly does "N struct smf" mean in this context? [14:18:49] I don't think it's 1x smf per object [14:20:02] in the stats from developing the binning, bin0's size range was responsible for 77.6% of all requests, but 3% of all object storage in bytes [14:20:50] (but bin0+1 are also going to tend to get covered better by frontends, too) [14:23:07] not really sure what that means, the docs don't say much [14:24:29] "bytes allocated" is perhaps more interesting? [14:40:12] 10netops, 06Operations, 10ops-eqiad: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3193747 (10ayounsi) >if possible to migrate kafka1022 I believe you mean kafka1020 > any issue from the network capabilities perspective to move a kafka node in row... [14:45:12] so yeah varnish-counters(7) says very little about those counters [14:45:14] g_smf – N struct smf (INFO) [14:45:14] g_smf_frag – N small free smf (INFO) [14:45:14] g_smf_large – N large free smf (INFO) [14:47:19] the total number of cached objects in cp2002's backend is ~5M though, so those ~58K N struct smf are definitely not 1x object [14:48:21] *cp2005 ^ [14:54:44] so "struct smf" is basically chunks of underlying storage [14:54:50] (looking at the code) [14:55:06] they can contain arbitrary numbers of bytes [14:55:32] the space in the storage is essentially managed in these chunks, and struct smf is the metadata for each chunk [14:55:39] they seem to be variably-sized [14:56:29] my best guess would be when allocating space for a new object insertion, it creates an smf as big as the object [14:57:13] when an object is ejected/freed, that smf goes onto the free list, and at certain points in the code it attempts to merge up adjacent free smfs into larger ones (on the free list for re-use as space to split up for new objects) [14:57:24] but the code is (as usual) hard to follow [14:59:35] it's all in bin/varnishd/storage/storage_file.c [15:00:58] so, excessive "struct smf" relative to expected object counts and eviction rates probably indicates something about excessive fragmentation [15:01:21] which is kinda interesting. bin0 is the only bin that doesn't have a strict x16 size range, as it covers sizes 0-16K [15:02:16] maybe we have a lot of truly-tiny objects in there too, and at the high request rates flowing through bin0 they cause contention/fragmentation due to the extreme size mismatch between ~15K objects and ~150B objects and such [15:02:52] the old statistical data used during binning research says the average size of bin0-range objects was ~7328 bytes [15:03:13] but that's a simple arithmetic mean, it doesn't tell us much about the distribution [15:36:00] 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3194084 (10ema) Looks like the overwhelming majority of objects ending up in bin0 (thus smaller than 16k) is made of items smaller than 1676 bytes. I've ran the... [15:36:44] bblack: your theory about small objects in bin0 seems correct at a first glance :) [15:55:10] yikes, half of them are < ~1.5K :) [15:56:01] so, back to binning, we could of course make bin0 be something like 0-1K, and bin1 1-16K and so-on [15:56:13] but really [15:56:30] perhaps a better approach would be to refuse to cache objects <1K size in the upload backends [15:56:39] which would keep bin0 to 1K-16K range like a normal bin [15:56:56] and leave it to frontends -> passthrough-backends to take care of caching the tiny files on their own [15:57:24] right, if they're popular they'll be frontend hits anyways [15:57:59] it's probably reasonable to assume basically all remote-popular objects <1K size fit in FE storage simultaneously [15:58:05] but maybe some rough math should confirm that [15:58:39] malloc on new-gen frontends is currently 100G [15:59:23] so that theoretically allows ~100M of them if they were the only thing in the frontend (which they're not) [16:00:41] in those old stats I ran, 61 days of 1/1000 data only turned up ~10M total unique objects in the 0-16K size range [16:00:57] so, seems like a reasonable assumption above, I guess [16:01:48] we could also just start smaller and work our way up and make sure we don't cause problems [16:02:06] e.g. refuse to backend-cache objects with CL < 256B first, see how that goes and work up to 1K [16:02:28] watching out for excessive cache miss->pass on those small objects from not enough of the hott-ish ones fitting [16:03:40] what sucks is the hit-for-pass objects probably take as much room as the tiny objects themselves :) [16:03:52] so there's some pragmatic things to think about on issues like that [16:04:03] haha yeah! [16:06:45] I guess we could also just do them without hfp [16:06:58] just a simple conversion of miss->pass at response time with 0s+uncacheable [16:07:28] (like we're doing for size filters I think on FE) [16:08:11] ah no we do regular hfp for large objects in the FE [16:08:21] but we do a non-hfp pass on the N-hit-wonders [16:11:00] interestingly we don't count the N-hits as pass though, we count them as miss statistically [16:11:05] I guess that makes more sense in that case [16:12:02] we'd do the same in this case I think, so that the hitrate works right for how it's handled FE-only [16:12:52] so I guess the ideal plan would be something like the N-hit-wonder code on upload-fe today [16:13:14] but in the backend, and doing the ttl=0+uncacheable on misses where CL < N [16:13:44] and for bonus points, put the same CL < N + miss check in the frontend and use it to switch to cache_foo_random like pass-traffic [16:14:03] (since those will never be cached by the backends, no point chashing them when they can be better-distributed randomly) [16:15:08] very cool [16:15:32] I was trying to find out the size of a hfp object out of curiosity [16:15:44] is it just a struct objcore or is there gonna be more stuff involved? [16:16:47] well, it's gotta have some more than just fixed fields, as it has to match all the relevant things (I guess URL + hashed fields + Vary headers. and maybe since cache storage engine can't predict what you'd hash or vary on, that might mean all headers?) [16:17:15] oh no, it wouldn't need all headers for the hash or vary stuff, that's just bad thinking [16:17:23] hmmm [16:17:31] yeah it could well be just a struct objcore [16:22:18] qq - Dan in my team was trying to fetch https://analytics.wikimedia.org/datasets/periodic/reports/metrics/language/compact-language-links/frwiki.tsv from javascript getting cached content, meanwhile it seems that adding any Cookie header would force a pass to the backend [16:23:00] Tried to check VCL a bit but my understanding of it is really horrible :) [16:23:55] (and the response carries Cache-Control: max-age=86400, public, must-revalidate) [16:24:09] what is the best practice to suggest in this case? [16:25:40] yeah so I don't know what you're actually looking for, but what's generally going on there is: [16:26:16] analytics.wikimedia.org maps to the cache_misc cluster (along with a bazillion random misc services) [16:27:07] we don't (in any programmatic sense) know which of those services have session cookies that matter and should Vary or block caching to avoid leaking data between users or between logged-in users and the public [16:27:32] (much less know the names of said cookies, or other non-cookie methods of accomplishing the same) [16:28:00] so the cache_misc code has little choice but to be very conservative and basically say "if I see a cookie, do not cache, because the cookie could be a session cookie" [16:28:13] ahh now it makes sense [16:28:22] and then we've carved out a few exceptions to that for common/well-known cookies that we know aren't session cookies [16:28:37] once we set ourselves at the wikimedia.org level, common google analytics cookies, etc [16:29:22] the regex of cookie keys we ignore is: "(?i)(^|;\s*)(_ga[-_a-z]*|_utm[-_a-z]*|WMF-Last-Access(-Global)?|GeoIP|CP)=[^;]*" [16:30:37] thanks a lot, now it makes perfect sense [16:31:35] probably the right move here for the future of cache_misc is to start making this part of the metadata about specific services in the req_handling hieradata [16:32:17] to specify that a service either has no private sessions (ignore all cookies, etc), or should be treated as default above (only ignore well-known cookies), or specify a regex of its known session cookies and ignore all others for the above purposes. [16:32:27] and then we could at least go through and fix up cases we have more info about [16:33:14] or in another direction, both for cache_text and cache_misc we could start reviving the idea of X-Vary-Options or whatever its more RFC-standard replacement was [16:33:36] (which lets the applayer actually tell us per-response "Vary: Cookie, but only on the following cookie key") [16:36:58] so the "whitelist" is suggested by the backend and the cache only needs to respect it [16:38:43] right [16:39:05] in theory, (and maybe in practice 15 years ago for very simple example sites) [16:39:20] the idea was the only reason you'd probably use a cookie was for a login session [16:39:45] and by emitting the response header "Vary: Cookie", you tell caches not to share contents between requests with different Cookie-header values and you're done [16:40:29] the problem is now there's a ton of cookie-spam for non-login purposes, both within applications and globally, so "Vary: Cookie" ends up in practice just meaning "you don't get to cache jack, because every request has lots of cookies no matter what" [16:40:49] :D [16:41:33] so X-Vary-Options was something we did in the past here with squid, that allowed to supply more metadata about "Vary:", like specifying specific fields in the Cookie header [16:41:51] I think TimStarling here invented it [16:42:11] and Varnish didn't support it when we switch from Squid to Varnish, so we lost it and had to go back to hacky Cookie-parsing and interpretation [16:43:22] and now I think I recall that there's a more-standardized thing similar to X-Vary-Options now [16:43:25] now I get why I found https://github.com/atdt/xvo [16:43:36] but I can't recall what the header was called now [16:55:06] oh, hfp objects end up in transient storage apparently [16:56:48] if (bo->uncacheable || lifetime < cache_param->shortlived) [16:56:48] storage_hint = TRANSIENT_STORAGE; [16:57:00] (from bin/varnishd/cache/cache_fetch.c) [16:58:39] so I guess we don't have to worry about them causing any fragmentation at the file storage level? [17:01:35] yeah I guess [17:01:49] the downside of that, is transient storage is kinda crazy [17:02:08] that's another thing I've never really dug into, all the potential problems we have with transient [17:02:24] because transient is basically unbounded-malloc, so if a lot of things end up there it could get ugly [17:02:45] I don't even remember what the default shortlived is either, which could put some short-TTL objects like RL in it [17:03:11] 10s apparently (ttl+grace+keep) [17:03:18] ok that's not awful [17:03:38] I think you can replace transient storage with a bounded malloc of the same name, too [17:03:49] which might turn out to be useful if we have tons of long-lived hfps [17:04:09] also, it looks like headers are indeed stored in the hfp object [17:04:11] (could explain why it's so hard to tune cache boxes' FE malloc storage size to avoid toom, too) [17:04:27] s/toom/oom/ [17:04:27] https://phabricator.wikimedia.org/P5291 this one passes, adding a header the size assertion fails [17:06:06] I don't think we plot any of the transient metrics BTW, which would be interesting to look at :) [17:06:52] SMA.Transient.g_bytes (which I think is the total malloc'd size of transient) on cp1065 is currently ~850MB [17:07:00] for the backend, I mean [17:07:29] ugh, not bad [17:07:39] and the frontend there is 65MB transient [17:08:04] when we consider we probably don't have many (if at all) shortlived actual objects [17:08:12] 850MB is a crapload of hfp objects :) [17:08:38] of course cp1065 is a dumb place to look with so little traffic right now :) [17:09:12] cp2014 is probably busier :) [17:09:44] codfw cache_upload (cp2002): [17:09:45] root@cp2002:~# varnishstat -1 |grep SMA.Transient.g_bytes [17:09:45] SMA.Transient.g_bytes 33616711 . Bytes outstanding [17:09:48] root@cp2002:~# varnishstat -1 -n frontend |grep SMA.Transient.g_bytes [17:09:51] SMA.Transient.g_bytes 5561744129 . Bytes outstanding [17:09:57] 5.5GB of frontend transient [17:10:04] probably because of all the size-based HFPs [17:11:04] cp2001 (codfw text) is ~1.6G trans for the backend and ~8G for the frontend [17:11:09] so yeah, they do get significant [17:11:30] we might want to bind them to a reasonable amount then [17:12:35] hmmm on the hfp things (vs non-hfp) [17:12:58] I was forgetting earlier that of course the whole reason for hfp is to avoid serialization stalls on heavily-hit objects [17:13:48] what we have today makes sense [17:14:10] N-hit-wonder don't use hfp (but if they started getting traffic, they'd quickly fail the N-hit-wonder test and no longer serialize) [17:14:58] whereas the upload frontends' >= 256KB filter does use hfp, as those could be heavy [17:16:46] right [17:17:12] perhaps with very small objects we don't even care too much about serialization? [17:19:27] I donno, serialization can be very painful is one of them is very hot [17:19:32] even if they're tiny [17:20:45] e.g. if the RTT through the caches+app stack is .01ms, and 10K reqs arrive for a single tiny object, the last of those will stall for 100s before it gets the object. [17:21:03] mmh yeah [17:22:57] so the plan is to give a try to hfp for objects with CL < N in upload-backend, possibly trying increasing values of N starting around 256? [17:27:09] yeah although I'm still lost on whether it's better to hfp them or not [17:27:31] when things are the other way around (we're passing on the front and hitting on the back) hfp is a necessary protection [17:27:48] if we're hitting in the front and the backs are just handled FE misses, the traffic shouldn't be high [17:28:10] especially if we're randomizing them on the FE because we know the BE won't cache it [17:28:59] so maybe better to avoid further transient bloat and do the non-hfp version for the backends skipping caching of CL with randomization on the frontend [17:29:27] right [17:30:01] because we seem to have 2 alternatives: frontend randomization and no-hfp on the backend, or frontend chashing with hfp on the backend [17:30:19] basically we want this to look a lot like the 4-hit-wonder code in frontends today [17:30:51] but replace the: beresp.http.X-Cache-Int !~ " hit/([4-9]|[0-9]{2,})$" check with a CL < N check [17:30:56] and do it in the backend instead of frontend [17:31:10] and on the frontend side, under the same conditions choose the randomized backend instead of the chashed one [17:32:55] I like it that it's always a small stupid patch with an endless reasoning behind it [17:33:39] s/always/often/ :) [17:33:54] yes [17:35:01] the ultimate version of that phenomenon will be when we finally implement dsb's exp(-size/c) filter :) [17:35:51] patch is a oneliner, reasoning is on next CACM [18:25:41] 10netops, 10DBA, 06Operations, 10ops-codfw: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3194902 (10RobH) row a done [19:12:20] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3195178 (10faidon) The latest from Juniper: ``` Faidon, I just got more information on this Case. The current PR tracking this issue is 1238906. Which is in... [19:17:28] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3195205 (10akosiaris) >>! In T133387#3195178, @faidon wrote: > The latest from Juniper: > ``` > Faidon, > > I just got more information on this Case. > > The... [20:48:00] 10netops, 10DBA, 06Operations, 10ops-codfw: db20[7-9][0-9] switch ports configuration - https://phabricator.wikimedia.org/T162944#3195612 (10RobH) row b done [23:24:14] bblack: still around? could the upload 5xx alarm be related to cp2014 mailbox lag at 7300190? [23:24:51] we have also the Codfw HTTP alarming btw [23:55:15] 10netops, 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3196208 (10chasemp) FYI @andrew labservices1001 will be caught up in this as it lives in [[ https://racktables.wikimedia.org/index.php?page=rack&...