[15:04:29] godog: o/ [15:04:38] can I use one of the existing swift accounts to test, or should I make a new one? [15:05:00] also, what's with the user names? [15:05:04] e..g pagecompilation:zim [15:05:08] why the two parts? [15:09:36] oh hmmm, i see [15:09:39] an account can have multiple users? [15:10:31] ottomata: yeah exactly, one account and multiple users [15:10:50] but yeah please make a new one, containers are isolated too that way [15:10:53] ok, not sure what user will be for use, but I will make an analytics:admin for now [15:11:05] and we can add other users for specific purposes if we need to [15:11:09] sounds good to me! [15:17:23] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/512184 and https://gerrit.wikimedia.org/r/c/labs/private/+/512183/1/hieradata/common/swift/params.yaml [15:17:43] will add a key to private too if that looks ok to you [15:18:25] godog: do you just randomly generate a key? [15:22:41] looks like it. proceeding, feel free to post review. [15:26:58] ottomata: yeah keys are random, and patches LGTM! [15:27:10] danke! [15:27:43] ottomata: I will be off tomorrow btw, back on Mon tho [15:27:56] i'll be off tomorrow too, and then at analytisc offsite next wek [15:27:57] week [15:28:02] so i'll just be testing this oozie stuff today [15:28:18] but then get into actually using it after june 2 [15:28:34] godog: is there anything I need to do to add accounts other than let puppet run? [15:29:30] ottomata: puppet run and roll restart swift-proxy on ms-fe hosts, with depools of course [15:29:37] oo [15:30:00] ok there are 4 in each codfw and eqiad ya? [15:30:11] and a simple depool; sudo service swift-proxy restart; pool [15:30:13] will be ok? [15:30:32] yeah 4 per site and that should work yes [15:30:41] ok [15:30:48] godog: you ok with me doing that now? [15:30:48] with a little time for traffic to come back heh [15:30:52] sure [15:31:14] ottomata: should be safe yeah, try codfw first [15:31:16] k [15:31:27] i have to run now though [15:31:48] k [15:35:43] hmm depool doesn't seem to be depooling... [15:35:46] https://config-master.wikimedia.org/pybal/codfw/swift [15:37:28] did depool output anything at all? [15:38:08] ottomata: if you ran 'sudo depool' and it didn't output anything, try 'sudo -i depool' [15:38:19] it did output [15:38:22] Depooling all services on ms-fe2005.codfw.wmnet [15:38:24] i will try -i [15:38:39] that is better! [15:38:39] codfw/swift/nginx/ms-fe2005.codfw.wmnet: pooled changed yes => no [15:38:39] codfw/swift/swift-fe/ms-fe2005.codfw.wmnet: pooled changed yes => no [15:38:58] editing docs... [15:39:01] yeah, it needs the environment variables to get access to etcd, and unfortunately 'depool' doesn't output anything on error [15:41:29] thanks [15:49:25] (we should probably fix the latter, IMO it should say something if talking to etcd fails for whatever reason) [15:49:26] hmmmm, i think something went wrong on ms-fe1005. [15:49:29] i just depooled again [15:49:43] after depool and swift-proxy restart [15:49:56] i saw a lot of ConnectionTimeout in logs. [15:49:58] is that normlal? [15:50:00] normal? [15:50:12] actually i see a good number of them on e.g. ms-fe1006 too [15:50:31] cdanis: ^ do you know? [15:50:43] taking a look [15:50:50] sorry ^^^^ 'after depool; swift-proxy restart; pool'* [15:52:23] godog: hm, dispersion in eqiad has been unhappy for a week now? [15:54:55] ottomata: looks like it is mostly errors talking to a single object server, ms-be1033 [15:55:09] ok phew well not caused by my restart then at least [15:55:11] ? [15:55:15] no, I do not think so [15:55:33] oh, and that server is down [15:56:15] let me take a look at that [15:56:21] ok cool, i will wait. [15:56:25] cdanis: should I repool 1005? [15:56:29] yes that seems fine to me [15:56:31] ok [16:00:17] ah [16:00:19] https://phabricator.wikimedia.org/T223518 [16:08:14] ottomata: bblack: gentle ping on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511690/ [16:13:33] ohhhh [16:13:34] hm [16:14:17] i agree that'd be userful hm [16:15:42] cdanis: i don't suppose we could stick that in the X-Cache response header somehow? [16:15:48] instead of adding a new field? [16:16:13] e.g. [16:16:35] cp1083 miss, cp1075 miss, mw1001 backend [16:16:38] or something? [16:16:39] or final? [16:16:43] or whatever the proper term might be? [16:16:54] I'll let bblack weigh in on that one [16:17:01] k will comment on patch [16:17:39] this sounds like reinventing request tracing! :) [16:18:36] (which isn't live yet either, but still) [16:18:57] it is a very poor substitute for such, but it seemed useful still :) [16:19:01] there's another effort around attaching a UUID to requests at the edge and transitting it through all our layers and having them emit trace events to link it all up [16:19:41] (every layer either tracks the existing UUID it finds on the request or generates a new one if it's missing (e.g. internal svc->svc reqs)), and then that goes to some opentracing-like stuff with spans [16:19:59] perf team is working on that, I think we have a task open at the traffic layer to start generating the uuid [16:20:15] but yeah, this still seems useful [16:20:18] neat, I had been following some of those tasks but wasn't sure how actively they were being worked on [16:20:37] not sure tbh! [16:21:08] I'd say let's not mess with X-Cache though. Lots of other bits within and without traffic parse the X-Cache header and expect it to look a certain way. [16:21:18] aye k. [16:21:29] I thikn this info would be useful even with x-request-id work [16:21:36] tacking on a Server line to webrequest is useful [16:21:45] just keep in mind it won't always be there [16:21:55] it'll be nice to know the backend server without having to join on other tables [16:21:57] aye [16:22:01] only on misses ya? [16:22:14] cdanis: will need to discuss with analytics team about that, since it is a webrequest schema change [16:22:17] not urgent ya? [16:22:20] no, it'll be there on hits too presumably, but the mw backend server in quesiton has to successfully respond and identify itself for it to appear [16:22:37] not urgent, no [16:22:40] oh right, response was originally served by mwxxx [16:22:46] varnish doesn't pick an MW server. Varnish hits LVS, which randomly hits an mw* server, which then responds and includes the Server: mw1001 header or whatever. [16:22:59] ohhh hm. [16:23:04] so if mw* fails to respond, you get nothing [16:23:13] but if mw* gives an explicit 500 responds, you will get a Server header [16:23:22] cdanis: can you make a phab ticket and attach that patch to it? [16:23:26] sure [16:23:27] and add the Analytics tag [16:23:27] ? [16:23:38] so it can help sync up the 500s, but it can't help track down a lot of 503 cases [16:23:44] cdanis: is it ok if I proceed with rolling restarts of the remaining eqiad swift proxies? [16:23:50] ottomata: sgtm [16:24:07] we don't really have any good way to trace unresponsive 503s [16:24:42] yeah, I had thought of this while trying to track down 500s [16:24:42] only LVS knows which unresponsive server the request was routed to, and it only operates on the IP/ethernet header layers, it can't see inside the traffic or modify it. [16:25:10] I've seen one or two cases where -- while it's really hard to be sure about this! -- we served a lot of 500s in a short interval, and the only thing obviously wrong at the lvs/pybal level was a single appserver on the fritz [16:25:25] right [16:25:47] you'd think those 500s would hit mw fatal logs too, but maybe not if the 500 happens up in apache/fastcgi and not MW [16:25:51] right [16:26:37] the appserver brokenness in question at least once was apache being unable to reach hhvm via fastcgi [16:27:58] ottomata: will any Analytics stuff get upset if a new field starts appearing in the input JSON? mostly I was imagining using this via logstash [16:29:08] i think the field will be ignored [16:29:14] iirc [16:29:16] not 100% sure of that [16:29:22] ok I'll add a question to the task [16:29:25] but, webrequest doesnt' go to logstash anyway [16:29:32] varnish50x does [16:29:33] too much [16:29:35] ohhh [16:29:36] hm [16:31:17] the data might be useful in webrequest anyways [16:31:19] filed https://phabricator.wikimedia.org/T224236 [16:31:36] yeah, for sure, just wondering if that means the analytics schema changes need to gate the submission of this change [16:31:49] I think they ignore new fields by default, but yeah should check [16:32:01] and then we'll want to ask them to add it so we can query on turnilo and such [16:32:09] yeah [16:36:58] bblack: what's the eventual procedure for rolling out varnishkafka changes like this one? [16:38:55] I think it's just a normal puppet-merge and puppetization on our end, like any normal change [16:39:05] can't say about the uptake side in analytics though, I donno [16:40:53] its pretty simple i think, at least for hive. [16:41:17] bblack: so puppet manages restarting varnishkafkas and that's pretty un-impactful? [16:41:18] if we eventually merge webrequest schema into the other event schemas, the schema changes should be more automated [16:41:28] ya that is very unimpactful [16:41:31] cool [16:41:36] the worst is that logs don't get sent during the restart [16:41:42] vk just reads the varnish shared memory log [16:41:46] so it is decoupled from varnish [16:42:05] what does that shared memory look like, a ring buffer? does vk know where it 'stopped' reading? [16:43:12] ya ring buffer i believe [16:43:22] and no it doesn't. it just starts reading and continues [16:43:27] i believe at the end [16:43:37] ah got it [16:47:06] yeah I suspect we lose a few reqs, but whatever [16:47:17] we have bigger problems than that :) [16:48:03] varnish's shmlog buffer and how client libraries interact with it is interesting to study :) [16:48:13] but only if you don't value your sanity [16:49:04] https://varnish-cache.org/docs/trunk/reference/vsm.html is a start, but only the source can really tell you [16:52:25] ... interesting indeed [16:54:04] I would call it crazy, but I've done crazier things in the name of avoiding the traditional problems of locks, so I'm not one to talk! [16:57:45] gdnsd's source has this lovely variable declaration in it: [16:57:49] static __thread volatile sig_atomic_t thread_shutdown = 0; [16:57:54] ahah [16:58:12] which is the worst abuse of type-qualifier bingo or whatever I think I've ever written [16:58:26] and googling any part of that will lead to all kinds of answers about how whatever you're doing has to be wrong [16:58:36] but in this particular case, it's all correct and appropriate. [16:58:48] https://github.com/gdnsd/gdnsd/blob/master/src/dnsio_udp.c#L105 [17:00:08] I remember from my past job a bunch of code and macros that were similar to a lower-level version std::atomic, surrounded with all sorts of notes about how you almost certainly should never, ever use them [17:00:21] s/version/& of/ [17:00:38] C11 now has atomic stuff built-in too [17:01:05] oh, cool! I hadn't been following any of that since C99 [17:01:10] but even their loose-est version of atomics is more expensive than necessary for some things :/ (e.g. still uses LOCK instruction prefixes where they're not necc for the use-case) [17:02:21] they consider all the normal memory models and such, but these things rarely consider the most-efficient case where there's actually only one possible writer. [17:02:53] mm, nod [17:04:18] (or where you don't actually care that newly-written values propogate to other threads/cpus instantly, only that the write is "atomic" in the sense of being tear-free... e.g. that incrementing a 32-bit or 64-bit counter by 1 won't show another thread a state where the bits are half-updated for the increment and showing a completely erroneous value) [17:07:15] but using such primitives correctly is often hard regardless of the tools/language. I can see why for general-purpose APIs, the cheapest atomics want to at least gaurantee two writers don't stomp on each other to produce garbage, and that propogation happens in a timely fashion with respect to ordering guarantees, etc [17:09:33] but if you want the tightest loop possible, but your thing has multiple threads and emits stats and can coordinate global shutdown correctly, etc... sometimes you need them. [17:11:18] gdnsd's UDP threads are like that. If you strace them on a busy server, all you ever see is a neverending sequence of recvmmsg() and sendmmsg() syscalls, and they never invoke locks or stalls or spins of any kind outside of calling those two syscalls, even though they're emitting stats counters that can be gathered across threads, and they can coordinate thread shutdown quickly and correctly, and th [17:11:24] e response packets contain data from things that can be updated at runtime, like resource states and geoip database updates, etc. [17:11:56] (the latter bits being handled by QSBR-style RCU) [17:14:33] I managed to benchmark one thread out to where it fully saturated a CPU core on my laptop. over the loopback interface it was responding to 450K reqs/sec. [17:15:23] (and is incapable of having any kind of hiccup on that, even when you reconfigure things or load new data, or even restart the whole daemon with a new version of the binary deployed) [17:17:10] in theory these techniques are universal, but building the right abstractions to operate this way when doing much higher-level application things, is hard. [17:17:43] lock-free accumulating counters I've seen but lock-free data reloads is tremendously hard [17:18:00] well RCU fixes that part [17:18:11] https://liburcu.org/ ! [17:18:26] I just need someone to integrate that deeploy and correctly with Rust :) [17:18:31] *deeply [17:19:08] and the QSBR variant in particular is what's magic [17:19:25] the others are "nice", but QSBR is the deep magic part, and is kinda instrusive to abstract over [17:19:50] (the reader threads need to explicitly declare idle-points where they're not referencing the updateable data, e.g. between requests or whatever) [17:21:59] seems like a thing you could do right before calling recvmsg [17:25:50] yeah [17:26:20] it ends up being complicated in practice, especially when you mix in the desire to have threads shut down orderly (finish their current transaction + send) on command in a timely fashion [17:26:55] because if the thread goes idle (no requests) for a while, you'd be stuck waiting forever in recvmsg() and not declaring idle points to let data updates flow [17:27:29] and you don't want to spam them nonblocking without stalling in the kernel either, that just wastefully spins fast [17:28:47] so it has a system of switching between long and short timeouts on the recvmsg() operation. So long as there's no more than a ~100ms gap between arriving packets, it just marks the idle before each blocking recvmsg() [17:29:46] if recvmsg() blocks for a full 100ms, then it switches the thread's RCU support "offline" (which is relatively-expensive, but still cheap) so it doesn't stall writers, and then blocks for ~3s at a time in recvmsg() while waiting for either a thread termination signal or a new packet (which instantly switches it back to the fast mode) [17:31:13] the terminating signal will end the blocking wait anyways, except in rare race conditions, so that's why it's only ~3s... it's a tradeoff that in a rare race, the thread could take up to ~3s to shutdown on command. But waking up to re-enter recvmsg once every 3s isn't all that costly, either. [17:33:18] but that kind of bullshit level of optimization is really only realistic for UDP [17:33:40] if you were abstracting this for a more-generic application eventloop, you'd do more like gdnsd's TCP code [17:33:54] going RCU offline is expensive enough that you only want to do it in the 'long' case? [17:34:08] (which uses eventloop hooks to go offline when stalled on external events and/or quick quiescent states as above, but simpler and integrates better) [17:34:42] to take RCU offline/online, it has to actually update a shared data structure. Which is lock-less, but still incurs some data synchronization costs. [17:35:06] sure [17:35:43] rcu_quiescent_state() is the "I'm momentarily as an idle point" call, vs rcu_thead_online()/rcu_thread_offline() if you know the thread's going to be stalled a while and not reading RCU-updatedable data [17:36:01] (which updates the thread-shared data structures to let the writer thread know not to wait for it) [17:37:43] ideally, you did this because you have lots of traffic, and all you ever need is rcu_quiescent_state(). The other things are just so you don't have to put a giant asterisk next to your software that says "Oh but if you stop sending it fast traffic, everything goes haywire and your data updates take forever and the thing never shuts down", etc [17:38:25] I thought about just having another thread send it priming traffic too (some packet it knows to ignore), but that seemed pretty unecessarily complicated. [17:39:01] would a thread that was taken offline (in the UDP mode) be a bit slower to reply to its first message when in recvmsg() because it has to restart the RCU and re-read the data? [17:39:12] yes [17:39:29] but "a bit slower" is pretty negligible in practice [17:39:54] it has to update a shared data structure, which involves some pinging about cache sync between CPUs in a multi-core/numa system, etc [17:40:51] sure, I imagined it was fast enough [17:40:51] the worst-case scenario is your rate of traffic happens to exactly map to the chosen timeouts (e.g. you're getting 1 packet exactly every 101ms), but then at 10pps you don't really need amazing efficiency, and the latency cost of the offline/online is trivial compared to network latencies. [17:41:30] but that's also why the timeout values for the long and short timeouts are both carefully chosen to avoid stupid patterns [17:41:48] cf deranged commentary circa https://github.com/gdnsd/gdnsd/blob/master/src/dnsio_udp.c#L58 [17:42:12] I picked numbers that come up as primes simultaneously at the seonds, milliseconds, and microseconds resolutions. [17:42:24] lol [17:42:51] <3 [17:46:57] the TCP stuff is similar-but-different [17:47:09] it's using a proper eventloop library over epoll() and such. [17:47:58] when re-entering the eventloop after processing the current events, if there are already more events pending it does the rcu_quiescent_state(), but if there are no events pending and it needs to block, it goes offline until it gets awoken by timeout/epoll stuff. [17:48:48] but probably given the relative heaviness of the TCP loop (it even dynamically allocates memory, the horror!), honestly just going offline every time you re-enter the loop might've been fine. [17:49:24] but that's more the style of integration/abstractin you'd expect with higher-level application code exploiting stuff like this [18:02:09] * volans 12605 bytes saved [18:02:38] the macros aren't very readable without digging elsewhere in that repo too.. [18:02:58] but you can see the implementation of the 3 qsbr operations (online, offline, quiescent state) as they're used in practice, in the inlines defined in: [18:03:01] https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h [18:03:17] they're the 3 functions at the bottom of that, but call some of the other bits [18:03:32] and then you have to kinda parse out what CMM_LOAD_SHARED and stuff means in arch-specific terms on x86_64, etc.... [18:03:43] but it's various lock-free memory sequencing stuff [18:03:52] (and barriers, etc) [21:35:39] Great podcast about the cosmic rays impacting computers: https://www.wnycstudios.org/story/bit-flip