[05:08:59] good morning [05:09:16] the transport between eqord and ulsfo (Telia) is still down [05:09:33] and the last email from them is ~10h ago, saying that everything should be ok [05:11:14] "Please bounce you port if it is still not up" [05:11:15] :D [05:14:03] XioNoX: --^ [06:08:41] interesting, we have this on the ulsfo front [06:08:42] Laser receiver power : 0.0001 mW / -40.00 dBm [06:13:54] followed up in the Telia email thread that Daniel started yesterday [06:21:51] thanks! [11:49:01] just an FYI all, i added a new feature to utils/pcc.py so that it supports a simlar feature to "check experimental" (https://phabricator.wikimedia.org/T166066). If you add a `Hosts: ` entry to your commit message and run `./utils/pcc.py with $change_number parse_commit` then PCC will parse the commit message of $change_id for `Hosts: ` lines. using this with last means that you can test the most [11:49:07] recent push using `./utils/pcc.py last ... [11:49:09] ... parse_commit` [11:49:29] `./utils/pcc.py last parse_commit` [12:48:19] that's super cool! [12:52:15] indeed! [12:52:45] nice, is that local only or will it work on ci too? [13:00:20] nice work jbond42 :) [13:59:06] nice, thanks jbond42 <3 [15:40:19] we don't do any traffic-layer request coalescing for logged-in-user requests, right? a session cookie is an immediate pass? [15:40:48] cdanis: correct (based on the onboarding material) [15:50:21] text-frontend.inc.vcl.erb is somewhat instructive on this [15:59:01] correct, they're an immediate pass [15:59:35] they're not terribly-efficiently implemented though, depending on what angle of this you're looking at.... [16:00:09] I'm asking because I've found evidence of some other coalescing or queuing happening at some other layer, comparing the traffic I'm sending to the frontends to packet captures on the poolcounters [16:00:54] queueing could definitely happen [16:00:59] (in a few places) [16:01:00] ok! [16:01:19] I'm sending queries with a max concurrency of 10, against a URL where PoolCounter should be enforcing a limit of 3 [16:01:19] the commentary at the top of evaluate_cookie in text-frontend tells you most of what's going on [16:01:22] yeah :) [16:01:36] weirdly I saw *less* queuing if I provided a session cookie, but not *no* queuing [16:01:42] is that expected? [16:02:06] are the queries of a kind that are normally cacheable for anon? [16:02:34] or cacheable for logged-in, technically we have those too [16:02:43] no [16:02:45] < cache-control: private, must-revalidate, max-age=0 [16:02:57] we say all logged-in traffic is pass-mode because it's an easy approximation that's mostly true [16:03:07] what's the exception, stuff like load.php? [16:03:09] but MW outputs which do not have "Vary: Cookie" actually can have cache hits for logged-in users [16:03:23] right, I was guessing as much from evaluate_cookie [16:03:50] so those will see some coalescing, and anything that can coalesce has at least one new way to have queueing due to coalesce attempts [16:04:21] the Token=1 vary-slotting is a bit tricky and inefficient, too (but better than some other alternatives) [16:04:53] because we're actually creating hit-for-pass objects in cache storage, for every cacheable object viewed by any logged-in user [16:04:59] right [16:05:13] The Token=1 hack at least keeps us from creating them per-session/user [16:05:43] but there could be some black-box-observable "queueing" just from the acquisition of storage for those, for instance [16:05:55] if the requests are otherwise fast, I guess [16:06:32] there's probably a global lock to allocate the storage [16:07:51] (at some level anyways. it's probably transient which is probably a bare malloc, but then malloc has to do something under the hood. it's jemalloc in the case of varnish, which does some kind of thread-pooling I think, which might help) [16:08:07] but not as well as e.g. tcmalloc [16:09:14] it might be interesting to first narrow down where it's happening at by looking at the flow through varnish-be for these if they're unique enough to find [16:09:27] err I mean ats-be [16:09:41] that could tell you if the queueing or whatever is in v-fe at all or not [16:10:25] how do you normally peek at such things? tcpdump? [16:11:12] I'm less-familiar in the ATS world, but you could use varnishlog to look at the output side of the varnish-fe in question [16:11:15] ahh [16:11:28] varnishlog -n frontend -b -q .... [16:11:35] (-b for looking at the backend-facing request flow) [16:56:58] * bd808 weirdly misses bblack's dissertations on the horror of varnish internals [17:33:02] I should write up what I did here, it would be useful if other people ever want to trace Poolcounter traffic for particular key types [17:33:08] even if it is a horrifying mess :) [17:36:46] some sample output: https://phabricator.wikimedia.org/P11174 [17:37:21] the request concurrency is 10; mediawiki uses poolcounter to limit the execution concurrency to 3, which is plainly visible here [18:14:25] https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production [18:28:07] what was the short version? is there queueing somewhere? [18:29:40] bblack: so when I run concurrency=10 HTTP queries directly on an appserver, I see 10 ACQ4MEs hit Poolcounter ~simultaneously, and then I see 3 requests finish, another 3 start, etc [18:30:14] ok [18:30:29] when I run concurrency=10 HTTP queries from home, not-logged-in, I saw *5* ACQ4MEs hit Poolcounter, and then the same pattern of 3 finish, 3 more start, etc [18:30:41] and when logged in I saw 6 initial ACQ4MEs [18:31:02] is some of that just latency effects thought? [18:31:04] *though? [18:31:27] it does take time for all 5 to serialize through various layers [18:32:28] maybe? but the initial queries are issued within about 10ms of each other, and the responses take ~a second to be generated at the applayer [18:33:54] so it shouldn't be that much of a factor [18:33:56] ok [18:34:12] are they on separate TCP conns? [18:34:39] yes [18:35:19] yeah I donno then, clearly something is going on [18:35:24] here, allow me to show you my embarrassing concurrency=10 HTTP scraper: (yes 'https://en.wikipedia.org/w/index.php?title=Special:Contributions/Xaosflux&offset=&limit=500&target=Xaosflux' | stdbuf -eL xargs -P10 -L1 curl -sv --trace-time -o/dev/null) |& egrep ' > GET | < HTTP/2 ' [18:35:39] heh [18:35:41] (random enwiki user chosen by looking at main page contribs) [18:36:09] but they all eventually succeed right? [18:36:37] they do [18:36:56] mine do too [18:39:24] anyway, kind of strange, but not actually relevant to the thing I was really wanting to test, I think :) [18:45:20] yeah, but it's the kind of thing that spawns interesting avenues and finds new dumb things to fix :) [18:46:13] actually, the odds-on chance is it will uncover a deeply disturbing problem, which was documented 6 years ago in a 5-digit-numbered phab ticket that everyone long forgot about [18:46:50] lol I was joking about 5-digit-numbered Phab tasks the other day [18:46:59] they are the really scary ones [18:47:04] yeah [18:47:41] the scariness index of an open phab ticket is usually something like (10 - $digits)^2 [18:57:58] moritzm / jbond: I'm going through SRE onboarding and I'm at the step where I need to enable U2F for IDP. The steps to do so are here: https://wikitech.wikimedia.org/wiki/CAS-SSO [18:57:58] Question (not at all time critical): Is U2F exclusively supported or is TOTP an option as well? I prefer to use TOTP where possible, but will go with U2F if that's not an option [18:58:59] at this point the web SSO (for the services listed on wikitech) only supports U2F [18:59:54] TOTP is supported code-wise, but not enabled at this point [19:00:33] thanks for the reminder I wanted to enable U2F for myself :) [19:09:02] thanks mortizm