[02:27:27] Krinkle: the modern otel name for the distinction is that Jaeger's "Tags" are directly on the span itself, whereas the "Process" ones come from a set of k/v pairs that are attached to the OTel Resource itself https://opentelemetry.io/docs/languages/js/resources/ [02:28:24] basically a Resource is metadata about the thingy in production that is emitting the spans [02:29:30] one of the reasons that's done is for efficiency of the wire protocol [02:30:24] there's also a super-modern, I don't even know if actually anyone uses it thing, where a Resource can consist of a set of Instrumentation Scopes https://opentelemetry.io/docs/concepts/instrumentation-scope/ [02:32:29] ok so based on that JS example, it seems both the spans and the resource/process are emitted from the same source, possibly even submitted to Jeager in the same HTTP body? So it's saying here's some shared data for this reqId from my layer, and here are the spans I got during that same period within/under that. [02:34:20] does that mean layers like Envoy that convey most of their stuff implicitly, do they partipate like any other server, just that they have basically no spans, only a process that they submit data about based on what they got from the trace* related headers they construct it more or less implicitly. [02:35:00] they participate like any other server; Envoy creates spans based on its timing of the HTTP requests it is receiving/sending [02:35:11] in fact right now the only places that emit OTel anywhere in production are Envoys. for mw-* and also several other services [02:36:01] ah right, it does have 1 span indeed. there's no distinction between spans inside a service vs the service's handling of a reqId overall. That just happens to be the root span. [02:36:31] so I guess if Jeager received a body with an empty array of spans and only process data, that'd probably be an error or not be visible. [02:36:37] it'd be not visible, yeah [02:37:51] is there some concept of root spans that e.g. means inside Jeager we can see the differences between MW adding a span inside its reqId vs sessionstore/envoy out of bound adding stuff to that reqId? Like is there some name or concept "Jeager got this stuff in the same jeager API submission from some source" [02:39:41] there is a concept of a root span, yes [02:39:58] we can see that difference because there process block [02:40:07] there's still a process block associated with every span [02:40:21] I don't think you can tell it came in the same incoming batch or whatever, but you can tell it came from the same instance [02:40:46] right if you expand it, it either has the same process as other spans or doesn't, that's how you tell it came from another source. [02:41:18] right now you *can't* tell that, really, because in like 99% of traces the only thing sending any spans at all is the one Envoy on the mediawiki pod handling the query [02:41:33] for mw-* and also several other services, when the mesh Envoy receives a req on their TLS-termination listener intended for the service they're running as a sidecar of, we make a sampling decision (set to 100% for mw-debug, set quite low I think 0.1% for prod mediawikis) [02:41:58] then if sampling is enabled, we create a span; if we were the ones who enabled sampling it will be the root span [02:42:16] those envoys also emit spans for any requests that the application makes to other services via that envoy, as long as the application propagates traceparent [02:43:30] ok so root span in this case, is 1:1 with a page in Jeager / reqId. [02:44:33] the search UI btw is searching spans (not just root spans) [02:44:36] I meant a name/concept for one or more spans for a given reqId from the same producer/ source. E.g. would it visually look different if MW emitted the span for sessionstore instead of envoy (apart from process tab inside the expanded view, and the difference in latency numbers) [02:44:50] so if you search for like, shellbox-syntaxhighlight you'll see the root spans labeled as mw-web, mw-jobrunner etc [02:45:27] https://trace.wikimedia.org/trace/eeeeafd20f1bd7b409408b7fd247b625 and this is what it looks like, with the added complication of we apparently get the tracing service name wrong on the shellboxen [02:45:30] I'll fix that next week [02:46:06] ah okay, so that is root node indeed. [02:46:32] https://i.imgur.com/6m2ffX1.png [02:47:04] two sides of the same HTTP requests [02:51:52] oh, and, a particularly egregious example https://phabricator.wikimedia.org/T368064 [03:43:45] cdanis: nice find. [03:43:54] I feel like we've come across that before not too long ago. [03:47:44] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731861 [03:47:45] ref T288848 [03:47:46] T288848: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 [03:49:06] on baremetal the envoy for "mwapi" (loopback use case) pointed to api-rw.discovery.wmnet [03:49:42] which was I guess dc-local kind of, but also 2021 is when that went out so pre-multidc [03:58:10] I can't find it, but I recall that we could or perhaps even that were going to set this to local rw always and/or set it to ro (same effectively) based on the understanding that 1) POST loopbacks generlaly only happen in POST external requests anyway so it's 99% correct (if at all, I can't think of any right POST loopbacks right now), and 2) unlike pre-multidc where we secondary dc was effectively read-only with MW config pointing to [03:58:11] local read-only primary, since multi-dc we no longer do this, MW is technically writable everywhere. If a write happens in the secondary DC, it'll to to the real primary db, this is slower of course but it's allowed / works. [03:58:58] so setting mwapi envoy to mw-ro (assuming mw-ro is mapped to local dc) should be fine. In the (probably not possible?) case it POST loopbacks from a GET, it'll work fine. [03:59:29] And while I can't think of a POST-from-GET loopback example, we do have a small trickle of db writes from get requests, namely post-send deferred updates. [04:00:09] and those naturally go across DCs when they happen. [04:00:48] I guess we either never did that yet, or found a reason not to, or undid it in the move to k8s.