[06:59:28] dcausse_: good morning! [06:59:39] atsukoito: hey! [06:59:56] morning backport session lets go [07:00:08] atsukoito: just sent you a slack message [08:02:32] dcausse: thanks for deployment and debugging session! [08:08:57] atsukoito: np! [09:32:20] errand+lunch [10:39:29] dcausse: i wanted to ask you about wmf-opensearch-search-plugins, it seems like we only build it for opensearch1 and for bullseye [10:41:47] what plugins will we need for ttmserver? cc ebernhardson Nikerabbit [10:42:45] ah-h, i see, fuzzy_like_this and levenshtein_distance_score [11:04:53] found plugins for opensearch2 and bullseye, will try those, and maybe rebuild it for trixie [11:07:19] disregard this, it is present in the correct repo [12:07:00] atsukoito: sounds good, pretty sure translate only needs the extra plugin but the wmf-opensearch-search-plugins debian package might be easier to use it has all the plugins we use in production [12:07:47] errand again, back in ~45min [13:20:56] back [13:25:13] \o [13:25:21] that did not fix cindy :( [13:26:15] the UV_THREADPOOL_SIZE [13:30:30] o/ [13:30:39] :( [13:31:06] a bit clueless with cindy issues... was about to try hardcoding stuff in fresh /etc/hosts image to see if this helps [13:31:27] or write a small nodejs loop to try reproducing [13:32:05] hmm, i suppose that would probably be a reasonable step. I was going to try and change the `timeout:1 retry:3` in resolv.conf, increase the timeout, but hard to say. Indeed if it didn't take 15+ minutes to repro would help [13:32:29] i have a tcpdump running now, but not sure it will answer anything. Might hint at what layer it disapears [13:33:43] with sudo journalctl -u docker -f you definitely some ns resolution being attempted on the search.eqiad1.wikimedia.cloud domain for the test wikis [13:34:14] I wonder if we can blame defreitas/dns-proxy-server:3.5.2, all docker containers are using it [13:36:48] it does seem plausible. it might just be request frequency, but it seems like it always fails in the nodejs side of things [13:37:03] we do way more requests w/ nodejs than the browser [13:37:50] what I'm not sure is that I see "options timeout:1 attempts:3 ndots:1" in fresh@/etc/resolv.conf [13:39:03] yea i had wondered about that timeout:1 as well, to change that we have to compile a custom mwcli though. Shouldn't be hard [13:39:16] it's because mwcli replaces it's static files in the config dictory on startup [13:39:20] (or at least, it used to) [13:39:36] was wondering if there was a way to override that [13:39:59] dcausse: thats dns_opt in docker-compose [13:40:57] tcpdump doesn't really say anything useful. Around the time the first error was printed, we have good sub-10ms req/resp cycles for the domain name it failed to lookup :S [13:41:07] yes, it'd be great if we could provide some overrides for the docker-compose files embedded in mwcli [13:42:53] also that ndots:1 means we don't attempt all search domains for hosts like cirrustest.wiki.local.wmftest.net [13:43:38] but I see some attempts on cirrustest.search.eqiad1.wikimedia.cloud. so there are some cases where we use the shortcut "cirrustest" [13:44:03] but not sure these are the ones to blame, possibly unrelated to ns resolution done by node [13:46:20] oh, surprised we are using the shortname :S [13:46:49] or it's external traffic? I'm not sure, that's what I see running journalctl on docker [13:46:55] oh, actually i do have a bunch of server failures in the tcpdump just before [13:48:09] from defreitas/dns-proxy-server:3.5.2,: Timed out while trying to resolve mailhog.search.eqiad1.wikimedia.cloud./AAAA, id=51928 class=IOException [13:48:42] https://phabricator.wikimedia.org/P92510 [13:49:51] the dns-proxy is .10 [13:50:25] should limit external traffic? we seem hammered by some worm or the like [13:50:55] "hammered" is perhaps a bit strong [13:51:01] so what is 127.0.0.11? I guess it's like a docker proxy? [13:51:31] hmm, are they publicly accessble? I thought we turned off those proxys [13:51:33] checking horizon [13:52:22] at a general level, the tcpdump makes it look like chrome started up, and the integration test suite was doing something. They didn't issue that many requests though, about 10 over 3s [13:53:02] in the dns-proxy I see: # nameserver 127.0.0.11 # dps-comment [13:53:11] in /etc/resolv.conf [13:53:29] it looks like we do have proxies configured, although it 5xx's right now: cirrustest-cirrus-integ.wmflabs.org [13:58:26] looking at the tcpdump, it did get a response but it took too long. There is a packet that would have answered an already-failed dns request, but it came 5s after requested [13:58:57] from dps to the docker embedded server (apparently 127.0.0.11 is used for docker networking and resolves container names before forwarding) [14:00:25] trying to write a reproduction, in theory fire off dozens of requests at once and see if it stalls ? [14:07:10] sure, I don't quite get why we need dps tho [14:10:50] i suspect it's supposed to be a cache? but unclear as well. [14:11:11] dcausse I left pairing, but if you have anything to discuss LMK [14:11:50] inflatador: oh sorry, was distracted with cindy's issues [14:12:39] dcausse np at all, we are all busy ;) ping me if I can help w/anything [14:12:48] sure, thanks! [14:12:53] probably more complex than necessary, but claude wrote this based on my description and reproduced in ~10s: https://phabricator.wikimedia.org/P92511 [14:13:35] as for the fix...i'm still unsure :S [14:14:30] looks like there are some options to make chrome skip some of those lookups it does on startup, might help but is only a bandaid [14:15:16] this does suggest if we lengthen the timeouts enough, it will probably work i guess [14:15:56] still trying to find out where cirrustestwiki.mediawiki.local.wmftest.net is defined [14:16:18] looking at sudo docker inspect mwcli-mwdd-default-mediawiki-web-1 I don't see any aliases for that [14:16:32] dcausse: isn't *.local.wmftest.net a catchall [14:17:24] but it must say somewhere that it points to mwcli-mwdd-default-mediawiki-web-1 (10.0.0.4) no? [14:19:55] hmm, i'm actually not sure :S anything under local.wmftest.net returns 127.0.0.1. But the browser is running in the fresh container, so it must route to mediawiki container somehow [14:21:39] fresh targets the dps ns proxy container from resolv.conf [14:21:42] yea, here i get 127.0.0.1, but inside the fresh container it gets 10.0.0.3 [14:22:48] sigh I hate these containers that do not even have "ps" :( [14:23:12] indeed. the -f flag on ps from the host can help, but isn't great [14:23:30] trying to understand how this java dns-proxy gets configured [14:24:14] there is also a tedious nsenter way: PID=$(docker inspect -f '{{.State.Pid}}' mwcli-mwdd-default-mediawiki-fresh-1); nsenter -t $PID -n ps ax [14:27:26] I'm trying to investigate in the mwcli-mwdd-default-dps-1 which as far as I can tell from the logs is java based dns-proxy [14:31:29] hmm, indeed not much for hints in that container :S no args passed to dns-proxy-server, config/ directory looks minimally configured. nothing interesting in the environ for dns-proxy-server [14:33:10] the question is it this dns-proxy-server that does the mapping for the multi wiki domains? [14:33:39] trying to see, this is the patch where they started using local.wmftest.net but i havn't quite decided where it happens: https://gitlab.wikimedia.org/repos/releng/cli/-/merge_requests/635/diffs [14:34:01] commands/docker/hosts/add.go seems plausible [14:34:19] and then the domain names get auto-added on wiki creation? [14:35:30] :S That implies it re-writes the hosts files [14:36:33] but haven't seen any hosts file with these names [14:37:11] same [14:42:25] going through the patch, nothing else stands out :S there is a bit for the nginx-proxy and the domains, but i don't think that's relevant [14:48:15] i can't quite understand...it clearly know via `./mw docker hosts writable` that the hosts file isn't writable, but unclear what it's doing instead [14:49:32] there's some magic with this dns proxy: https://mageddo.github.io/dns-proxy-server/latest/en/5-tutorials/docker-reverse-proxy/ [14:49:50] it has an "admin" port but it's not exposed [14:51:14] the domains end up here: cat ~/.config/mwcli/mwdd/default/record-hosts [14:51:37] hmm, we could probably expose that ui somehow. might give insight [14:51:54] this is on the host? [14:52:01] yea thats on the host machine [14:53:40] i see how destroy and foreachwiki use it...but unclear on dps still [14:59:31] couple envs mwcli-mwdd-default-mediawiki-web-1 as VIRTUAL_HOST=*.mediawiki.local.wmftest.net, mwcli-mwdd-default-nginx-proxy-1 has HOSTNAMES=.mediawiki.local.wmftest.net,keycloak.local.wmftest.net,dashboard.local.wmftest.net [15:00:27] hmm, that sounds like a thread [15:01:47] dps seems to inspect HOSTNAMES but not sure how it can access other containers env var [15:03:08] chrome is being annyoing, no mic detected...few minutes late [15:03:17] Hi! Are you around for Wednesday meeting? We have guests :-) [15:04:08] dcausse: ^^ [15:07:48] pfischer: oops joining [15:22:57] do y'all know if we ran anything on cloudelastic recently? We had an I/O stall for all the hosts about 45m ago https://grafana.wikimedia.org/goto/efly8xp3ewfeof?orgId=1 [15:23:19] query latency went up too https://grafana.wikimedia.org/goto/dfly8z2a4f7k0a?orgId=1 [16:13:49] There's a followup question that I don't know the answer to: Why is this user only getting the interlanguage search results /some of the time/ ? 🤔 - https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#c-Suffusion_of_Yellow-20260512211000-Quiddity_(WMF)-20260512210700 [16:22:14] Trey314159: ^ any idea? I would expect textcat to hit every time i guess [16:22:32] I'll take a look [16:37:44] heading out [16:49:39] using the custom image with -XX:MaxHeapSize=200m (probably far too much, but whatever) seems to avoid the failures, the reproduction with 32 io threads ran for a minute with no failures. p99 is rough, from 50-250ms, but livable [16:51:43] 128 threads it pushed it into a fail after 80s, but that should be far more traffic than we send [17:00:40] lol at claude's reason for the failure: A DNS server in C wouldn't have a heap. Choosing JVM + Quarkus + dependency injection for a protocol that fits in a UDP packet is the kind of decision that's fine until someone hammers it, and then it isn't. [17:06:08] quiddity: I'm blaming our old frenemy shard term statistics. I can't replicate the problem, but I found a lof of similar examples of reloading the page and getting different results, which I think point to what's going on. In short, because of borderline term stats that vary across servers, some search servers get a suggestion and some don't, which overrides (or doesn't) the cross-language results. [17:07:22] Speaking of java, is it still true that heap sizes over 32 GB don't do much good? Context is this CR https://gerrit.wikimedia.org/r/c/operations/puppet/+/1285926 [17:08:17] inflatador: it's not that it's "bad" necessarily, but when you go from 31G to 33G you probably have less heap space. Because in 31G all pointers are 32bits, and in 33G all pointers are 64 bits [17:09:35] if you want more details it's called compressed oops [17:10:28] going from 20g to 96g you will absolutely get more space, at least double [17:11:12] probably quadruple i guess [17:12:20] ebernhardson ACK, ChatGPT tells me to try `java -XX:+PrintFlagsFinal -version | grep UseCompressedOops` [17:13:02] probably reasonable, i think it's safe to assume any time heap is under the 2^32 limit compressed oops will be enabled [17:13:56] got a pass from the test suite, started up the run-cindy.sh script and letting it vote again [17:14:29] still need to write up a patch for the env that will auto-magically adjust the image and set the env var so we don't lose the fix when replacing the instance [18:46:33] * ebernhardson oddly cannot find the adjusted notebook i used for the dym tests before... [18:47:14] we have the main notebook in gitlab still, but i had adjusted it to handle the fact we had 3 test names, but they were all the same just different wikis with different configurations [18:48:36] oh there it is, just hiding in a subdir. I should try to note use thoes :P [18:49:01] * ebernhardson fails at typing... [19:24:19] sigh, 47.2% of enwiki events have mismatch tests, which implies multiple buckets per user for some reason. Can never just run an old report and get results :( [19:30:52] appears to come from data collection itself, so the javascript is seeing multiple buckets reported by the backend :(