[10:19:01] lunch [13:21:56] .o/ [13:36:33] o/ [14:00:58] \o [14:10:36] o/ [16:00:00] teacher conference, back in ~1h [16:56:58] back [17:09:15] * ebernhardson remembering now that trying to capture updates via tcpdump is tedious...because they go to 50 different machines :P [17:09:16] Trey314159: just to check if expected, while trying to add a cindy test case using russian mapping I stumbled on 'ь' prefering '{' over 'm' (was trying to add a test case using existing page i.e. мутщЬ -> venom) [17:11:02] random, maybe bad, thought: some use cases want origional html...we could just store that in an unindexed field. But then we are getting off the path of providing search features into something else. [17:11:20] dcausse: hmmm... let me take a look [17:12:56] ebernhardson maybe a job for distributed tracing? Not that I know much about jaeger, but happy to take a look if it could be useful for tracing the lifecycle of an update [17:13:40] inflatador: i'm not really trying to track a specific update, more i'm looking for requests from streaming updater that apply __DELETE_GROUPING__, to attempt to find an example of an update that does the wrong thing [17:14:02] (we have evidence in the index that __DELETE_GROUPING__ is not always applied and sometimes gets indexed, but i can't reproduce so trying to find production evidence of it happening) [17:14:34] ah, damn [17:16:06] captured 5 minutes on three different machines, and none match the wireshark filter `tcp contains DELETE` :( But these are probably not that frequent.. [17:17:28] I kinda doubt we're capturing internal traffic with netflow, but let me ask in #wikimedia-sre [17:17:53] it probably wouldn't help anyways, the data is encrypted over the wire. I tcpdump the `lo` interface between nginx and opensearch to get the unencrypted requests [17:18:12] oh good point [17:18:41] dcausse: I locally added мутщь/venom as a test case and it works. Your example above is мутщЬ, which maps to venoM (capital Ь). Is there any chance you used мутщЪ (which does map to `veno}`)? [17:18:48] maybe a job for ansible then...let me check [17:19:14] annoyingly it would probably be more specific to tcpdump from the updater, because it would capture everything. but that side is encrypted (and i doubt the container has tcpdump...but maybe some sidecar can be attached) [17:19:26] otherwise, something very fishy is going on. [17:19:48] Trey314159: sorry it's 'мутщь' I tested [17:20:12] ь is ambiguous and seems to prefer { rather than m [17:22:41] hm.. seems like we don't get the same behaviors, pasted your ealier comment and I see veno{ [17:23:50] re: tcpdump, we could do something like https://gist.github.com/xpi-d/d39f61f2a4fe44fd8a702b39d8454e9a but it's still gonna be 50 different pcap files to review [17:23:55] the mw-api might apply some unicode normalization while parsing params, perhaps that it? checking [17:26:58] dcausse: yeah, something weird is happening. Converting ь to (uppercase) Ъ is very weird. I don't really see any ambiguity in the mapping in the code, but I will keep poking. [17:28:10] Trey314159: is I add if ( !isset($dwim[1][$c2]) ) { $dwim[1][$c2] = $c1; } in stringToWrongKeyboardMaps I get "m" as the prefered transformation, somehow ь is ambiguous in the mapping [17:28:15] s/is/if [17:28:40] inflatador: hmm, i suppose something like that might work. I've used `clusterssh` for similar purposes before but it's super awkward with 50 hosts (it opens 50 ssh sessions in different windows, and has a place where one input types into all of them) [17:30:01] i suppose i should also check...i was assuming the deletes come in somewhat regularly from live edits, but didn't actually verify in the stream [17:31:30] ebernhardson ansible is much better for this use case than clusterssh. In a perfect world, you run the playbook and in a few minutes you'll have 50 pcap files on your desktop. I can give it a test run with a host or two if you're interested. [17:33:30] yea deletes come in plenty often, looks like a one minute capture should have something [17:33:32] Trey314159: please ignore, sorry about this, I messed something on my end, added an extra entry in the mapping by accident and this got me confused [17:34:24] dcausse: No worries... I've done worse! I was very confused, though, because I was using print_r to dump the mapping in the constructor and it was not ambiguous.... [17:56:27] dinner [18:52:55] back [19:22:17] working on the tcpdump playbook. Let's see what happens when I give it 2 hosts [19:29:09] inflatador: i got it working, sadly not finidng what i need :( [19:30:18] ebernhardson ACK, sorry it didn't pan out :(