[10:30:10] lunch [11:13:02] Hello. We have a low memory alert for rdf-streaming-updater: https://phabricator.wikimedia.org/T402886#11143380 - not urgent, but I'm thinking of bumping the RAM available to the taskmanagers. [11:16:30] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1184500 [11:20:00] Any objections? [12:14:24] btullis: answered on the patch, I think I had this conversation with Alex already couple years ago but can't find it in phab... [12:21:13] dcausse: Ack, thanks. In that case, we can probably exclude this namespace from that specific alert, instead. [12:22:33] btullis: if possible, yes. looking closer I think we already fine-tuned mem settings there, so possibly something not working as I expect, but I'm pretty sure the job is not mem-pressured [12:23:35] could be rocksdb a bit greedy, I can try to set it to use 95% of the overhead instead of 100%? [12:24:41] but the job is not oomkilled so I'm not too worried except from the fact that it's causing noise [12:29:50] OK, thanks. I filed this instead: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1184509 [12:30:18] btullis: thanks! [13:17:07] I didn't finish my stats homework ;( Are we supposed to do stats this week? [13:27:50] inflatador: no I don't think so, we have the office hours [13:30:08] * inflatador breathes a sigh of relief [13:36:56] still needs a lot of work, but we have master-eligible info in a dashboard now: https://grafana.wikimedia.org/goto/zxIUBOrHg?orgId=1 . Will add to the percentiles dashboard shortly [14:02:08] \o [14:02:50] o/ [14:02:56] .o/ [14:09:09] * ebernhardson did not realize he implemented java \u syntax with utf-16 and not \u with a codepoint number like pcre [14:10:05] should have known, but didn't consider it would be different and just made it match what the Pattern class does :P [14:22:00] todays wed meeting question: What \u syntax is appropriate to support? :P [14:24:36] yes :) [14:36:31] for codepoints the syntax is \x{codepoint} both for pcre and java, \u I don't see it in pcre, for java yes but it's slithly redundant with the preprocessor: "Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014." [14:39:18] hmm, for some reason i was sure i had seen \u in pcre, but indeed i don't see it at https://www.pcre.org/original/doc/html/pcresyntax.html [14:44:05] oh, it's in pcre2 https://www.pcre.org/current/doc/html/pcre2syntax.html [14:44:24] only "If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode")" [14:52:05] interesting... also supports \u{} [14:54:24] IIUC you basically have 3 ways to match codepoints \N{U+hh..} \u{hh..} \x{hh..} [14:55:44] and \o{ooo} if you only have a numpad :) [14:56:38] I was just reviewing the patch. I dislike \x because the full syntax is wacky (in the way you'd expect Perl to be wacky)—two-digit-no-brace style like \x65 is ok, and multibyte \x{123456} is okay, too. Blech [14:56:41] \N [14:56:52] \N{} also supports *names* of codepoints [14:57:13] I don't have much preference but I'd be leaning toward matching codepoints rather than UTF-16 bytes [14:59:29] i agree utf-16 bytes seems awkward and out of place [14:59:52] could be fancy to match names of codepoints but unsure if easy nor really useful [15:06:58] no, no, don't do it! [15:08:00] i have kinda mixed feelings, on the one hand i don't want to construct or maintain a table. I've also never used the named syntaxes myself. But they were probably designed for less technically advanced users, and thats like the definition of editors? [15:08:32] I guess one question to answer would be, they were likely designed for less technical users, but do they work for them? [15:12:37] hmm, apparenly \u is from ECMAScript (== javascript), which is probably almost as prevalant as pcre in younger minds [15:19:01] and in javascript \unnnn is a utf-16 code-unit, and \u{hhhh} or \u{hhhhhh} match the "unicode value" (codepoint number?) [15:23:12] Hmm.. for me I'm just so used to the \uXXXX format because that's what OpenSearch (and formerly Elasticsearch) use internally for (some?) multibyte characters, and I'm happy to only have one way to think about them.. but most searchers probably don't spend as much time looking at analysis output of gothic character input as I do. That said, supporting \uXXXX and \u{XXXX..} would be pretty cool, as long as it doesn't confuse users [15:23:12] too much. [15:24:16] that would be the json notation, which uses \u with utf-16 code units [15:24:49] apparently javascript, like java, is internally utf-16 [15:25:08] (vs python and php that are internally utf-8) [15:26:50] re json: makes sense! [17:37:59] lunch, back in ~40 [18:34:51] sorry, been back awhile [18:49:23] now the real hard problem i forgot to mention...old dumps are at http://dumps.wikimedia.org/other/cirrussearch. Where do the new ones go? [18:49:45] i'm assuming we need a new name and a transition period before turning off the old ones [18:54:15] Why not just use the same location? They are divided by day. Are the old dumps even running? The last one is 08/25, so it seems not. [18:55:38] they should in theory be running, can look at https://airflow-test-k8s.wikimedia.org/dags/mediawiki_cirrussearch_dump/grid [18:56:27] says last sync was the data interval that ended on 8-25. The next dump dated 9-1 is running but will take forever for s4 (commons) and s8(wikidata) [18:57:25] we could in theory drop them in the same directory, but it seems confusing. The contents of the files are the same, but the set of files is different [19:00:10] i suppose i could try harder to align the names, old names of `aawiki-20250825-cirrussearch-content.json.gz` will become `index_name=aawiki_content/aawiki_content-20250825-00000.json.bz2`. not sure it matters though [20:02:22] ebernhardson: are you all set on the \u patch? If so, I'm ready to approve it. [20:06:47] Trey314159: yup, i was thinking similarly and just resolved the open convo :) [20:07:00] then can get the patches prepped to release the new plugins [20:08:09] Cool!