[10:41:24] lunch [13:13:01] \o [13:14:19] o/ [13:16:19] dcausse: Regarding the SUP Java 17 update: it’s is a bit trickier than expected. Issues are caused by unauthorised usage of reflection. Some of that I could solve by updates but other parts (Kryo, during windowed deduplication) are harder to get by. Are we fine with allowing access via JVM flag (`--add-opens java.base/java.lang=ALL-UNNAMED`), or shall we try to find a proper solution? [13:58:45] pfischer: yes absolutely that's what flink is doing upstream, see the thread here: T404340#11196543 [13:58:45] T404340: [EPIC] Upgrade flink jobs to java 17 - https://phabricator.wikimedia.org/T404340 [13:59:30] and it should already be pulled by the operator at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/charts/flink-kubernetes-operator/conf/flink-conf.yaml#27 [14:00:04] but needs https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189823 first to avoid having to set these options manually on a per job basis [14:02:56] if you hit this during maven builds with java17 I think we should copy these options to our build setup [14:04:23] pfischer: but curious that you've hit kryo? we should fail wheneber we rely on generic serialization [15:17:36] ouch...applying the formula for sample size determiniation, i would need to manually review just under 1k queries to get a 95% confidence :S [15:17:49] * ebernhardson1 should have just ignored that that calculation exist s:P [15:33:54] although i guess i can just change the numbers, that was 95% confidence that the real value is +- 1% of the estimated value. Curiously 99% confidence on +- 2% is only 436 [15:34:27] then again, my sample flags 2.7% of queries as having question words, so +- 2% would be a bit much :P [15:38:23] * ebernhardson is finding, like many things in stats, there are lots of magic numbers to pick :P [16:00:49] first time I think this one: bulk action failed with status BAD_REQUEST: {"index":"testwikidatawiki_content_1756353103","type":"_doc","id":"193822","cause":{"type":"exception","reason":" Elasticsearch exception [type=illegal_argument_exception, reason=Document contains at least one immense term in field=\"labels.en.near_match" [16:01:10] curious, i think i have seen "immense term" somewhere before, but it's very rare [16:02:04] whose UTF8 encoding is longer than the max length 32766, 32k even utf8 that's pretty big for a single term [16:02:28] probably no need to index those big tokens as is [16:03:01] 2counts in 15days [16:03:15] and it's testwikidata... [16:03:31] hmm, reminds me of the fix for sudachi, where we added a filter that breaks tokens at (iirc) 8k [16:04:18] over two months still only testwikidata, same doc all the time [16:04:43] will get retried by the saneitizer over and over but not sure that's a big deal [16:07:11] i always have mixed feelings...the errors themselves are totally reasonable, the only problem is that they hide other errors [16:07:47] by hide, i just mean if we know that docs fail indexing every day, less likely to look into a new doc thats failing to index [16:08:30] but having 0 indexing errors is probably unlikely, so some strategy is necessary [16:09:28] yes... tempted to add some error counters with separate buckets for errors we expect (e.g. document_missing_exception) and "unknown" errors where we should take a look from time to time [16:10:31] and try to force ourselves to keep that unknown error flat, by changing the mapping of wikidata to truncate those large labels for instance [16:11:10] seems reasonable, although maybe worth pondering what worked and didn't about the cirrus side of that. We were also flagging error types there (still do, but ofc no indexing traffic now) [16:12:47] sure, always was a bit of a whack a mole game, but tbh Elastica did not really help in that regard [16:14:56] yea, very much so [16:20:11] what's different I think here with the SUP (need to double check) is that we have clear separation between request failures vs individual bulk action failures, request failures are retried a fixed number of times and would fail the whole pipeline I think if they're not salvageable [16:20:46] with cirrus this distinction was not so clear [16:20:58] oh right, that does make a big difference. indeed in cirrus it was unclear [16:23:36] heading out, have a nice week-end! [16:26:08] .o/ [17:51:12] lunch, back in ~40 [18:41:39] still feeling a bit woozy from the vaccines...going to try and lay down, should be back in 1-2h [18:45:13] * ebernhardson somehow did not expect it would be so hard to decide if queries are natural language or not.... [18:45:57] "Artists who studied under George Bellows and lived in Florida" is, kinda? like it has some structure, but certainly not how you would talk to another person [20:00:10] sounds like a Jeopardy! category ;P [20:03:20] there are actual multiple of this form, it almost feels bot-ish [20:04:14] the sample is ~1% of queries that were flagged as potentially natural language, to have multiples in a 1% sample of the same form feels suspicious at least [20:31:42] that is a weird one to appear multiple times [20:32:38] it's not exactly the smae, but same form. like: Pakistan footballer who played for Karachi and PIA, selection committee [20:33:15] or `AFL footballer who played for Essendon, Adelaide and Geelong`. I dunno, maybe real people search that way? [20:34:13] some are awfully specific: stunt coordinator who turned to acting after injuries collaborated with comedian banned work ethic issues [20:36:45] Oh OK, I thought you meant lots of people were interested in George Bellows/Florida [20:38:03] ryankemper looks like wdqs2016 's blazegraph unit is crashlooping . Taking a look now, but it's not related to the LVS decom yesterday, is it? [20:38:59] inflatador: a few of the old wdqs-public hosts are in a bit of a wonky state [20:39:53] ryankemper ACK, you set a suppression yesterday right? If so, no big deal. I just saw the alert come thru #data-platform-alerts a couple of times today [20:39:53] there’s a couple of stale units that need cleaning up which I was going to do by reimaging but the 2 reimages I tried yesterday had issues [20:41:10] yeah there might be a couple missing suppressions tho, I think I mainly set ones on probedown [20:41:52] in any case basically every old public host besides 2009 can be downtimes [20:42:20] downtimed* (2009 is the one still serving the legacy endpoint) [20:51:05] ryankemper ACK, besides that, do you have anything for pairing? I'm just finishing off my Asana updates for the week, fine w/me to skip [20:51:45] inflatador: I’m fine skipping, just that stuff and more spicerack iteration [21:06:42] ryankemper ACK, see ya Monday