[07:13:25] o/ [08:14:23] o/ [09:01:30] gmodena: thanks for https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1105! added a question regarding deprecating q_by_ip_day, relatedly was it a request by the privacy team to drop it? [09:03:04] articlecountry backfill is finally done :) [09:22:32] dcausse: do you have the list of topics to increase retention for T388372? [09:22:32] T388372: Increase retention of Wikidata RDF Stream (Kafka and/or Hadoop) - https://phabricator.wikimedia.org/T388372 [09:22:47] gehel: yes will comment on the ticket [09:22:51] thanks! [10:36:53] errand+lunch [10:56:39] dcausse ack. i followed up to your comment in gitlab+mail [12:32:07] lunch+errand [14:00:12] ebernhardson: (sorry, I was out Friday!) If the Sudachi settings are the same (or very nearly the same) for Elastic and OpenSearch, and you can load a chunk of Japanese text without errors, then I'd call that good enough for now. I can test the internals in T386870, and if there are any weird inconsistencies we can look at whether they are related to porting it over. [14:00:12] T386870: Regression Test OpenSearch Language Analysis - https://phabricator.wikimedia.org/T386870 [14:16:23] \o [14:31:11] .o/ [14:32:20] o/ [14:39:04] o/ [14:53:44] i feel like lots of airflow SLA's fail so often...they are almost meaningless [14:56:59] ebernhardson: yes... with wikidata dumps broken many are screaming :/ [14:57:38] hmm, i guess that implies what we actually need is more of a cascade of some sort. Like if B,C,D depend on A, only complain about A until it's done [14:58:19] but i don't think airflow does any kind of dependencies like that, not sure if they expect you to stuff it all into one DAG or what [14:59:20] what could help but unsure that's going to be visible in the sla email is using dags tags so that we can rapidly filter them [14:59:50] for instance wdqs_streaming_updater_reconcile_hourly can be tagged differently as subgraph_query_mapping [15:00:07] hmm, yea perhaps we can do something along those lines [15:00:35] the analytics airflow instances uses them quite a lot [15:01:01] but yes that does not solve the noise issue if wikidata dumps are down for several weeks [15:37:55] OK, starting the cloudelastic1010 reimage... [16:22:17] Dropping off to take dog out [16:54:40] cloudelastic1010 hanging at the PXE boot screen as cloudelastic1009 did...good times as usual ;) [17:00:10] isn't hardware fun? [17:09:54] * inflatador used to think so ;P [17:13:41] and...firmware update co okbook failing [17:51:08] random thought: Transform the set of titles/redirects into n-grams starting from the beginning of a title and do some sort of frequency analysis to auto-detect "List of" and other such less-meaningfull prefixes. Then feed those into completion building as some sort of ignorable prefix [17:51:31] essentially identify high-frequency prefixes [18:04:52] interesting, we already drop some words via stopwords perhaps we could indeed explore such frequencies? [18:06:12] the edge-ngram fields might have some data there already? i.e. what would be the longest prefix that still have a very high freq [18:06:29] it seems plausible, not sure if we would do it in hadoop as a once-in-awhile thing, or maintain the index in elastic. I suppose the basis of the idea is that i'm pretty sure the 'Trigonometric Identities' ticket didn't resolve because levenshtein sees `list of` at the beginning and decided it isn't close enough [18:06:48] ebernhardson: definitely [18:07:16] hmm, i suppose i didn't check what we already index [18:09:10] title.prefix_asciifolding & redirect.prefix_asciifolding should have some info [18:10:05] but unsure if elastic is going to be happy let us scan the term frequencies like that [18:11:42] it's not :P Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. [18:11:50] when attempting to ask for top 10 prefixes [18:12:25] yes... perhaps offline exploration in hadoop would be good enough? I mean perhaps that list is relatively stable? [18:13:13] Yea i can't imagine it changes very often [18:36:17] with a quick hacky script in spark...`List of` is the 6th most popular prefix at 277k occurances [18:37:21] https://phabricator.wikimedia.org/P74175 [18:37:48] it also wants to strip `John` :P [18:38:19] some of those are surprising, probably needs refinement, but plausible [18:44:55] hm that John is surprising at 120k, https://en.wikipedia.org/w/index.php?search=all%3Aprefix%3AJohn&title=Special%3ASearch&profile=default&fulltext=1 only finds 56k [18:45:14] i think it was because i had enwiki_general in there too. Re-ran against only enwiki_content (refresh paste) [18:45:57] now it's better and the top 3 are totally reasonable to filter. But not sure how to know on an arbitrary wiki where the cutoff is, i was hoping for a sharper distinction [18:46:38] 50k `john` and only 120k `the` is surprising to me [18:47:18] Then things like `Battle of` at 11k could probably be filtered, but not the end of the world to lose it [18:47:55] unsure about battle of tho... [18:48:34] the is already filtered thanks to stopword but might be interesting to get on languages where we don't have any stopwords declared [18:49:30] now wondering what are those pages that start with List but not with List of :) [18:50:36] lol, yea it's curious [18:50:55] "List A cricket" [18:56:00] some tech things too, like 'list coloring', 'list comprehension', etc [18:56:35] in those cases stripping list would be meh, i suppose we would subtract `list of` from `list` and find `list` shouldn't be filtered [18:58:12] yes [19:08:07] looks like cloudelastic1010 is finally booting into the installer. I had to manually install NIC firmware this time [19:26:05] dinner [19:37:22] I'm riding back to San Antonio now...will keep an eye out on the reimage [19:57:08] OK, cloudelastic1010's back in the cluster, unbanned, part of the voting config