[13:20:20] o/ chtnnh [14:05:03] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10Growth-Team, 10MediaWiki-Recent-changes: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906 (10Tgr) The easiest (but more wasteful in terms of space) option would be to keep u... [15:16:53] haksoat: halfak: regarding es tokenization, The tokenization API is exposed as raw elasticsearch api the endpoint is /enwiki/_analyze, /arwiki/_analyze, etc. Exposing this through a mediawiki api could be possible, and is probably what we couldn't promise to deliver. If it was truly necessary it could probably be done in a week by someone not doing other things (ofc there are always other [15:16:59] things) [15:18:16] ebernhardson, ooh that's useful. Can you help me build a sample query to enwiki/_analyze? [15:18:32] FWIW, I don't think this needs to be part of the MW API. [15:18:41] yup, one sec [15:18:59] It just needs to be available and able to handle the load of ~100 qps or so for tokenizing wiki articles [15:21:14] halfak: https://phabricator.wikimedia.org/P10965 [15:22:06] halfak: that instance (cloudelastic.wikimedia.org) is accessible from anywhere in cloud or prod, in prod would point at search.svc.eqiad.wmnet [15:22:26] This is great [15:22:35] i wonder though, we have multiple analyzers which one you want depends. hmm, [15:23:07] Maybe haksoat's first pass can be looking through the analyzers and what we have in revscoring to make a recommendation. [15:24:35] ebernhardson Is there a way to get a list of the analyzers? [15:25:39] this is the relevant api doc: https://www.elastic.co/guide/en/elasticsearch/reference/6.5/indices-analyze.html For our case the values to test in analyzer would be "plain", "plain_search", "text", "text_search", "short_text" and "short_text_search" [15:25:52] haksoat: it's not pretty, but search for "analyzer": in https://en.wikipedia.org/w/api.php?action=cirrus-settings-dump [15:25:57] halfak Yeah. I'll do that. [15:26:32] the different between the foo and foo_search analyzers is that typically *_search only generates one token per input token, the non-search suffixed analyzers may generate multiple tokens (typically we index multiple variants, but only search for one) [15:28:44] at a high level, plain tries to do minimal normalization, text does heavy normalization. I forget what exactly short_text varies on, we use it for content that is single-sentence (titles, headings, etc.) [15:31:07] i suppose one extra difficulty to figure out with respect to ORES will be how to handle changes to the analysis pipeline, you essentially want versioned analyzers instead of updating whenever we update. That could be done by fully specificying the analyzer in the api request instead of referencing an already defined one (to ensure the definition stays constant) [15:38:40] ebernhardson Something like this: https://www.elastic.co/guide/en/elasticsearch/reference/current/test-analyzer.html#test-analyzer ? [15:40:07] haksoat: hmm, that doesn't have an example of fully defining the analysis chain. Sec i can work up an example [15:40:47] Okay. Thanks. [15:44:00] haksoat: this would be the enwiki plain_search analysis chain (one of the simplest). This could be extracted programatically from the cirrus-settings-dump api and stored in a versioned file for later re-use. https://phabricator.wikimedia.org/P10966 [15:44:59] haksoat: an analyzer is a composition of a tokenizer, 0 or more char filters and 0 or more token filters, in the cirrus-settings-dump they are all named, for the _analyze api we can pull all the named things into the definition directly [15:45:15] How do we know when an analyzer has gotten a substantial update? [15:46:10] halfak: Mostly would be through trey mentioning something. Most analyzers are unchanged over a multi-year period, but trey goes through and fixes up individual languages as time goes by [15:47:40] i suspect it would be relatively rare that the analysis changes, but when it does I imagine it would at a minimum reduce the quality of the model due to having different inputs than it trained on [15:48:36] Right. If changes are yearly, that's not bad for a start but eventually, we'll want our modeling process to be automated based on changed upstream. [15:49:42] In ORES, we output semantic versions to help people track this downstream. I wonder if that is an option for these endpoints. Alternatively, we could always just have another endpoint that we query to know when an update has been made. [15:51:04] hmm, we do version our analysis pipelines but not versioned in a way that helps you unfortunately. Well, let me double check [15:54:37] hmm, as it stands today this versioning wouldn't help, although it could maybe be changed at some point. Essentially we have version constants in specific classes, and those versions get stored in elasticsearch, but because those are by-class instead of by-language you will only see fairly major analysis changes that way and not minor things like a configuration change that updates the [15:54:43] character filter maps or whatever [15:56:05] the "easy" way off the top of my head would be a script that takes a mw api endpoint and an analyzer name, and emits the appropriate _analyze formatted request (minus the "text" field). It could compare what comes out to the stored "version" it already has to know when things change. Would require commiting/storing those versioned _analyze requests somewhere [16:55:41] 10Jade, 10Scoring-platform-team (Current), 10Epic: Impement secondary Jade Integrations - https://phabricator.wikimedia.org/T229974 (10ACraze) [16:56:41] 10Jade, 10Scoring-platform-team (Current), 10Epic: Impement secondary Jade Integrations - https://phabricator.wikimedia.org/T229974 (10ACraze) a:03ACraze [16:58:18] ebernhardson is there a downside to us updating when there's an update (asides having to check)? Also does an update usually break the endpoint? [17:12:07] haksoat: updates wont break the endpoint, the only difference is you might get different tokens [17:13:33] haksoat: so for example we used to convert the dotless I (ı) into the letter i in turkish, but a config change was deployed to protect that character. Depending on what kind of downstream nlp you are doing, those tokenization changes could change model performance [17:14:38] the bigger concern would be when we do major updates, such as when chinese was moved to the current pipeline that normalizes everything into simplified form. Prior to that change you could get traditional or simplified chinese from the tokenizer, now it can only return simplified [17:14:57] (those bigger updates are easier to communicate though, ofc) [17:15:35] Amazing. That simplifies things ;) [17:16:06] ebernhardson, do the endpoints produce whitespace tokens too? [17:16:54] halfak: sadly i dont think there is anyway to get whitespace tokens [17:17:08] Noo. >:( Hmm [17:17:16] I didn't mean angry face there [17:17:18] Just sad face [17:17:23] :) [17:17:25] :D [17:18:05] Elasticsearch comes with a whitespace analyzer [17:18:40] haksoat: right, but that doesn't emit whitespace tokens, it tokenizes on whitespace [17:19:57] I get now. Thanks for clarifying. [17:20:42] Can we tell the difference between a word, sentence, and paragraph break? [17:21:37] halfak: sadly we don't get that either :S Basically it only emits words [17:21:46] dang [17:22:20] halfak my two cents, you can hack your way to getting word sentence and paragraph breaks [17:22:37] of course its not pretty [17:24:24] 10Jade, 10Scoring-platform-team (Current), 10Epic: Implement secondary Jade Integrations - https://phabricator.wikimedia.org/T229974 (10ACraze) [17:27:24] ebernhardson Will it be possible if we create a custom analyzer? [17:29:55] haksoat: whitespace tokens? Best you could do is perhaps guessing based on term position and start/end offset. Basically assume whatever was in between was whitespace. It could then be sliced out of the source document (hax, basically) [17:31:31] Hmmmm. Okay. [17:32:57] for "red dog" it will report red at position 0 with start 0 and end 3, then dog at position 1 with start 7 and end 10. From that you could infer positions 4-6 wwere whitespace [17:34:39] Nice [17:36:17] As chtnnh said, it won't be pretty, looking at the possible edge cases if we are to tell the difference between a word, sentence and paragraph break. [17:40:17] yes i believe it would involve some if else blocks [17:49:01] posting our async update notes -- [17:49:17] halfak: [17:49:19] Last week: I did a lot of meetings and tuning deck work. Otherwise I dug into uwsgi memory issues and ORES capex. I also worked with chtnnh on a bunch of things and got signed up as a mentor for GSOC. [17:49:21] T: Somehow I got a ton of email over the weekend so I spent a good amount of time processing that. I talked analyzing endpoints with ebernardson and haksoat in IRC this morning and got haksoat started on a report comparing what's available in Elastic Search vs. what we have in revscoring. I'll be trying to learn a bit more about uwsgi and getting some feedback on my CapEx estimation strategy. [17:49:26] kevinbazira: [17:49:42] Last Week: [17:49:44] Focused on addressing UI issues identified in user-testing from https://phabricator.wikimedia.org/T247897 [17:49:46] - Pushed a couple of patchsets [17:49:48] Reviewed a couple of Andy's patchsets [17:49:50] T: [17:49:52] Succeeded at setting up MW Vagrant with MW 1.35 [17:49:54] I think I have figured out a less "painful" way to setup MW Vagrant that is reproducible. I'll write it's documentation and share it. [17:49:56] and me: [17:49:59] Last week: Focused on developing the secondary schemas for Jade, also did a fair amount of code review [17:50:00] T: Mostly heads down working on the 2ndary integrations for Jade, also re-enabling hooks & fixing tests that were broken by new schemas [19:41:52] I've confirmed that SIGTERM still has the bad behavior and SIGHUP seems to have good behavior. [20:03:54] @halfak Can you take a look at this when chanced: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html [20:03:54] 04Error: Command “halfak” not recognized. Please review and correct what you’ve written. [20:03:54] I think we could find it useful. It's a pattern tokenizer, that lets us specify our token using regex. [20:04:56] halfak: Can you take a look at this when chanced: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html [20:05:04] I think we could find it useful. It's a pattern tokenizer, that lets us specify our token using regex. [20:05:22] Interesting. You can get a token regex out of our tokenizer. [20:05:29] One sec I'll make a demo. [20:07:22] https://gist.github.com/halfak/0f73a39bd0108c38eddf609a4ebe056e [20:07:52] Woops. I left my bad guesses in there. Just fixed it. [20:10:10] On a side note, I bet there are some improvements we can make in how we specify our regexes. [20:10:40] https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py [20:10:40] Taking a look... [20:16:30] I just came across this: "IMPORTANT: The regular expression should match the token separators, not the tokens themselves." [20:16:30] Now I'm wondering how that could work. [20:16:55] I'll do more checks though [20:31:46] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10artificial-intelligence: Create follow-up edit quality campaign for ptwikipedia - https://phabricator.wikimedia.org/T246668 (10Halfak) Sorry for the delay on this one. I'd been focusing on working on the articlequality model so I f... [20:33:25] haksoat, oh. Yeah, that's a completely different problem. [20:33:48] * halfak thinks. [20:35:28] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10artificial-intelligence: Create follow-up edit quality campaign for ptwikipedia - https://phabricator.wikimedia.org/T246668 (10Halfak) https://github.com/wikimedia/editquality/pull/220 [20:35:52] 10Scoring-platform-team, 10Discovery-Search, 10Epic, 10Growth-Team (Current Sprint): [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics) - https://phabricator.wikimedia.org/T240517 (10Tgr) [20:37:32] wikimedia/editquality#716 (ptwiki_followup_2020 - bbd453a : Aaron Halfaker): The build failed. https://travis-ci.org/wikimedia/editquality/builds/674583820 [20:37:46] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for ptwikipedia - https://phabricator.wikimedia.org/T246663 (10Halfak) I forgot to take notes, but I did look into this. We used to match some templates weirdly. It looks lik... [21:27:17] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for ptwikipedia - https://phabricator.wikimedia.org/T246663 (10Halfak) I just did another update of our extractor and the feature set based on some more feedback from @He7d3r....