[10:05:52] <dcausse>	 lunch+errand
[13:53:19] <pfischer>	 Design Research proposed a method of evaluating the potential alternatives to keyword search: https://docs.google.com/document/d/1-YiFuxx0DyIiy3vfS57oiMFS3CpIf1k8RBMEgMZChdo/edit?tab=t.0 they suggest a list of queries and approaches for comparison of results. Could you please have a look? What do you think? Is it feasible?
[14:15:39] <ebernhardson>	 \o
[14:16:16] <ebernhardson>	 pfischer: what they want is a golden set basically, we have a tiny one at https://people.wikimedia.org/~ebernhardson/esltr/discernatron.json 
[14:17:01] <inflatador>	 ebernhardson psst! We're supposed to have the day off. (Don't tell anyone but I'm working a little too ;P )
[14:17:07] <ebernhardson>	 we could plausibly stand up the software that collected those again, but actually grading queries is tedious.
[14:17:16] <ebernhardson>	 inflatador: oh! no wonder there is no school today
[14:18:04] <inflatador>	 {◕ ◡ ◕}
[14:19:33] <ebernhardson>	 pfischer: at a general level, the "generally suggested" way to compare multiple ranking algorithms, such as done for the big IR conferences, is to collect the top N results from all search engines under test, grade them on a scale (often 4 points), and then calculate metrics like ndcg@n, precision@k, recall@k, map@k
[14:21:51] <pfischer>	 ebernhardson: thanks! And the golden set (you linked) - Who scored the results?
[14:22:00] <dcausse>	 here it's unclear what they want to grade if it's a list or the best snippet with the answer
[14:22:38] <ebernhardson>	 pfischer: mostly trey and i, but also a wide variety of random people (for example, at the SF office and had people grade queries for an hour for free pizza)
[14:22:41] <pfischer>	 dcausse: They even suggest “flipping” the response format. I am not sure I got that correctly
[14:22:48] <dcausse>	 they seem to have a list of queries at page 9
[14:23:09] <dcausse>	 no clue if they have more
[14:23:26] <dcausse>	 seems like the approach is a survey with these queries
[14:24:23] <pfischer>	 dcausse: yes, that’s how I read it, too.
[14:27:05] <ebernhardson>	 i worry it's far too small, the example there is 17 queries,  i'm not sure you can get strong metrics off that.  Maybe if they are wildly variable, like compraing only keyword to a natural language approach, but not sure you can tease out the difference between natural language approaches
[14:27:16] <dcausse>	 I guess my worry is how these queries are actually representative of what we actually serve
[14:32:05] <ebernhardson>	 i guess i'll head back out then, enjoy a day off :)  Maybe hit up a bakery
[14:32:36] <dcausse>	 :)
[14:32:43] <dcausse>	 enjoy!
[14:33:34] <pfischer>	 Yeah, happy Columbus’ Day (right?)
[14:34:40] <pfischer>	 ebernhardson: IIUC, ML will end up with only one prototype that will make the race and will be evaluated using the suggested approach, so it’s just this prototype vs. keyword search.
[14:35:31] <dcausse>	 the question-to-question approach has been discarded?
[14:36:34] <pfischer>	 Not yet, but they have to make a decision this week, AFAIK.
[14:37:56] <dcausse>	 oh ok
[17:39:31] <dcausse>	 dinner