I'm working with a customer who is looking to improve the relevancy of the documents being returned by Solr. They are having an issue where they expect certain results to rank higher than others because the search term used matches exactly words found in the 'name' attribute on the Catalog side.
For example, (the client is pt_BR) search for term 'boneca' returns documents with both Boneco (actually returned higher in results) than items with Boneca..they do get both results though. Looking in to this issue, the text_pt field (the one used to analyze this property) uses the BrazilianStemFilterFactory (BSF) which stems that word down to 'bonec' ..confirmed this is used in querying and would explain this behaviour.
So I understand that that's the reason why we're getting mixed results of Boneca & Boneco items together. Is there a easy way to force documents to be more relevant where the 'name' matches/scores high because of exact match like this?
A possible solution for example, create another property on the product item like 'tags' that is set to string type & multivalued on the indexed type so they could enter in words that could be searched on with exact matches? Would require a bit more management of their products but this could be helpful in special cases where they need to boost relevancy of otherwise similar documents due to the way the terms are analyzed/tokenized during querying.
Other ways that are more straightforward?
Can you use KeywordRepeatFilterFactory as index and query time analyzer. This filter should be added before any stemming is performed.
Ex: if the product name is: Boneca then the tokens generated should be "bonec", "boneca". For Boneco the tokens would be "bonec", "boneco".
If the user now searched for "boneca", the tokens used for query would be "bonec" or "boneca" (assuming that the default operator is OR) and since the tokens in name match for both the terms it would have a higher score. The product with name: "boneco" would have an exact match for "bonec" but fuzzy for "boneco" and hence a lower score.
Note: The results may vary further depending upon the the boost factors which are used (out of the box it is 1, 0.5 and 0.25 x boost factor and 2 x boost factor for phrases) and uniqueness of the term in the index and the product.
Your approach sounds good, if you had this tags field though you'll probably need to have a different type in your schema that doesn't use the stem factory otherwise you'll get the same results. You could also index a field called "exact_name" or something like that, using the new type and index the name as is into that field. Then when querying you can apply a larger boost value to that field. This is similar to your tags approach but should be less work for product management team since it's automatic.
Otherwise you can try changing the index analyzer so that it indexes the exact words, but keep the query analyzer using the stem factory, I don't know how that will affect search though, you may get some inconsistent results.