- New ranking algorithms, in addition to Lucene’s Vector Space Model:
- Added key statistics to the index format to support additional scoring models.
- Term- and field-level statistics for collection frequencies and deriving averages.
- Additional document-level statistics for computing normalization factors.
- Decoupled matching from ranking in Lucene’s core search classes:
- Customize scoring without digging into the “guts”.
- Customize explanations: essential for debugging relevance issues.
- Powerful low-level Similarity API, supporting advanced use cases:
- Incorporate per-document values from Column Stride Fields into the score.
- Use different scoring parameters or algorithms for different fields.
- Fuse multiple scoring algorithms into a combined score.
- Convenient high-level SimilarityBase for everything else:
- Write your own scoring function in one Java method.
- Easy access to available index statistics.
Flexible ranking in Lucene 4
Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.I’ll be giving a talk about how you can practically apply some of the upcoming Lucene 4 search features at Lucene Eurocon in October, and at the SFBay Apache Lucene/Solr Meetup later this month.Some bullet points of the new scoring features:

[...] Flexible ranking in Lucene 4 [...]
[...] has evolved from offering a single vector space scoring model to one that now offers plug-n-play ranking (BM25 [...]
[...] this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based [...]
[...] Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided (see flexible-ranking-in-lucene-4). [...]
[...] Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided. [...]