Blog | Got A Cool Story? Post It Here.
Home » Reference Materials » Documentation » Flexible ranking in Lucene 4
Flexible ranking in Lucene 4
Over the summer I served as a Google Summer of Code mentor for David Nemeskey, PhD student at Eötvös Loránd University. David proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework.These improvements are now committed to Lucene’s trunk: you can use these models in tandem with all of Lucene’s features (boosts, slops, explanations, etc) and queries (term, phrase, spans, etc). A JIRA issue has been created to make it easy to use these models from Solr’s schema.xml.Relevance ranking is the heart of the search engine, and I hope the additional models and flexibility will improve the user experience for Lucene: whether you’ve been frustrated with tuning TF/IDF weights and find an alternative model works better for your case, found it difficult to integrate custom logic that your application needs, or just want to experiment.I’ll be giving a talk about how you can practically apply some of the upcoming Lucene 4 search features at Lucene Eurocon in October, and at the SFBay Apache Lucene/Solr Meetup later this month.Some bullet points of the new scoring features:
  • New ranking algorithms, in addition to Lucene’s Vector Space Model:
  • Added key statistics to the index format to support additional scoring models.
    • Term- and field-level statistics for collection frequencies and deriving averages.
    • Additional document-level statistics for computing normalization factors.
  • Decoupled matching from ranking in Lucene’s core search classes:
  • Powerful low-level Similarity API, supporting advanced use cases:
    • Incorporate per-document values from Column Stride Fields into the score.
    • Use different scoring parameters or algorithms for different fields.
    • Fuse multiple scoring algorithms into a combined score.
  • Convenient high-level SimilarityBase for everything else:
    • Write your own scoring function in one Java method.
    • Easy access to available index statistics.
For more information about this GSOC project, take a look at its wiki page