The Significance of 4.0.0
These 4.0.0 releases are over 2 years in the making, and represent the first truly “Revolutionary Release” from the Lucene project, compared to most of the “Evolutionary” X.0 releases in the past. To mark the occasion, I thought it would be fitting to collect comments and thoughts from various Lucene/Solr developers on how they feel about this event.Personally, I can’t even begin to articulate the magnitude of this release. I have been working with Lucene & Solr for 8 years; in that time I have seen the project (and the code) shift and morph and grow in ways I would have never imagined when I first started getting involved. I would love to hear from other users about your impressions and what you think of the realse — either on the mailing lists, or in the comment form below. Mark Miller…
And congrats to everyone. Took a long time for this day to come. 4.0 has always felt comfortably off in the distance to me – this will be an adjustment ;) This community creates fantastic software and it’s been a pleasure lending a hand to build up to this. Thanks to everyone for all the goodness you have contributed to make this release so awesome. Special thanks to our fearless release manager.Yonik Seeley…
It’s awesome to finally see the release of Solr 4, getting SolrCloud scalability, NoSQL features, and of course performance improvements into the hands of users.Grant Ingersoll…
Solr and Lucene 4.0 represent a watershed moment in the making of next generation open source search capabilities. By focusing on building a toolset to power the next generation of data driven, real time applications at large scale, the leading open source search library and server further cements their position as the goto place for search. On top of it all, 4.0 is more flexible than ever, appealing to both those who want something to just work to those who want control over all the details. Whether it is the new flexible indexing and scoring capabilities or the easy scaling of SolrCloud, chances are Lucene and Solr 4.0 have the tools needed to solve many difficult search problems.Mike McCandless…
Every release brings great new features and bug fixes to Lucene and Solr. What I think is most special about 4.0 are the “meta features” like flexible indexing and scoring, which give users new freedom to explore fundamental changes to how Lucene stores its index and scores query hits. Such deep explorations were not really possible before, so I’m looking forward to the fun things our users create based on these meta features.Steven Rowe…
4 more Lucene! 4 more Solr!Uwe Schindler…
i am tired :( it is 2 am in the morning! i have no idea anymore….
(Annotated) Release HighlightsSolr 4.0 Release Highlights The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. See wiki.apache.org/solr/SolrCloud for more details.
- Distributed indexing designed from the ground up for near real-time (NRT) and NoSQL features such as realtime-get, optimistic locking, and durable updates.
- High availability with no single points of failure.
- Apache Zookeeper integration for distributed coordination and cluster metadata and configuration storage.
- Immunity to split-brain issues due to Zookeeper’s Paxos distributed consensus protocols.
- Updates sent to any node in the cluster and are automatically forwarded to the correct shard and replicated to multiple nodes for redundancy.
- Queries sent to any node automatically perform a full distributed search across the cluster with load balancing and fail-over.
- A collection management API.
- Smart SolrJ client (CloudSolrServer) that knows to send documents only to the shard leaders
- Update durability – A transaction log ensures that even uncommitted documents are never lost.
- Real-time Get – The ability to quickly retrieve the latest version of a document, without the need to commit or open a new searcher
- Versioning and Optimistic Locking – combined with real-time get, this allows read-update-write functionality that ensures no conflicting changes were made concurrently by other clients.
- Atomic updates – the ability to add, remove, change, and increment fields of an existing document without having to send in the complete document again.
- New spatial field types with polygon support.
- Pivot Faceting – Multi-level or hierarchical faceting where the top constraints for one field are found for each top constraint of a different field.
- Pseudo-fields – The ability to alias fields, or to add metadata along with returned documents, such as function query values and results of spatial distance calculations.
- A spell checker implementation that can work directly from the main index instead of creating a sidecar index.
- Pseudo-Join functionality – The ability to select a set of documents based on their relationship to a second set of documents.
- Function query enhancements including conditional function queries and relevancy functions.
- New update processors to facilitate modifying documents prior to indexing.
- A brand new web admin interface, including support for SolrCloud and improved error reporting
- Numerous bug fixes and optimizations.
- New spatial field types with polygon support.
- Various Admin UI improvements.
- SolrCloud related performance optimizations in writing the the transaction log, PeerSync recovery, Leader election, and ClusterState caching.
- Numerous bug fixes and optimizations.
- The index formats for terms, postings lists, stored fields, term vectors, etc are pluggable via the Codec api. You can select from the provided implementations or customize the index format with your own Codec to meet your needs.
- Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided.
- The new doc values feature stores typed values per-document. It can be used for custom scoring factors (accessible via Similarity), for pre-sorted Sort values, and more.
- IndexWriter now flushes segments to disk concurrently, when the application uses multiple threads for indexing, resulting in substantial performance improvements.
- Per-document normalization factors (“norms”) are no longer limited to a single byte. Similarity implementations can use any DocValues type to store norms.
- New index statistics have been added, including the number of tokens for a term or field, number of postings for a field, and number of documents with a posting for a field. These support additional scoring models.
- A new default term dictionary/index (BlockTree) indexes shared prefixes instead of every n’th term. This is not only more time- and space- efficient, but can avoid going to disk at all for terms that do not exist in certain cases. Alternative term dictionary implementions are provided and pluggable via the Codec api.
- Indexed terms are no longer limited to UTF-16 char sequences; they can now be any binary value encoded as byte arrays. By default, text terms are encoded as UTF-8 bytes. Sort order of terms is defined by their binary value, which is identical to UTF-8 (Unicode code point) sort order.
- Substantially faster performance when using a Filter during searching.
- File-system based directories can rate-limit the IO (MB/sec) of merge threads, to reduce IO contention between merging and searching threads.
- A number of alternative Codecsand components have been added:
- “Appending” works with append-only filesystems (such as Hadoop DFS)
- “Memory” writes the entire terms+postings as an FST read into RAM
- “Pulsing” inlines the postings for low-frequency terms into the term dictionary
- “SimpleText” writes all files in plain-text for easy debugging/transparency
- “Bloom” uses a bloom filter to sometimes avoid disk seeks when looking up terms
- “Direct” holds all postings as simple byte and int for very fast performance at the cost of very high RAM consumption
- “Block” use a new index layout and compression scheme for improved performance
- …among others.
- Term offsets can be optionally encoded into the postings lists and retrieved per-position.
- A new AutomatonQuery returns all documents containing any term matching a provided finite-state automaton.
- FuzzyQuery is 100-200 times faster than in past releases.
- A new spell checker, DirectSpellChecker, finds possible corrections directly against the main search index without requiring a separate index.
- Various in-memory data structures such as the term dictionary and FieldCache are represented more efficiently with less object overhead.
- All search logic is now required to work per segment, IndexReader was therefore refactored to differentiate between atomic and composite readers.
- Lucene 4.0 provides a modular API, consolidating components such as Analyzers and Queries that were previously scattered across Lucene core, contrib, and Solr. These modules also include additional functionality such as UIMA analyzer integration and a completely reworked spatial search implementation.
- A new “Block” PostingsFormat offering improved search performance and index compression. This will likely become the default format in a future release.
- All non-default codec implementations were moved to a separated codecs module. Just add lucene-codecs-4.0.0.jar to your classpath to test these out.
- Payloads can be optionally stored on the term vectors.
- Many bugfixes and optimizations.