While many Lucene/Solr applications will never outgrow a single, well-configured machine, the fact is, more and more applications are pushing beyond the single machine limit due to either index size or query volume. In discussing Lucene and Solr best practices for performance and scaling, Mark Miller explains how to get the most out of a single machine, as well as how to harness multiple machines to handle large indexes, large query volume, or both.
By Mark MillerOne of the cache settings to be mindful of is the autowarm value. The autowarm setting tells Solr how many entries to take from the old cache and put into the new one when a new view of the index is opened (due to an index change). The document cache cannot be autowarmed, but for the other caches, you want to use a value that is big enough to give your caches a nice boost in filling up, but not so big that it takes too long to warm the caches. The new view will not be available to users until the warming is done, so be sure to test to ensure you are warming in an acceptable time frame. You want to balance the autowarm count so that it is high enough that a fair portion of the cache is carried into the new Searcher, but its not so high that it takes too much time to warm a new Searcher for use. It is also good idea to use the Solr admin webpage to look at your cache statistics. If you have a very low hit rate, your cache may be doing more harm than good. If you have a very high eviction rate, your cache is likely too small, and also may be doing more harm than good. If you have enough evictions, it is entirly possible that cached results are being tossed out before they are used, or after they are only used a handful of times. Check out the Solr Wiki on SolrCaching and be sure to use the appropriate settings for best performance. This is not the first time we have needed to know things like how many unique values we might have in a field. A very useful tool for finding some of this information is the LukeRequestHandler that Solr provides. Simply hitting solr/admin/luke or solr/admin/luke?wt=xslt&tr=luke.xsl will display a variety of great statistics about your data. Don’t be afraid to slurp it in, look at things with the LukeRequestHandler, tweak what you have done, and then start all over. For large indexes, you might sacrifice some information by adding numTerms=0, solr/admin/luke?numTerms=0. This can turn a call that takes many minutes on a large index into seconds, for the price of less detailed term data.Lucene replication is a mostly a do-it-yourself affair. The ‘best practice’ technique is to take advantage of Lucene’s index file semantics. Lucene indexes are made up of 1-n individual segments. A write once scheme is used, so that each segment’s files do not change on index updates. Instead, new files are created, and then the index is atomically told to point at the old files that have not changed and any new files that were created. This setup works well with index replication because it’s quite easy to use something like rsync to efficiently replicate index changes – you can just copy the new files. For example, upon adding a few documents to an index that already has millions of documents, a new segment containing the few new documents will be written, and often, only this segment will need to be replicated to the other machines. While segment merging will affect which segments need to be copied, many times there will be large unchanged segments, allowing for efficient copying of small index deltas. So a classic configuration would be to have a master for adding and updating documents on, and then n slave servers that you would replicate the master index to (actually just the changed files in the index). When the time and bandwidth needed for replication is less of a concern, and high query throughput is more important, it can be wise to abandon the advantage of transferring changed segments and only replicate fully optimized indexes. It costs a bit more in terms of resources, but the master will eat the cost of optimizing (so that users don’t see the standard machine slowdown affect that performing an optimize brings), and the slaves will always get a fully optimized index to issue queries against, allowing for maximum query performance. Generally, bandwidth for replication is not much of a concern now, but keep in mind that optimizing on a large index can be quite time consuming, so this strategy is not for every situation.For high availability you can use a load balancer to set up a virtual IP for each shard’s set of slaves. If you are new to load balancing, HAProxy is a good open source software load balancer. If a slave server goes down, a good load balancer will detect the failure using some technique (generally a heartbeat system), and forward all requests to the remaining live slaves that served with the failed slave. A single virtual IP should then be set up so that requests can hit a single IP, and get load balanced to each of the virtual IPs for the search slaves. With this configuration you will have a fully load balanced, search side fault tolerant system (Solr does not yet support fault tolerant indexing). Incoming searches will be handed off to one of the functioning slaves, then the slave will distribute the search request across a slave for each of the shards in your configuration. The slave will issue a request to each of the virtual IPs for each shard, and the load balancer will choose one of the available slaves. Finally the results will be combined into a single results set and returned. If any of the slaves go down, they will be taken out of rotation and the remaining slaves will be used. If a shard master goes down, searches can still be served from the slaves until you have corrected the problem and put the master back into production.
SOLR-739 and should be available soon.
LUCENE-1483. Optimizing in both Lucene and Solr is an I/O intensive operation, and on a large index, it can actually take some time to complete. You might also consider issuing a partial optimize. With a partial optimize, you can tell Lucene/Solr how many resulting segments you want. This allows you to improve search speed, perhaps a step at a time, without committing to a full optimization down to a single segment. Another strategy for maintaining a low segment count is to use a low merge factor on your IndexWriter when adding to the index. The merge factor controls how many segments your index needs to span. Using a value of lower than 10 can help keep searches nice and fast. The tradeoff is that additions to the index will now take a bit longer as more merging has to take place more often to keep the segment count low. For example, with a mergefactor of 2 (the lowest allowed value), you would never have more than two segments.
FieldCacheto efficiently access all of the values for a field in memory rather than going to disk. This is necessary for sorting and can be used for Solr’s faceting, among other things. For a large index, the FieldCache can require a fair amount of RAM, especially if you load one for many fields (if you sort on many fields for example). Understanding out of memory errors related to FieldCaches has been a common issue for many Lucene/Solr users. A FieldCache caches the value (and possibly ordinal) for every document in the index in memory. This allows for fast comparisons on a value for a given Document field. An ordinal simply indicates order, and might be used for something like Strings. Instead of Tom, Dick, Clark, you might use 3, 2, 1 – sorting will be faster, while maintaining the right order. For other types (integer, long, etc), the value itself can be a good ordinal as well. Most of a FieldCache is simply an array, the size related to how many documents are in the index (including deleted docs that have not been merged out). So if you are sorting on a long field for an index with 10 million documents, that will load 10 million longs into a long array: That is approximately 76.29 megabytes. Multiply that by the number of long fields that a FieldCache is built on to get your total long FieldCache memory usage. Repeat for your other field types to get an idea of total usage. Another example: An int array on a 100 million document index will consume over 380 megabytes. The String type is a bit more complicated than the others. If you have a non locale String-based FieldCache (that is, you are sorting on a String field, but you are not supplying a Locale for String comparisons), an array of all of the unique terms in the index (String) will be loaded and then a second array of integers will be loaded for each document in the index. The second array is full of ordinals that index into the unique terms array. This is less efficient access for the values (two array dereferences), but in the single IndexReader case, it allows you to sort using integers rather than Strings, as you can compare using the ordinal array of integers. If you supply a locale for the String FieldCache, a String array is filled with the term from each document for that field in the index, just like the other primitive types. Ordinal compares will not work when you are using a locale. The String representation will save an index into an array on lookup, but its still slower because you have to compare Strings rather than integer ordinals when sorting.
This shows the main RAM eating data structures for a String FieldCache that does not compare using a Locale. The first array contains all of the unique terms in the index for the FieldCache field, and the second array is an index into the first for each of the documents in the index. As you can see, you can sort the document just using comparisons of the integers in the second array. You can’t do this with a MultiSearcher however – ordinals from different indexes cannot be usefully compared.
- FilterCache – unordered document ids. This is for caching filter queries. This cache stores enough information to filter out the right documents across the whole index for a given query. Using set intersections on these filtered ids allows for efficiency in combining filter queries. This won’t cache the order of returned documents, so it’s no good for caching a query that relies on relevance or sort fields. If you are faceting with the FieldCache method (and you should be if you have a large number of unique fields), this should be set to at least the number of unique values in all the fields you are using for faceting (using the FieldCache method) .
- QueryCache – ordered document ids. This is for caching the results of normal queries. This can require much less RAM than the FilterCache because it only caches the returned documents, while the FilterCache must cache the results for the whole index. The optimal size of this cache depends on a lot of factors. Essentially, you want to make sure that it is large enough so that the majority of the results of your really common queries are cached.
- DocumentCache – stores stored fields. Solr caches Documents in memory so that no request has to hit the disk for stored fields. This can be very valuable as stored fields are most often used for hit list displays. The Solr Wiki recommends that you set the size of this cache to at least <max_results> * <max_concurrent_queries>, to ensure that Solr does not need to re-fetch a document during a request.
- FacetQueries are handled by caching the results of a query as a filter. This FacetQuery set of documents is intersected against result sets to count how many documents a query condition is true for (the facet counts). If there are few enough results in the filter, the filter is maintained as a hashed set of document ids. If there are greater than the ‘hashDocSet’ setting results, a bit set is used instead.
- FacetFieldsallow for facet counts based on distinct values in a field. There are two methods for FacetFields, one that performs well with few distinct values in a field, and the other for when a field contains many distinct values (generally, thousands and up – you should test what works best for you).The first method, facet.method=enum, works by issuing a FacetQuery for every unuiqe value in the field. As mentioned, this is an excellent method when the number of distinct values in a field is small. It requires excessive memory though, and breaks down when the number of distinct values gets large. When using this method, be careful to ensure that your FilterCache is large enough to contain at least one filter for every distinct value you plan on faceting on.The second method uses the Lucene FieldCache (future version of Solr will actually use a different non-inverted structure – the UnInvertedField). This method is actually slower and more memory intensive for fields with a low number of unique values, but if you have a lot of uniques, this is the way to go. This method uses the FieldCache to look up the values for the given field for each document, and every time a document with a given value is found, the value has its count incremented.
http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7… You simply add a shards parameter that contains each shards URL, comma separated. This will cause the select RequestHandler to search each of the listed URLs indepently and then combine the results as if you had issued one search across one large index. You should load balance requests across each of the servers. It’s generally best to avoid using the URL to specify your shards, though. If you have set up a lot of shards, or you just don’t want to deal with a bunch of URLs in a Solr GET request, its much easier to set the shards parameter for your SearchHandler in solrcofig.xml. That way you can set it once and effectively forget about it for a while. Any RequestHandler that extends SearchHandler can use SearchComponents and perform a distributed search. However, only SearchComponents that are ‘distributed aware’ work with distributed searches. The current components that support distributed search are:For best results, you will want to load balance incoming requests across each of the shards. Each request that hits a shard will be distributed by that shard to itself and the other shards and then the results are merged. You want to be sure to distribute that duty evenly across your shards. Be careful of the deadlock warning in the Solr Wiki if you do this though. You need to be sure that the number of threads serving http requests in your container is greater than the number of requests you can get from the shard itself, and all of the other shards in your configuration, or you may experience a deadlock. Get the full details on setting up distributed search with Solr at the Solr Wiki.
- The Query component that returns documents matching a query
- The Facet component, for facet.query and facet.field requests where facets are sorted by count (the default). <Solr 1.4> The next version of Solr will also support sorting by name.
- The Highlighting component
- the Debug component
- SolrPerformanceFactors – A list of common things to consider when thinking about Solr performance.
- NIOFSDirectory – Lucene Directory implementation that allows for faster multi-core/processor performance on non-Windows operating systems.
- FSDirectory – Standard Lucene Directory implementation.
- FieldCache – Lucene cache of document field values / ordinals
- IndexDeletionPolicy – Lucene class that allows custom control over deletion of old index views.
- FieldSelector – Lucene class that allows selective field loading.
- RemoteSearchable – Lucene class that allows for distributing indexes across servers.
- Lucene Benchmark Contrib – great framework for benchmarking Lucene changes and settings.
- Lucene: Improve Searching Speed – General tips for improving Lucene search performance.
- TrieRangeQuery – Efficient large scale numerical range search.
- Solr Wiki: Distributed Search – More about Solr’s excellent distributed search features.
- Solr Wiki: Collection Distribution – Solr index replication with Unix scripts.
- Solr Wiki: Solr Replication – Solr index replication using a Java implementation.
- Solr Wiki: Solr Caching – Learn more about Solr’s cache support.
- Solr Wiki: LukeRequestHandler – Explore your index data.
- SystemInfoHandler – Quickly get info about Solr and your server.
- Solr Faceting – An overview of Solr’s faceting features.
- Java Performance – Performance tips for Java from Sun
- Hot Backups with Lucene – Information on how to safely back up your Lucene indexes.
- Solr Wiki: Writing Distributed Search Components – Getting started writing your distributed SearchComponents with Solr.
- Sun JVM bug: Single FileChannel slower than using multiple FileChannels (windows).
- LUCENE-1483 – Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector.
- SOLR-739 – Add support for OmitTf() to Solr.
- Skip Lists – Faster postings list intersection via skip pointers.
- Permuterm indexes – Efficient wildcard queries.
- BiWords – Efficient phrase searching
- HAProxy – An open source software load balancer