- fieldOriginal - The source of the content. This was the main field used for sorting
- fieldSearch – Copy field of Original, but rounded to the nearest 100. This was the main field for searching.
- fieldFacet – Copy field of Original, but rounded based on a percentage of the original value so as to provide a sliding scale for faceting. This was the main field used for faceting.
Sorting, Faceting and Schema Design in Solr
I was recently with a client doing a Best Practices assesment when I came across a common source of confusion related to sorting, faceting and schema design.
As background, Solr provides a schema that describes the Fields and Field Types (FT) that are used by an application. Field Types describe how Solr should handle the information contained in a Field. For instance, the integer FT tells Solr to treat the contents of any Field of type integer as, you guessed it, an integer. By integer here, I mean, good old fashioned Java ints. Solr provides other FTs like long, double, float, string, date, as well as Text (which can be associated with Lucene’s analysis process). Additionally, Solr provides several “sortable” FTs such as sint, slong, sdouble and sfloat. Therein lies the confusion. I think what happens is developers hear the word “sortable” and think they should use the sortable FT for any field they want to sort results by. However, there is some subtlety here. Namely, “sortable” FTs manipulate the content so that the lexicographic order is the same as the numeric order for use during search. Sortables are thus really meant to be used when doing things like range queries (i.e. [price:2 TO 100]) and not for sorting as it relates to returning results. Due to these required changes, sortables take up more space in the index (and in memory) then their non-sortable compadres.
What’s this got to do with schema design? Well, this client had three fields, all defined as sortable integer FTs, as in:

So after reading the above I went and converted all the fields i was using to sort results from SINT to INTEGER. After implementing this change to the schema, SOLR started throwing java exceptions from fieldcache trying to convert to int. Changing back reversed this. We are trying to cascade several sort fields BTW, and this seems to work fine. Here is the exception:
SEVERE: java.lang.NumberFormatException: For input string: “”
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at org.apache.lucene.search.FieldCacheImpl$3.parseInt(FieldCacheImpl.java:148)
at org.apache.lucene.search.FieldCacheImpl$7.createValue(FieldCacheImpl.java:262)
at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:245)
at org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:239)
at org.apache.lucene.search.FieldSortedHitQueue.comparatorInt(FieldSortedHitQueue.java:291)
at org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:188)
at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:168)
at org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:56)
at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1108)
at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:834)
at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:269)
at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:160)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:177)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Unknown Source)
You will need to re-index.
Of course and in fact I did a complete reindex of all documents and reloaded SOLR app in Tomcat after every schema change. I was wondering does omitnorms=true on these sort fields have anything to do with this? Does cascading six sort fields have anything to do with this? All this works great as expected as long as I have our sort fields be type=SINT.
What’s your schema field and FieldType definition for the field causing the problem? The exception makes me think the field in question has some missing entries.
Yes? i was thinking the same thing but not understanding why. Now note i am searching using all these as sort values and it seemed to work properly if i didn’t make the fields required and only populated the field when i had a value above zero. if boosted is ’1′ i want them first, but otherwise i want the non-boosted docs to sort by the next sort field down so i made them required=false. i am testing with only 16 docs and my sort fields are based upon the example’s stock field type definitions, sint and integer. Here they are (using integer):
uh they won’t post… hah! i will try removing the angle brackets:
field name=”boosted” type=”integer” indexed=”true” stored=”false” omitNorms=”true” required=”false” default=”0″
field name=”stocked” type=”integer” indexed=”true” stored=”false” omitNorms=”true” required=”false” default=”0″
field name=”sales” type=”integer” indexed=”true” stored=”false” omitNorms=”true” required=”false” default=”0″
field name=”views” type=”integer” indexed=”true” stored=”false” omitNorms=”true” required=”false” default=”0″
field name=”image” type=”integer” indexed=”true” stored=”false” omitNorms=”true” required=”true” default=”0″
This comment from yonik makes it sound like sorting via sint doesn’t even use the parseint function, explaining why that makes my exception go away:
> 2) If we sort on postid instead, would we need to use integer, or the sint
> type? I assume sint would be faster(?) but perhaps use more storage?
If you need range queries, SortableIntField values are ordered
correctly for them to work.
For sorting, both int and sint fields work… the difference is in how
the FieldCache entry is built.
For IntField, an Integer.parseInt(str) needs to be done for each distinct str.
SortableIntField is sorted like strings… the ordinal (order in the
index) is recorded for each distinct value.
So sint will build the FieldCache faster, but the string values will
cause the entry to be larger. Aftert the FieldCache entry is built,
both int and sint should be comparable in speed.
Robert, yes you are right how the FieldCache works w.r.t int and sint types.
If you are sure you have re-indexed all of the documents, you could also try an optimize. Without that, some old terms for the field could remain, even though they no longer reference any existing documents. The Lucene FieldCache code for integers tries to parse the term *before* it iterates over the documents containing that term.
Oh really! Well that in fact was the problem. I did not optimize at all and I deleted more documents than the 16 i am currently playing with. I even found that mentioned on the wiki. I am now sorting on Integer fields successfully.
Thanks Yonik!
PS. Wondering since we are on this subject, how much overhead is it for one query to sort on several fields rather than have it sort on one field computed from the values of those several fields? I saw a formula of memory usage looking something like mem(facets) > mem(sorts) > mem(query results fields)
> how much overhead is it for one query to sort on several
> fields rather than have it sort on one field computed
> from the values of those several fields?
Speed-wise it shouldn’t matter too much, but memory-wise it could matter a great deal.
If it’s a numeric field like “float” then you will save memory if all your queries sort by a single field rather than multiple. If it’s a “string” field it gets more complicated and could be more or less memory usage depending on how big the strings are and how you combine them (combinatorial explosion of the resulting indexed terms can negate per-document savings).