IntroductionDuring a past ecommerce webinar with Brian Doll of Sheetmusicplus.com, I posted a checklist of items that are commonly occurring in many ecommerce applications and then I waved my hands, due to time constraints, and said Solr (and now LucidWorks) can do almost all of them out of the box and left the rest as an exercise for the reader. Well, now I have some time, so let me fill in the blanks with some more concrete examples about how to do this.
SetupFor this example, I am using real estate data freely available from the NYC government. The reason I am interested in this data is that it is:
- It has product-like data in it, as in: name, description, a bunch of metadata and price
- It’s mostly real (I embellished it with descriptions and a few other pieces and filled in some missing pieces of data, see the Indexer class in the source code.) In fact, it’s so real, that when setting up the app, one quickly sees how noisy the data is in terms of things like missing values, etc. For instance, 1804 records don’t have the year built specified.
- Unzip the ecommerce.zip file into the directory of your choice
- cd lucid_ecom
- In a separate terminal window: cd solr
- java -jar start.jar (just as if you were running the Solr tutorial. Note, I am running a relatively recent version of the Solr 3.x branch)
- Point your web browser at http://localhost:8983/solr/nyc and take a moment to familiarize yourself with the interface.
- ant delete-all (deletes the existing content)
- ant index
Implementing the ChecklistI’ve broken out each checklist item below and will cover each of them in more detail in the following subsections.
Keyword searchThere really isn’t much to be said here other than Solr has built in support for querying in all the “usual” ways that one would expect out of a search engine. Keywords, phrases, wildcards, fielded search and much, much more. For example, try:
- http://localhost:8983/solr/nyc?q=tottenville or just type tottenville in the search box.
- http://localhost:8983/solr/nyc?q=5+bedrooms+%22Staten+Island%22 (5 bedrooms “Staten Island”)
- http://localhost:8983/solr/nyc?q=5+bedrooms+borough_display%3ABro* (5 bedrooms borough_display:Bro* — Should match all 5 bedrooms in either the Bronx or Brooklyn)
High Quality relevance (precision @ < 10)In many search applications, and ecommerce is no exception, users often abandon searches when the first page of results (often the top 10) are not relevant to their query. Thus, it is important that a search engine return good results on the first page. While some guidance (more on this in the coming sections) can help alleviate the abandonment problem, a strong first showing is often the quickest way to more clickthroughs. Since Solr utilizes Lucene, which implements an industry standard vector space approach to search, results are often quite good out of the box. Nevertheless, many ecommerce applications may need one or more of the tools that Solr/Lucene provide out of the box to tweak relevance, such as:
- Document, field, token boosting (i.e. matches in the title field are more important than matches in the description.)
- Query term boosting (provide weights for different terms, such as synonyms.)
- Disjunction Maximum Query scoring (aka the “dismax” parser or the extended dismax parser) for dealing with cross field matches.
- Automatic phrase generation from multiword queries even when the user did not explicitly quote the keywords.
- The ability to override low-level scoring information such as term frequency, document frequency, document length normalization and coordination factors.
- Function queries (more later) to allow values in fields (such as price) to be factors in scoring.
- Editorial Boosting/Sponsored Results (in Solr-speak it’s called the QueryElevationComponent — more later) to place specific results at the top.
Faceting/DiscoveryOne of Solr’s most appealing features is its out of the box support for faceting (sometimes called navigators, parametric search, guided navigation) in a number of different ways (see Faceted-Search-Solr for a primer. Also see http://wiki.apache.org/solr/SimpleFacetParameters) In the example application, the left hand nav area shows facets for things like borough (field based faceting), sale price (numeric range faceting), sale date (date range faceting) and pet friendly (facet by query). Solr also supports “multi-select” faceting. And, while there isn’t support for true hierarchical faceting in Solr yet, there are ways to achieve it through intelligent modeling of your tokens. Last, but not least, you may find https://issues.apache.org/jira/browse/SOLR-792 useful for doing grouped faceting (color: red, size: large).Additionally, helping customers discover items of interest goes well beyond facets. Features like Did you mean, Related Items/Searches, Collaborative Filtering/Recommenders (see Mahout for an open source solution), Auto Suggest and others can go a long way in increasing the user’s ability to purchase items from your store. Many of these features I’ll cover below.
Flexible language analysis tools
Lucene and Solr have an extensive, open language analysis framework that makes it easy to do linguistic analysis. I won’t spend too much time here, as you can have a look at the included schema.xml for information on the various pieces I used. Also, have a look at the Solr wiki for more info. Suffice it to say, Solr has many tokenizers, stemmers and other token modification capabilities. In many cases, a good search system will use a variety of techniques (case changes, stemming, synonyms, etc.) to achieve the desired results. It is also often useful to build up a list of protected words for things like product names so that they do not get confused with other words that share a common root. Finally, keep in mind that of all the extension points to Lucene and Solr, writing your own TokenFilter is one of the easiest things you can do to extend the capabilities of your application.
Multilingual supportSolr contains support for most of the commonly spoken languages in the world, including English, Chinese, French, Spanish, Korean, German, Thai and many more. Lucene and Solr are also UNICODE compliant.
Frequent Incremental UpdatesLucene, and thus, Solr has supported incremental updates from it’s inception without the need to re-index the whole collection. It is also very fast at making new documents available for search. Additionally, with the combination of recent and upcoming work in Lucene, real time search should be available soon. The one piece that is still missing is individual field update, but for certain types of fields (ratings, for instance), there may be easy workarounds.
Ratings and ReviewsIn working with many ecommerce customers on Solr, there are usually questions around how to incorporate ratings and reviews into search results without skewing results or introducing too much noise. On the ratings side, app developers often want to incorporate the aggregate rating of an item as a boost factor in the overall score. I will discuss how to do this in detail in the section titled Editorial Relevance Controls below. Meanwhile, on the review side, it is often the case that too much noise is introduced by including reviews “on par” with matches in the product title or description. For instance, if I’m selling “Widget X” and a review for a different product says something like “You should also check out Widget X”, bringing back a match on that second product really isn’t all that useful for a customer searching for “Widget X”. To deal with this noise, people often take a couple of different approaches:
- They weight review matches lower than product matches via boosting (either at query time or indexing time)
- They only search reviews if they don’t feel they have high quality matches for the main product search
Auto-suggestAuto suggest (aka auto complete) is one of the cheapest (in terms of development costs) mechanisms available for enhancing the chance that users find what they are looking for. I’ve heard of vendors adding auto-suggest and having it add millions to their bottom line. Simply by providing a drop down list of ways of completing what a user has typed so far an application can do a number of things:
- Reduce spelling errors thus leading to lower frustration and better results sooner rather than later
- Seed the user with items that they may want but weren’t explicitly looking for. After all, an intelligent auto-suggest box can very easily not only give completions, but it can also hook in related items too.
- Short-circuit search all together and go directly to a landing page for a specific search
- Applied the SOLR-1316 patch to the 3.x branch. This required a minor tweak to the HighFreqDictionary.java file. See patch below
- Add the necessary piece to the solrconfig.xml. See the /autosuggest SearchComponent in the solrconfig.xml in the appendix.
- Decide what fields to use in building the auto-suggest index (see schema.xml). I then “copy fielded” these into a field named suggest. Note that I used a non-stemming analyzer. I also used Solr’s word-based n-gram filter with a shingle base of 5 so as to give phrase suggestions too. Note, this is intended for demonstration purposes, as you may wish to not use shingles and append terms as the user types or you may want to use a different value for n. Also note, I did not spend much time at all on evaluating what went into the suggest field that is used as a source. You will want to validate it and make sure it is aligned with your business goals.
- Build the auto-suggest data structures via the Spell Checker build command (see the next section)
Did You Mean?Just like auto-suggest, spell checking can be helpful to users in finding what they are looking for, especially given the propensity of manufacturers/product designers to use incorrectly spelled words in their product name in order to better “brand” the product. Good spell checking goes beyond merely hooking up a dictionary of terms, it is also quite important to know when to suggest a term and when not suggest a term. Lucene/Solr has the basics of setting up spell checking covered via the SpellCheckComponent, but a good spell checking application will need to go beyond merely setting up the component in order to achieve good results. First things first, however, let’s take a look at getting spell checking setup and then we can examine what is needed to make it better.First, we need to configure the SpellCheckComponent in the solrconfig.xml file. There is an example of this in the Solr tutorial example, from which I changed the distance measure from the Levenstein edit distance to the Jaro-Winkler distance. The reason I did this is based on past experience that users tend to misspell words towards the end of the word and not the beginning, which the Jaro-Winkler distance accounts for. My configuration looks like:
The whole point of a SearchComponent such as the SpellCheckComponent is to hook it into the main Solr request processing instead of having to make a separate call. Thus, I hooked the SpellCheckComponent into the /nyc RequestHandler so that all queries that are submitted to the “main” RequestHandler will also be spell checked. Once the configuration is setup, the spelling index must be built (and maintained.) This is handled by issuing an &spellcheck.build=true command to the spell checker, as in:<searchComponent name="spellcheck"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="spellcheckIndexDir">./spellchecker</str> <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str> </lst> <!-- ... --> </searchComponent>
http://localhost:8983/solr/autosuggest?q=man&spellcheck=true&wt=xml&rows=0&indent=true&spellcheck.build=true(Note, the &q param can be anything.)Once the configuration is hooked up and the spell checking data structure is built, the last piece is to hook it into the UI. (Note, I setup the solrconfig.xml to automatically do spell checking on every query request.) To hook into the UI, I co-opted the suggest.vm file and spruced it up a bit to provide links, etc. Other than that, it is exactly the same, since both are just different implementations of spell checking.See the Solr wiki on the SpellCheckComponent for more information.
Related Searches/ItemsIn many ecommerce applications, stores position related items next to a particular item so as to inspire the user to either buy an additional item or offer an alternative. Naturally, the “relation” is determined by the store and might take on a variety of forms, such as: accessories, enhanced versions, cheaper versions, alternatives from different manufacturers or popular items based on other users. Similarly, a store may wish to give users not only suggestions and spelling corrections, but they may also want to give users alternative search terms or other popular searches. For instance, if a user searches for TVs, a store may want to suggest they search for “LCD TVs” or “HD TVs”, etc.When it comes to related items, many Solr users rely on either hand-crafting a second query (given an original query and a particular item) by using the original terms of the query and some of the terms that describe the item. For instance, an application might use the category of the item plus some of the keywords for that item to then craft the query, submit it to Solr and then display the first few results. This approach can also be done automatically using Solr’s built in More Like This (MLT) capability, but you may need to do some tuning to get the results you desire. For the sake of the example, I incorporated MLT into the application. You can see it on the left hand side, just below the map, under the “Similar Properties” heading. The configuration of MLT was done in the solrconfig.xml file as part of the /nyc RequestHandler. Note, in a typical application you may not wish to generate MLT results for a search query, but instead only provide them once a user chooses a particular document, as MLT can add a fair amount of overhead to the process. Other Solr applications will often calculate related items off line or through some type of collaborative filtering approach (see Apache Mahout’s recommender capability for an open source library to do this) and either add the information to the document and re-index or integrate it at the application level. In these cases, it’s not hard to integrate, but it is beyond the scope of this article.As for the functionality to add related searches, there is not currently support built into Solr, but there is a JIRA issue open to track the idea. Related searches can often be determined through a combination of log analysis (look for patterns in a user session) and synonyms or via collaborative filtering/recommenders. Also, have a look at Mahout’s Frequent Pattern Mining capabilities. One could also index the queries into another index (Solr core) and simply issue fuzzy queries to it.
Editorial Relevance Controls
Whether its called “editorial controls”, “sponsored results”, “best bets” or any other name, the ability to implement business goals as part of search is a fundamental need of any ecommerce solution. Hidden in the various names is a desire to have total control of search relevance without sacrificing speed or hindering the engine from working well when no business rules are applicable. Solr and Lucene offer a myriad of mechanisms to achieve business goals ranging from the typical boost values on documents, fields, tokens and query terms to the hardcore “gotta have it exactly my way” option of cracking open the source and adding your own query mechanism. In between these two extremes are a whole range of things like function queries, payloads, the QueryElevationComponent for setting fixed results as well as excluding specific documents, similarity adjustments, augmented queries (such as automatic phrase boosting) and much more. Of these, most people rely on function queries, the dismax extensions and the QueryElevationComponent to achieve their relevance goals.
In the working example, I made a couple of changes to demonstrate some of the relevance ideas described here:
- The /nyc RequestHandler has the QueryElevationComponent hooked in and keyed off of the elevate.xml file. In that file, I mapped the query “3 bedroom Brooklyn” to rank a specific document higher and exclude one other. See http://localhost:8983/solr/admin/file/?file=elevate.xml for the mapping. To see this, add &enableElevation=false to the query, as in: http://localhost:8983/solr/nyc?q=3+bedroom+Brooklyn&enableElevation=false
- I setup “phrase boosting” on the description field to generate phrases against the description field. See the /nyc RequestHandler (it’s the “pf” setting” in the solrconfig.xml).
- I added a “boost function” to rank documents higher based on the commission paid for selling the property (note, I randomly assigned a value to this field for pedagogical reasons). See the “bf” setting in the /nyc RequestHandler.
- Also, don’t forget creative domain modeling: for instance, if you want to support landing pages and banners, why not just create them as documents in your index (assign a type to them) and make sure they are at the top of the results (other possibilities include doing two queries, one for landing pages first and then one for the results)
If you are so inclined, you can also extend Solr and Lucene. Before you do, however, you might want to search for you issue, or even ask on the appropriate mailing list. If that doesn’t help, I recommend starting with the Solr Plugins wiki page and then you can dig into the source from there if necessary. My advice: If you think you need a new Query class (a low-level Lucene mechanism for custom scoring), see if you can solve your problem via a FunctionQuery (even a custom one) first and maybe some other mechanisms before going down the Query path.