Author | Erick Erickson
Home » Articles posted by Erick Erickson

Payloads are neat,

but where’s a complete example for Solr?…

I’ve been a bit frustrated whenever I discuss payloads in Solr by the lack of an example I could find that gave me all the pieces in a single place. So I decided to create one for Solr 4.0+ (actually, 4.8.1 at the time of this writing, but this should apply for all the 4x code line). There are many helpful fragments out there, our
Update processors have been around for a long time, but they don’t seem to have garnered much attention. This post is intended to give them a little more visibility and show a simple use-case that I ran across recently.

I’m assuming a reasonable familiarity with Solr schema elements here, especially analysis chains, stored/indexed data etc. so I’ll get right to the point.

The high-level problem

There are, as you probably already know, a ton… of

Hard commits, soft commits and transaction logs

As of Solr 4.0, there is a new “soft commit” capability, and a new parameter for hard commits – openSearcher. Currently, there’s quite a bit of confusion about the interplay between soft and hard commit actions, and especially what it all means for the transaction log. The stock solrconfig.xml file explains the options, but with the usual documentation-in-example limits, if there was a full explanation of everything, the …

Or “Why can’t you answer a simple question?”…

Client after client and user after user (on the user’s list) ask the perfectly reasonable question: “Given documents of size X, what kind of hardware do we need to run Solr?”. I shudder whenever that question is asked because our answer is inevitably “It Depends ™”. This is like asking “how big a machine will I need to run a C program?”. You have to know what

Experimenting with join performance…

We recently had a client who wanted some up-front sense of how Solr joins performed. Naturally, the client wanted to use joins in the most painful way, so I set out to make a prototype. Of course I ran into some issues, but one of the delights of working for Lucid is that I have ready access to many of the people who wrote the code, something to treasure! Being able
Or, “Trunk can use about 1/3 of the memory required by 3.x”Please note that these tests were created with an eye toward emphasizing the differences. For instance, I chose to sort on strings since I knew that was an area that would highlight the improvelements. Even so, I’m quite impressed.If you want to skip all the setup info, drop to the section “Measurement Results”.

Estimating hardware requirements…

At Lucid, we often get asked
More NOW evilPrompted by a subtle issue a client raised, I was thinking about date boosting. According to the Wiki, a good way to bost by date is by something like the following:http://localhost:8983/solr/select?q={!boost b=recip(ms(NOW,manufacturedate_dt),3.16e-11,1,1)}ipod(see:  date boosting link). And this works well, no question.However, there’s a subtle issue when paging. NOW evaluates to the current time, and every subsequent request will have a different value for NOW. This blog post…

Or “How to never re-use cached filter query results even though you meant to”:…

Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see
Two popular methods of indexing existing data are the Data Import Handler (DIH) and Tika (Solr Cell)/ExtractingRequestHandler. These can be used to index data from a database or structured documents (say Word documents, or PDF or….). These are great tools for getting things up and running quickly, and I have seen productions sites that work well with one or both of these tools.

OK, then why talk about SolrJ?…

Well, somewhere in the architectural document

Wildcard query terms aren’t analyzed, why is that?

Prior to the current 3x branch (which will be released as 3.6) and the trunk (4.0) Solr code, users have frequently been perplexed by wildcard searching being un-analyzed, often manifesting in case sensitivity. Say you have an analysis chain in your schema.xml file defined as follows and a field named lc_field of this type:
<fieldType name="lowercase" class="solr.TextField" >
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.LowercaseFilterFactory" />
Now, you index