Mark Miller recently
posted a brief intro to Span Queries, so I thought I would piggyback on top of his work and show how to get started with
Payloads (see also [1]).
Introduction
Like Spans, payloads involve the position of terms, but go one step further. Namely, a Payload in Apache Lucene is an arbitrary byte array stored at a specific position (i.e. a specific token/term) in the index. A payload can be used to store weights for specific terms or things like part of speech tags or other semantic information. If you read Brin and Page’s (you know, the Google guys) original paper
Anatomy of a Large-Scale Hypertextual Search Engine, they describe what is essentially a payload functionality, whereby they store information about font, etc. at a specific position in the index (remember when you could get your pages ranked number one by using really big fonts?) and then utilize it in search.
There are three parts to taking advantage of payloads in Lucene. Solr requires a fourth step, which I will explain in a moment.
- Add a Payload to one or more Tokens during indexing.
- Override the Similarity class to handle scoring payloads
- Use a Payload aware Query during your search
For Solr, step 3 requires you to have your own Query Parser, as none of the existing Solr Query Parsers support the BoostingTermQuery. Thus, the third step for Solr is add a Query Parser that supports payloads (and Spans would be nice, too! Please donate if you do this!)
Adding Payloads during indexing
(I’m using Lucene 2.9-dev)
I’m going to use the same indexing code I did for my post on
co-occurrence analysis, but with a few modifications.
First off, I’m going to change Analyzers to one of my own creation:
class PayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;
PayloadAnalyzer(PayloadEncoder encoder) {
this.encoder = encoder;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
result = new DelimitedPayloadTokenFilter(result, '|', encoder);
return result;
}
}
In this Analyzer, I have the basic whitespace tokenizer and a lower case filter, but then I add in the recently added DelimitedPayloadTokenFilter (DPTF). The DPTF allows you to add payloads to tokens simply by marking up the tokens with a special character followed by the payload value. For instance, I changed my sample docs from the co-occurrence example to now include payload information. Specifically, I said that all nouns should be weighted by 10, all verbs by 5 and all adjectives by 2 (I used
http://l2r.cs.uiuc.edu/~cogcomp/pos_demo.php to tag the sentences, any errors are likely mine.) Everything else has no payload. I also stripped all punctuation. My DOCS array now looks like:
public static String[] DOCS = {
"The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0",
"The quick red fox jumped over the lazy brown dogs",//no boosts
"The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 box|10.0",
"Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0",
"Mary had a little lamb whose fleece was white as snow",
"Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0",
"Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
"Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0",
"The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0",
"The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0"
};
The DOCS array simply marks each noun, verb and adjective with a | (pipe symbol) and then a float indicating the boost. I also added some docs that have no boosts at all to demonstrate the differences at query time. The DPTF will then use this to encode the payloads using the PayloadEncoder. A PayloadEncoder is an interface that tells the DPTF how to convert the payload to a byte array. Also note that Lucene’s contrib/analysis package contains several other TokenFilters for adding payloads to a Token and, of course, you can write your own as well. Furthermore, the PayloadHelper class can help encode/decode payloads for common types.
Overriding the Similarity Class
The next step, which should happen before indexing, is to override the Similarity class to handle payloads. While it is isn’t strictly required that this happens before indexing in
THIS case, it is a good habit to do in case you have made other changes to the Similarity class that are required during indexing (such as overriding how norms are encoded.)
Overriding the Similarity is done on both the IndexWriter and the IndexSearcher. See [3] below for the full code, including the calls to set the similarity. My Similarity implementation simply converts the byte array to a float and returns the float, as in:
class PayloadSimilarity extends DefaultSimilarity {
@Override
public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
}
}
Executing the Query
Currently, Lucene has one payload aware Query called the BoostingTermQuery (BTQ for short, see [2] for another Payload aware query that may be in Lucene 2.9), which can be used just like any other query. For instance:
IndexSearcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(payloadSimilarity);
BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
TopDocs topDocs = searcher.search(btq, 10);
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
ScoreDoc doc = topDocs.scoreDocs[i];
System.out.println("Doc: " + doc.toString());
System.out.println("Explain: " + searcher.explain(btq, doc.doc));
}
In this example, I create the BTQ and hand it to the searcher and then print out the results. Easy peasy, yet so powerful.
Running this yields:
———–
Results for body:fox of type: org.apache.lucene.search.payloads.BoostingTermQuery
Doc: doc=0 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 0), product of:
7.071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
10.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=0)
Doc: doc=2 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 2), product of:
7.071068 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
10.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=2)
Doc: doc=1 score=0.42344445
Explain: 0.42344445 = (MATCH) fieldWeight(body:fox in 1), product of:
0.70710677 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
1.0 = scorePayload(…)
1.9162908 = idf(body: fox=3)
0.3125 = fieldNorm(field=body, doc=1)
Notice how both Doc 0 and Doc 2, which both contain the word “fox” in the body occur before doc 1 even though they all have the same term frequency and length.
Running the a simple TermQuery (ignoring payloads) with the exact same Term, on the other hand, yields:
———–
Results for body:fox of type: org.apache.lucene.search.TermQuery
Doc: doc=0 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 0), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=0)
Doc: doc=1 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 1), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=1)
Doc: doc=2 score=0.59884083
Explain: 0.59884083 = (MATCH) fieldWeight(body:fox in 2), product of:
1.0 = tf(termFreq(body:fox)=1)
1.9162908 = idf(docFreq=3, numDocs=10)
0.3125 = fieldNorm(field=body, doc=2)
As you can see, in the TermQuery case, all the docs are scored exactly the same.
Next Steps
As you can see from above, getting started with Payloads is pretty easy. In reality, the only hard part is determining what exactly to put in your payload and then how it should factor into your score. Lucene takes care of the rest. Tools like UIMA, OpenNLP and other proprietary vendors can often be used to provide higher level lexical, syntactical and semantic information about tokens, thus giving you the power to create very expressive payloads and richer search applications.
[1] See Michael Busch’s talk at the last SF Meetup for more details on payloads:
http://www.meetup.com/SFBay-Lucene-Solr-Meetup/files/
[2]
https://issues.apache.org/jira/browse/LUCENE-1341
[3] Full class:
package com.lucidimagination.noodles;
import junit.framework.TestCase;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.PayloadEncoder;
import org.apache.lucene.analysis.payloads.FloatEncoder;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.payloads.BoostingTermQuery;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import java.io.Reader;
import java.io.IOException;
/**
*
*
**/
public class PayloadTest extends TestCase {
Directory dir;
public static String[] DOCS = {
"The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy|2.0 brown|2.0 dogs|10.0",
"The quick red fox jumped over the lazy brown dogs",//no boosts
"The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0 brown|2.0 box|10.0",
"Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was|5.0 white|2.0 as snow|10.0",
"Mary had a little lamb whose fleece was white as snow",
"Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0 despite ties|10.0 to sheep|10.0 farming|10.0",
"Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0 that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
"Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0 and a man|10.0 obsessed|10.0",
"The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket|10.0 and a baseball|10.0 cap|10.0",
"The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the best|2.0 of all dogs|10.0"
};
protected PayloadSimilarity payloadSimilarity;
@Override
protected void setUp() throws Exception {
dir = new RAMDirectory();
PayloadEncoder encoder = new FloatEncoder();
IndexWriter writer = new IndexWriter(dir, new PayloadAnalyzer(encoder), true, IndexWriter.MaxFieldLength.UNLIMITED);
payloadSimilarity = new PayloadSimilarity();
writer.setSimilarity(payloadSimilarity);
for (int i = 0; i < DOCS.length; i++) {
Document doc = new Document();
Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
doc.add(id);
//Store both position and offset information
Field text = new Field("body", DOCS[i], Field.Store.NO, Field.Index.ANALYZED);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
}
public void testPayloads() throws Exception {
IndexSearcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(payloadSimilarity);//set the similarity. Very important
BoostingTermQuery btq = new BoostingTermQuery(new Term("body", "fox"));
TopDocs topDocs = searcher.search(btq, 10);
printResults(searcher, btq, topDocs);
TermQuery tq = new TermQuery(new Term("body", "fox"));
topDocs = searcher.search(tq, 10);
printResults(searcher, tq, topDocs);
}
private void printResults(IndexSearcher searcher, Query query, TopDocs topDocs) throws IOException {
System.out.println("-----------");
System.out.println("Results for " + query + " of type: " + query.getClass().getName());
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
ScoreDoc doc = topDocs.scoreDocs[i];
System.out.println("Doc: " + doc.toString());
System.out.println("Explain: " + searcher.explain(query, doc.doc));
}
}
class PayloadSimilarity extends DefaultSimilarity {
@Override
public float scorePayload(String fieldName, byte[] bytes, int offset, int length) {
return PayloadHelper.decodeFloat(bytes, offset);//we can ignore length here, because we know it is encoded as 4 bytes
}
}
class PayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;
PayloadAnalyzer(PayloadEncoder encoder) {
this.encoder = encoder;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
result = new DelimitedPayloadTokenFilter(result, '|', encoder);
return result;
}
}
}
Powerful payload – allowing control of boost on term level! Great introduction. I’m wondering good use cases of using payload in real life search engine. what type of searches will benefit from using payload?
The Google paper linked above provides some examples (they stored font and other information in the payload). You could also use it to override specific term weights. I’ve also used it to store part of speech information that allowed me to boost words that were a specific part of speech.
In this example, the weight of each word are known. If I do not want to allow users to
Inquiries, know the weight of fox | 10. 0 to 10, how can I do?
[...] has a post introducing Lucene payloads. This is the primary way to get token/term level metadata into a lucene index. See also Michael [...]
Useful intro. A couple of questions:
What is a payload of a span supposed to look like? Is it merely the concatenation of the payloads of the constituent tem occurrences? I ask because I’m creating new span operators and want to get the payloads correct.
Second question: I see that NearSpansOrdered allocates a new hashset for every match, to handle potential payloads (in the shrinkToAfterShortestMatch method). Doesn’t that make the operator really slow, even for non-payload queries? Or am I missing something obvious?
Thanks!
How to integrate the Lucene payloads with solr katta patch.
Not sure how to integrate w/ Katta/Solr. That being said, the TokenFilter is there to support it, so it should just work from the indexing side.
I am not able to find BoostingTermQuery class in Lucene 3.0. Can some help why this class is not there in 3.0 or where to get this ?
It’s now called the PayloadTermQuery
Thanks Grant but i am struggling to use PayTermQuery because of PayloadFunction. I didnt find any example.
In the example mention here if I want to extract payload value how can I do that e.g. 10.0 = scorePayload(…)
I am not sure any method to get this value.
Could you please help me to get this value ?
[...] Lucid Imagination » Getting Started with Payloads, I introduced the basics of payloads, but that article is now slightly out of date if you are using [...]
Ajay,
See http://www.lucidimagination.com/blog/2010/04/18/refresh-getting-started-with-payloads/
I posted an update to this for Lucene 3.0. If you want the same functionality of the BoostingTermQuery, just use the AveragePayloadFunction.
Hi Grant,
Thanks for the reply. PayTermQuery works now but I also wants to get some API to get scorepayload value which is currently getting printed in searcher.explain(query, doc.doc) but I want something like getPayLoadValue() which will return payload associated with term.
is there any open API ?
[...] using BoostTermQuery but I am not able to extract payload value. I am using example mention in http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/comment-page-1/#commen… Could you please help me to get payload value ? Related Posts:Term offsets for highlighting [...]
Ajay,
Have a look at the TermPositions class, either that or Span.getPayload()
Thanks Grant I am able to manage getpayload value using TermPosition class but now I am having one issue. I have document with doc_id and Title and I want to get payload for a given term in a given document but current API’s takes input as term only. There is no parameter for document id.
API example:
Term t = new Term(“Title”, “Java”);
TermPositions positions = ir.termPositions(t)
I want to iterate for all documents and wants to get payload for each document separately. Is there any API available ?
Hey, Grant
Its a great article. After reading it, i feel payload can have many applications. But One thing i was not able to figure out, in the article u are using a specific term to query. How can i use it with query parser so that i can search multiple terms.
Hi Arpit,
Unfortunately, there is no Query Parser support yet, but see https://issues.apache.org/jira/browse/SOLR-1337
In your example you use payloads to add metadata to a term. The metadata you add is weight, which is used to change a term’s weight when returned in a search result.
I have a different usecase and I’d like you to comment whether my usecase is consistent with the intended use of the payload feature.
Specifically, I’d like to store a video transcript time-code with each term (the timecode is the number of milliseconds of elapsed time from the start of the video where the term occurs). the timecode is not data that affects Lucene’s indexing or searching behavior, rather it is simply data I want associated with the term.
Hi Peter,
Payloads would work fine for that. You will need to use TermPositions to access it or SpanQuery if you want to get the actual payload out at some point other than as part of a search.
Great post. One question: Will payload term boosting work with MLT handler query? i.e. will the matches be ranked by matching on terms with more weights? If not, is there a way to achieve this?
[...] can read more about payloads in this nice post from Grant Ingersoll. Tagged: lucenequestionsSolr /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO [...]
Is it possible to rank the documents only by its payload? Please help