ApacheCon - Advanced Indexing with Lucene (Lucene Payloads)

Presented by Michael Busch of IBM

This class was packed out - people were sitting in every chair, and others on the floor in the back of the room. Michael starts by briefly telling us about some new features in Lucene 2.4 - the latest release. For instance, payloads have been introduced to allow a certain amount of metadata to be stored in the index. He gives a very good explanation of how inverted indexes work and how payloads work in the latest release.

I won’t try to explain it in depth to you without the slides - the pictures speak a thousand words. Basically, for each word in any indexed document, they store an array of three pieces of information: document ID - the ID of the document that it appeared in, position

the position in the document, payload - a byte array of metadata that you can access later when it is returned in a search result. (I’ll try to upload a snapshot of the slide later - they’re not currently available on the ApacheCon site).

Payload Use Cases:

Score certain occurrences of a term higher than others. This was slightly complicated, but basically you could store a number as the payload for terms that you wanted to boost. Then you could create a type of term query that pays attention to that payload to boost the term in the results.

Store a unique document ID. If you’ve used Lucene, you are probably very aware that document IDs change if you reindex things. With payloads, you can index a single term per document that stores a unique ID in the payload. You’ll have to see the code in the slides to see exactly how to accomplish this. In the usecase he gave, they wanted to cache the unique document ID indexed by the Lucene document ID. To read them all into the cache, it only took 430 milliseconds as opposed to 16.5 seconds if you saved this as a term in the document. This shows one strength of payloads - iterating through documents to retrieve a certain piece of information.

_Efficient Numeric Search. _I think this is the most interesting yet since if you’ve done much Lucene indexing and searching / you realize the problem of searching for dates, especially with any fine level of granularity. Every unique date is stored in the dictionary (index - list of terms). I will definitely be investigating using this in my current Lucene use. Using payloads, you could store the date in the payload. But to search for it, you would need to iterate through all documents to find matching documents. To improve this, he suggests a hybrid approach: store the month and year in the term, and then store the day in the payload. This way, when you do a date search, you can use a regular TermQuery to search for documents that match the date and year, and refine it by day when necessary (this could easily be extended to day / hour / minute / etc).

Flexible Indexing

Token has traditionally had only: tpye / offset / position increment / payload / flags. This makes it hard to add additional data to the token. There is work (uncommitted as of right now) to introduce a new Token API. It will allow you to add new attributes to the token. These changes will also split the DocumentsWriter into several classes, following a consumer model. These improvements will make Lucene much more flexible - right now they are only in trunk (not in 2.4). The search side of these additional attributes is not implemented yet. There’s no API to search for your custom attributes for now - but there will be - stay tuned.