The Lucene Server Project

SourceForge Logo


Table of Contents

How does it work ?
Quick start
Let's go through an example.
The sample index.
File handlers
How to invoque services ?
Configuration File Reference.

Abstract

The Lucene Server project is an attempt to extend the Jakarta Lucene tool with server capabilities.

Lucene is a robust Java API that enables you creating indexes from text sources and perform powerful searches on these indexes. With Lucene, creating an index must be done programmatically and there are almost no possibilities of integrating index management in a distributed environment. In other words, out of the box, Lucene is suitable for integrating indexing and searching possibilities in a single application but not for providing index/search services for multiple applications.

The Lucene Server project comes with a Java API that propose the following

  • make it easy to create indexes in a declarative way by simply providing an XML configuration document.
  • make it easy to personalize the way Lucene must handle different kind of data sources.
  • provide services for index management and searching that can be accessed from several applications.
  • enable batch tasks scheduling.

How does it work ?

Lucene Server is a Java package called org.apache.lucene.server.

This package contents a class called IndexServer that is the implementation of a Lucene Index Server. The main method of the IndexServer class creates a new instance of an IndexServer Object then makes it available to other Java applications through RMI mechanism.

By creating a new IndexServer Object, the IndexServer class interprets the configuration file provided as entry parameter to figure how indexes are defined and what data sources are to be indexed.

Once the IndexServer is registered as a Remote object, any Java application on any other host can access its services which are index creation and management.

The Index server also creates RemoteSearchable (part of the Lucene API) objects for searching purpose.

Quick start

Let's go through an example.

  • Download the Zipped archive of Lucene Server and extract anywhere you want.(in the rest of this document I will suppose you have extracted the archive in the C:\ directory (sorry for UNIX users)).

  • Make sure that your CLASSPATH environment variable contains a path to the Lucene jar file.

  • Move to the C:\LuceneServer-0.1\Sample directory

  • Run the startis.bat (startis.sh for UNIX) script. It starts a Lucene Server on your local host using the isconfig.xml configuration file. The following messages should be displayed :

    C:\LuceneServer-0.1\Sample>startis
    C:\LuceneServer-0.1\Sample>echo off
    IndexServer bound
    Error while binding searcher for index SampleIndex
    C:\LuceneServer-0.1\Sample\data\SampleIndex not a directory                    
    

    The error message is due to the fact that the SampleIndex index has not been yet built.

    The program doesn't stop; let it run (remember it is a server).

  • Run the buildindex.bat script. It builds the SampleIndex index.

    C:\LuceneServer-0.1\Sample>buildindex
    C:\LuceneServer-0.1\Sample>echo off
    Sample index built ...
    C:\LuceneServer-0.1\Sample>
    

  • Finally run the search.bat script and type a query.

    C:\LuceneServer-0.1\Sample>search
    C:\LuceneServer-0.1\Sample>echo off
    
    Enter a query (type q to quit) : apache
    Search query string : apache
    *** Found 1 document(s) .
    Keyword<file:C:\LuceneServer-0.1\Sample\data\The_Apache_Software_License.txt>
    ***
    
    Enter a query (type q to quit) : q
    Bye ...
    C:\LuceneServer-0.1\Sample>
    

The sample index.

Now you should inspect the isconfig.xml configuration file.

First note that the IndexServer.dtd is used as document type. This DTD is part of the Lucene Server distribution; you can copy it anywhere you want on your file system but don't forget to refer to it in your configuration file.

Next let's see the server configuration itself (for more details see the Configuration file reference).

This configuration file defines a server that manages a unique index named SampleIndex. The server will be bound on the 4000 TCP port as indicated by the port attribute. The entire definition of that index is enclosed into the IndexManager element.

The IndexDirectory element indicates that the lucene files will be stored in the ./data/SampleIndex directory for that index.

For adding documents to the index we will use the StandardAnalyzer analyzer, as mentioned in the AnalyzerClass element.

the Source element defines what files are to be indexed. In this case the SampleIndex will contain all *.txt files located in the ./data/ directory and subdirectories. You probably wonder what the FileHandlerClass is. It is a java class that defines the way files are imported into the index; in particular, it's up to this class to create fields in the index (see the next section).

finally we specify that the SampleIndex must be automatically rebuilt everyday at 8 PM. This is the meaning of the Tasks/Build element

File handlers

Actually the FileHandlerClass must be a java class that implements the org.apache.lucene.server.FileHandlerInterface. This interface defines a method that creates lucene documents from a file.

public interface FileHandlerInterface 
{
	org.apache.lucene.document.Document[] toDocuments(XFile f) throws Exception;
}                    
                

When the server constructs an index it invokes the appropriate toDocuments method for each file in the Source then adds the returned documents to the index.

For instance let's see how the org.apache.lucene.server.samples.TextFileHandler class handles file.

public class TextFileHandler implements FileHandlerInterface
{
	public org.apache.lucene.document.Document[] toDocuments(XFile f) throws Exception
	{
		// read file content into a StringWriter
		InputStream is = f.createInputStream();
		StringWriter sw = new StringWriter();
		int c;
		while ((c = is.read()) != -1)
		{
			sw.write(c);
		}
		
		//create a new Lucene Document
		Document doc = new Document();
		//add fields to this Document
		doc.add(Field.Keyword("file", f.toString()));
		doc.add(Field.Text("content", sw.toString()));
		
		Document[] dlist = {doc};
		return dlist;
	}
}

So you can remark that this File Handler will lead to create the "content" and "file" fields in the index.

You also probably remarked that the parameter of toDocuments is not a java.io.file as one could expect it. In the Lucene Server project all classes manipulate org.apache.lucene.server.XFile instead of File. XFile is an interface which defines services similar to java.io.File

But what is it for ? Let's take an example.

Imagine you run your Index Server on a Windows environment and you want to index files that are located on a UNIX host. You can share these files by using Samba. Unfortunatly the java.io.File is not able to access Samba files; you should instead use the SmbFile class provided by the jcifs tool. XFile allow access to either File or SmbFile or anything else through a unified interface.

In order to create appropriate XFile instances given a string name, you need to use a XFileFactory. If you did not specify a XFileFactory in the configuration file (as in our example), the default SimpleXFileFactory class will be used but it can only deal with normal files.

How to invoque services ?

When you have run the startis script, the IndexServer class has created a new RMI registry on the port 4000 and bound an instance of IndexServer with the name of IndexServer.

Now you are ready to invoque services on that remote object (if you are not familliar with java RMI you should take a look at the related tutorial). List of all available services is given by the org.apache.lucene.server.IndexService interface.

For instance here is the code of the org.apache.lucene.server.samples.BuildSampleIndex class that orders the server to rebuild the SampleIndex.

import java.rmi.Naming;
import java.rmi.RMISecurityManager;
import org.apache.lucene.server.IndexService;
                
public class BuildSampleIndex 
{
    public static void main(String[] args) 
    {
        if (System.getSecurityManager() == null) 
        {
            System.setSecurityManager(new RMISecurityManager());
        }

        try 
        {
            String name = "//localhost:4000/IndexServer";
            IndexService server = (IndexService) Naming.lookup(name);
            server.BuildIndexFromScratch("SampleIndex");
            System.out.println("Sample index built ...");
        } 
        catch (Exception e) 
        {
            System.err.println("IndexServer error : " + 
                               e.getMessage());
            e.printStackTrace();
        }
    }
}

And now how to invoque search services ?

Lucene Server focusses on index creation and management, thus it does not provide any search service. Instead it binds to the RMI registry one org.apache.lucene.search.RemoteSearchable object per index. By convention the RemoteSearchable for a given index is bound with the name of that index plus "_searcher" (for instance the RemoteSearchable remote object for searching into the SampleIndex index is bound to SampleIndex_searher.

Let's see how the org.apache.lucene.server.samples.SampleSearch class (used in the search script) performs a query over the SampleIndex index.

...

String name = host + "IndexServer";
IndexService server = (IndexService) Naming.lookup(name);
						
Searchable rsearchable1 = (Searchable) Naming.lookup(host + "SampleIndex_searcher");
Searchable[] searchables = {rsearchable1};
MultiSearcher multi = new MultiSearcher(searchables);
			
Analyzer ana = (Analyzer)server.getAnalyzerClass("SampleIndex").newInstance();

...
			
Query query = QueryParser.parse(Squery,"content", ana);
Hits hits = multi.search(query);

Note that to correctly parse the query, you need to know which analyzer to use; that's why the IndexServer provides the getAnalyzerClass service that returns the analyzer class used to construct a given index.

Configuration File Reference.

The reference documentation for the configuration file is available here.