Indexing data with Apache Lucene
In this recipe, we will demonstrate how to index a large amount of data with Apache Lucene. Indexing is the first step for searching data fast. In action, Lucene uses an inverted full-text index. In other words, it considers all documents, splits them into words or tokens, and then builds an index for each token so that it knows in advance exactly which document to look for if a term is searched.
Getting ready
The following are the steps to be implemented:
- To download Apache Lucene, go to http://lucene.apache.org/core/downloads.html, and click on the Download button. At the time of writing, the latest version of Lucene was 6.4.1. Once you click on the Download button, it will take you to the mirror websites that host the distribution:
- Choose any appropriate mirror for downloading. Once you click a mirror website, it will take you to a directory of distribution. Download the
lucene-6.4.1.zip
file onto your system: - Once you download it, unzip the distribution. You will see a nicely organized folder distribution, as follows:
- Open Eclipse, and create a project named
LuceneTutorial
. To do that, open Eclipse and go to File. Then go to New... and Java Project. Take the name of the project and click on Finish: - Now you will be inserting JAR files necessary for this recipe as external libraries into your project. Right-click on your project name in the Package Explorer. Select Build Path and then Configure Build Path... This will open properties for your project:
- Click on the Add External Jars button, and then add the following JAR files from Lucene 6.4.1 distributions:
lucene-core-6.4.1.jar
, which can be found inlucene-6.4.1\core
of your unzipped Lucene distributionlucene-queryparser-6.4.1.jar
, which can be found inlucene-6.4.1\queryparser
of your unzipped Lucene distributionLucene-analyzers-common-6.4.1.jar
, which can be found inlucene-6.4.1\analysis\common
of your unzipped Lucene distribution
After adding the JAR files, click on OK:
- For indexing, you will be using the writings of William Shakespeare in text format. Open a browser, and go to http://norvig.com/ngrams/. This will open a page named Natural Language Corpus Data: Beautiful Data. In the files in the Download section, you will find a .txt file named shakespeare. Download this file anywhere in your system.
- Unzip the files and you will see that the distribution contains three folders,
comedies
,historical
, andtragedies
: - Create a folder in your project directory. Right-click on your project in Eclipse and go to New, and then click Folder. As the folder name, type in input and click on Finish:
- Copy the
shakespeare.txt
in step 8 into the folder you created in step 9. - Follow the instructions in step 9 to create another folder named index. At this stage, your project folder will look like this:
Now you are ready for coding.
How to do it...
- Create a package in your project named
org.apache.lucene.demo
, and create a Java file in the package namedIndexFiles.java
: - In that Java file, you will create a class named
IndexFiles
:public class IndexFiles {
- The first method you will write is called
indexDocs
. The method indexes any given file using the given index writer. If a directory is provided as argument, the method recursively iterates over files and directories found under the given directory. This method indexes one document per input file:Tip
This method is relatively slow, and therefore for better performances, put multiple documents into your input file(s).
static void indexDocs(final IndexWriter writer, Path path) throws IOException {
- writer is the index writer that writes index where the given file or directory information will be stored
- path is the file to index, or the directory containing the files for which index will be created
- If a directory is provided, the directory will be iterated or traversed recursively:
if (Files.isDirectory(path)) { Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
- You will then be overriding a method named
visitFile
to visit the file or directory based on the given path and basic file attributes:@Override public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
- Next, you will be calling a static method that you will create later, named
indexDoc
. We have deliberately left the catch block empty as we have let you decide what to do if a file cannot be indexed:try { indexDoc(writer, file, attrs.lastModifiedTime().toMillis()); } catch (IOException ignore) { }
- Return from the
visitFile
method:return FileVisitResult.CONTINUE; }
- Close the blocks:
} ); }
- In the else block, call the
indexDoc
method. Remember that in theelse
block, you are dealing with files, not directories:else { indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis()); }
- Close the
indexDocs()
method:}
- Now create a method to deal with indexing of a single document:
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
- First, create a
try
block to create a new empty document:try (InputStream stream = Files.newInputStream(file)) { Document doc = new Document();
- Next, add the path of the file as a field. As a field name, type
"path"
. The field will be searchable or indexed. However, note that you do not tokenize the field and do not index term frequency or positional information:Field pathField = new StringField("path", file.toString(), Field.Store.YES); doc.add(pathField);
- Add the last modified date of the file, a field named
"modified"
:doc.add(new LongPoint("modified", lastModified));
- Add the contents of the file to a field named
"contents"
. The reader that you specify will make sure that the text of the file is tokenized and indexed, but not stored:doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));
Note
If the file is not in
UTF-8
encoding, then searching for special characters will fail. - Create an index for the file:
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) { System.out.println("adding " + file); writer.addDocument(doc); }
- There is a chance that the document might have been indexed already. Your
else
block will handle those cases. You will useupdateDocument
instead of replacing the old one matching the exact path, if present:else { System.out.println("updating " + file); writer.updateDocument(new Term("path", file.toString()), doc); }
- Close the try block and the method:
} }
- Now let's create the main method for the class.
public static void main(String[] args) {
- You will be providing three options from the console when you run your program:
- The first option is index, and the parameter will be the folder that contains indexes
- The second option is docs, and the parameter will be the folder that contains your text files
- And the last option is update, and the parameter will denote whether you want to create new indexes or update old indexes
To hold the values of these three parameters, create and initialize three variables:
String indexPath = "index"; String docsPath = null; boolean create = true;
- Set the values of the three options:
for(int i=0;i<args.length;i++) { if ("-index".equals(args[i])) { indexPath = args[i+1]; i++; } else if ("-docs".equals(args[i])) { docsPath = args[i+1]; i++; } else if ("-update".equals(args[i])) { create = false; } }
- Set the document directory:
final Path docDir = Paths.get(docsPath);
- Now you will start indexing the files in your directory. First, set the timer, as you will be timing the indexing latency:
Date start = new Date();
- For indexing, Create a directory and create an analyzer (in this case, you will be using a basic, standard analyzer and an index writer configurer):
try { Directory dir = FSDirectory.open(Paths.get(indexPath)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
- With the index writer configured and based on the input regarding the creation or update of the index, set the open modes for the indexing. If you choose to create a new index, the open mode will be set to
CREATE
. Otherwise, it will beCREATE_OR_APPEND
:if (create) { iwc.setOpenMode(OpenMode.CREATE); } else { iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); }
- Create an index writer:
IndexWriter writer = new IndexWriter(dir, iwc); indexDocs(writer, docDir);
- Close the
writer
:writer.close();
- At this point, you are almost done with the coding. Just complete the tracking of time for indexing:
Date end = new Date(); System.out.println(end.getTime() - start.getTime() + " total milliseconds");
- Close the
try
block. We intentionally left thecatch
block blank so that you can decide what you do in the case of an exception during indexing:} catch (IOException e) { }
- Close the main method and close the class:
} }
- Right-click on your project in Eclipse, select Run As, and click on Run Configurations...:
- Go to the Arguments tab in the Run Configurations window. In the Program Arguments option, put
-docs input\ -index index\
. Click on Run: - The output of the code is as follows:
How it works...
The complete code for the recipe is as follows:
package org.apache.lucene.demo; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.LongPoint; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.nio.charset.StandardCharsets; import java.nio.file.FileVisitResult; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import java.util.Date; public class IndexFiles { static void indexDocs(final IndexWriter writer, Path path) throws IOException { if (Files.isDirectory(path)) { Files.walkFileTree(path, new SimpleFileVisitor<Path>() { @Override public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException { try { indexDoc(writer, file, attrs.lastModifiedTime().toMillis()); } catch (IOException ignore) { } return FileVisitResult.CONTINUE; } } ); } else { indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis()); } } static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException { try (InputStream stream = Files.newInputStream(file)) { Document doc = new Document(); Field pathField = new StringField("path", file.toString(), Field.Store.YES); doc.add(pathField); doc.add(new LongPoint("modified", lastModified)); doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8)))); if (writer.getConfig().getOpenMode() == OpenMode.CREATE) { System.out.println("adding " + file); writer.addDocument(doc); } else { System.out.println("updating " + file); writer.updateDocument(new Term("path", file.toString()), doc); } } } public static void main(String[] args) { String indexPath = "index"; String docsPath = null; boolean create = true; for(int i=0;i<args.length;i++) { if ("-index".equals(args[i])) { indexPath = args[i+1]; i++; } else if ("-docs".equals(args[i])) { docsPath = args[i+1]; i++; } else if ("-update".equals(args[i])) { create = false; } } final Path docDir = Paths.get(docsPath); Date start = new Date(); try { System.out.println("Indexing to directory '" + indexPath + "'..."); Directory dir = FSDirectory.open(Paths.get(indexPath)); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); if (create) { iwc.setOpenMode(OpenMode.CREATE); } else { iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); } IndexWriter writer = new IndexWriter(dir, iwc); indexDocs(writer, docDir); writer.close(); Date end = new Date(); System.out.println(end.getTime() - start.getTime() + " total milliseconds"); } catch (IOException e) { } } }