Java Data Science Cookbook
上QQ阅读APP看书,第一时间看更新

Indexing data with Apache Lucene

In this recipe, we will demonstrate how to index a large amount of data with Apache Lucene. Indexing is the first step for searching data fast. In action, Lucene uses an inverted full-text index. In other words, it considers all documents, splits them into words or tokens, and then builds an index for each token so that it knows in advance exactly which document to look for if a term is searched.

Getting ready

The following are the steps to be implemented:

  1. To download Apache Lucene, go to http://lucene.apache.org/core/downloads.html, and click on the Download button. At the time of writing, the latest version of Lucene was 6.4.1. Once you click on the Download button, it will take you to the mirror websites that host the distribution:

    Getting ready

  2. Choose any appropriate mirror for downloading. Once you click a mirror website, it will take you to a directory of distribution. Download the lucene-6.4.1.zip file onto your system:

    Getting ready

  3. Once you download it, unzip the distribution. You will see a nicely organized folder distribution, as follows:

    Getting ready

  4. Open Eclipse, and create a project named LuceneTutorial. To do that, open Eclipse and go to File. Then go to New... and Java Project. Take the name of the project and click on Finish:

    Getting ready

  5. Now you will be inserting JAR files necessary for this recipe as external libraries into your project. Right-click on your project name in the Package Explorer. Select Build Path and then Configure Build Path... This will open properties for your project:

    Getting ready

  6. Click on the Add External Jars button, and then add the following JAR files from Lucene 6.4.1 distributions:
    • lucene-core-6.4.1.jar, which can be found in lucene-6.4.1\core of your unzipped Lucene distribution
    • lucene-queryparser-6.4.1.jar, which can be found in lucene-6.4.1\queryparser of your unzipped Lucene distribution
    • Lucene-analyzers-common-6.4.1.jar, which can be found in lucene-6.4.1\analysis\common of your unzipped Lucene distribution

    After adding the JAR files, click on OK:

    Getting ready

  7. For indexing, you will be using the writings of William Shakespeare in text format. Open a browser, and go to http://norvig.com/ngrams/. This will open a page named Natural Language Corpus Data: Beautiful Data. In the files in the Download section, you will find a .txt file named shakespeare. Download this file anywhere in your system.
  8. Unzip the files and you will see that the distribution contains three folders, comedies, historical, and tragedies:

    Getting ready

  9. Create a folder in your project directory. Right-click on your project in Eclipse and go to New, and then click Folder. As the folder name, type in input and click on Finish:

    Getting ready

  10. Copy the shakespeare.txt in step 8 into the folder you created in step 9.
  11. Follow the instructions in step 9 to create another folder named index. At this stage, your project folder will look like this:

Getting ready

Now you are ready for coding.

How to do it...

  1. Create a package in your project named org.apache.lucene.demo, and create a Java file in the package named IndexFiles.java:

    How to do it...

  2. In that Java file, you will create a class named IndexFiles:
             public class IndexFiles { 
    
  3. The first method you will write is called indexDocs. The method indexes any given file using the given index writer. If a directory is provided as argument, the method recursively iterates over files and directories found under the given directory. This method indexes one document per input file:
    Tip

    This method is relatively slow, and therefore for better performances, put multiple documents into your input file(s).

            static void indexDocs(final IndexWriter writer, Path path) 
              throws IOException { 
    
    • writer is the index writer that writes index where the given file or directory information will be stored
    • path is the file to index, or the directory containing the files for which index will be created
  4. If a directory is provided, the directory will be iterated or traversed recursively:
            if (Files.isDirectory(path)) { 
              Files.walkFileTree(path, new SimpleFileVisitor<Path>() { 
    
  5. You will then be overriding a method named visitFile to visit the file or directory based on the given path and basic file attributes:
            @Override 
              public FileVisitResult visitFile(Path file, 
                BasicFileAttributes attrs) throws IOException { 
    
  6. Next, you will be calling a static method that you will create later, named indexDoc. We have deliberately left the catch block empty as we have let you decide what to do if a file cannot be indexed:
            try { 
                indexDoc(writer, file, 
                   attrs.lastModifiedTime().toMillis()); 
              } catch (IOException ignore) { 
      
           } 
    
  7. Return from the visitFile method:
            return FileVisitResult.CONTINUE; 
           } 
    
  8. Close the blocks:
        } 
             ); 
        } 
    
  9. In the else block, call the indexDoc method. Remember that in the else block, you are dealing with files, not directories:
            else { 
             indexDoc(writer, path,  
               Files.getLastModifiedTime(path).toMillis()); 
           } 
    
  10. Close the indexDocs() method:
           } 
    
  11. Now create a method to deal with indexing of a single document:
            static void indexDoc(IndexWriter writer, Path file, long 
              lastModified) throws IOException { 
    
  12. First, create a try block to create a new empty document:
            try (InputStream stream = Files.newInputStream(file)) { 
              Document doc = new Document(); 
    
  13. Next, add the path of the file as a field. As a field name, type "path". The field will be searchable or indexed. However, note that you do not tokenize the field and do not index term frequency or positional information:
            Field pathField = new StringField("path", file.toString(),  
              Field.Store.YES); 
            doc.add(pathField); 
    
  14. Add the last modified date of the file, a field named "modified":
            doc.add(new LongPoint("modified", lastModified)); 
    
  15. Add the contents of the file to a field named "contents". The reader that you specify will make sure that the text of the file is tokenized and indexed, but not stored:
            doc.add(new TextField("contents", new BufferedReader(new 
              InputStreamReader(stream, StandardCharsets.UTF_8)))); 
    
    Note

    If the file is not in UTF-8 encoding, then searching for special characters will fail.

  16. Create an index for the file:
            if (writer.getConfig().getOpenMode() == OpenMode.CREATE) { 
                System.out.println("adding " + file); 
                writer.addDocument(doc); 
            } 
    
  17. There is a chance that the document might have been indexed already. Your else block will handle those cases. You will use updateDocument instead of replacing the old one matching the exact path, if present:
            else { 
                System.out.println("updating " + file); 
                writer.updateDocument(new Term("path", file.toString()),  
                  doc); 
           } 
    
  18. Close the try block and the method:
            }
            }
  19. Now let's create the main method for the class.
            public static void main(String[] args) {
  20. You will be providing three options from the console when you run your program:
    • The first option is index, and the parameter will be the folder that contains indexes
    • The second option is docs, and the parameter will be the folder that contains your text files
    • And the last option is update, and the parameter will denote whether you want to create new indexes or update old indexes

    To hold the values of these three parameters, create and initialize three variables:

            String indexPath = "index"; 
            String docsPath = null; 
            boolean create = true; 
    
  21. Set the values of the three options:
            for(int i=0;i<args.length;i++) { 
             if ("-index".equals(args[i])) { 
                indexPath = args[i+1]; 
                i++; 
             } else if ("-docs".equals(args[i])) { 
                docsPath = args[i+1]; 
                i++; 
             } else if ("-update".equals(args[i])) { 
                create = false; 
             } 
           } 
    
  22. Set the document directory:
            final Path docDir = Paths.get(docsPath); 
    
  23. Now you will start indexing the files in your directory. First, set the timer, as you will be timing the indexing latency:
            Date start = new Date(); 
    
  24. For indexing, Create a directory and create an analyzer (in this case, you will be using a basic, standard analyzer and an index writer configurer):
           try { 
     
             Directory dir = FSDirectory.open(Paths.get(indexPath)); 
             Analyzer analyzer = new StandardAnalyzer(); 
             IndexWriterConfig iwc = new IndexWriterConfig(analyzer); 
    
  25. With the index writer configured and based on the input regarding the creation or update of the index, set the open modes for the indexing. If you choose to create a new index, the open mode will be set to CREATE. Otherwise, it will be CREATE_OR_APPEND:
             if (create) { 
                iwc.setOpenMode(OpenMode.CREATE); 
             } else { 
                iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); 
             } 
    
  26. Create an index writer:
            IndexWriter writer = new IndexWriter(dir, iwc); 
            indexDocs(writer, docDir);  
    
  27. Close the writer:
           writer.close(); 
    
  28. At this point, you are almost done with the coding. Just complete the tracking of time for indexing:
            Date end = new Date(); 
            System.out.println(end.getTime() - start.getTime() + " total 
              milliseconds"); 
    
  29. Close the try block. We intentionally left the catch block blank so that you can decide what you do in the case of an exception during indexing:
            } catch (IOException e) { 
            } 
    
  30. Close the main method and close the class:
           } 
           } 
    
  31. Right-click on your project in Eclipse, select Run As, and click on Run Configurations...:

    How to do it...

  32. Go to the Arguments tab in the Run Configurations window. In the Program Arguments option, put -docs input\ -index index\. Click on Run:

    How to do it...

  33. The output of the code is as follows:

How to do it...

How it works...

The complete code for the recipe is as follows:

package org.apache.lucene.demo; 
 
import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.document.Field; 
import org.apache.lucene.document.LongPoint; 
import org.apache.lucene.document.StringField; 
import org.apache.lucene.document.TextField; 
import org.apache.lucene.index.IndexWriter; 
import org.apache.lucene.index.IndexWriterConfig.OpenMode; 
import org.apache.lucene.index.IndexWriterConfig; 
import org.apache.lucene.index.Term; 
import org.apache.lucene.store.Directory; 
import org.apache.lucene.store.FSDirectory; 
import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.nio.charset.StandardCharsets; 
import java.nio.file.FileVisitResult; 
import java.nio.file.Files; 
import java.nio.file.Path; 
import java.nio.file.Paths; 
import java.nio.file.SimpleFileVisitor; 
import java.nio.file.attribute.BasicFileAttributes; 
import java.util.Date; 
 
public class IndexFiles { 
   static void indexDocs(final IndexWriter writer, Path path) throws 
     IOException { 
      if (Files.isDirectory(path)) { 
         Files.walkFileTree(path, new SimpleFileVisitor<Path>() { 
            @Override 
            public FileVisitResult visitFile(Path file, 
              BasicFileAttributes attrs) throws IOException { 
               try { 
                  indexDoc(writer, file, 
                    attrs.lastModifiedTime().toMillis()); 
               } catch (IOException ignore) { 
               } 
               return FileVisitResult.CONTINUE; 
            } 
         } 
               ); 
      } else { 
         indexDoc(writer, path, 
            Files.getLastModifiedTime(path).toMillis()); 
      } 
   } 
 
   static void indexDoc(IndexWriter writer, Path file, long 
      lastModified) throws IOException { 
      try (InputStream stream = Files.newInputStream(file)) { 
         Document doc = new Document(); 
         Field pathField = new StringField("path", file.toString(), 
           Field.Store.YES); 
         doc.add(pathField); 
         doc.add(new LongPoint("modified", lastModified)); 
         doc.add(new TextField("contents", new BufferedReader(new 
            InputStreamReader(stream, StandardCharsets.UTF_8)))); 
 
         if (writer.getConfig().getOpenMode() == OpenMode.CREATE) { 
            System.out.println("adding " + file); 
            writer.addDocument(doc); 
         } else { 
            System.out.println("updating " + file); 
            writer.updateDocument(new Term("path", file.toString()), 
              doc); 
         } 
      } 
   } 
   public static void main(String[] args) { 
      String indexPath = "index"; 
      String docsPath = null; 
      boolean create = true; 
      for(int i=0;i<args.length;i++) { 
         if ("-index".equals(args[i])) { 
            indexPath = args[i+1]; 
            i++; 
         } else if ("-docs".equals(args[i])) { 
            docsPath = args[i+1]; 
            i++; 
         } else if ("-update".equals(args[i])) { 
            create = false; 
         } 
      } 
 
      final Path docDir = Paths.get(docsPath); 
 
      Date start = new Date(); 
      try { 
         System.out.println("Indexing to directory '" + indexPath + 
           "'..."); 
 
         Directory dir = FSDirectory.open(Paths.get(indexPath)); 
         Analyzer analyzer = new StandardAnalyzer(); 
         IndexWriterConfig iwc = new IndexWriterConfig(analyzer); 
 
         if (create) { 
            iwc.setOpenMode(OpenMode.CREATE); 
         } else { 
            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); 
         } 
         IndexWriter writer = new IndexWriter(dir, iwc); 
         indexDocs(writer, docDir); 
 
         writer.close(); 
 
         Date end = new Date(); 
         System.out.println(end.getTime() - start.getTime() + " total 
           milliseconds"); 
 
      } catch (IOException e) { 
      } 
   } 
}