Java Data Science Cookbook
上QQ阅读APP看书,第一时间看更新

Searching indexed data with Apache Lucene

Now that you have indexed your data, you will be searching the data using Apache Lucene in this recipe. The code for searching in this recipe depends on the index that you created in the previous recipe, and therefore, it will only successfully execute if you followed the instructions in the previous recipe.

Getting ready

  1. Complete the previous recipe. After completing the previous recipe, go to the index directory in your project that you created in step 11 of that recipe. Make sure that you see some indexing files there:

    Getting ready

  2. Create a Java file named SearchFiles in the org.apache.lucene.demo package you created in the previous recipe:

    Getting ready

  3. Now you are ready to type in some code in the SearchFiles.java file.

How to do it...

  1. Open SearchFiles.java in the editor of Eclipse and create the following class:
            public class SearchFiles { 
    
  2. You need to create two constant String variables. The first variable will contain the path of your index that you created in the previous recipe. The second variable will contain the field contents where you will be searching. In our case, we will be searching in the contents field of the index:
            public static final String INDEX_DIRECTORY = "index"; 
            public static final String FIELD_CONTENTS = "contents"; 
    
  3. Start creating your main method:
            public static void main(String[] args) throws Exception { 
    
  4. Create an indexreader by opening the indexes in your index directory:
            IndexReader reader = 
              DirectoryReader.open(FSDirectory.open
                (Paths.get(INDEX_DIRECTORY))); 
    
  5. The next step will be to create a searcher that will search the index:
             IndexSearcher indexSearcher = new IndexSearcher(reader); 
    
  6. As your analyzer, create a standard analyzer:
             Analyzer analyzer = new StandardAnalyzer(); 
    
  7. Create a query parser by providing two arguments to the QueryParser constructor, the field where you will be searching and the analyzer you have created:
            QueryParser queryParser = new QueryParser(FIELD_CONTENTS,  
              analyzer); 
    
  8. In this recipe, you will be using a predefined search term. In this search, you are trying to find the documents that contain both "over-full" and "persuasion":
            String searchString = "over-full AND persuasion"; 
    
  9. Using the search string, create a query:
            Query query = queryParser.parse(searchString); 
    
  10. The searcher will be looking into the index to see whether it can find out the search term. You are also mentioning how many search results will be coming as a result, which in our case is 5:
            TopDocs results = indexSearcher.search(query, 5); 
    
  11. Create an array to hold the hits:
            ScoreDoc[] hits = results.scoreDocs; 
    
  12. Note that during indexing, we have used only one document, shakespeare.txt. So the length of this array, in our case, can be a maximum of 1.
  13. You will also be interested in knowing the number of documents where the search was found as a hit:
            int numTotalHits = results.totalHits; 
            System.out.println(numTotalHits + " total matching documents"); 
    
  14. Finally, iterate through the hits. You get the document ID for which a hit was found. With the document ID, you will then create the document and print the path of the document and the score calculated by Lucene for a document for the search term you have used:
            for(int i=0;i<hits.length;++i) { 
             int docId = hits[i].doc; 
             Document d = indexSearcher.doc(docId); 
             System.out.println((i + 1) + ". " + d.get("path") + " score=" 
               + hits[i].score); 
            } 
    
  15. Close the method and the class:
            } 
            } 
    
  16. If you run the code, you will see the following output:

    How to do it...

  17. Open the shakespeare.txt file in the input folder of your project folder. Search manually, and you will find that both "over-full" and "persuasion" are present in the document.
  18. Change the searchString in step 8, as follows:
            String searchString = "shakespeare"; 
    
  19. By keeping the rest of the codes as they are, whether you run the code, you will see the following output:

    How to do it...

  20. Open the Shakespeare.txt file again and double-check if the term Shakespeare appears in it. You will find none.

The complete code for this recipe is as follows:

package org.apache.lucene.demo; 
import java.nio.file.Paths; 
import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.index.DirectoryReader; 
import org.apache.lucene.index.IndexReader; 
import org.apache.lucene.queryparser.classic.QueryParser; 
import org.apache.lucene.search.IndexSearcher; 
import org.apache.lucene.search.Query; 
import org.apache.lucene.search.ScoreDoc; 
import org.apache.lucene.search.TopDocs; 
import org.apache.lucene.store.FSDirectory; 
 
public class SearchFiles { 
   public static final String INDEX_DIRECTORY = "index"; 
   public static final String FIELD_CONTENTS = "contents"; 
 
   public static void main(String[] args) throws Exception { 
      IndexReader reader = DirectoryReader.open(FSDirectory.open
        (Paths.get(INDEX_DIRECTORY))); 
      IndexSearcher indexSearcher = new IndexSearcher(reader); 
 
      Analyzer analyzer = new StandardAnalyzer(); 
      QueryParser queryParser = new QueryParser(FIELD_CONTENTS, 
         analyzer); 
      String searchString = "shakespeare"; 
      Query query = queryParser.parse(searchString); 
 
      TopDocs results = indexSearcher.search(query, 5); 
      ScoreDoc[] hits = results.scoreDocs; 
 
      int numTotalHits = results.totalHits; 
      System.out.println(numTotalHits + " total matching documents"); 
 
      for(int i=0;i<hits.length;++i) { 
         int docId = hits[i].doc; 
         Document d = indexSearcher.doc(docId); 
         System.out.println((i + 1) + ". " + d.get("path") + " score=" 
           + hits[i].score); 
      } 
   } 
} 
Note

You can visit https://lucene.apache.org/core/2_9_4/queryparsersyntax.html for the query syntaxes supported by Apache Lucene.