BrainDump
  Home arrow BrainDump arrow Page 3 - Working with Lucene.Net
ASP Free Forums 
.NET  
ASP  
ASP Code  
ASP.NET  
ASP.NET Code  
BrainDump  
C#  
Code Examples  
Database  
Database Code  
IIS  
Microsoft Access  
MS SQL Server  
Visual Basic.NET  
Windows Scripting  
Windows Security  
XML  
ASP Web Hosting  
ASP.NET Web Hosting 
Mobile Linux 
App Generation ROI 
Windows Web Hosting
 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
BRAINDUMP

Working with Lucene.Net
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 9
    2007-08-16

    Table of Contents:
  • Working with Lucene.Net
  • Using Lucene.Net
  • Creating an index
  • Adding data to a document
  • Searching an index

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Working with Lucene.Net - Creating an index


    (Page 3 of 5 )

    Indexing is the process of analyzing raw text data and converting it into a format that will allow Lucene.Net to search that data quickly. A Lucene.Net index is optimized for fast random access to all words stored in the index. When you create a Lucene.Net index, you have the option to create multiple fields and store different data in each field. For example, if you are indexing Microsoft Office (Word, Excel, Power Point, etc.) files, you can create a field for the filename, a field for the file date, and a field for the body of the document. In this way, at search time, you can narrow your query to only filenames, file dates, or the body of the document, or you can mix two or more fields with the same query and get a search hit.

    Example 4-1 shows a slightly modified version of the demo code found in Lucene.Net’s source-code distribution. This example application shows you how to create an index and populate it with data. It assumes that you have a folder holding several raw text files. If you don’t have such a folder, you’ll need to create one and populate it with some files. In addition, you will need an empty folder where the index will be stored. The example application will create a subfolder called index for this purpose.

    Example 4-1. A Lucene.Net command-line sample application to index a filesystem

    using System;
    using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer; using IndexWriter = Lucene.Net.Index.IndexWriter;
    using Document = Lucene.Net.Documents.Document;
    using Field = Lucene.Net.Documents.Field; using DateTools = Lucene.Net.Documents.DateTools;

    namespace Lucene.Net.Demo
    {
      class IndexFiles
      {
       
    internal static readonly System.IO.FileInfo INDEX_DIR =
            new System.IO.FileInfo("index");

        [STAThread]
        public static void Main(System.String[] args)
        {
         
    System.String usage = typeof(IndexFiles) + " <root_directory>";
          if (args.Length == 0)
          {
           
    System.Console.Error.WriteLine("Usage: " + usage);
            System.Environment.Exit(1);
          }

          // Check whether the "index" directory exists.
          // If not, create it; otherwise, exit program.
          bool tmpBool = System.IO.Directory.Exists(INDEX_DIR.FullName);
          if (tmpBool)
          { 
            System.Console.Out.WriteLine("Cannot save index to '" +
                INDEX_DIR + "' directory, please delete it first");
            System.Environment.Exit(1);
          }

          System.IO.FileInfo docDir = new System.IO.FileInfo(args[0]);
          tmpBool = System.IO.Directory.Exists(docDir.FullName);
          if (!tmpBool)
          {
           
    System.Console.Out.WriteLine("Document directory '" +
                docDir.FullName + "' does not exist or is not readable, " +
                "please check the path");
           
    System.Environment.Exit(1);
          }

          System.DateTime start = System.DateTime.Now;
          try
          {
           
    IndexWriter writer =
                new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
            System.Console.Out.WriteLine("Indexing to directory '" +
                                   INDEX_DIR + "'...");
            IndexDocs(writer, docDir);
            System.Console.Out.WriteLine("Optimizing...");
            writer.Optimize();
            writer.Close();

            System.DateTime end = System.DateTime.Now;
            System.Console.Out.WriteLine(end.Ticks - start.Ticks +
                  
    " total milliseconds");
          }
          catch (System.IO.IOException e)
          {
            System.Console.Out.WriteLine(" caught a " + e.GetType() +
                   "\n with message: " + e.Message);
          }
        }

        public static void IndexDocs(IndexWriter writer,
                        System.IO.FileInfo file)
        {
          if (System.IO.Directory.Exists(file.FullName))
          {
           
    System.String[] files =
       
    System.IO.Directory.GetFileSystemEntries(file.FullName);
            if (files != null)
            {
             
    for (int i = 0; i < files.Length; i++)
              {
                IndexDocs(writer, new System.IO.FileInfo(files[i]));
              }
            }
         
    }
         else
          {
           
    System.Console.Out.WriteLine("adding " + file);
            writer.AddDocument(IndexDocument(file));
          }
        }
       
    public static Document IndexDocument(System.IO.FileInfo f)
       
    {
          // Make a new, empty document
          Document doc = new Document();

          // Add the path of the file as a field named "path".
          // Use a field that is indexed (i.e., searchable), but don't
          // tokenize the field into words.
          doc.Add(new Field("path", f.FullName, Field.Store.YES,
                     Field.Index.UN_TOKENIZED));

          // Add the last modified date of the file to a field named
          // "modified". Use a field that is indexed (i.e., searchable),
          // but don't tokenize the field into words.
          doc.Add(new Field("modified",
                          DateTools.TimeToString(f.LastWriteTime.Ticks, 
                   DateTools.Resolution.MINUTE),
                          Field.Store.YES, Field.Index.UN_TOKENIZED));

          // Add the contents of the file to a field named "contents".
          // Specify a Reader, so that the text of the file is tokenized
          // and indexed, but not stored. Note that FileReader expects
          // the file to be in the system's default encoding. If that's
          // not the case, searching for special characters will fail.
          doc.Add(new Field("contents",
                      new System.IO.StreamReader(f.FullName, 
                System.Text.Encoding.Default)));

          // Return the document
          return doc;
        }
      }
    }

    The key Lucene.Net references used in this example application areStandardAnalyzer,IndexWriter,Document, andField. We’ll take a look at each of these next.

    Understanding analyzers.  An analyzer, combined with a streamer, plays an important role in Lucene.Net. During indexing, an analyzer and a streamer take a stream of raw text and break it into searchable terms. In addition, they remove any “noise”from the text (commas, periods, question marks, etc.), as well as common words (“this,” “that,” “then,” “is,” “a,” etc.). Removing noise and common words greatly speeds up searching.

    If you want to index non-English data, you can write your own analyzer and streamer. However, chances are that someone has already written one that fits the bill and contributed it to Lucene.Net. Currently, the following streamers are supported: Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. These streamers can be found in the contrib folder of the distribution. If you want to write your own, you can use one of the available analyzers and streamers as a model.

    Our example application uses the standard analyzer that comes with Lucene.Net.

    Understanding the role of the IndexWriter.  The following line:

      IndexWriter writer=
          new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);

    creates or opens an index. This is done through theIndexWriterobject. AnIndexWriteris used whenever you want to add anything to or delete anything from an index. The first parameter is the path to the index. The second parameter is an analyzer (discussed in the previous section). If you wrote your own analyzer, you will specify it here. The last parameter tells theIndexWriterconstructor to create a new index (true) or open an existing one (false).

    Once an index has been created or opened, you’re ready to modify it. In our example, we are indexing a filesystem, which means we will read a folder and the subfolders it contains. As we iterate through the filesystem, any file we visit will be opened by theIndexDocs()method as text and indexed.IndexDocs() opens files and passes the file handles toaddDocument(). This method constructs what is known as a Lucene.NetDocument.

    Think of aDocumentas a virtual document that contains metadata: the title, author, publication date, and chapters. For each file you index, a separateDocumentis created, like so:

      Document doc = new Document();

    More BrainDump Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Windows Developer Power Tools," published...
       · heygreat article..but i guess the current update is not reflected. While using...
       · In case anybody hits this snag, the syntax has changed in the latest...
     

    Buy this book now. This article is excerpted from chapter four of the book Windows Developer Power Tools, written by James Avery and Jim Holmes (O'Reilly; ISBN: 0596527543). Check it out today at your favorite bookstore. Buy this book now.

    BRAINDUMP ARTICLES

    - Internet Explorer 8 Review
    - Nilpo`s Top Windows Add-Ons
    - Beginning Silverlight 2.0 Development using ...
    - Fixing Vista`s Troubles
    - Preparing Windows Images for Mass Deployment
    - The Trouble With Vista
    - Slipstreamed and Unattended Windows Installa...
    - Microsoft Office SharePoint Server
    - Microsoft Office SharePoint Designer
    - Microsoft Windows SharePoint Services 3.0
    - Microsoft Live Mesh Overview
    - XAML Brushes and Silverlight
    - Silverlight and XAML Basics
    - Immortal XP
    - XAML Basics





    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 1 hosted by Hostway
    Stay green...Green IT