Working with Lucene.Net - Creating an index
(Page 3 of 5 )
Indexing is the process of analyzing raw text data and converting it into a format that will allow Lucene.Net to search that data quickly. A Lucene.Net index is optimized for fast random access to all words stored in the index. When you create a Lucene.Net index, you have the option to create multiple fields and store different data in each field. For example, if you are indexing Microsoft Office (Word, Excel, Power Point, etc.) files, you can create a field for the filename, a field for the file date, and a field for the body of the document. In this way, at search time, you can narrow your query to only filenames, file dates, or the body of the document, or you can mix two or more fields with the same query and get a search hit.
Example 4-1 shows a slightly modified version of the demo code found in Lucene.Net’s source-code distribution. This example application shows you how to create an index and populate it with data. It assumes that you have a folder holding several raw text files. If you don’t have such a folder, you’ll need to create one and populate it with some files. In addition, you will need an empty folder where the index will be stored. The example application will create a subfolder called index for this purpose.
Example 4-1. A Lucene.Net command-line sample application to index a filesystem
using System;
using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer; using IndexWriter = Lucene.Net.Index.IndexWriter;
using Document = Lucene.Net.Documents.Document;
using Field = Lucene.Net.Documents.Field; using DateTools = Lucene.Net.Documents.DateTools;
namespace Lucene.Net.Demo
{
class IndexFiles
{
internal static readonly System.IO.FileInfo INDEX_DIR =
new System.IO.FileInfo("index");
[STAThread]
public static void Main(System.String[] args)
{
System.String usage = typeof(IndexFiles) + " <root_directory>";
if (args.Length == 0)
{
System.Console.Error.WriteLine("Usage: " + usage);
System.Environment.Exit(1);
}
// Check whether the "index" directory exists.
// If not, create it; otherwise, exit program.
bool tmpBool = System.IO.Directory.Exists(INDEX_DIR.FullName);
if (tmpBool)
{
System.Console.Out.WriteLine("Cannot save index to '" +
INDEX_DIR + "' directory, please delete it first");
System.Environment.Exit(1);
}
System.IO.FileInfo docDir = new System.IO.FileInfo(args[0]);
tmpBool = System.IO.Directory.Exists(docDir.FullName);
if (!tmpBool)
{
System.Console.Out.WriteLine("Document directory '" +
docDir.FullName + "' does not exist or is not readable, " +
"please check the path");
System.Environment.Exit(1);
}
System.DateTime start = System.DateTime.Now;
try
{
IndexWriter writer =
new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
System.Console.Out.WriteLine("Indexing to directory '" +
INDEX_DIR + "'...");
IndexDocs(writer, docDir);
System.Console.Out.WriteLine("Optimizing...");
writer.Optimize();
writer.Close();
System.DateTime end = System.DateTime.Now;
System.Console.Out.WriteLine(end.Ticks - start.Ticks +
" total milliseconds");
}
catch (System.IO.IOException e)
{
System.Console.Out.WriteLine(" caught a " + e.GetType() +
"\n with message: " + e.Message);
}
}
public static void IndexDocs(IndexWriter writer,
System.IO.FileInfo file)
{
if (System.IO.Directory.Exists(file.FullName))
{
System.String[] files =
System.IO.Directory.GetFileSystemEntries(file.FullName);
if (files != null)
{
for (int i = 0; i < files.Length; i++)
{
IndexDocs(writer, new System.IO.FileInfo(files[i]));
}
}
}
else
{
System.Console.Out.WriteLine("adding " + file);
writer.AddDocument(IndexDocument(file));
}
}
public static Document IndexDocument(System.IO.FileInfo f)
{
// Make a new, empty document
Document doc = new Document();
// Add the path of the file as a field named "path".
// Use a field that is indexed (i.e., searchable), but don't
// tokenize the field into words.
doc.Add(new Field("path", f.FullName, Field.Store.YES,
Field.Index.UN_TOKENIZED));
// Add the last modified date of the file to a field named
// "modified". Use a field that is indexed (i.e., searchable),
// but don't tokenize the field into words.
doc.Add(new Field("modified",
DateTools.TimeToString(f.LastWriteTime.Ticks,
DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.UN_TOKENIZED));
// Add the contents of the file to a field named "contents".
// Specify a Reader, so that the text of the file is tokenized
// and indexed, but not stored. Note that FileReader expects
// the file to be in the system's default encoding. If that's
// not the case, searching for special characters will fail.
doc.Add(new Field("contents",
new System.IO.StreamReader(f.FullName,
System.Text.Encoding.Default)));
// Return the document
return doc;
}
}
}
The key Lucene.Net references used in this example application areStandardAnalyzer,IndexWriter,Document, andField. We’ll take a look at each of these next.
Understanding analyzers. An analyzer, combined with a streamer, plays an important role in Lucene.Net. During indexing, an analyzer and a streamer take a stream of raw text and break it into searchable terms. In addition, they remove any “noise”from the text (commas, periods, question marks, etc.), as well as common words (“this,” “that,” “then,” “is,” “a,” etc.). Removing noise and common words greatly speeds up searching.
If you want to index non-English data, you can write your own analyzer and streamer. However, chances are that someone has already written one that fits the bill and contributed it to Lucene.Net. Currently, the following streamers are supported: Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. These streamers can be found in the contrib folder of the distribution. If you want to write your own, you can use one of the available analyzers and streamers as a model.
Our example application uses the standard analyzer that comes with Lucene.Net.
Understanding the role of the IndexWriter. The following line:
IndexWriter writer=
new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
creates or opens an index. This is done through theIndexWriterobject. AnIndexWriteris used whenever you want to add anything to or delete anything from an index. The first parameter is the path to the index. The second parameter is an analyzer (discussed in the previous section). If you wrote your own analyzer, you will specify it here. The last parameter tells theIndexWriterconstructor to create a new index (true) or open an existing one (false).
Once an index has been created or opened, you’re ready to modify it. In our example, we are indexing a filesystem, which means we will read a folder and the subfolders it contains. As we iterate through the filesystem, any file we visit will be opened by theIndexDocs()method as text and indexed.IndexDocs() opens files and passes the file handles toaddDocument(). This method constructs what is known as a Lucene.NetDocument.
Think of aDocumentas a virtual document that contains metadata: the title, author, publication date, and chapters. For each file you index, a separateDocumentis created, like so:
Document doc = new Document();
Next: Adding data to a document >>
More BrainDump Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of the book Windows Developer Power Tools, written by James Avery and Jim Holmes (O'Reilly; ISBN: 0596527543). Check it out today at your favorite bookstore. Buy this book now.
|
|