Data is everywhere, whether it’s on the Internet, your local system, or networked hard drives. The challenge often isn’t in collecting and organizing your data but in finding it. Businesses collect data in a staggering array of formats, including Microsoft Outlook or Excel files, Access or SQL databases, PDFs, HTML files, plain old text files, and perhaps even custom application formats. That data often then gets scattered across a dizzying number of locations on different servers.
Chances are that your customers will need to deal with disparate data formats and with data stored in multiple locations. Furthermore, they will probably want to be able to exert some control over how searches are performed. Customers may want to be able to limit searches to certain keywords or to a particular set of data folders on a particular server, or to filter out information older than a particular date.
Google Desktop has made a splash by bringing this functionality to end users. Now you have the power to bring the same indexing and searching capabilities into your applications using Lucene.Net, a high-performance, scalable search engine library written in the C# language and utilizing the .NET Framework.
Lucene.Net at a Glance
Tool
Lucene.Net
Version covered
1.4.3, 1.9, 1.9.1, and 2.0
Home page
http://incubator.apache.org/lucene.net/
Power Tools page
http://www.windevpowertools.com/tools/144
Summary
.NET-based search engine API for indexing and searching contents
License type
Apache License, version 2.0
Online resources
API documentation, mailing list at ASF
Supported Frameworks
.NET 1.1, 2.0
Getting Started
Lucene.Net is an open source project currently under incubation at the Apache Software Foundation (ASF). The source code can be downloaded from the project’s home page as a .zip archive or checked out from the Subversion repository.
Lucene.Net requires a Microsoft C# compiler and version 1.1 or 2.0 of the .NET Framework. It works with either Microsoft Visual Studio 2003 or 2005. The source comes with a solution for Visual Studio 2003.
NUnit is required if you want to run the test code. It can be downloaded from its home page at http://www.nunit.org.
You’ll also need SharpZipLib (discussed later in this chapter) if you want to support compressed indexing in Lucene.Net versions 1.9 and 1.9.1. SharpZipLib can be downloaded from its home page at http://www.icsharpcode.net/OpenSource/ SharpZipLib/.
Lucene.Net is not a standalone search engine application. It can’t be used as-is out of the box to index and search your data or the Web. Out of the box, Lucene.Net can’t extract or read your binary data (such as Microsoft Office or PDF files), make use of SQL data, or crawl the Web.
You must understand this about Lucene.Net so that you will be able to appreciate and understand its capabilities. All that Lucene.Net has to offer is a set of rich APIs that you must call to first create a Lucene.Net index and later search on that index. The task of extracting raw text data out of your binary data is your job. You have to write the code to read from formats such as Microsoft Office files, extract the raw text out of the files, and pass this raw text data to Lucene.Net, where it can finally be indexed and later searched.
After your raw text data has been indexed, you can use Lucene.Net’s API to search this data. Indexing and searching via Lucene.Net’s APIs is easy and yet very powerful.
A Brief History of Lucene.Net
Lucene.Net’s origins can be traced back to its parent project, Apache Lucene. Apache Lucene is written in Java, is well established as an ASF project, and has solid followers in the open source community. Lucene.Net is a port of Apache Lucene to C# that utilizes the Microsoft .NET Framework, and it preserves the look and feel of Apache Lucene’s API.
If you open any C# file and its corresponding Java file, you’ll see that, with the exception of the naming conventions, the class names and method names are the same—-that is, org.apache.lucene.store.FSDirectory.createOutput() in Java becomes Lucene.Net.Store.FSDirectory.CreateOutput() in C#. It’s not only the classes and methods that are ported to C#, though; the Lucene algorithms are ported too, as well as the Lucene index format.
This consistent port offers a number of advantages. First, it means someone familiar with Lucene’s Java implementation will have an easy time reading Lucene.Net’s C# code.
More importantly, it means applications using Lucene.Net can coexist with applications using the Java version. Indexes can be read, modified, and shared between either version. What’s more, both the Java and C# versions can share Lucene’s lock file, so you Apache Lucene and Lucene.Net can use the same index concurrently.
Finally, in addition to the C# port of Lucene’s core code, the Lucene test code is also ported to C#. All NUnit tests pass as they do with the Java version. This should give you a high level of confidence in the C# port of the code.
Two groups of APIs make up Lucene.Net: the indexing APIs and the search APIs. You will spend most of your time writing code for the search APIs. However, before you can start searching, you must create indexes.
Indexing is the process of analyzing raw text data and converting it into a format that will allow Lucene.Net to search that data quickly. A Lucene.Net index is optimized for fast random access to all words stored in the index. When you create a Lucene.Net index, you have the option to create multiple fields and store different data in each field. For example, if you are indexing Microsoft Office (Word, Excel, Power Point, etc.) files, you can create a field for the filename, a field for the file date, and a field for the body of the document. In this way, at search time, you can narrow your query to only filenames, file dates, or the body of the document, or you can mix two or more fields with the same query and get a search hit.
Example 4-1 shows a slightly modified version of the demo code found in Lucene.Net’s source-code distribution. This example application shows you how to create an index and populate it with data. It assumes that you have a folder holding several raw text files. If you don’t have such a folder, you’ll need to create one and populate it with some files. In addition, you will need an empty folder where the index will be stored. The example application will create a subfolder called index for this purpose.
Example 4-1. A Lucene.Net command-line sample application to index a filesystem
using System; using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer; using IndexWriter = Lucene.Net.Index.IndexWriter; using Document = Lucene.Net.Documents.Document; using Field = Lucene.Net.Documents.Field; using DateTools = Lucene.Net.Documents.DateTools;
namespace Lucene.Net.Demo { class IndexFiles { internal static readonly System.IO.FileInfo INDEX_DIR = new System.IO.FileInfo("index");
// Check whether the "index" directory exists. // If not, create it; otherwise, exit program. bool tmpBool = System.IO.Directory.Exists(INDEX_DIR.FullName); if (tmpBool) { System.Console.Out.WriteLine("Cannot save index to '" + INDEX_DIR + "' directory, please delete it first"); System.Environment.Exit(1); }
System.IO.FileInfo docDir = new System.IO.FileInfo(args[0]); tmpBool = System.IO.Directory.Exists(docDir.FullName); if (!tmpBool) { System.Console.Out.WriteLine("Document directory '" + docDir.FullName + "' does not exist or is not readable, " + "please check the path"); System.Environment.Exit(1); }
System.DateTime start = System.DateTime.Now; try { IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); System.Console.Out.WriteLine("Indexing to directory '" + INDEX_DIR + "'..."); IndexDocs(writer, docDir); System.Console.Out.WriteLine("Optimizing..."); writer.Optimize(); writer.Close();
System.DateTime end = System.DateTime.Now; System.Console.Out.WriteLine(end.Ticks - start.Ticks + " total milliseconds"); } catch (System.IO.IOException e) { System.Console.Out.WriteLine(" caught a " + e.GetType() + "\n with message: " + e.Message); } }
public static void IndexDocs(IndexWriter writer, System.IO.FileInfo file) { if (System.IO.Directory.Exists(file.FullName)) { System.String[] files = System.IO.Directory.GetFileSystemEntries(file.FullName); if (files != null) { for (int i = 0; i < files.Length; i++) { IndexDocs(writer, new System.IO.FileInfo(files[i])); } } } else { System.Console.Out.WriteLine("adding " + file); writer.AddDocument(IndexDocument(file)); } } public static Document IndexDocument(System.IO.FileInfo f) { // Make a new, empty document Document doc = new Document();
// Add the path of the file as a field named "path". // Use a field that is indexed (i.e., searchable), but don't // tokenize the field into words. doc.Add(new Field("path", f.FullName, Field.Store.YES, Field.Index.UN_TOKENIZED));
// Add the last modified date of the file to a field named // "modified". Use a field that is indexed (i.e., searchable), // but don't tokenize the field into words. doc.Add(new Field("modified", DateTools.TimeToString(f.LastWriteTime.Ticks, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.UN_TOKENIZED));
// Add the contents of the file to a field named "contents". // Specify a Reader, so that the text of the file is tokenized // and indexed, but not stored. Note that FileReader expects // the file to be in the system's default encoding. If that's // not the case, searching for special characters will fail. doc.Add(new Field("contents", new System.IO.StreamReader(f.FullName, System.Text.Encoding.Default)));
// Return the document return doc; } } }
The key Lucene.Net references used in this example application areStandardAnalyzer,IndexWriter,Document, andField. We’ll take a look at each of these next.
Understanding analyzers. An analyzer, combined with a streamer, plays an important role in Lucene.Net. During indexing, an analyzer and a streamer take a stream of raw text and break it into searchable terms. In addition, they remove any “noise”from the text (commas, periods, question marks, etc.), as well as common words (“this,” “that,” “then,” “is,” “a,” etc.). Removing noise and common words greatly speeds up searching.
If you want to index non-English data, you can write your own analyzer and streamer. However, chances are that someone has already written one that fits the bill and contributed it to Lucene.Net. Currently, the following streamers are supported: Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. These streamers can be found in the contrib folder of the distribution. If you want to write your own, you can use one of the available analyzers and streamers as a model.
Our example application uses the standard analyzer that comes with Lucene.Net.
Understanding the role of the IndexWriter. The following line:
IndexWriter writer= new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
creates or opens an index. This is done through theIndexWriterobject. AnIndexWriteris used whenever you want to add anything to or delete anything from an index. The first parameter is the path to the index. The second parameter is an analyzer (discussed in the previous section). If you wrote your own analyzer, you will specify it here. The last parameter tells theIndexWriterconstructor to create a new index (true) or open an existing one (false).
Once an index has been created or opened, you’re ready to modify it. In our example, we are indexing a filesystem, which means we will read a folder and the subfolders it contains. As we iterate through the filesystem, any file we visit will be opened by theIndexDocs()method as text and indexed.IndexDocs() opens files and passes the file handles toaddDocument(). This method constructs what is known as a Lucene.NetDocument.
Think of aDocumentas a virtual document that contains metadata: the title, author, publication date, and chapters. For each file you index, a separateDocumentis created, like so:
Once you’ve created a Document, you’ll need to add data to it. This is done by creating one or more Fields for each piece of metadata in your file. For example, in the sample application, we created aFieldcalledpaththat holds the path to the file we are indexing, aFieldcalledmodifiedthat holds the date the file was last modified, and aFieldcalledcontentsthat holds the document’s raw text content. You can create moreFields as your application requires. When you create aField, you can also specify what type ofFieldit is.
The threeFields in our sample application are added to aDocumentlike so:
doc.Add(new Field("contents", new System.IO.StreamReader(f.FullName, System.Text.Encoding.Default)));
After you’ve populated aDocument object withFieldobjects, you’re ready to add theDocument to the index:
writer.AddDocument(IndexDocument(file));
Running the IndexFiles application. From the command line, run theIndexFilesapplication against the folder you have populated with raw text files. You can also simply pointIndexFilesto the Lucene.Net source directory, andIndexFileswill index the Lucene.Net source files for you. To startIndexFiles, issue the following command from the bin directory:IndexFiles C:\Lucene.Net\. OnceIndexFilesis done indexing your files, it creates a directory called index in the current directory and stores the index in it.
Searching in Lucene.Net is similar to indexing and offers great functionality. It’s expected that you will spend more time in Lucene.Net’s search APIs than in the indexing ones.
There are several ways you can search your index. You can use Lucene.Net to search one index, or you can search multiple indexes usingMultiSearcher. Searching two or more indexes distributes your data across multiple indexes for faster searching, better tuning, and greater control.
For example, you can separate your data into date ranges, perhaps creating an index for each month. This will allow you to narrow your search to a particular month’s index or combine multiple months’ indexes. (Obviously, this kind of index creation doesn’t have to be date-related; it can be based on any useful criteria.)
In addition to theMultiSearcher, Lucene.Net also offers theRemoteSearchablecapability. WithRemoteSearchable, you can rely on Lucene.Net’s web server API to search one or more indexes residing on different servers.
Lucene.Net also gives you the power and flexibility of searching on one or more fields, individually weighting any of your fields, and applying Boolean query criteria such asAND,OR,NOT,NEAR, andDATE_RANGE. What’s more, you can update an index and search it at the same time. Once the index update is done, just close your searcher and reopen it, and your updated data will be available.
Our Lucene.Net example application will show you how to search the index that we created in Example 4-1, where we indexed the filesystem. Example 4-2 shows a slightly modified version of the demo code found in Lucene.Net’s source-code distribution.
Example 4-2. A Lucene.Net command-line sample application to search an index
using System; using Analyzer = Lucene.Net.Analysis.Analyzer; using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer; using Document = Lucene.Net.Documents.Document; using QueryParser = Lucene.Net.QueryParsers.QueryParser; using Hits = Lucene.Net.Search.Hits; using IndexSearcher = Lucene.Net.Search.IndexSearcher; using Query = Lucene.Net.Search.Query; using Searcher = Lucene.Net.Search.Searcher;
namespace Lucene.Net.Demo { class SearchFiles { [STAThread] public static void Main(System.String[] args) { try { Searcher searcher = new IndexSearcher(@"index"); Analyzer analyzer = new StandardAnalyzer();
// Create a new StreamReader using standard input as the stream System.IO.StreamReader streamReader = new System.IO.StreamReader( // Sets reader's input stream to the standard input stream new System.IO.StreamReader( System.Console.OpenStandardInput(), System.Text.Encoding.Default) .BaseStream, // Sets reader's encoding to whatever standard input is using new System.IO.StreamReader( System.Console.OpenStandardInput(), System.Text.Encoding.Default) .CurrentEncoding); while (true) { System.Console.Out.Write("Query: "); System.String line = streamReader.ReadLine();
Hits hits = searcher.Search(query); System.Console.Out.WriteLine(hits.Length() + " total matching documents");
int HITS_PER_PAGE = 10; for (int start = 0; start < hits.Length(); start += HITS_PER_PAGE)
{ int end = System.Math.Min(hits.Length(), start + HITS_PER_PAGE); for (int i = start; i < end; i++) { Document doc = hits.Doc(i); System.String path = doc.Get("path"); if (path != null) { System.Console.Out.WriteLine(i + ". " + path); } else { System.String url = doc.Get("url"); if (url != null) { System.Console.Out. WriteLine(i + ". " + url); System.Console.Out. WriteLine(" - " + doc.Get("title")); } else { System.Console.Out. WriteLine(i + ". " + "No path nor URL for this document"); } } }
if (hits.Length() > end) { System.Console.Out.Write("more (y/n) ? "); line = streamReader.ReadLine(); if (line.Length <= 0 || line[0] == 'n') break; } } } searcher.Close(); } catch (System.Exception e) { System.Console.Out.WriteLine(" caught a " + e.GetType() + "\n with message: " + e.Message); } } } }
In this example application, the key Lucene.Net references being used areStandardAnalyzer,Document,QueryParser,Hits,IndexSearcher,Query, andSearcher.
Understanding searchers. ASearcheris the front door to your index. Through it, search single or multiple indexes located locally on your hard drive or remotely on different machines. The following line:
Searcher searcher = new IndexSearcher(@"index");
creates aSearcherobject by instantiating anIndexSearcher. The parameter passed toIndexSearcheris the name of a folder containing an index, expressed as either a full path or a relative path.
Using analyzers in searching. We used analyzers when we created the index. Why do we need them again during searching? During indexing, we used an analyzer to clean up our raw text. The same rules must be applied on the text a user types at the search prompt. Furthermore, the same type of analyzer must be used for searching as for indexing, or the search results will not be correct—or, even worse, no hits may be returned at all.
This line creates the matching analyzer:
Analyzer analyzer = new StandardAnalyzer();
Revisiting documents. We also covered theDocumentclass during indexing. At search time, we use aDocumentobject to hold information about a hit resulting from a search query. TheDocumentobject contains the fields and the data in those fields.
In our example application, a reference to aDocumentobject is retrieved like so:
Document doc = hits.Doc(i);
Parsing user input with QueryParser. AQueryParser works hand-in-hand with an analyzer. The job of theQueryParseris to take a user’s query, apply the same rules as the analyzer, and figure out what the user is searching for.
For example, if your search query is+cat +dog, theQueryParserwill know that you are searching for both the words cat and dog and that they must be in the same field.
The+option marks a term as a required part of the query.
Lucene.Net supports several such power-search features. You can do a Boolean search usingOR,AND, andNOTterms, and you can limit your search to a particular field.
In our example application, aQueryParseris created like so:
Here, we pass three parameters to the parser. The first is the string that the user typed (the search query). The second parameter is the name of the default field that we will search. You can specify multiple fields, or no field at all, leaving it up to the user to identify the field to search in. The final parameter is the analyzer.
Working with search hits. AHitscollection is what you get back as a result of running a search query. If your search query returns hits, you use theHitsobject to iterate over a list ofDocumentobjects.
In our example application, a reference to aHitsobject is returned like so:
Hits hits = searcher.Search(query);
Remember that we instantiated aSearcherobject and pointed it at our index folder. Now we’re passing it a reference to theQueryobject discussed previously. This kind of abstraction is what makes Lucene.Net so flexible and powerful; working with an index is consistent, regardless of whether you’re using one or more indexes and whether they’re local or remote. Additionally, the search behavior is consistent, whether you have one query or a combination of queries.
Running the SearchFiles application
When you’re ready to run the application, move to the folder where the index was created during indexing. Once you are in that folder, run the SearchFiles application by just typing its name (using the fully qualified pathname if you haven’t copied it to the same directory as the indexes).
Getting Support
Since Lucene.Net is an open source project and is incubated into ASF, support for it is through its mailing list, noted at the project’s home page. Subscribe to the mailing list and post your questions there. Questions are answered in a timely fashion, and the community is looking to grow.
Lucene.Net in a Nutshell
Lucene.Net is a powerful, fast, and feature-rich search engine. In addition, it is open source, is incubated at ASF, and has a support community.
Today, Lucene.Net is being used to index and search filesystems, email data, web pages, and even source code. What’s more, Lucene.Net is being used in commercial applications as a web service search engine, as an embedded search engine for Outlook, and as a desktop search engine for Novel Linux via the Mono compiler.
As applications become more and more complex and generate more and more data, the addition of a search feature is becoming a logical solution. Lucene.Net’s APIs make it possible to integrate powerful search capabilities into your applications. What’s more, Lucene.Net provides the means to scale; supports different languages; and is cross compatible with Apache Lucene at the API, algorithmic, and index levels.
—George Aroush, committer for Lucene.Net
Please check back next week for the continuation of this article.