Creating a Personal Search Engine. by Sixto Luis SantosSearch facilities have become an expected part of every web site. But this is not always possible. For example, if yours is a personal web site that is not always connected to the Internet, or you are in charge of an Intranet with confidential information, you may not or cannot make use of the site indexing capabilities of commercial search engines like Beseen or Altavista. That is exactly why we tried to implement a simple text search facility with the tools that we already have, an ASP capable web server and the VBScript objects. Our solution is based principally on two of VBScript's objects: The FileSystemObject, in charge of retrieving the target pages' text, and the RegExp object, to do the actual search and to extract the document's title. We encapsulated the search functionality within two self-contained procedures to allow us flexibility in the search page design. This means that you can change the search page to match the look and feel of your site without requiring major changes in the code. | Figure 1 - Our search engine in action... | | ~~*~~ This means that you can change the search page to match the look and feel of your site without requiring major changes in the code. ~~*~~ | |
|
Our program relies heavily on the RegExp Object. This object allow us to do search or search and replace operations using 'Regular Expressions'. A regular expression is a pattern of text that consists of ordinary characters and special characters, known as metacharacters. The pattern describes one or more strings to match when searching a body of text. The regular expression serves as a template for matching a character pattern to the string being searched. For more information on Regular Expressions and Scripting Technologies in general please refer to the Microsoft Scripting Technologies web site at http://msdn.microsoft.com/scripting/default.htm. We begin by creating our starting procedure. This is the procedure we call to start the search process. It takes a single parameter, SearchString, that will hold our search criteria. First, we do a standard instantiation of the objects. Second, we set up the RegExp objects, and here's where the magic begins. By setting the RegExp's Global property we instruct the object to find every match of our search pattern. If we set this to False, as is the case of the GetTitle object, the search stops at the first match found. The IgnoreCase property should be self-explanatory, this simply instructs the object to do case insensitive searches. The Pattern property is where we state the search expression. Note the difference between Regex.Pattern and GetTitle.Pattern below. In the former we just feed the content of the SearchString parameter as it came from the user. In the later we construct a special pattern to match text enclosed in <title> tags. Observe in the code window below the special metacharacters right between <title> and </title>. We use parenthesis to change the order of precedence, the . match any single character except the new line character (In VBScript this would be vbCrLf). The \n match the new line character. The pipe character | in between indicates an or, and the asterisk * indicates to match zero or more of the preceding characters. In summary, this pattern will match anything (any amount of characters or new lines) between <title> and </title>. Third, we make sure that our paths variables contain their trailing slashes as we will be using these as the base path for our matched documents. Fourth, we start the actual search process by calling the SearchFiles procedure. And fifth and last, we display a message if no matches were found and we do some object cleaning. Find below the code for our starting procedure. | Listing 1 - Starting Procedure | <%
Sub Search(SearchString)
Set fs = CreateObject("Scripting.FileSystemObject") Set GetTitle = New RegExp Set Regex = New RegExp
With Regex ' .Global = True .IgnoreCase = True .Pattern = Trim(SearchString) End With With GetTitle .Global = False .IgnoreCase = True .Pattern = "<title>(.|\n)*</title>" End With
RootFolder = Server.MapPath(RootFld)
If Right(RootFld,1) <> "/" Then RootFld = RootFld & "/" End If
If Right(RootFolder, 1) <> "\" Then RootFolder = RootFolder & "\" End If rfLen = Len(RootFolder) + 1
SearchFiles RootFolder
If MatchedCount = 0 Then Response.Write " <B>No Matches Found.</b><BR>" End If
Set Regex = Nothing Set GetTitle = Nothing Set fs = Nothing
End Sub
%>
|
|
~~*~~ The next part of our project is the search engine itself. This engine is in the form of a self calling procedure, otherwise known as recursive. We decided to implement the engine as a recursive procedure to simplify the process of traversing a directory tree. Note that in a recursive procedure, a new and independent set of variables and objects are created each time it is called. First, we get the current 'root' folder where files and other folders may exist. Then we iterate thru each file in the folder. We then compare each file's extension to a global variable (not shown) holding a list of extensions for valid files (e.g. html, asp, txt, etc.). If a match is found, the file is opened to get the text contained inside, and the RegExp search is applied. If the search returned one or more matches we then proceed to try and get hold of the document's title by executing the GetTitle RegExp search. This, of course, will only return something for HTML and some ASP files. If we find a title, we use this as our results entry text, otherwise we use the file name. Note that we need to strip out the <title> tags. In version 5.5 of the scripting engine (as found in Windows 2000) a SubMatches object is available, returning what's inside the entities called captured matches, a pattern enclosed in parenthesis, avoiding the need to prepare the match manually. Unfortunately, there's no SubMatches object in the more popular versions 4 or 5 of the scripting engine. Anyway, once we got our entry's name, we proceed to construct the line that will be displayed on our results page. We add some miscellaneous (also known as fancy or mostly useless) information to the entry, and do some html-formatting as we go. Check out the somewhat commented code to the recursive procedure below. | Listing 2 - Recursive Search Procedure | <%
Sub SearchFiles(FolderPath) Dim fsFolder Dim fsFolder2 Dim fsFile Dim fsText Dim FileText Dim FileTitle Dim FileTitleMatch Dim MatchCount Dim OutputLine
' Get the starting folder Set fsFolder = fs.GetFolder(FolderPath) ' Iterate thru every file in the folder For Each fsFile In fsFolder.Files ' Compare the current file extension with the list of valid target files If InStr(1, ValidFiles, Right(fsFile.Name, 3), vbTextCompare) > 0 Then DocCount = DocCount + 1 ' Open the file to read its content Set fsText = fsFile.OpenAsTextStream FileText = fsText.ReadAll ' Apply the regex search and get the count of matches found MatchCount = Regex.Execute(FileText).Count MatchedCount = MatchedCount + MatchCount If MatchCount > 0 Then DocMatchCount = DocMatchCount + 1 ' Apply another regex to get the html document's title Set FileTitleMatch = GetTitle.Execute(FileText) If FileTitleMatch.Count > 0 Then ' Strip the title tags FileTitle = Trim(replace(Mid(FileTitleMatch.Item(0),8),"</title>","",1,1,1)) ' In case the title is empty If FileTitle = "" Then FileTitle = "No Title (" & fsFile.Name & ")" End If Else ' Create an alternate entry name (if no title found) FileTitle = "No Title (" & fsFile.Name & ")" End If ' Create the entry line with proper formatting ' Add the entry number OutputLine = " <b>" & DocMatchCount & ".</B> " ' Add the document name and link OutputLine = OutputLine & "<A href=" & chr(34) & RootFld & replace(Mid(fsFile.Path, rfLen),"\","/") & chr(34) & "><B>" OutputLine = OutputLine & FileTitle & "</B></a>" ' Add the document information OutputLine = OutputLine & "<font size=1><br> Criteria matched " & MatchCount & " times - Size: " OutputLine = OutputLine & FormatNumber(fsFile.Size / 1024,2 ,-1,0,-1) & "K bytes" OutputLine = OutputLine & " - Last Modified: " & formatdatetime(fsFile.DateLastModified,vbShortDate) & "</Font><br>" ' Display entry Response.Write OutputLine Response.Flush End If fsText.Close End If Next
' Iterate thru each subfolder and recursively call this procedure For Each fsFolder2 In fsFolder.SubFolders SearchFiles fsFolder2.Path Next
' Do some objects clean-up Set FileTitleMatch = Nothing Set fsText = Nothing Set fsFile = Nothing Set fsFolder2 = Nothing Set fsFolder = Nothing End Sub
%>
|
|
As you can see, it is very easy to create a simple search engine without expending big bucks on third-party solutions. Bear in mind that this is a very simplistic approach to the search engine problem. Aside from the fact of the absent-minded nature of this engine (it will match text inside code procedures or text inside html tags, something not always desirable), a robust solution would index each file in a separate process and store the information in a database for fast retrieval. Even thought, the solution presented here is sure to satisfy many web developers in need of a simple search facility, and it sure demonstrate what can be done with the sometimes neglected tools available in every ASP developer's toolbox. Feel free to send your comments and suggestions to sixtos@prtc.net (threat mail is strongly discouraged). |