Lucene Website Crawler and Indexer

stlane

4.62/5 (9 votes)

Jan 31, 2009

CPOL

4 min read

94521

6073

Java Lucene website crawler and indexer

Download source - 5.35 MB

Introduction

This project makes use of the Java Lucene indexing library to make a compact yet powerful web crawling and indexing solution. There are many powerful open source internet and enterprise search solutions available that make use of Lucene such as Solr and Nutch. These projects although excellent may be over kill for more simple projects.

Background

A CodeProject article that inspired me in creating this demo was the .NET searcharoo search engine created by craigd. He created a web search engine designed to search entire websites by recursively crawling the links form the home page of the target site. This JSearchEngine Lucene project is different from searcharoo because it uses the Lucene indexer rather than the custom indexer used in searcharoo. Another difference between the projects is that searcharoo has a function that uses Window’s document iFilters to parse non-HTML pages. If there is enough interest, I may extend the project to use the document filters from the Nutch web crawler to index PDF and Microsoft Office type files.

Using the Code

The solution is made up from two projects, one called JSearchEngine and one called JSP, both projects were created with the netbeans IDE version 6.5.

Indexer/Crawler

The JSearchEngine project is the nuts and bolts of the operation. In the main method, the home page of the site to be crawled and indexed is hard-coded. Since it is a command line app, the code can be easily modified to take the home page as a command line parameter. The main control function for the crawler is below, and it works as follows:

The indexDocs function is called with the first page as a parameter.
The URl for the first page is used to build a Lucene Document object. The document object is made up of field and value pairs, such as the <title> tag as field and the actual field as value. This is all taken care of by the document object constructor.
Once the Document has been built, then Lucene adds it to its index. The workings of Lucene are outside the scope of this article as they are covered here.
After the document has been indexed, the links from the document are parsed into a string array, then each of those strings are recursively indexed by the indexDocs function. The HTMLParser at htmlparser.sourceforge.net is used.
Only URL names from the original page will be followed, this will prevent the crawler from following external links and attempting to crawl the internet!
The indexer excludes zip files as it cannot index them.

  private static void indexDocs(String url) throws Exception {

        //index page
        Document doc = HTMLDocument.Document(url);
        System.out.println("adding " + doc.get("path"));
        try {
            indexed.add(doc.get("path"));
            writer.addDocument(doc);          // add docs unconditionally
            //TODO: only add HTML docs
            //and create other doc types

            //get all links on the page then index them
            LinkParser lp = new LinkParser(url);
            URL[] links = lp.ExtractLinks();

            for (URL l : links) {
                //make sure the URL hasn't already been indexed
                //make sure the URL contains the home domain
                //ignore URLs with a querystrings by excluding "?"
                if ((!indexed.contains(l.toURI().toString())) &&
                    (l.toURI().toString().contains(beginDomain)) &&
                    (!l.toURI().toString().contains("?"))) {
                    //don't index zip files
                    if (!l.toURI().toString().endsWith(".zip")) {
                        System.out.print(l.toURI().toString());
                        indexDocs(l.toURI().toString());
                    }
                }
            }

        } catch (Exception e) {
            System.out.println(e.toString());
        }
    }

JSP Search Client

Once the target site has been completely indexed, the index can be queried; further sites can also be added to the index before querying. Since the index is Lucene based, it can be queried with any compatible Lucene library such as the Java or .NET implementation. In this demo, Java implementation has been used. The JSP project is a set of Java server pages that are used to search and display search results. In order to run this web app, it is necessary to deploy the complied .war file on a J2EE compatible server such as Glassfish or Tomcat. The following mark-up is the entry point for the web app, it takes a search term and passes it to the results.jsp page which will query the index and display the results:

 <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>JSP Search Page</title>
    </head>
    <body>
        <form name="search" action="results.jsp" method="get">
        <p>
            <input name="query" size="44"/> Search Criteria
        </p>
        <p>
            <input name="maxresults" size="4" value="100"/> Results Per Page
            <input type="submit" value="Search"/>
        </p>
        </form>
    </body>

The following is the main Java code from the results page. The variables are initialized with parameters passed from the search page in order to construct a Lucene index searcher:

String indexName = "/opt/lucene/index";
IndexSearcher searcher = null;
Query query = null;
Hits hits = null;
int startindex = 0;
int maxpage = 50;
String queryString = null;
String startVal = null;
String maxresults = null;
int thispage = 0;
searcher = new IndexSearcher(indexName);
queryString = request.getParameter("query");
Analyzer analyzer = new StandardAnalyzer();
QueryParser qp = new QueryParser("contents",
analyzer);
query = qp.parse(queryString);

hits = searcher.search(query);

Once the hits object has been instantiated using the search results, it will be possible to loop through the hits and display them with the HTML on page:

    for (int i = startindex; i < (thispage + startindex); i++) {  // for each element
%>
    <tr>
        <%
        Document doc = hits.doc(i);          //get the next document
        String doctitle = doc.get("title");  //get its title
        String url = doc.get("path");        //get its path field

Points of Interest

Since there are two separate projects, they can be mixed and matched with other programming environments that are Lucene compatible, for example, the JSP project could easily be modified to query an index created by Lucene.Net.

Further Information

Both of these projects are described in more detail in this four part series on my blog:

History

31^st January, 2009: Initial post