Click here to Skip to main content
15,887,027 members
Articles / Programming Languages / C++

Simple Text Indexer Using SQLite Database

Rate me:
Please Sign up or sign in to vote.
3.93/5 (10 votes)
14 Jun 2006CPOL5 min read 62.2K   1.8K   46   9
Simple Text Indexer Using SQLite Database

Introduction

Text Indexer Using SQLite

SQLite is a powerful, free, open source database. It has an excellent implementation of the B-Tree, which is fast and small. SQLite is not copyrighted and therefore you can use it freely in any commercial projects. This project uses the SQLite’s B-Tree extensively to index text files. Each text document is scanned and the content is tokenized to build a dictionary in the SQLite database. Later searches are conducted against the dictionary and related map tables to retrieve the file names containing the search string.

This article assumes the reader is well versed in C++ and data structures. An in-depth understanding of Windows APIs is also necessary.

Why use B-Tree?

A well balanced B-Tree has several advantages over a Hashtable:

  1. It is faster to insert a new item.
  2. It is fast and easy to search for ‘Nearest Hit’ items. This is very important in building a search dictionary. Searches can be done with partial string, for example to search for the word ‘where’, we can start with the string whe’.

Once the first “Nearest Hit” node is obtained, it is easy to walk through the B-Tree list to obtain the matching nodes.

Design

The indexer consists of a set of B-Trees (Word Dictionary and MAP tables) and a collection of ‘observers’ looking for ‘user input’, ‘file changes’ and ‘system idle time’ to update or query the B-Trees.  Each observer runs in separate threads.

  1. ‘User Input’ observer reads user’s input and queries the indexer for matches
  2. ‘File Change’ observer waits for file change notifications from the observed directories and queue the change information in a list.
  3. ‘System Idle Time’ observer wakes up when the system is idling and processes the queued file change information. This observer is responsible for indexing modified files.

File Change Observer

This observer opens the directory names specified in the ‘settings.ini’ file (located in the same folder as ‘INDEXER.EXE’). Later it creates an IoCompletionPort and waits for file changes. ‘IoCompletionPort’ allows a single thread to observe several directories at the same time. It uses ‘ReadDirectoryChangesW’ API to retrieve the file changes from the system whenever an IoCompletionPort reports a completion status. All the new, modified or deleted files names are put in a file name list.

Class Name: CFileObserver (filechg.h and filechg.cpp)

System Idle Observer

This observer sleeps most of the time and whenever it wakes up it uses the ‘NtQuerySystemInformation’ API exposed by NTDLL.LL to calculate the CPU usage. If the CPU usage is less than 5% and there are file names in the file name list (queued up by the File Change Observer), it will de-queue the file names, will parse and index the content of the files.

Class Name: CIdleObserver (observer.h and observer.cpp)

User Input Observer

This observer waits for user inputs on ‘STDIN’ and queries the indexer to retrieve file names containing the input string. It runs in the main application thread and will continue to run until user presses CTRL+C.

Class Name: CInputObserver (observer.h and observer.cpp)

The Indexer

This class indexes the file content. It uses the dictionary lookup algorithm to index and search words. There is only ONE dictionary in the sample code provided. If you need high performance lookup, you should create an array of dictionaries based on the hash value of a portion of the word (say, the first three characters of the word). In such a case, the dictionary is a Hashtable of B-Trees and should provide fast lookups. See below for the schema design of the indexer.

  1. Dictionary (m_dict of CIndexer class)

    This table maps words to its address. An address is 64bit integer. Every query will look up the address of the word from this table and later uses the address for all other table lookups.

    Word1 Address1
    Word2 Address2
    Word3 Address3

  2. File Map Table (m_file of CIndexer class)

    This table maps file names to FILE IDs. Each FILE ID is a unique 64bit integer.

    File name1 FILE ID1
    File nam2 FILE ID2
    File name3 FILE ID3

  3. File ID to File Name map table (m_fileId of CIndexer class)

    This table is a reverse lookup table for looking up a file name given the FILE ID.

    FILE ID1 File name1
    FILE ID2 File name2
    FILE ID3 File name3

  4. WORD address to FILE ID map table (MAP table) (m_map of CIndexer class)

    This table maps a word address to a set of FILE IDs. Each word address entry is a primary key to a vector record containing the FILE IDs of all the files containing at least one occurrence of the word.

    Address1 FILE ID1,
    FILE ID2
    Address2 FILE ID1
    Address3 FILE ID3

Indexing Algorithm

  • Create an entry in the FILE ID table if the file name is new. Also, create an entry in the reverse lookup table.
  • Parse the content of the file by reading one line at a time and tokenizing the string. Each token is assumed to be a WORD and index the word.
  • Lookup the Dictionary to get the word’s address. If the word is not present in the dictionary, create an entry.
  • Lookup up the MAP table using the word’s address as the primary key to retrieve the vector of FILE IDs. Insert the new FILE ID into the vector and save it in the database.
  • Create a line number and column number entry in the Line number map table.

Searches Algorithm

  • Get the nearest hit word addresses from Dictionary (For example, a search of ‘w’ may result in ‘where’, ‘while’, ‘which’, etc.).
  • For every word address, lookup the MAP table to retrieve the FILE IDs
  • For every FILE IDs, lookup the FILE ID to File Name reverse lookup table to retrieve the file name.
  • Lookup the line number map table to obtain the line number and column number of the word in the file.

Source Code

The source code is arranged as follows.

Directory ‘./sqlite’ contains all the SQLite source code. I have picked only the relevant ‘C’ and ‘H’ files from the SQLite source tree and have modified some of the files to remove the extra dependencies. If you want to use the latest code from the SQLite site (http://www.sqlite.org), you will have to modify the code to make it work.

  • Directory ‘./indexer’ contains all the indexer code
  • All the project files are based on Visual Studio 2005
  • Binary file is created in ./bin/win32 directory
  • Settings file (settins.ini) is located in ./bin/win32 directory
  • The index file is created in %TEMP%/index directory

History

  • 14th June, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
United States United States
I have been writing code for a living for last three hundred years! I have written code in almost all the languages- C/C++, JAVA, C#, VB, Pascal, Delphi, JScript and so on.

Comments and Discussions

 
GeneralCombo use Pin
Anderson Luís19-Jun-06 15:38
Anderson Luís19-Jun-06 15:38 
GeneralRe: Combo use Pin
Chulliyan20-Jun-06 7:46
Chulliyan20-Jun-06 7:46 
GeneralHmmm Thanks Pin
Randor 15-Jun-06 2:20
professional Randor 15-Jun-06 2:20 
GeneralRe: Hmmm Thanks Pin
Neville Franks15-Jun-06 9:20
Neville Franks15-Jun-06 9:20 
GeneralRe: Hmmm Thanks Pin
Quynh Nguyen19-Jun-06 7:01
Quynh Nguyen19-Jun-06 7:01 
GeneralRe: Hmmm Thanks Pin
Chulliyan19-Jun-06 8:32
Chulliyan19-Jun-06 8:32 
GeneralRe: Hmmm Thanks Pin
Neville Franks19-Jun-06 9:45
Neville Franks19-Jun-06 9:45 
GeneralRe: Hmmm Thanks Pin
Randor 22-Jul-06 2:59
professional Randor 22-Jul-06 2:59 
GeneralRe: Hmmm Thanks Pin
Neville Franks22-Jul-06 14:57
Neville Franks22-Jul-06 14:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.