Click here to Skip to main content
15,886,873 members
Articles / Programming Languages / C#
Article

Word Search Engine

Rate me:
Please Sign up or sign in to vote.
3.17/5 (15 votes)
1 May 2006CPOL4 min read 77.4K   2K   44   4
This article describes a new Word Search Engine for your application or website.

Sample Image

Introduction

If you need a word search engine for your website or for your application, you must create it, or send your data to One Internet Search engine and send money for indexing your site, but this application can index your text/HTML/ASP files and save the words in its database to search in future.

This application use the SQLite database as its DBMS. This is an Open Source DBMS, and can be used freely. For more information, you can visit this site.

For connecting from .NET code to SQLite, use the .NET wrapper from Finisar Corporation. This wrapper makes the ADO.NET libraries for SQLite, and it is very user-friendly. Its DLL can be download from SourceForge.

C#
using Finisar.SQLite;

Database

You can see the E/R chain diagram of the database in this picture:

Image 2

This diagram has a relation m<-> n, this relation should be converted to a table (index table) named Word_Page.

Word_Page
WidPidQTY

Create database (class DBWSE)

I think an application is complete if it can create database in code and its executable file does not include an empty database. For creating the required database, we can use the CREATE SQL commands.

C#
private void CreateDataBase(string path)
{
    try
    {
        FileStream fi = File.Create(path);
        fi.Close();
        DBFile = path;
        OpenDataBase();
        string strSQL = "Create Table words (" +
        "WID INTEGER PRIMARY KEY ," +
        "Word NVarChar(50)" +
        ")";
        SQLiteCommand sqd = new SQLiteCommand(strSQL, sqconn);
        sqd.ExecuteNonQuery();
        sqd.CommandText ="Create Table pages(" +
        "PID INTEGER PRIMARY KEY ," +
        "path NVarChar(100) NOT NULL," +
        "nofw INTEGER NOT NULL," +
        "Date_add NVARChar(10) NOT NULL" +
        ")";
        sqd.ExecuteNonQuery();
        sqd.CommandText ="Create Table word_page(" +
        "WID INTEGER FORIGEN KEY REFERENCES words (wid)," +
        "PID INTEGER FORIGEN KEY REFERENCES pages (pid)," +
        "QTY INTEGER NOT NULL," +
        "PRIMARY KEY (WID,PID)" +
        ")";
        sqd.ExecuteNonQuery();
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

This function creates an empty database with this file.

Add to Word, Page, and Word_Page tables

Now we must send data to the database to be saved. I think to get more speed, we must use SQL commands; this means, "we must use SQL INSERT commands".

C#
public int AddPage(string path, int nofw)
{
    int i = PID(path);
    if (i < 0)
    {
        // the page must be add 
        string strSQL = "INSERT INTO pages (path,nofw,Date_add)" +
            "VALUES ('" + path + "'," + nofw + ",'" + 
            DateTime.Now.ToShortDateString() + "')";
        SQLiteCommand sqc = new SQLiteCommand(strSQL, sqconn);
        sqc.CommandText = strSQL;
        sqc.ExecuteNonQuery();
        //return PID
        i = PID(path);
    }
    return i;
}

Next function calls the "PID function" to find the PID (page ID) of the new page. Moreover, now we add this page data to the database.

C#
private int createWord(string word)
{
    string strsql = "INSERT INTO words (word) VALUES ('" + word + "')";
    SQLiteCommand sqc = new SQLiteCommand(strsql, sqconn);
    sqc.ExecuteNonQuery();
    return WID(word);
}

This function adds a word to the "Word" table.

C#
public void AddWord(int page, string word, int QTY)
{
    int i = WID(word);
    if (i < 0)
        i = createWord(word);
    string strsql = "INSERT INTO word_page (WID,PID,QTY) VALUES " +
        "(" + i.ToString() + "," + page.ToString() + 
        "," + QTY.ToString() + ")";
    SQLiteCommand sqc = new SQLiteCommand(strsql, sqconn);
    sqc.ExecuteNonQuery();
}

In this solution, a word cannot be repeated. For this, check if WID finds a word, if yes, return the ID of the word, otherwise return -1. Moreover, we must add the word to the Words table. In fact, this function saves the index of the page without repeating any of data.

Search in the database

Now, we can search the database for our word.

C#
public DataTable SearchWord(string word)
{
    string strSQL = "SELECT pages.path,pages.nofw,pages.Date_add,word_page.QTY " +
        "FROM words INNER JOIN word_page ON words.wid=word_page.wid" +
        " INNER JOIN pages ON word_page.pid=pages.pid" +
        " WHERE words.word='"+word+"'";
    SQLiteDataAdapter sqd = new SQLiteDataAdapter(strSQL, sqconn);
    DataTable dt = new DataTable();
    sqd.Fill(dt);
    return dt;
}

This function returns a DataTable for use in our application.

Create data for saving in the database (index pages)

Now, you have a good database to save pages and its words, but you must create data for this database. This means that you must index the pages!

Initially, this work would be very easy! Because, you can divide a paragraph to words using a splitter. An example is: space , ',' , ')' , '(' , '[' ,'\' and etc., but this solution cannot be very precise.

You must remove trivial and repetitive words, and then you should start indexing.

Split paragraphs

C#
string[] split = words.Split (new Char[] { ' ', ',', '.', ':',';','{','}','[',
     ']','(',')','\'','"','\\','<','>','=','+','-','/','*','#','&','%','^',
     '`','~','0','1','2','3','4','5','6','7','8','9' });

Create a list of words (class ScanFile)

C#
private void StartMonitoring(string p)
{
    //first step must read file
    StreamReader stR = File.OpenText(p);
    string words = stR.ReadToEnd();
    stR.Close();
    string[] split = words.Split(new Char[] { ' ', ',', '.', ':',';','{','}','[',
        ']','(',')','\'','"','\\','<','>','=','+','-','/','*','#','&','%','^',
        '`','~','0','1','2','3','4','5','6','7','8','9' });
    max = split.Length;
    int index;
    int k = 0;
    list1.Clear();
    for (int i = 0; i < split.Length; i++)
    {
        WordInfo word = new WordInfo();
        word.Word = split[i].Trim();
        if (word.Word.Length > 2)
        {
            SearchableWord = word.Word;
            index = list1.FindIndex(HaveWord);
            if (index < 0)//not found
            {
                    word++;
                    list1.Add(word);
                    k++;
            }
            else
            {
                // increment count of word
                list1[index]++;
                k++;
            }
        }
            // Progress( this is a event )
        OnProgress(System.EventArgs.Empty);
    }
    total = k; 
}
private static bool HaveWord(WordInfo str)
{
    if (str.Word.ToUpper() == SearchableWord.ToUpper())
    return true;
    else
    return false;
}

In the function, I use the WordInfo class and the List <Word> class, and create a new event "OnProgress" to use when scanning files.

C#
SCF.Progress += new EventHandler(SCF_Progress);

To find a word in List, we must define a delegate:

C#
private static bool HaveWord(WordInfo str)

However, we cannot find a word by this argument. To solve the problem, I used a private variable for comparing strings:

C#
private static string SearchableWord;

Trivial word (class RepluseTrivial)

To Repulse Trivial words, you must have a list of those words. I think by saving the words in the database, we can quickly access the target word by using a SQL command. For this, create a table of trivial words using a SQL command and insert/delete trivial words in this table using SQL commands.

When creating an object of the RepluseTrivial class, you must send a list of words to this object and call the Repulse function to remove the trivial word.

C#
public void Repluse()
{
    List<WordInfo> temp = new List<WordInfo>();
    for (int i = 0; i < list1.Count; i++)
        if (!IsTrivial(list1[i].Word))
            temp.Add(list1[i]);
    list1 = temp;
}

This function calls the IsTrivial function to checks for trivial words.

Using the Code

Now you have a list of non-trivial words and a good database to save these words by using the said algorithm, so search by the "search function". However, this function can be developed as an engine that can quietly search such as the Google engine and Microsoft engine etc. (but not equal in performance).

We can show the progress using a progress bar:

C#
public Form1()
{
    InitializeComponent();
    SCF = new ScanFile();
    SCF.Progress += new EventHandler(SCF_Progress);
}

void SCF_Progress(object sender, EventArgs e)
{
    progressBar1.Maximum = SCF.MaxProgress;
    progressBar1.Value++;
}

To create a list of non-trivial words:

C#
private void button2_Click(object sender, EventArgs e)
{
    progressBar1.Value = 0;
    SCF.Scan(textBox1.Text);
    label2.Text = "Total Word:"+SCF.Total.ToString();
    List<WordInfo> list1=new List<WordInfo>(); 
    list1 = SCF.WordList;
    string str = 
      Path.GetDirectoryName(Application.ExecutablePath)+ 
      "\\Trivial.db3";
    RepluseTrivial rt = new RepluseTrivial(str);
    list1 = rt.Repluse(list1);
    listView1.Items.Clear();
    for (int i = 0; i < list1.Count; i++)
    {
        ListViewItem li = new ListViewItem(new string[] { i.ToString(), 
                          list1[i].Count.ToString(), list1[i].Word });
        listView1.Items.Add(li);
    }
}

Now save the list to the database:

C#
private void button3_Click(object sender, EventArgs e)
{
    string str = Path.GetDirectoryName(Application.ExecutablePath);
    DBWSE db = new DBWSE(str+""\\WSEDB.db3");
    int i=db.AddPage(textBox1.Text,SCF.Total);
    progressBar1.Maximum = SCF.WordList.Count;
    progressBar1.Value = 0;
    for (int j=0;j<SCF.WordList.Count;j++)
    {
        db.AddWord(i, SCF.WordList[j].Word, SCF.WordList[j].Count);
        progressBar1.Value++;
    }
}

You can search and show the results in a DataGridView very easily:

C#
private void button7_Click(object sender, EventArgs e)
{
    string str = Path.GetDirectoryName(Application.ExecutablePath);
    DBWSE db = new DBWSE(str + "\\WSEDB.db3");
    dataGridView2.DataSource = db.SearchWord(textBox3.Text);
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Austria Austria
I am Mohsen Ahmadian,
I was born in Tehran (capital city of Iran), I live in Vienna
I am a freelancer and a web developer
I Love Codeproject and C++, Java, C#, PHP
if you found my articles are good, buy a coffee for me
https://www.buymeacoffee.com/mohsenahmadian

Comments and Discussions

 
GeneralI have a exception [modified] Pin
vinh2b3-Dec-06 21:46
vinh2b3-Dec-06 21:46 
GeneralI like the ideas you've shared! Pin
AbuseByUnkindPeople28-May-06 0:17
AbuseByUnkindPeople28-May-06 0:17 
You could do better than Lucene and since Sequal express is limited (not suppoorting full text search) your article is definately worthy.

GeneralAlternative approaches Pin
rmcubedmm9-May-06 1:44
rmcubedmm9-May-06 1:44 
SuggestionRe: Alternative approaches Pin
ray pixar16-Aug-11 0:17
ray pixar16-Aug-11 0:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.