Click here to Skip to main content
15,881,089 members
Articles / Database Development / SQL Server

Full-Text Indexing Files with Microsoft SQL Server

Rate me:
Please Sign up or sign in to vote.
3.75/5 (9 votes)
2 Feb 2007CPOL6 min read 140.3K   52   17
Indexing Word, Excel and other types of documents is easy with SQL Server

Introduction

This article is about the full-text search capability of SQL Server 2000 (2005). It is an easy to use, very fast and extensible solution to index and search in various types of documents' content. For example in Word, Excel, Adobe portable document format (PDF) and HTML files.

Prerequisites

To go through with this example, you will need Microsoft SQL Server 2000 Server (at least) access, a database with DB owner right, and of course the client tools.

Creating a Table

In order to index files stored in a database table, we have to create two table fields. In the first field, we will store the content of the document in binary format, in the second we will store the extension of the file, for example ".doc" or ".xls".
It's a good idea to store the full name and the size information of the files, because probably you'll need these in real situations, but you won't need them for full-text indexing.

Storing the sizes of the files can be optional, because you can query them with the DataLength function, but this could take a long time, much longer than reading these values from a field.

So our create table script can be this:

SQL
CREATE TABLE [dbo].[Doc] (
 [ID]  uniqueidentifier ROWGUIDCOL  NOT NULL ,
 [Extension] [varchar] (10) NOT NULL ,
 [Content] [image] NOT NULL ,
 [FileSize] [int] NOT NULL ,
 [FileName] [nvarchar] (500) NOT NULL ,
 [Stamp] [timestamp] NOT NULL 
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Doc] WITH NOCHECK ADD 
 CONSTRAINT [PK_Doc] PRIMARY KEY  CLUSTERED 
 (
  [ID]
 )  ON [PRIMARY] 
GO

ALTER TABLE [dbo].[Doc] ADD 
 CONSTRAINT [DF_Doc_ID] DEFAULT (newid()) FOR [ID]
GO

A simple idea is to use a unique identifier type as the primary key of the document table. It can be useful in many cases if you are a web developer. Maybe you would like to use this id in the URL (query string) when you make a download page, and you don't want to give the chance to your users to download all the documents without seeing all your advertisements, etc. by simply increasing a parameter at the end of the URL.

Please notice that a primary key has been created by the script. It is always very important because this can guarantee that you can't insert two records with the same id into the database (a table is a set of records, you can't make the difference between records with the same values, except with cursors), and a good clustered index can really improve the performance of the SELECT statements.

Also, there is a timestamp field – called stamp – on the table. It will be necessary for us and we will talk about it later.

Creating Full-text Index

I think nobody knows the proper format of the full-text index creation statement by head. So, we will create this index from Microsoft SQL Server Enterprise Manager.
First choose your server, then your database. Choose tables then on the right, search for the table you've created (Doc).
After right clicking on the table, choose "Define Full-Text indexing on a table" submenu from "Full-Text Index table" menu.

Image 1

After these, a wizard will appear.
In the "Select an index step", choose your primary key. Later, when we query the table, we will get back this (unique) index's values from the full-text search engine as a result.

In the "Select Table Columns", step mark the "Content" column, and in that row in the document type column, choose the extension field. Then step away to another row, because the Next button will be activated only after you change the selection.

Image 2

In this step, we gave the indexable column to the server and also defined the types of each document.

In the "Select a Catalog" step, you can create a new full-text catalog or choose an existing one. I offer that you make a new catalog for these data, because the collected indexes can be huge if the count of the documents increase.

Image 3

In the "Select or Create Population Schedules", you can schedule the incremental and the full population of the catalog or the table. You can leave it empty in this case, because we will use the change tracking feature of the SQL Server. It means, we tell the SQL server to update its full-text catalog "immediately" after an insert or an update occurs.
In real situations, it is generally offered to make a full population every week or daily and an incremental daily or more often, but it depends on your server utilization.
The cause is that the change tracking and incremental update feature doesn't recognize if someone uses WriteText or UpdateText statements.

When we designed the table, we've put a timestamp field into it. The timestamp field is needed for incremental index updates, without that field all the incremental updates do full updates. The full-text index engine can find the updates on this property change. This is because the WriteText and UpdateText statements are not noticed by the full-text engine. These statements don't update the timestamp field like normal Insert and Update do.

So finish the wizard, by pressing the Next button few times and then Finish.

After the full-text index has been defined, we have to turn the change tracking functionality on. We can do it by right clicking on the table name in Enterprise Manager and then choose Change Tracking then the "Update index in background" from Full-Text Index Table menu. From this time, the full-text index engine will watch our table for changes and update its index if necessary.

An interesting result of our index creation:

SQL
if (select DATABASEPROPERTY(DB_NAME(), N'IsFullTextEnabled')) <> 1 
exec sp_fulltext_database N'enable' 
GO

if not exists (select * from dbo.sysfulltextcatalogs where name = N'Doc')
exec sp_fulltext_catalog N'Doc', N'create' 
GO

exec sp_fulltext_table N'[dbo].[Doc]', N'create', N'Doc', N'PK_Doc'
GO

exec sp_fulltext_column N'[dbo].[Doc]', N'Content', N'add', 1033, N'Extension' 
GO

exec sp_fulltext_table N'[dbo].[Doc]', N'activate'  
GO

Testing

Now we insert some records and then we create a few select queries. Let's run this insert script in Query Analyzer:

SQL
INSERT INTO [DOC] ([Extension], [Content], [FileSize], [FileName]) 
    VALUES ('.txt', 'Hello John! It''s me: Garfield!', 30, 'Cartoon1.txt')

INSERT INTO [Doc] ([Extension], [Content], [FileSize], [FileName]) 
    VALUES ('.txt', 'Oh my god!', 30, 'Shout.txt')

INSERT INTO [DOC] ([Extension], [Content], [FileSize], [FileName]) 
    VALUES ('.txt', 'NOWAN's web site: <a href="%22http://www.nowan.hu/'%22">http://www.nowan.hu/'</a>, 30, 'nowan.txt')

These are simple inserts, you can examine them yourself. After you inserted the records, you can run a SELECT statement to check the table:

SQL
SELECT * FROM [Doc]

Full-Text Search

To create a full-text query, you have to get closer with the next statements:

The first two statements have two parameters. The first is the column name and the second is the searched string. These functions give back Boolean values.

The second two statements are more interesting. These functions return tables that have two columns: Key and Rank. It means that we can get back the unique id of the searched record or records and also we can get back the hit probability (Rank):

SQL
SELECT * FROM ContainsTable([doc], Content, '"nowan"')

Another good trait of containstable and freetexttable is that you can give difficult expressions as searched string. For example, you can use "OR" and "AND" logical terms:

Image 4

Of course, the result table of these statements can be joined to real tables. So if we would like to get the original data row from the original table, we could use a select statement like this:

SQL
SELECT Doc.* FROM [Doc] 
    INNER JOIN ContainsTable([doc], Content, '"nowan"') AS FT 
    ON Doc.ID = FT.[Key]

Indexed File Types

The SQL Server can create indexes typically from text files and Microsoft Office files. This type list is extensible by iFilters designed for the old Indexing Service. You can get an iFilter from Adobe too for indexing PDF files.

Hungarian Version

You can find the Hungarian version of this article here.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Team Leader GSGroup
Hungary Hungary

Comments and Discussions

 
QuestionDo you have an example on how to search on Word document? Pin
jdavidmtzc2-Aug-12 12:18
jdavidmtzc2-Aug-12 12:18 
QuestionHow to use the Transact-SQL functions CONTAINSTABLE and CONTAINS to do a search on a full-text enabled table Pin
elizas23-Mar-10 23:38
elizas23-Mar-10 23:38 
QuestionHow to use the Transact-SQL functions CONTAINSTABLE and CONTAINS to do a search on a full-text enabled table Pin
elizas23-Mar-10 23:35
elizas23-Mar-10 23:35 
CONTAINS is a conditional predicate and used in a where clause to search columns containing character-based data types.

This term looks for a match based on a particular word or phrase which is provided in the sql query.
Basically it looks for an exact match, but it can be extended or modified to look for the inflectional matches.

Basic syntax :

CONTAINS({column | column_list | *} , "<search_conditions>")

In above:
- First argument: can take a single column name or multiple column names or "*" for all columns.
(note that a full text index can be created combining multiple columns of a table)

- Second argument: search_condition, we can put a word or a phrase.

http://www.mindfiresolutions.com/How-to-use-the-TransactSQL-functions-CONTAINSTABLE-and-CONTAINS-to-do-a-search-on-a-fulltext-enabled-table-181.php[^]
Cheers,
Eliza

GeneralFew questions Pin
Pratik.Patel9-Feb-09 14:44
Pratik.Patel9-Feb-09 14:44 
GeneralRe: Few questions Pin
István Kovács (HU)5-Dec-11 10:09
István Kovács (HU)5-Dec-11 10:09 
Generalfull-text index tab is not active Pin
Member 220985825-May-08 22:33
Member 220985825-May-08 22:33 
AnswerRe: full-text index tab is not active Pin
István Kovács (HU)26-May-08 9:36
István Kovács (HU)26-May-08 9:36 
QuestionDatabase size .. is it an issue for full text catalog ?? Pin
reej26-Mar-08 0:30
reej26-Mar-08 0:30 
AnswerRe: Database size .. is it an issue for full text catalog ?? Pin
István Kovács (HU)26-Mar-08 13:52
István Kovács (HU)26-Mar-08 13:52 
QuestionCan you explain how to search on Word document? Pin
dadvir11-Jul-07 22:02
dadvir11-Jul-07 22:02 
AnswerRe: Can you explain how to search on Word document? Pin
István Kovács (HU)11-Jul-07 23:12
István Kovács (HU)11-Jul-07 23:12 
GeneralDotLucene is better Pin
Jan Seda3-Feb-07 3:09
professionalJan Seda3-Feb-07 3:09 
GeneralRe: DotLucene is better Pin
Great George Smith24-Jun-07 6:59
Great George Smith24-Jun-07 6:59 
Questionwhy Full-Text search not support thai language? Pin
kookai29-Aug-06 21:12
kookai29-Aug-06 21:12 
AnswerRe: why Full-Text search not support thai language? [modified] Pin
István Kovács (HU)30-Aug-06 5:04
István Kovács (HU)30-Aug-06 5:04 
GeneralRe: why Full-Text search not support thai language? Pin
kookai30-Aug-06 16:46
kookai30-Aug-06 16:46 
AnswerRe: why Full-Text search not support thai language? Pin
volkan.ozcelik8-Sep-06 21:34
volkan.ozcelik8-Sep-06 21:34 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.