Click here to Skip to main content
15,886,693 members
Articles / Programming Languages / Python
Tip/Trick

Text Indexing with Prime Factors

Rate me:
Please Sign up or sign in to vote.
5.00/5 (9 votes)
19 Jan 2017CPOL4 min read 14.5K   5   6
Explanation and demo of using prime factors to index text records

Introduction

This report details the proof of concept for using prime numbers for indexing text with the intention of searching for words within a data set. This technique is shown to significantly outperform regular string comparison.

The targeted use case for this algorithm is in word frequency analysis for sentiment analysis.

Background

This technique utilizes the knowledge described by the unique-prime-factorization theorem (also known as the Fundamental theorem of arithmetic) where it is understood that every integer greater than 1 has a unique list of prime factors.

Algorithm Theory

The process is broken down into indexing and searching:

Indexing

  1. Defining the dictionary of the data set
  2. Index the dictionary by storing a unique prime number for each word in the dictionary
  3. Index the text records by storing the product of the prime numbers that make up each text record

Searching

  1. Fetch the unique prime number of the word to search for from the indexed dictionary
  2. Find the modulo of each indexed text record against the value of the indexed word being searched for
  3. If the modulo calculation equals 0, then the search word exists in that record

Example

Take the data set of text:

  • "black cat on mat",
  • "black hat for you",
  • "cat sat on you"

The dictionary of this data set is:

  • "black",
  • "cat",
  • "on",
  • "mat",
  • "hat",
  • "for",
  • "you",
  • "sat"

Now assign prime number to each word in dictionary:

  • "black": 2,
  • "cat" : 3,
  • "on" : 5,
  • "mat" : 7,
  • "hat" : 11,
  • "for" : 13,
  • "you" : 17,
  • "sat" : 19

Now turn each text record into the product of its prime numbers:

  • "black cat on mat" = "black(2) cat(3) on(5) mat(7)" = 2 x 3 x 5 x 7 = 210
  • "black hat for you" = black(2) hat(11) for(13) you(17) = 2 x 11 x 13 x 17 = 4862
  • "cat sat on you" = "cat(3) sat(19) on(5) you(17)" = 3 x 19 x 5 x 17 = 4845

Now the indexed data set is:

  • 210,
  • 4862,
  • 4845

Searching for a Single Word

To search for the word "sat"(prime equivalent 19) in the data set, we calculate the modulo of its equivalent prime across the data set and where the result is '0', the word exists within that text record:

  • 210 % 19 = 1 (not a factor)
  • 4862 % 19 = 17 (not a factor)
  • 4845 % 19 = 0 (is a factor "cat sat on you")

Another example, search for "you" (prime equivalent 17):

  • 210 % 17 = 6 (not a factor)
  • 4862 % 17 = 0 (is a factor "black hat for you")
  • 4845 % 17 = 0 (is a factor "cat sat on you")

Searching for Multiple Words

Because of the associative property of factors, to search for multiple words, you do the same as before except calculate the modulo using the product of the words that you are searching for:


Search for "you" and "cat"

"you"(17) x "cat"(3) = 51

  • 210 % 51 = 6 (not a factor)
  • 4862 % 51 = 17 (not a factor)
  • 4845 % 51 = 0 (is a factor "cat sat on you")

Search for "cat" and "on"

"cat"(3) x "on"(5) = 15

  • 210 % 15 = 0 (is a factor "black cat on mat")
  • 4862 % 15 = 2 (not a factor)
  • 4845 % 15 = 0 (is a factor "cat sat on you")

Search for "black", "hat" and "you"

"black"(2) x "hat"(11) x you(17) = 374

  • 210 % 374 = 210 (not a factor)
  • 4862 % 374 = 0 (is a factor "black hat for you")
  • 4845 % 374 = 231 (not a factor)

Performance vs 'numpy string find'

Normal string comparison entails performing compound character comparison for each character within a string which has a complexity of O(n) where n is the length of the string. After indexing: the prime factor approach only requires a single modulo operation per string record which has a complexity of O(1) which in turn results in faster searching across records as illustrated in the plot below of the search speed against the average length of documents in the data set.

The slight gradient in the prime factor plot can be attributed the Pythons implementation which uses arbitrary precision arithmetic to store the indexed document value.

See the full python implementation and performance comparison technique and results here.

Points of Interest

Word Order

Due to the association of multiplication, when the documents are indexed into one product, the information regarding the order of the words is lost.

Appending Documents

When new documents need to be added, they can be indexed individually and any new words can be appended to the dictionary without affecting the currently index documents.

Large Numbers

The size of the stored product value can become very large with long documents so special care should be taken in ensuring the data type used to store this value is able to store the potentially large values..

Conclusion

Prime factors can be used for text searching provided that searching for word order isn't a requirement. This technique can therefore prove especially useful in word frequency analysis problems supplementing techniques like TF-IDF and Bag of Words.

Appendix

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United Kingdom United Kingdom
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
PraiseClever! Pin
Pete Goodsall25-Oct-16 0:19
Pete Goodsall25-Oct-16 0:19 
GeneralRe: Clever! Pin
ZackAkil25-Oct-16 23:39
ZackAkil25-Oct-16 23:39 
QuestionInteresting technique, question about size Pin
jpmik24-Oct-16 19:55
jpmik24-Oct-16 19:55 
AnswerRe: Interesting technique, question about size Pin
ZackAkil24-Oct-16 23:52
ZackAkil24-Oct-16 23:52 
QuestionInteresting Pin
A. A. J. Rodriguez24-Oct-16 9:57
A. A. J. Rodriguez24-Oct-16 9:57 
In a system I designed many years ago, I used a similar mechanism to establish user privileges. For example, we categorized and gave a prime to each general ability within the system:
2: Data Entry
3: Warehouse
5: Accounts
7: Management
9: Administration

And then used a power of the prime to specify what a user was allowed to do within each category:
3: Warehouse employee
9 (3^2): Warehouse supervisor
27 (3^3): Warehouse manager

This way, we could take privileges from each category and assign them to each employee by multiplying each desired category-privilege level.

It was an interesting proof of concept, but it soon proved unwieldy when some combinations of privileges created numbers that were too large for our database.
AnswerRe: Interesting Pin
ZackAkil24-Oct-16 23:39
ZackAkil24-Oct-16 23:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.