Text Indexing with Prime Factors

ZackAkil

Rate me:

5.00/5 (9 votes)

19 Jan 2017CPOL4 min read

14.5K

Explanation and demo of using prime factors to index text records

Introduction

This report details the proof of concept for using prime numbers for indexing text with the intention of searching for words within a data set. This technique is shown to significantly outperform regular string comparison.

The targeted use case for this algorithm is in word frequency analysis for sentiment analysis.

Background

This technique utilizes the knowledge described by the unique-prime-factorization theorem (also known as the Fundamental theorem of arithmetic) where it is understood that every integer greater than 1 has a unique list of prime factors.

Algorithm Theory

The process is broken down into indexing and searching:

Indexing

Defining the dictionary of the data set
Index the dictionary by storing a unique prime number for each word in the dictionary
Index the text records by storing the product of the prime numbers that make up each text record

Searching

Fetch the unique prime number of the word to search for from the indexed dictionary
Find the modulo of each indexed text record against the value of the indexed word being searched for
If the modulo calculation equals 0, then the search word exists in that record

Example

Take the data set of text:

"black cat on mat",
"black hat for you",
"cat sat on you"

The dictionary of this data set is:

"black",
"cat",
"on",
"mat",
"hat",
"for",
"you",
"sat"

Now assign prime number to each word in dictionary:

"black": 2,
"cat" : 3,
"on" : 5,
"mat" : 7,
"hat" : 11,
"for" : 13,
"you" : 17,
"sat" : 19

Now turn each text record into the product of its prime numbers:

"black cat on mat" = "black(2) cat(3) on(5) mat(7)" = 2 x 3 x 5 x 7 = 210
"black hat for you" = black(2) hat(11) for(13) you(17) = 2 x 11 x 13 x 17 = 4862
"cat sat on you" = "cat(3) sat(19) on(5) you(17)" = 3 x 19 x 5 x 17 = 4845

Now the indexed data set is:

210,
4862,
4845

Searching for a Single Word

To search for the word "sat"(prime equivalent 19) in the data set, we calculate the modulo of its equivalent prime across the data set and where the result is '0', the word exists within that text record:

210 % 19 = 1 (not a factor)
4862 % 19 = 17 (not a factor)
4845 % 19 = 0 (is a factor "cat sat on you")

Another example, search for "you" (prime equivalent 17):

210 % 17 = 6 (not a factor)
4862 % 17 = 0 (is a factor "black hat for you")
4845 % 17 = 0 (is a factor "cat sat on you")

Searching for Multiple Words

Because of the associative property of factors, to search for multiple words, you do the same as before except calculate the modulo using the product of the words that you are searching for:

Search for "you" and "cat"

"you"(17) x "cat"(3) = 51

210 % 51 = 6 (not a factor)
4862 % 51 = 17 (not a factor)
4845 % 51 = 0 (is a factor "cat sat on you")

Search for "cat" and "on"

"cat"(3) x "on"(5) = 15

210 % 15 = 0 (is a factor "black cat on mat")
4862 % 15 = 2 (not a factor)
4845 % 15 = 0 (is a factor "cat sat on you")

Search for "black", "hat" and "you"

"black"(2) x "hat"(11) x you(17) = 374

210 % 374 = 210 (not a factor)
4862 % 374 = 0 (is a factor "black hat for you")
4845 % 374 = 231 (not a factor)

Performance vs 'numpy string find'

Normal string comparison entails performing compound character comparison for each character within a string which has a complexity of O(n) where n is the length of the string. After indexing: the prime factor approach only requires a single modulo operation per string record which has a complexity of O(1) which in turn results in faster searching across records as illustrated in the plot below of the search speed against the average length of documents in the data set.

The slight gradient in the prime factor plot can be attributed the Pythons implementation which uses arbitrary precision arithmetic to store the indexed document value.

See the full python implementation and performance comparison technique and results here.

Points of Interest

Word Order

Due to the association of multiplication, when the documents are indexed into one product, the information regarding the order of the words is lost.

Appending Documents

When new documents need to be added, they can be indexed individually and any new words can be appended to the dictionary without affecting the currently index documents.

Large Numbers

The size of the stored product value can become very large with long documents so special care should be taken in ensuring the data type used to store this value is able to store the potentially large values..

Conclusion

Prime factors can be used for text searching provided that searching for word order isn't a requirement. This technique can therefore prove especially useful in word frequency analysis problems supplementing techniques like TF-IDF and Bag of Words.

Appendix

Jupyter notebook of implementation and performance testing: https://github.com/ZackAkil/text-indexing-using-prime-factors/blob/master/text%20indexing%20using%20prime%20factors.ipynb
Python package implementation: https://pypi.python.org/pypi/primetext/

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

ZackAkil

United Kingdom

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.