How to I process huge text file consisting of millions of sentences and get the approriate sentences that contain a given word as input?

Question

0.00/5 (No votes)

See more:

i used Trie data structure to fetch a sentence on giving any word as input
it is working at good speed for few gb of a text file. problem is when i have huge text file its time complexity is more. fetching of sentences is taking more time.

so i decided to split a given huge text file into 26 small files where each file contains sentences starting with alphabetical order.

lets take the above sentences.

file1 consisting of only the sentences starting with the letter C

cricket is one of my favorite sports

file2 consisting of only the sentences starting with the letter I:

i am watching cricket

file3 consisting of only the sentences starting with the letter P:

playing cricket is my hobby

file 4 consisting of only the sentences starting with the letter W:

we dont miss ipl matches anytime

now, here problem would be, if i give a word called cricket
it will display the sentences from file1 since it has all sentences starting with c letter. but my file 2 and file 3 have cricket word in there sentences.
how to solve this problem?

What I have tried:

lets take few sentences that are in a file say:

i am watching cricket. playing cricket is my hobby. cricket is one of my favorite sports. we dont miss ipl matches anytime.

now we have to separate these sentences i used dot operator to split these sentences.
it showed something like this:

i am watching cricket
playing cricket is my hobby
cricket is one of my favorite sports
we dont miss ipl matches anytime

Posted 24-May-16 17:20pm

Member 12540483

Updated 24-May-16 22:50pm

Thanks7872

Add a Solution

Comments

Matt T Heffron 25-May-16 16:35pm

This is essentially a duplicate of the question you asked 2 days ago.
The main difference here is that you are assuming a specific partial implementation (tries and split files).
I gave a very extensive solution (click here) that traded pre-processing time to gain fast lookup time.
(Admittedly, I gave the solution in C#, but the principles are all the same.)

2 solutions

Add a Solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2016-05-24T20:14:00

Solution 1

The way I would do it is different: I'd create a "mapping" file which contained indexes into the big file. Probably two sets of indexes.
The first set would be in indexes and lengths of each sentence (plus a "Sentence ID" value to individually identify the sentence).
The second set would be each word. Or rather, each different word, with the Sentence ID it appears in, and the offset within the sentence.
So entries might look like this:

the     1,0;1,93;3,0;3,116;4,0;5,37;5,72,5,90
way     1,4
i       1,8
...

in the sentences that start this reply.
When you want to find a word, you just look in the "Words" mapping table, and it tells you immediately if it is in the text, and tells you which sentence, and offset within the sentence it's located.
Since the number of words in any language is fairly small (the average active vocabulary of an English native speaker is 20,000 words) this should be a lot quicker and easier to work with than parsing the table each time, and it only needs to be created (or modified) when your input text file changes.

Posted 24-May-16 20:14pm

OriginalGriff

Comments

Member 12540483 25-May-16 2:53am

but how you map the mapping file with the huge file and how are you providing sentence id to all the sentences in the huge file?
if you do so, it will still search for the senteces from the file by comparing the word to all the sentences in the text file. is it still not time consuming? @OriginalGriff

OriginalGriff 25-May-16 3:46am

Indexes! :laugh:
Each sentence has an index to it's starting position in the file, and a length.
So the first sentence starts with index zero, the second with index 103 (say)
That way, when you know which sentence you want, you can use Seek to go straight to the start, and read the number of characters it is long. Each sentence gets an ID value which is one larger than the previous one.

When you want to look for a word, you just scan your "Word mapping" file to find it, and that gives you a list of all the sentences it's used in.
Since the mapping files are way, way smaller than the whole file (and I'd probably hold them in memory anyway) your search is reduced from a few GB each time to 20,000 or so entries which is a much more manageable number!

Patrice T 25-May-16 4:52am

5ed

Member 12540483 25-May-16 4:55am

any other solution to this?
@OriginalGriff

OriginalGriff 25-May-16 5:16am

It's not as difficult as you think, really!
You already have the code to do the complicated bits: read and identify whole sentences, and break those into component words.
The rest is pretty much trivial!

Matt T Heffron 25-May-16 16:39pm

This looks an awful lot like the solution I posted on the previous version of this question 2 days ago... ;-)
http://www.codeproject.com/Answers/1102165/How-do-I-process-say-gb-string-text-file-in-gb-RAM#answer2
(Admittedly, I gave the solution in C#, but the principles are all the same.)

Patrice T · Answer 2 · 2016-05-24T22:50:00

I guess you have indexed a file, a set of files, or anything that is consistent for your needs. The index being to find answers fast.

Quote:
problem is when i have huge text file its time complexity is more. fetching of sentences is taking more time.

This sentence translate to "I design a complicated workaround to avoid hitting a design flaw in my search routine". You should explain why you have a time complexity problem with huge files.

By building the index with a set of informations for each words (
{Word, FileName, WordOffsetInFile} or {Word, FileName, SebtenceOffsetInFile}) the time complexity to retrieve a sentence is basically the sentence length.
I don't detail the various techniques used to compress the index.

Quote:
how to solve this problem?

To solve your problem, you need to explain why you have a time complexity problem. And probably show related code.
My solution is don't split the huge file, fix the flaw in your code.

How to I process huge text file consisting of millions of sentences and get the approriate sentences that contain a given word as input?

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0

How to I process huge text file consisting of millions of sentences and get the approriate sentences that contain a given word as input?

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0

Existing Members

...or Join us