How do I process say 20 gb string text file in 8 gb RAM and do the searching to get the sentences as output when we type a single word?

Question

1.00/5 (3 votes)

See more:

we have 20 gb of text file containing strings(sentences separated by dot operator).i need to process it so that fetching the sentences that includes a given word... as output would be faster.

What I have tried:

i have created trie data structure to fetch the sentences out of a given text file but not the huge file like 20 gb.

Posted 23-May-16 1:23am

Member 12540483

Updated 23-May-16 18:14pm

v3

Add a Solution

Comments

Kornfeld Eliyahu Peter 23-May-16 7:25am

Learn about the 'sliding window' to see how you can read a file that large...

Sergey Alexandrovich Kryukov 23-May-16 8:42am

The whole issue is: we don't know what "store or process" means, exactly. The presence of text files of 20 gb strongly suggests a kind of technology abuse. A file of few Mb or so still can be quickly and fully loaded with a reasonably fast text editor. If a file is much bigger, the whole idea of having such text files look questionable.
—SA

Richard MacCutchan 23-May-16 9:46am

You just need to read the file in smaller chunks and process each chunk individually.

Matt T Heffron 23-May-16 12:28pm

There's an ambiguity in your question:
At first you state: "we have 20 gb of text files" using plural files
Then you state: "the huge files like 20 gb"
So, the question is: Are the individual files 20 gb, or only the total size of all of the files?
This can make a big difference in how you could approach the problem.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Patrice T · Answer 1 · 2016-05-23T04:46:00

It is like for a book !
Do you need to read the whole book at once to be able to find a word in the book ?
Answer is NO.
As long as you read a part of the book (a line or a page) longer than the word you search, it is enough.
You just have to care if the word start on previous part and end on next one.

I fully agree with Sergey Alexandrovich Kryukov, a 20GB file is highly suspect of technology abuse.

Matt T Heffron · Answer 2 · 2016-05-23T08:51:00

Your "store or process" is vague, but it implies to me that it would be acceptable to pre-process the file(s). In that case, the thought that came to me was to index the files so that the information can be efficiently retrieved.

Caveat: this is just thoughts, none of it has been tried!

So, this architecture might fit in memory...
Create a class to represent a sentence:

C#

public class Sentence
{
  public string FilePath { get; set; } // if there are multiple files to index together
  public long BeginOffset { get; set; }
  public int Length { get; set; }
}

Then the class for all of this word lookup solution:

C#

public class WordLookup
{
  Dictionary<string, List<Sentence>> LookupMap = new Dictionary<string, List<Sentence>>();
  public bool AddFileToIndex(string filePath)
  {
    using (StreamReader reader = File.OpenText(filePath))
    {
      Stream bs = reader.BaseStream;
      if (!bs.CanSeek)
        return false;      // can't index non-seekable files

      StringBuilder word = new StringBuilder();
      int charcode;
      Sentence sent = new Sentence() { FilePath = filePath,
                                       BeginOffset = bs.Position };
      bool inSentence = false;  // haven't started the sentence yet;
      bool inWord = false;      // likewise
      while (0 <= (charcode = reader.Read()))
      {
        char c = (char)charcode;
        if (char.IsWhiteSpace(c))
        {
          // exclude leading white space from a sentence
          if (!inSentence)
            sent.BeginOffset = bs.Position;
          // skip white space before a word
          if (!inWord)
            continue;
        }

        if (/* detect end of word */)
        {
          if (inWord)
          {
            InternalSentencesForWord(word.ToString(), true).Add(sent);
            word.Clear();
            inWord = false;
          }
          if (/* detect end of sentence */)
          {
            long pos = bs.Position;
            sent.Length = pos - sent.BeginOffset;
            sent = new Sentence() { FilePath = filePath,
                                    BeginOffset = pos};
            inSentence = false;
          }
        }
        else
        {
          inWord = inSentence = true;
          word.Append(c);
        }
      }
      // just in case the file ends without properly terminating a sentence.
      if (inWord)
      {
        InternalSentencesForWord(word.ToString(), true).Add(sent);
      }
      if (inSentence)
      {
        long pos = bs.Position;
        sent.Length = pos - sent.BeginOffset;
      }
    }
    return true;
  }

  public ReadOnlyCollection<Sentence> SentencesForWord(string w)
  {
    return new ReadOnlyCollection<Sentence>(InternalSentencesForWord(w, false));
  }

  private List<Sentence> InternalSentencesForWord(string w, bool addToMap)
  {
      List<Sentence> sentences;
      if (!LookupMap.TryGetValue(w, out sentences))
      {
        sentences = new List<Sentence>();
        if (addToMap)
          LookupMap.Add(w, sentences);
      }
      return sentences;
  }
}

Use the SentencesForWord(w) method to get the collection of Sentence instances that will let you open the appropriate file, seek to the beginning of the sentence, read in only that sentence and then display it.

Obviously, you don't want to do this indexing each time you do a lookup, only when the file(s) change. Persisting this efficiently is left to you.

In fact, it would probably be even better to keep this information structure in a database, either local or on a server. That depends on your application, and how often your input file(s) change. Also, I didn't explore how to efficiently remove things from this index without rebuilding the whole thing from scratch.

Good luck.

=====
Edit MTH:

The indexing is actually much harder than I thought, because the FileStream.Position property (for disk files) returns the value in terms of internal buffer-fulls, not the actual byte position that corresponds with the next character to be read.
It's not incredibly difficult if you can guarantee only 1 byte characters. But this still depends on reflection and knowing the details of the actual implementation of StreamReader!!!!

First add:

C#

class TrackedStreamReader : StreamReader
{
  public TrackedStreamReader(string path)
    : base(path)
  {
    BaseStreamReader = (StreamReader)this;
    Type bst = typeof(StreamReader);
    CharPosField = bst.GetField("charPos", BindingFlags.DeclaredOnly | BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
    CharLenField = bst.GetField("charLen", BindingFlags.DeclaredOnly | BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
  }

  private StreamReader BaseStreamReader;
  private FieldInfo CharPosField;
  private FieldInfo CharLenField;

  public long Position
  {
    get
    {
      int charPos = (int)CharPosField.GetValue(BaseStreamReader);
      int charLen = (int)CharLenField.GetValue(BaseStreamReader);
      return BaseStream.Position - charLen + charPos;
    }
  }
}

Then change the AddFileToIndex to use that:

C#

public bool AddFileToIndex(string filePath)
{
  using (TrackedStreamReader = new TrackedStreamReader(filePath))
  {
    if (!reader.BaseStream.CanSeek)
      return false;      // can't index non-seekable files

    if (reader.EndOfStream)
      return false;      // Skip empty files, AND preload buffer so file preamble can be checked!

    StringBuilder word = new StringBuilder();
    Sentence sent = new Sentence() {
      FilePath = filePath,
      BeginOffset = reader.Position
    };
    bool inSentence = false;  // haven't started the sentence yet;
    bool inWord = false;      // likewise
    HashSet<string> wordsThisSentence = new HashSet<string>();
    while (!reader.EndOfStream)
    {
      char c = (char)reader.Read();
      if (char.IsWhiteSpace(c))
      {
        // exclude leading white space from a sentence
        if (!inSentence)
          sent.BeginOffset = reader.Position;
        // skip white space before a word
        if (!inWord)
          continue;
      }

      if (EndsWord(c))
      {
        if (inWord)
        {
          string w = word.ToString();
          // have we already seen this word in this sentence?
          if (!wordsThisSentence.Contains(w))
          {
            wordsThisSentence.Add(w);
            InternalSentencesForWord(w, true).Add(sent);
          }
          word.Clear();
          inWord = false;
        }

        if (EndsSentence(c))
        {
          long pos = reader.Position;
          sent.Length = (int)(pos - sent.BeginOffset);
          sent = new Sentence() {
            FilePath = filePath,
            BeginOffset = pos
          };
          wordsThisSentence.Clear();
          inSentence = false;
        }
      }
      else
      {
        inWord = inSentence = true;
        word.Append(c);
      }
    }
    // just in case the file ends without properly terminating a sentence.
    if (inWord)
    {
      InternalSentencesForWord(word.ToString(), true).Add(sent);
    }
    if (inSentence)
    {
      long pos = reader.Position;
      sent.Length = (int)(pos - sent.BeginOffset);
    }
  }
  return true;
}

// this is the simplest test
private bool EndsWord(char c)
{
  return !char.IsLetterOrDigit(c);
}

// another simple case
private bool EndsSentence(char c)
{
  return c == '.' || c == '?' || c == '!';
}

I'm still thinking through dealing with multiple byte characters!
(There's probably eventually an article in all of this...)
=====

How do I process say 20 gb string text file in 8 gb RAM and do the searching to get the sentences as output when we type a single word?

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0