Click here to Skip to main content
15,885,767 members
Please Sign up or sign in to vote.
1.00/5 (3 votes)
See more:
we have 20 gb of text file containing strings(sentences separated by dot operator).i need to process it so that fetching the sentences that includes a given word... as output would be faster.

What I have tried:

i have created trie data structure to fetch the sentences out of a given text file but not the huge file like 20 gb.
Posted
Updated 23-May-16 18:14pm
v3
Comments
Kornfeld Eliyahu Peter 23-May-16 7:25am    
Learn about the 'sliding window' to see how you can read a file that large...
Sergey Alexandrovich Kryukov 23-May-16 8:42am    
The whole issue is: we don't know what "store or process" means, exactly. The presence of text files of 20 gb strongly suggests a kind of technology abuse. A file of few Mb or so still can be quickly and fully loaded with a reasonably fast text editor. If a file is much bigger, the whole idea of having such text files look questionable.
—SA
Richard MacCutchan 23-May-16 9:46am    
You just need to read the file in smaller chunks and process each chunk individually.
Matt T Heffron 23-May-16 12:28pm    
There's an ambiguity in your question:
At first you state: "we have 20 gb of text files" using plural files
Then you state: "the huge files like 20 gb"
So, the question is: Are the individual files 20 gb, or only the total size of all of the files?
This can make a big difference in how you could approach the problem.

It is like for a book !
Do you need to read the whole book at once to be able to find a word in the book ?
Answer is NO.
As long as you read a part of the book (a line or a page) longer than the word you search, it is enough.
You just have to care if the word start on previous part and end on next one.

I fully agree with Sergey Alexandrovich Kryukov, a 20GB file is highly suspect of technology abuse.
 
Share this answer
 
Your "store or process" is vague, but it implies to me that it would be acceptable to pre-process the file(s). In that case, the thought that came to me was to index the files so that the information can be efficiently retrieved.

Caveat: this is just thoughts, none of it has been tried!

So, this architecture might fit in memory...
Create a class to represent a sentence:
C#
public class Sentence
{
  public string FilePath { get; set; } // if there are multiple files to index together
  public long BeginOffset { get; set; }
  public int Length { get; set; }
}

Then the class for all of this word lookup solution:
C#
public class WordLookup
{
  Dictionary<string, List<Sentence>> LookupMap = new Dictionary<string, List<Sentence>>();
  public bool AddFileToIndex(string filePath)
  {
    using (StreamReader reader = File.OpenText(filePath))
    {
      Stream bs = reader.BaseStream;
      if (!bs.CanSeek)
        return false;      // can't index non-seekable files

      StringBuilder word = new StringBuilder();
      int charcode;
      Sentence sent = new Sentence() { FilePath = filePath,
                                       BeginOffset = bs.Position };
      bool inSentence = false;  // haven't started the sentence yet;
      bool inWord = false;      // likewise
      while (0 <= (charcode = reader.Read()))
      {
        char c = (char)charcode;
        if (char.IsWhiteSpace(c))
        {
          // exclude leading white space from a sentence
          if (!inSentence)
            sent.BeginOffset = bs.Position;
          // skip white space before a word
          if (!inWord)
            continue;
        }

        if (/* detect end of word */)
        {
          if (inWord)
          {
            InternalSentencesForWord(word.ToString(), true).Add(sent);
            word.Clear();
            inWord = false;
          }
          if (/* detect end of sentence */)
          {
            long pos = bs.Position;
            sent.Length = pos - sent.BeginOffset;
            sent = new Sentence() { FilePath = filePath,
                                    BeginOffset = pos};
            inSentence = false;
          }
        }
        else
        {
          inWord = inSentence = true;
          word.Append(c);
        }
      }
      // just in case the file ends without properly terminating a sentence.
      if (inWord)
      {
        InternalSentencesForWord(word.ToString(), true).Add(sent);
      }
      if (inSentence)
      {
        long pos = bs.Position;
        sent.Length = pos - sent.BeginOffset;
      }
    }
    return true;
  }

  public ReadOnlyCollection<Sentence> SentencesForWord(string w)
  {
    return new ReadOnlyCollection<Sentence>(InternalSentencesForWord(w, false));
  }

  private List<Sentence> InternalSentencesForWord(string w, bool addToMap)
  {
      List<Sentence> sentences;
      if (!LookupMap.TryGetValue(w, out sentences))
      {
        sentences = new List<Sentence>();
        if (addToMap)
          LookupMap.Add(w, sentences);
      }
      return sentences;
  }
}

Use the SentencesForWord(w) method to get the collection of Sentence instances that will let you open the appropriate file, seek to the beginning of the sentence, read in only that sentence and then display it.

Obviously, you don't want to do this indexing each time you do a lookup, only when the file(s) change. Persisting this efficiently is left to you.

In fact, it would probably be even better to keep this information structure in a database, either local or on a server. That depends on your application, and how often your input file(s) change. Also, I didn't explore how to efficiently remove things from this index without rebuilding the whole thing from scratch.

Good luck.

=====
Edit MTH:

The indexing is actually much harder than I thought, because the FileStream.Position property (for disk files) returns the value in terms of internal buffer-fulls, not the actual byte position that corresponds with the next character to be read.
It's not incredibly difficult if you can guarantee only 1 byte characters. But this still depends on reflection and knowing the details of the actual implementation of StreamReader!!!!

First add:
C#
class TrackedStreamReader : StreamReader
{
  public TrackedStreamReader(string path)
    : base(path)
  {
    BaseStreamReader = (StreamReader)this;
    Type bst = typeof(StreamReader);
    CharPosField = bst.GetField("charPos", BindingFlags.DeclaredOnly | BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
    CharLenField = bst.GetField("charLen", BindingFlags.DeclaredOnly | BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
  }

  private StreamReader BaseStreamReader;
  private FieldInfo CharPosField;
  private FieldInfo CharLenField;

  public long Position
  {
    get
    {
      int charPos = (int)CharPosField.GetValue(BaseStreamReader);
      int charLen = (int)CharLenField.GetValue(BaseStreamReader);
      return BaseStream.Position - charLen + charPos;
    }
  }
}

Then change the AddFileToIndex to use that:
C#
public bool AddFileToIndex(string filePath)
{
  using (TrackedStreamReader = new TrackedStreamReader(filePath))
  {
    if (!reader.BaseStream.CanSeek)
      return false;      // can't index non-seekable files

    if (reader.EndOfStream)
      return false;      // Skip empty files, AND preload buffer so file preamble can be checked!

    StringBuilder word = new StringBuilder();
    Sentence sent = new Sentence() {
      FilePath = filePath,
      BeginOffset = reader.Position
    };
    bool inSentence = false;  // haven't started the sentence yet;
    bool inWord = false;      // likewise
    HashSet<string> wordsThisSentence = new HashSet<string>();
    while (!reader.EndOfStream)
    {
      char c = (char)reader.Read();
      if (char.IsWhiteSpace(c))
      {
        // exclude leading white space from a sentence
        if (!inSentence)
          sent.BeginOffset = reader.Position;
        // skip white space before a word
        if (!inWord)
          continue;
      }

      if (EndsWord(c))
      {
        if (inWord)
        {
          string w = word.ToString();
          // have we already seen this word in this sentence?
          if (!wordsThisSentence.Contains(w))
          {
            wordsThisSentence.Add(w);
            InternalSentencesForWord(w, true).Add(sent);
          }
          word.Clear();
          inWord = false;
        }

        if (EndsSentence(c))
        {
          long pos = reader.Position;
          sent.Length = (int)(pos - sent.BeginOffset);
          sent = new Sentence() {
            FilePath = filePath,
            BeginOffset = pos
          };
          wordsThisSentence.Clear();
          inSentence = false;
        }
      }
      else
      {
        inWord = inSentence = true;
        word.Append(c);
      }
    }
    // just in case the file ends without properly terminating a sentence.
    if (inWord)
    {
      InternalSentencesForWord(word.ToString(), true).Add(sent);
    }
    if (inSentence)
    {
      long pos = reader.Position;
      sent.Length = (int)(pos - sent.BeginOffset);
    }
  }
  return true;
}

// this is the simplest test
private bool EndsWord(char c)
{
  return !char.IsLetterOrDigit(c);
}

// another simple case
private bool EndsSentence(char c)
{
  return c == '.' || c == '?' || c == '!';
}

I'm still thinking through dealing with multiple byte characters!
(There's probably eventually an article in all of this...)
=====
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900