Click here to Skip to main content
15,886,518 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Let say , I have n number of domain and for each domain, i have separate text corpus. [ means text corpus specific to different market places[US/UK/JP...]] Now same word can appear in many different corpus. I want to create a flat file where i want to associate each word with list of domain id where it was present , along with the frequency of occurrence. Need some suggestion , how can we represent this information at run time in memory efficient way ? N can be between 400 to 500.

What I have tried:

I can think of a flat file and domainID,count against different word.
Can load this entire data in trie.Where word can have value as metadata at last node.
Posted
Comments
Richard MacCutchan 2-Jun-16 6:01am    
Use a Dictionary type or a compact database.
Sergey Alexandrovich Kryukov 2-Jun-16 10:47am    
As there can be too many words, and even the number of domains is considerable, you should not rely on the possibility of storing it all in memory, you need persistent storage. Yes, it can be a "flat" file, but why not a database?
—SA
Matt T Heffron 2-Jun-16 12:29pm    
Virtual +5.
OP could create a Data Access Layer, starting with Dictionary as Richard suggested, while working out the non-storage part of the application (e.g., parsing, indexing, lookup, ...).
When "ready", switch to database without needing to change (much of) the rest of the application.
Sergey Alexandrovich Kryukov 2-Jun-16 13:24pm    
Thank you, Matt.
That's a good point. Separation and abstracting out the data layer is a key here (which I took for granted :-).
—SA
manishhsinam 3-Jun-16 9:16am    
Thanks Guys for inputs , really helpful.
Let me allow to add more complexity to the problem.
We need more reacher metadata.

Ex : "Busy day ahead"
I need to know :
-- count of unigrams , bigrams , trigrams. [for all domain specific corpus]

Wild card query as well
ex: count for all bigram start with "Busy"* .
ex: count for all trigram start with "Busy day"* . And so on.
Any suggestion to capture this info effectviely in DB or using Trie approach.
I think of some kind of partially loaded trie approach.





This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900