Word association with domain specific metadata

Question

0.00/5 (No votes)

See more:

Let say , I have n number of domain and for each domain, i have separate text corpus. [ means text corpus specific to different market places[US/UK/JP...]] Now same word can appear in many different corpus. I want to create a flat file where i want to associate each word with list of domain id where it was present , along with the frequency of occurrence. Need some suggestion , how can we represent this information at run time in memory efficient way ? N can be between 400 to 500.

What I have tried:

I can think of a flat file and domainID,count against different word.
Can load this entire data in trie.Where word can have value as metadata at last node.

Posted 1-Jun-16 23:38pm

manishhsinam

Add a Solution

Comments

Richard MacCutchan 2-Jun-16 6:01am

Use a Dictionary type or a compact database.

Sergey Alexandrovich Kryukov 2-Jun-16 10:47am

As there can be too many words, and even the number of domains is considerable, you should not rely on the possibility of storing it all in memory, you need persistent storage. Yes, it can be a "flat" file, but why not a database?
—SA

Matt T Heffron 2-Jun-16 12:29pm

Virtual +5.
OP could create a Data Access Layer, starting with Dictionary as Richard suggested, while working out the non-storage part of the application (e.g., parsing, indexing, lookup, ...).
When "ready", switch to database without needing to change (much of) the rest of the application.

Sergey Alexandrovich Kryukov 2-Jun-16 13:24pm

Thank you, Matt.
That's a good point. Separation and abstracting out the data layer is a key here (which I took for granted :-).
—SA

manishhsinam 3-Jun-16 9:16am

Thanks Guys for inputs , really helpful.
Let me allow to add more complexity to the problem.
We need more reacher metadata.

Ex : "Busy day ahead"
I need to know :
-- count of unigrams , bigrams , trigrams. [for all domain specific corpus]

Wild card query as well
ex: count for all bigram start with "Busy"* .
ex: count for all trigram start with "Busy day"* . And so on.
Any suggestion to capture this info effectviely in DB or using Trie approach.
I think of some kind of partially loaded trie approach.

Sergey Alexandrovich Kryukov 3-Jun-16 10:38am

You see, by this clarification, you introduce a sharp turn from the initial formulation of the problem. It all can grow as far as the natural language analysis, which would make the initially formulated problem ridiculously insignificant, compared to that really advanced topic.
—SA

manishhsinam 3-Jun-16 13:40pm

I agree with your point, it is my bad, both problem have a different scope.
But it is true that i need to solve both.
Any resource you guys can suggest that i can refer ?
Any open source available that can fit here ?

I agree since data is huge , i can not think of loading every thing in memory
but i strongly feel , run time data structure should take care of wilcard queries to avoid overloading the DB with different combination of ngram.
Any suggestions will be highly appreciated .

Sergey Alexandrovich Kryukov 3-Jun-16 13:59pm

Resource? I think this is not about resource. This is you who needs to analyze the scope and picture some data module for the problem. This should be the starting point. Everything else should be developed based on general programming approaches, without any special resources. Right now, you are probably interested in some decisions on the data layer along. Probably, the first decision would be between SQL or NoSQL, or some other storage not related to SQL or SQL-like representation of relational model (SQL is criticized for not being pure relational model). I would say, SQL-based database would be a very likely adequate solution, but it's better to keep in mind NoSQL. Perhaps you have to read on those concepts. How much are you comfortable with relational model?

If you manage to describe the data scope comprehensively, maybe we could advise more detail on the data model and data layer. As I say, abstraction of the data layer is the key in your situation.

—SA

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)