Click here to Skip to main content
15,891,019 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
i am trying to find the type of ms word document and categorize them for a project and the aim of the project is document clustering(i.e grouping) based on the content of the document.the objective is to achieve semi-supervised learning grouping documents based on both labelled and unlabelled data. and i am reading the document word by word in c#.but i cant find a way to categorize the document based on its content. can anyone give the remedy?. thanks.
Posted
Updated 21-Oct-12 21:30pm
v3
Comments
Sergey Alexandrovich Kryukov 22-Oct-12 2:04am    
The question does not seem to make any sense, but you can try to explain if you think it does.
--SA
Zoltán Zörgő 22-Oct-12 13:14pm    
I suppose you downvoted the answers you don't like. Well, in this case please keep in mind, that not always the answers are guilty, might be that the question is not well formulated.
If wasn't you, than please disregard this comment.

That's called semantic analysis of a text. The easiest way is to define set of words that are common for a specific document category. Than you make statistics for that document over the word classes. And you elect the best matching group.
If you need more deeply analysis, you have to make use of a thesaurus (a semantic graph of a language). For English you can use this one: http://wordnet.princeton.edu/[^], but it is not common to all cultures to have such thesaurus already made :(
If yo have to go even deeper, you will have to do research. Start here: http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07.pdf[^], http://en.wikipedia.org/wiki/Document_classification[^]
 
Share this answer
 
v2
Comments
Legor 22-Oct-12 4:11am    
These are good advices for the question topic.
Zoltán Zörgő 22-Oct-12 13:12pm    
And why is the "1"? I am really interested...
The extension of the word file are .doc/docx you should read all the file from your drive in loop and put their value in string, check the containing values eg: string.contain and categorized accordingly .

Thanks,
Ambesha
 
Share this answer
 
v2
It's not clear what you're doing reading it "word by word" but if the document is being read using Open XML then you can just get the document properties (CoreFilePropertiesPart) and look for the subject, keywords or category.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900