what is the best way to find the category of ms word document in c#

Question

1.00/5 (1 vote)

See more:

i am trying to find the type of ms word document and categorize them for a project and the aim of the project is document clustering(i.e grouping) based on the content of the document.the objective is to achieve semi-supervised learning grouping documents based on both labelled and unlabelled data. and i am reading the document word by word in c#.but i cant find a way to categorize the document based on its content. can anyone give the remedy?. thanks.

Posted 21-Oct-12 19:41pm

manikandansanthi

Updated 21-Oct-12 21:30pm

v3

Add a Solution

Comments

Sergey Alexandrovich Kryukov 22-Oct-12 2:04am

The question does not seem to make any sense, but you can try to explain if you think it does.
--SA

Zoltán Zörgő 22-Oct-12 13:14pm

I suppose you downvoted the answers you don't like. Well, in this case please keep in mind, that not always the answers are guilty, might be that the question is not well formulated.
If wasn't you, than please disregard this comment.

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

**Zoltán Zörgő** · Answer 1 · 2012-10-21T21:10:00

That's called semantic analysis of a text. The easiest way is to define set of words that are common for a specific document category. Than you make statistics for that document over the word classes. And you elect the best matching group.
If you need more deeply analysis, you have to make use of a thesaurus (a semantic graph of a language). For English you can use this one: http://wordnet.princeton.edu/[^], but it is not common to all cultures to have such thesaurus already made :(
If yo have to go even deeper, you will have to do research. Start here: http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07.pdf[^], http://en.wikipedia.org/wiki/Document_classification[^]

Ambesha · Answer 2 · 2012-10-21T21:31:00

Solution 3

The extension of the word file are .doc/docx you should read all the file from your drive in loop and put their value in string, check the containing values eg: string.contain and categorized accordingly .

Thanks,
Ambesha

Posted 21-Oct-12 21:31pm

Ambesha

v2

Mark Storen · Answer 3 · 2012-10-21T20:34:00

Solution 1

It's not clear what you're doing reading it "word by word" but if the document is being read using Open XML then you can just get the document properties (CoreFilePropertiesPart) and look for the subject, keywords or category.

Posted 21-Oct-12 20:34pm

Mark Storen