Click here to Skip to main content
15,886,362 members
Please Sign up or sign in to vote.
3.25/5 (4 votes)
See more:
i am reading a ms word doc using c#, i want only words(upper case and lower case) not space,comma,numbers,special characters,symbols etc. kindly help me with a good solution with code. thanks in advance.
Posted
Comments
Nitesh Kejriwal 13-Oct-12 4:56am    
can you show your sample doc file you are trying to read?

1 solution

Hi,
A reliable, professional-grade solution requires a lot of programming, and is not a trivial task. One good example you can find online in my free Semantic Analyzer, which extracts words and sentences from arbitrary text (btw, multilingual) and then apply concordance calculator to compute the frequency of word occurences: Semantic Analyzer[^]

In general, you first must get a string containing the plain text of interest (no formatting etc), then remove all special characters (like ",", ":", ";", etc.) using either String.Replace() or regular expression, then apply String.Split() using " " separator. You will get an array of strings containing words in the text. In real world solution, you must do much more of string processing, for e.g., replacing trailing blank spaces "     " with just a single one " ", etc. As mentioned above, entire production-grade solution goes far beyond the boundary of just a single article, and is also subject/domain-specific. You should probably start with simple proto and then trim it to fit your particular case. For your immediate needs, you can use my free online semantic analyzer, which provides a reasonable accuracy.

Kind regards,
AB
 
Share this answer
 
Comments
RaviRanjanKr 13-Oct-12 23:59pm    
My 5+
DrABELL 14-Oct-12 0:07am    
Thanks!
Marco Bertschi 16-Oct-12 5:21am    
Maybe the Microsoft.Interop.Word DLL which is installed with the MS Office Suite provides some additional help, but I am not sure about it.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900