Click here to Skip to main content
15,893,487 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Hi how can I find abbreviations in a text file? Are there any roles or algorithms which I can use?

Thanks
M.H
Posted
Comments
Richard MacCutchan 18-Jun-12 11:11am    
Abbreviations of what?
Sergey Alexandrovich Kryukov 18-Jun-12 11:54am    
I assumed the general-case abbreviations used in a language.
I answered, please see.
--SA
merh 18-Jun-12 13:25pm    
Thanks for your answer.
I working with Swedish language. I would like to find the abbreviations and replace them with real words.
I have already a limited dictionary of abbreviations.
how can I recognize an abbreviation?
you says words which cannot be recognized should make the whole text invalid.
How can I find out words which make the whole text invalid?
Thanks
Merh

1 solution

The minimal sense of language should have told you that this is even theoretically impossible with 100% confidence. The languages simply do not work like that.

There are no characteristic features which can universally tell the abbreviation from a "regular" word; and some abbreviations has the form of some "regular" word. It highly depends on the language though. In many Western cultures, one sign of an abbreviation is using capital characters beyond the first one. This is not a sure characteristic though, because this method would give a number of false negative. This rule won't work on abbreviations like English "codec" and those shameful lazy abbreviations like "math", "rehab" and a lot more.

Another method would be using some dictionary and considering the words marked in the dictionary as abbreviations and those not found in the dictionary as abbreviations. This method would give a lot of false negative and false positives. Note that some abbreviations are intentionally designed to match some previously existing "real" words exactly, such as "DRY" ("Don't Repeat Yourself"), "KISS" ("Keep It Simple, Stupid") and a lot more. Besides, this method can hardly work at all on some complex languages like Russian, Hungarian, etc., where the dictionaries usually do not list all forms of a word — there can be hundred of them; so the dictionary user is supposed to combine the grammar form to shape a word form in question, only the exclusions from the rules can be listed, not always.

The real solution of the problem could be found in the framework of the full semantic analysis of the context, but at the present level of the development of linguistic and computer science, there are no fully successful systems like that so far.

[EDIT]

In practice, you can simply have a very limited dictionary of abbreviations to be allowed in certain context. In this approach, all words which cannot be recognized should make the whole text invalid.

—SA
 
Share this answer
 
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900