Finding abbreviations in a text file

Question

1.00/5 (1 vote)

See more:

C#

Windows

C#4.0

Hi how can I find abbreviations in a text file? Are there any roles or algorithms which I can use?

Thanks
M.H

Posted 18-Jun-12 4:56am

merh

Add a Solution

Comments

Richard MacCutchan 18-Jun-12 11:11am

Abbreviations of what?

Sergey Alexandrovich Kryukov 18-Jun-12 11:54am

I assumed the general-case abbreviations used in a language.
I answered, please see.
--SA

merh 18-Jun-12 13:25pm

Thanks for your answer.
I working with Swedish language. I would like to find the abbreviations and replace them with real words.
I have already a limited dictionary of abbreviations.
how can I recognize an abbreviation?
you says words which cannot be recognized should make the whole text invalid.
How can I find out words which make the whole text invalid?
Thanks
Merh

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2012-06-18T05:52:00

The minimal sense of language should have told you that this is even theoretically impossible with 100% confidence. The languages simply do not work like that.

There are no characteristic features which can universally tell the abbreviation from a "regular" word; and some abbreviations has the form of some "regular" word. It highly depends on the language though. In many Western cultures, one sign of an abbreviation is using capital characters beyond the first one. This is not a sure characteristic though, because this method would give a number of false negative. This rule won't work on abbreviations like English "codec" and those shameful lazy abbreviations like "math", "rehab" and a lot more.

Another method would be using some dictionary and considering the words marked in the dictionary as abbreviations and those not found in the dictionary as abbreviations. This method would give a lot of false negative and false positives. Note that some abbreviations are intentionally designed to match some previously existing "real" words exactly, such as "DRY" ("Don't Repeat Yourself"), "KISS" ("Keep It Simple, Stupid") and a lot more. Besides, this method can hardly work at all on some complex languages like Russian, Hungarian, etc., where the dictionaries usually do not list all forms of a word — there can be hundred of them; so the dictionary user is supposed to combine the grammar form to shape a word form in question, only the exclusions from the rules can be listed, not always.

The real solution of the problem could be found in the framework of the full semantic analysis of the context, but at the present level of the development of linguistic and computer science, there are no fully successful systems like that so far.

[EDIT]

In practice, you can simply have a very limited dictionary of abbreviations to be allowed in certain context. In this approach, all words which cannot be recognized should make the whole text invalid.

—SA