Click here to Skip to main content
15,881,089 members
Articles / General Programming / Algorithms

Develop Your Own Language Translation System

Rate me:
Please Sign up or sign in to vote.
5.00/5 (5 votes)
10 Aug 2010CPOL4 min read 151.2K   17   58
Understanding of Example Based Machine Translation (EBMT) system and how to create your own using exisiting tools

Abstract

This article describes the development of Example Based Machine Translation (EBMT) system using Java on Linux platform for translation from one language to another. In this particular case, I will be translating English sentences to Hindi. The principle of translating in EBMT is simple: a system decides an appropriate translation of an input sentence by analyzing the pre-translated sentences in the database. Therefore, the larger the database of pre-translated sentences, greater will be the accuracy of the EBMT system.

This article is greatly inspired by the works of Ralf Brown and Balakrishnan who have done extensive research in this field.

Introduction and Background

Example based translation is essentially translation by analogy. This means that if an EBMT system is given a set of sentences in the source language (from which one is translating) and their corresponding translations in the target language, the system can use these examples to translate other such similar source language sentences into target language sentences. The basic premise is that, if a previously translated sentence occurs again, the same translation is likely to be correct again.

Software Used

Developing your own machine translation is a difficult task. However, there are some tools that can help accelerate the process. I used the following tools in my EBMT system:

  1. Moses Decoder
  2. Giza++
  3. IRST-LM

Block Diagram

EBMT.png

Description

I divided the entire EBMT system into four modules.

1. Module I: Exact Match Algorithm

In this module, the input English sentence is first checked with every sentence in the available bilingual corpora for an exact match. If found, the corresponding Hindi sentence is retrieved and displayed as output.

In the case when the input is a paragraph, then the input is first broken down into sentences, and each sentence is taken one by one and translated.

2. Module II: Sentence Rule Based Translation

Every language has some grammar that describes how the words in the sentences should be organized. For instance, consider English vs. Hindi. English follows Subject-Verb-Object (SVO) linguistic topology while Hindi follows Subject-Object-Verb (SOV) topology. To illustrate this example, compare the following two sentences:

English: Anshul plays football

Hindi: Anshul football khelta hai

This module converts the input language into tokenized format. For example, the above English sentence is converted to

<Subject> plays <Object>

This helps in generalizing the translation process.

Besides this, there are many other linguistic rules that must be taken into consideration while translating sentences.

3. Module III: Phrase Decoder

When the first modules fail to translate, we divide the sentences into phrases against which we run algorithms based on statistical machine translation to find the most probable translated output of the input sentence.

Mathematically, we try to find out:

H*= arg max<sub>H</sub>P(H/E)            -(1)

I know this sounds complicated, so let me explain how we came to this equation.

According to the famous Bayes Law (Probability),

P(A/B) = P(B/A) * P(A)/P(B) 

In this case, we need to find that translated sentence A which has max probability of being the correct translation for a given input sentence B. Since we are looking for the most likely outcome A* for an event, given a fixed event B, P(B) is constant and doesn't play a role.

Thus, we want:

=> A* = arg max<sub>A</sub> P(A/B)

=> A*=arg max<sub>A</sub> P(B/A)*P(A)/P(B)

=> A*= arg max<sub>A</sub> P(B/A)*P(A)         -same as (1)   

This module tries to find the most probable Hindi translation of an English sentence by trying to find phrase H that would maximize P(E/H)*P(H). Phrases like these are clubbed together to complete the sentence.

Note:

  • P(H)=[Language model probability]:

    I used IRST-Language Model that measures fluency and probability of Hindi sentence and provide a set of fluent sentences to test for potential translation.

  • P(E/H)=[translation model probability H->E]:

    I used Giza++ that measures faithfulness, Probability of an (English, Hindi) pair given a Hindi sentence and test if a given fluent sentence is a translation.

  • arg maxH

    I used Moses Decoder that uses heuristic search to effectively and efficiently find H*.

4. Module IV: Word Decoder

This is the last attempt by EBMT to translate the input sentence. When Module III also fails to translate, EBMT breaks the sentence into words. For every word, it tries to seek the dictionary translation and simply stitches the outputs into a translated sentence.

Setup of EBMT

Basic preparation of an EBMT system requires you to do the following:

  1. Develop a bilingual corpora having pretranslated sentences from source language to destination language.
  2. Once you have a decent size corpora, then you need to install Giza++, Moses and IRST on your system.
  3. IRST requires monolingual file as well. This can easily be created by separating the bilingual corpora.
  4. Finally, you need to train your corpora with giza++. At the backhand, shell scripts and Perl scripts are run that compute probabilities and generate various files such as alignment file, translation table, fertility file, distoration table, etc.
EBMT2.png

Result

Training with Giza++ took 1.5 days. After which my EBMT system was ready!

EBMT4.png

Future Work

Machine translation is a research field with a lot of work already done and a lot more yet to be done. I merely demonstrated how you can use existing tools to create your own machine translation system. This is my first step towards innovation and I have a long way to go...

History

  • 11th August, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
India India
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questionawesome work Pin
Subramanyam D8-Sep-15 2:46
Subramanyam D8-Sep-15 2:46 
Questionsource code Pin
hossei.gholami7-Apr-15 4:20
hossei.gholami7-Apr-15 4:20 
QuestionHow To Develop a Bilingual Corpora Pin
Samuel Kaiser24-Sep-14 6:23
Samuel Kaiser24-Sep-14 6:23 
QuestionPlease Help. Pin
Navnath Kagde19-Aug-14 0:30
Navnath Kagde19-Aug-14 0:30 
Questionhelp me Pin
limh_dan28-Jun-14 22:14
limh_dan28-Jun-14 22:14 
QuestionDeveloping MT System Pin
Ahmed Salah Eldein15-Mar-14 4:36
Ahmed Salah Eldein15-Mar-14 4:36 
GeneralSource Code Pin
Member 1033714615-Oct-13 0:53
Member 1033714615-Oct-13 0:53 
GeneralSource Code Pin
Kenneth Sim19-Sep-13 13:45
Kenneth Sim19-Sep-13 13:45 
GeneralRe: Source Code Pin
Member 1379656422-Mar-19 10:52
Member 1379656422-Mar-19 10:52 
QuestionDear anshulskywalker! Pin
Endrias Haile2-Aug-13 20:47
Endrias Haile2-Aug-13 20:47 
QuestionSource Code Pin
fahmiomar1-Aug-13 4:41
fahmiomar1-Aug-13 4:41 
GeneralVery useful application Pin
Roopali 230-Jul-13 22:48
Roopali 230-Jul-13 22:48 
QuestionAwesome project Pin
Jandiv27-Jul-13 21:58
Jandiv27-Jul-13 21:58 
QuestionGreeting all Pin
Member 101409074-Jul-13 13:38
Member 101409074-Jul-13 13:38 
QuestionThis is very nice and life saving article. Pin
jagdish240611-Apr-13 22:10
jagdish240611-Apr-13 22:10 
AnswerRe: This is very nice and life saving article. Pin
jagdish240617-Apr-13 20:59
jagdish240617-Apr-13 20:59 
GeneralRe: This is very nice and life saving article. Pin
Member 101409074-Jul-13 13:39
Member 101409074-Jul-13 13:39 
Questionsource code Pin
gbigotes28-Feb-13 5:54
gbigotes28-Feb-13 5:54 
Questionrequest Pin
meys_online13-Oct-12 21:46
meys_online13-Oct-12 21:46 
QuestionSource Code Pin
akatsa13-Oct-12 6:10
akatsa13-Oct-12 6:10 
AnswerRe: Source Code Pin
Er.Maninderjit 21-Nov-12 21:11
Er.Maninderjit 21-Nov-12 21:11 
QuestionEnglish Pin
silncs21-Jun-12 0:56
silncs21-Jun-12 0:56 
Generalown language translation Pin
anuradhapriyankara8-Apr-12 21:31
anuradhapriyankara8-Apr-12 21:31 
QuestionDifficulties with my translation model Pin
sarahaf7-Apr-12 4:20
sarahaf7-Apr-12 4:20 
AnswerRe: Difficulties with my translation model Pin
mululer15-Aug-12 23:17
mululer15-Aug-12 23:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.