What is statistical machine translation

Machine translation

Machine translation systems automatically transfer a source text into a target language without the direct involvement of a specialist translator. This rough translation is subsequently corrected and revised by specially qualified specialist editors, so-called post-editors. A distinction is currently made between statistical and rule-based machine translation (MT | MT) and neural machine translation (NMT | NMÜ).

Rules-based machine translation systems analyze and translate a source text using extensive dictionaries and grammatical rules. The more sophisticated the linguistic rule systems for the respective language pair and the more specific and extensive the technical vocabulary in the dictionaries, the more precise the final result in terms of correct lexicons, grammar and syntax, correctness of content and general comprehensibility. However, style and flow remain quite mechanical and often appear constructed. This method and its hybrid forms form the basis for stationary software applications such as Apertium, Systran®, Promt® or Babylon®.

Statistical machine translation systems are based on a large and as broadly diversified text corpus as possible of existing texts for a language pair. These bilingual sets of data are subjected to a statistical analysis of frequency and consistency using purely quantitative methods and compared with the text to be translated. Based on the statistical approximation values ​​determined, the most similar sentence fragments are combined into a translation. Although the hit rates tend to increase with the increase in the sentence databases, the translation quality seems to decrease again from a certain reference text volume, since the determined matches then become too diffuse and arbitrary.

Statistical machine translations are mostly stylistically catchy and relatively easy to read, but often incomplete and misleading in terms of content, as the syntactic references are often incorrectly reproduced and the correct expressions and technical terms are not used. This method is characteristic of online translation providers such as Google Translate®, KantanMT®, Asia Online® or Yandex Translate®.

Neural translation systems combine the components of a language according to the principle of qualitative similarity. You analyze the conceptual contexts of language elements in large text corpora using deep learning algorithms. In several neural processing layers, this statistical method is used to create an abstract language model that can be transferred to entered texts. The application and output of this “learned” model in the target language is based on the principle of probability.

Due to the required computing power, neural translation systems run on high-performance servers with powerful graphics processors that are available via the Internet.

There are currently two different neural translation systems: On the one hand, Recurrent Neural Networks (RNN), the development of which began at the end of 2014 in the course of speech recognition research, and on the other hand, Convolutional Neural Networks (CNN), which result from the deep learning approaches of machine processing of Developed image and audio data.

The difference between these two neural translation methods lies in the structure of the training data on which they are based. While convolutional neural networks process a fixed number of morphological word fragments in parallel on many layers, recurrent networks use a sequential process with entries of any length based on whole words.

With both NMT processes, significantly more convincing translation results can be achieved than with all previous translation machines. Short sentences with limited vocabulary are translated fluently, lexically correct and idiomatically almost perfectly. Complex sentence structures, specialist terminology and words with a very low frequency of occurrence, on the other hand, are not adequately reproduced even by neural translation machines. As with previous statistical MT methods, lexical misinterpretations, inconsistencies, omissions and incorrect sentence references also occur with neural machine translation.

As a result, translation machine providers are currently starting to offer intelligent hybrid solutions that combine rule-based and statistical translation technologies with a neural network. The providers of rule-based RNN translation machines include Personal Translator® and Systran Pure Neural Machine Translation®. Statistical RNN translation engines are used, for example, by Microsoft Translator® and Google Translate®. In contrast, providers such as DeepL® and Facebook® use machine translation technologies based on convolutional neural networks.

For more information on technologies used for translation:

We are at your disposal for detailed specialist advice.