How machine learning can help humans better understand language phenomena? Part I

by Metaminds / 10 February


In this article, I will talk briefly about how the understanding of language has been transformed by statistical approaches and statistical learning.  In particular, the focus will be on the language of translations, without going too deep into the history of this field. It is obvious that translations have been an important part of human interactions ever since the first phrase was uttered, therefore, understanding all the aspects of this process is essential in producing better translations and facilitating communication.


Corpus-based translation studies emerged early in the ’80s when researchers observed that translations have a specific distributional pattern of word occurrences that is significantly different from the one observed in original texts written in the same language. Translations were nicknamed “the third code” – a language variety/lect that is clearly different from the source language and especially different from texts written in original form. The term translationese was coined from translation + -ese (as in Chinese, Portuguese, Legalese) to highlight that a translation is a type of language variety that emerges from the contact of the source and target languages. These differences appear regardless of the proficiency of the translator or the quality of translation, but rather they are visible as a statistical phenomenon in which certain patterns of the source language are transferred into the target texts.


Second-language acquisition research has been well-acquainted with the terms “language transfer” or “interlanguage”, denoting a similar process in which native language features of a speaker are transferred during the language acquisition phase. In 1979, Gideon Toury proposed a theory of translation based exactly on language acquisition principles, adding that some translation phenomena are linguistic universals and appear with a strong tendency regardless of the source and target languages. Unlike language learners, professional translators always translate into their mother tongue, therefore ensuring that the output is as close as possible to the actual target language norms.


Together with the development of computational approaches, more and more research has been shifted towards empirical hypothesis testing at the corpus level. More exactly, corpus studies became central in the development of translation studies and linguistic hypotheses revealing several distributional phenomena that characterize translated texts across corpora, genre, and languages. Among the most important phenomena, we count 1) simplification –  the tendency of translators to make do with less words or to create a language production that is closer to conventional grammaticality and 2) explicitation – described as a rise in the level of cohesive explicitness in the target-language texts.


Corpus linguistics changed drastically over the years, as John McHardy Sinclair stated in 1991: Thirty years ago when this research started it was considered impossible to process texts of several million words in length. Twenty years ago it was considered marginally possible but lunatic. Ten years ago it was considered quite possible but still lunatic. Today it is very popular.


The methodology employed to study translation-related phenomena also changed across years: in the beginning, word counts combined with basic statistical modeling would dominate the analysis. The majority of interpretations were drawn from specific examples and from a thorough manual investigation. With the development of statistical learning, new methods such as text classification methods, regression analysis, neural networks, and generic tools of artificial intelligence have been employed in the analysis of the differences between translations and originals. Processing billion-word-corpora is a feasible task nowadays and so (computational) linguists employ AI tools such as BERT or language models fine-tuned on different language varieties to identify whether the data presents statistically learnable differences. Besides being useful for a handful of researchers who aim to contribute to our understanding of language phenomena, what other use cases may we find for such investigations, you may ask.

Machine translation is one of the most important fields where translationese keeps adding a significant impact. At some point, having more data does not make an AI model smarter, but having the right kind of data can make a model learn faster, better, and closer to the processes activated during human-generated translation.


We have reached a point in which linguists may rely on statistical learning to understand whether, why, and how two texts belong to different linguistic categories. This is also one of the reasons why it is of utmost importance to build explainable machine learning in order for the models to justify their decisions and to draw the correct interpretation of the results. The way to build explainable and unbiased AI is still an open problem of high interest, but we will leave that for another time.