About parallel corpora
Parallel corpora: what is it?
A parallel corpus is a special case of a linguistic corpus, one of the main tools used by linguistics specialists in the XXI century. Like the main part of linguistic corpora, the parallel corpus is usually provided with the so-called metainformation (information about each text — when it was created, by whom, what volume it is, etc.), as well as markup (each word is assigned its initial form, grammatical information, etc.).
Parallel corpus is a collection of texts in two languages at once. An important element of marking parallel corpora is alignment: each sentence (at least a paragraph) in language X corresponds to a sentence in language Y. Thanks to the alignment, the parallel corpus becomes a useful tool for several categories of users. This:
- students of a foreign language and teachers of a foreign language (words and expressions can now be searched not in a dictionary, but in contexts, and in the same contexts to see the compatibility of words in another language);
- translators (since the parallel corpus is a large database of all the findings that were invented by previous translators for certain expressions and techniques);
- specialists in statistical or neural network NLP — in the last decade, almost all serious companies have abandoned the development of rule translators (i.e., those that are based on a dictionary loaded there and a set of specific rules for translation). Now we need big data in two languages, where each sentence (or a smaller segment) will be given correspondences. Of course, the parallel corpus for programmers differs in design (markup and meta-information are not always needed there);
- linguists and translation specialists (on the basis of such databases, many conclusions can be drawn in the field of comparative study of grammar, semantics and vocabulary).
Here are the most famous examples of parallel cases:
- Reverso Context — the most user-friendly corpus in a variety of language pairs; used by foreign language learners and translators;
- OPUS - a combined database of parallel corpora that are often used for machine translation;
- Bible Translations — one of the most ancient parallel buildings, aligned in the XIII-XVI centuries by verses;
- EuroParl - the corpus of official documents of the European Parliament — an EU body with 27 official languages;
- Here you can look at some more known parallel corpora.