Russian-Chinese parallel corpus of Russian National Corpus

Russian National Corpus (RNC) — one of the largest and highest-quality families of corpora for the Russian language. There are a large number of so-called subcorps in the corpus — small databases dedicated to a specific area of language research (syntax, stress, etc.). One of these subcorps is parallel; it, in turn, is itself divided into twenty Russian-foreign-language corpora.

You can find out about what parallel corpora are here.

A little history

Our corpus appeared exactly inside the RNC in 2016. In 2019, it became available on two pages — there is its “old” the version on the RNC website, and “new” — on the website of the HSE corpora.

In 2020, we received support from the HSE for the development of our project.

We do not break away from our roots and still associate ourselves with the RNC; however, for a number of reasons, it is much easier for us to update the version of the Corpus on the HSE website. Therefore, first of all, we will talk about the news, algorithms, composition and team of the version of the Corpus that is located on the website of the HSE corpora.

The current state of the Corpus

The volume of the Corpus is more than 2.3 million words. It consists of 30 literary texts by Russian and Chinese authors of the XIX-XXI centuries, including Liu Zhenyun, F. M. Dostoevsky, L. E. Ulitskaya, Lu Xin and others.

Today, the Corpus has a Russian and English interface; we are working on creating a Chinese version of the site.

You can read about what exactly you can do in our case in the instructions on the search page: click on the orange question icon at the top of the page.

What are we notable for?

Now our project is the only parallel corpus being developed in Russia that has four useful properties at once:

it represents a pair of languages - Russian and Putonghua;
it is available on the internet;
it has a user-friendly search system;
it is marked up grammatically.

We only know about one analogue of our project, which is currently being developed in Beijing.

Our team

Our project involves students, teachers and researchers of the following institutes:

Dozens of people work on the corpus. But we still have a huge number of unresolved tasks, for which there are not enough active and courageous participants. Therefore, if you are interested in our project, be sure to look at our vacancies!