Main: 2.3M

SocMed: 0.83M

Main: ??M

SocMed: ??M

Main: ??M

SocMed: ??M

Main: 1.75M

SocMed: 1.85M

Main: 2.63M

SocMed: 3.59M

Main: 1.75M

SocMed: 0.014M

Main: 9.57M

SocMed: 2.66M

Corpora of Uralic languages of the Volga-Kama area

This page contains links to the corpora of Uralic languages spoken in the Volga-Kama area and adjacent regions. Since this is a work in progress, only Erzya, Moksha and Udmurt corpora are available right now, however corpora for the Komi and Mari languages will follow soon.

The languages represented here are “middle-sized” in terms of number of speakers, which varies between several dozen thousands and several hundred thousands. On the one hand, these numbers are less than the major European Uralic languages (Hungarian, Finnish and Estonian) have; on the other, there are plenty of minority Uralic languages spoken in Russia that have much fewer speakers. These languages also have a middle level of digital presence. Digital press, blogs, social media etc. exist for each of them, but their total amount is orders of magnitude less than that of Hungarian, Finnish or Estonian.

The corpora available here mostly contain texts that were published on the internet in one way or another. There are two corpora for each language: the “Main” corpus and the Social media corpus. The latter contains open posts and comments written in social media (which at the moment includes only vkontakte, the most popular social media platform in Russia) and, in some cases, on forums. All other texts (newspapers, blogs, fiction, Bible translations, Wikipedia) go to the former.

You can find more detailed information about these corpora and their development in this paper. Please consider citing this paper if your research is based on these corpora:

Timofey Arkhangelskiy. 2019. Corpora of social media in minority Uralic languages. Proceedings of the fifth Workshop on Computational Linguistics for Uralic Languages, pages 125–140, Tartu, Estonia, January 7 - January 8, 2019.

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word cat followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, these corpora are not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

You can find answers to other frequently asked questions on the pages of the individual corpora.

Authors

All stages of development of these corpora, with some exceptions, were performed by Timofey Arkhangelskiy (you can find more detailed information on the pages of the corpora). All social media corpora and almost all other corpora were developed in 2018–2019 as a part of a postdoctoral project supported by Alexander von Humboldt Foundation. All corpora presented here are hosted by the School of linguistics at HSE, Moscow.

Contacts


If you have questions or would like to propose collaboration, please contact Timofey Arkhangelskiy. You can also use my morphological analyzers and the tsakorpus corpus platform, which are open source and freely available.

timarkh@gmail.com