Language: Romanian - LINDAT/CLARIAH-CZ Catalog Search Results

Creator:: Zeman, Daniel and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: tokenization, word segmentation, morphology, tagging, syntax, parsing, and universal dependencies
Language:: Afrikaans, Arabic, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Persian, Finnish, French, Old French (842-ca. 1400), Irish, Galician, Gothic, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Latin, Latvian, Dutch, Norwegian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Thai, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, and Chinese
Description:: CoNLL 2017 and 2018 shared tasks: Multilingual Parsing from Raw Text to Universal Dependencies This package contains the test data in the form in which they ware presented to the participating systems: raw text files and files preprocessed by UDPipe. The metadata.json files contain lists of files to process and to output; README files in the respective folders describe the syntax of metadata.json. For full training, development and gold standard test data, see Universal Dependencies 2.0 (CoNLL 2017) Universal Dependencies 2.2 (CoNLL 2018) See the download links at http://universaldependencies.org/. For more information on the shared tasks, see http://universaldependencies.org/conll17/ http://universaldependencies.org/conll18/ Contents: conll17-ud-test-2017-05-09 ... CoNLL 2017 test data conll18-ud-test-2018-05-06 ... CoNLL 2018 test data conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata and filenames modified so that it is digestible by the 2017 systems.
Rights:: Licence Universal Dependencies v2.2, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2, and PUB

31. CoNLL 2017 Shared Task System Outputs

Creator:: Zeman, Daniel, Potthast, Martin, Straka, Milan, Popel, Martin, Dozat, Timothy, Qi, Peng, Manning, Christopher, Shi, Tianze, Wu, Felix G., Chen, Xilun, Cheng, Yao, Björkelund, Anders, Falenska, Agnieszka, Yu, Xiang, Kuhn, Jonas, Che, Wanxiang, Guo, Jiang, Wang, Yuxuan, Zheng, Bo, Zhao, Huaipeng, Liu, Yang, Teng, Dechuan, Liu, Ting, Lim, Kyungtae, Poibeau, Thierry, Sato, Motoki, Manabe, Hitoshi, Noji, Hiroshi, Matsumoto, Yuji, Kırnap, Ömer, Önder, Berkay Furkan, Yuret, Deniz, Straková, Jana, Vania, Clara, Zhang, Xingxing, Lopez, Adam, Heinecke, Johannes, Asadullah, Munshi, Kanerva, Jenna, Luotolahti, Juhani, Ginter, Filip, Kuan, Yu, Sofroniev, Pavel, Schill, Erik, Hinrichs, Erhard, Nguyen, Dat Quoc, Dras, Mark, Johnson, Mark, Qian, Xian, Vilares, David, Gómez-Rodríguez, Carlos, Aufrant, Lauriane, Wisniewski, Guillaume, Yvon, François, Dumitrescu, Stefan Daniel, Boroş, Tiberiu, Tufiş, Dan, Das, Ayan, Zaffar, Affan, Sarkar, Sudeshna, Wang, Hao, Zhao, Hai, Zhang, Zhisong, Hornby, Ryan, Taylor, Clark, Park, Jungyeul, de Lhoneux, Miryam, Shao, Yan, Basirat, Ali, Kiperwasser, Eliyahu, Stymne, Sara, Goldberg, Yoav, Nivre, Joakim, Akkuş, Burak Kerim, Azizoglu, Heval, Cakici, Ruket, Moor, Christophe, Merlo, Paola, Henderson, James, Wang, Haozhou, Ji, Tao, Wu, Yuanbin, Lan, Man, de la Clergerie, Eric, Sagot, Benoît, Seddah, Djamé, More, Amir, Tsarfaty, Reut, Kanayama, Hiroshi, Muraoka, Masayasu, Yoshikawa, Katsumasa, Garcia, Marcos, and Gamallo, Pablo
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: dependency parser and parsebank
Language:: Arabic, Bulgarian, Russia Buriat, Czech, Catalan, Church Slavic, Danish, German, Modern Greek (1453-), English, Spanish, Estonian, Basque, Persian, Finnish, French, Irish, Galician, Gothic, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Latin, Latvian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Swedish, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, and Chinese
Description:: This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.
Rights:: Licence Universal Dependencies v2.0, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0, and PUB

32. CoNLL 2018 Shared Task System Outputs

Creator:: Zeman, Daniel, Potthast, Martin, Duthoo, Elie, Mesnard, Olivier, Rybak, Piotr, Wróblewska, Alina, Che, Wanxiang, Liu, Yijia, Wang, Yuxuan, Zheng, Bo, Liu, Ting, Li, Zuchao, He, Shexia, Zhang, Zhuosheng, Zhao, Hai, Wu, Yingting, Tong, Jia-Jun, Nguyen, Dat Quoc, Verspoor, Karin, Wan, Hui, Naseem, Tahira, Lee, Young-Suk, Castelli, Vittorio, Ballesteros, Miguel, Hershcovich, Daniel, Abend, Omri, Rappoport, Ari, Smith, Aaron, Bohnet, Bernd, de Lhoneux, Miryam, Nivre, Joakim, Shao, Yan, Stymne, Sara, Kırnap, Ömer, Dayanık, Erenay, Yuret, Deniz, Kanerva, Jenna, Ginter, Filip, Miekka, Niko, Leino, Akseli, Salakoski, Tapio, Lim, KyungTae, Park, Cheoneum, Lee, Changki, Poibeau, Thierry, Bhat, Riyaz Ahmad, Bhat, Irshad, Bangalore, Srinivas, Qi, Peng, Dozat, Timothy, Zhang, Yuhao, Manning, Christopher, Boroș, Tiberiu, Dumitrescu, Stefan Daniel, Burtica, Ruxandra, Arakelyan, Gor, Hambardzumyan, Karen, Khachatrian, Hrant, Rosa, Rudolf, Mareček, David, Straka, Milan, Seker, Amit, More, Amir, Tsarfaty, Reut, Önder, Berkay Furkan, Gümeli, Can, Jawahar, Ganesh, Muller, Benjamin, Fethi, Amal, Martin, Louis, Villemonte de la Clergerie, Eric, Sagot, Benoît, Seddah, Djamé, Özateş, Şaziye Betül, Özgür, Arzucan, Gungor, Tunga, Öztürk, Balkız, Ji, Tao, Liu, Yufang, Wang, Yijun, Wu, Yuanbin, Lan, Man, Chen, Danlu, Lin, Mengxiao, Hu, Zhifeng, and Qiu, Xipeng
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: parsed data, conllu, and universal dependencies
Language:: Afrikaans, Arabic, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Persian, Finnish, French, Old French (842-ca. 1400), Irish, Galician, Gothic, Ancient Greek (to 1453), Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Latin, Latvian, Dutch, Norwegian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Thai, Turkish, Uighur, Ukrainian, Urdu, Vietnamese, and Chinese
Description:: Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.
Rights:: Licence Universal Dependencies v2.2, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2, and PUB

33. ConsILR - Consortium for the Romanian Language: Resources & Tools

Type:: lexicalConceptualResource
Language:: English and Romanian
Description:: Resources and tools developed for Romanian
Rights:: Not specified

34. Contributia României la victoria asupra fascismului

Type:: text and monografie
Subject:: Dějiny států a území na Balkánském poloostrově, sborníky, konference vědecké, hnutí antifašistická, Rumunsko, odboj, odpor, antifašismus, antikomunismus, světové dějiny 1939-1945, zahraniční periodika a sborníky, and zahraniční konference, kongresy
Language:: Romanian
Rights:: unknown

35. Corpus for training and evaluating diacritics restoration systems

Creator:: Náplava, Jakub, Straka, Milan, Hajič, Jan, and Straňák, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: diacritical marks generation and natural language correction
Language:: Czech, Vietnamese, Romanian, Polish, Slovak, Spanish, Croatian, Irish, Latvian, Hungarian, French, and Turkish
Description:: Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

36. Cultura moldovenească în timpul lui Ştefan cel Mare :

Type:: text and monografie
Subject:: Dějiny států a území na Balkánském poloostrově, Štěpán, Berza, Mihai,, sborníky, dějiny kultury, Rumunsko, dějiny vědy, umění, kultury a techniky, kulturní vztahy, světové dějiny středověku (do r. 1492), světové dějiny 1492-1648, and zahraniční periodika a sborníky
Language:: Romanian
Description:: Moldavská kultura v době Štěpána Velikého.
Rights:: unknown

37. DaMuEL 1.0: A Large Multilingual Dataset for Entity Linking

Creator:: Kubeša, David and Straka, Milan
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: entity linking, NEL, NER, dataset, and knowledge base
Language:: Afrikaans, Arabic, Armenian, Basque, Belarusian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Maltese, Marathi, Modern Greek (1453-), Northern Sami, Norwegian Nynorsk, Persian, Polish, Portuguese, Romanian, Russian, Scottish Gaelic, Serbian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Uighur, Ukrainian, Urdu, Vietnamese, and Wolof
Description:: We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

38. Dan al II-lea, Sigismund de Luxemburg şi cruciada târzie :

Creator:: Cîmpeanu, Liviu
Type:: text and studie
Subject:: Dějiny Evropy, Zikmund Lucemburský,, Dan, Radu, panovníci čeští, dokumenty, výpravy křížové, panovníci valašští, války proti Turkům, řád, němečtí rytíři, Rumunsko, světové dějiny středověku (do r. 1492), vojenské operace, války, bitvy, and české země 1419-1471
Language:: Romanian
Description:: Dan II, Sigismund od Luxembourg and the later crusades. A new document from the archive of the teutonic order.
Rights:: unknown

39. Deep Universal Dependencies 2.4

Creator:: Zeman, Daniel and Droganova, Kira
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: semantic dependency and universal dependencies
Language:: Afrikaans, Assyrian Neo-Aramaic, Akkadian, Amharic, Arabic, Belarusian, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Mandarin Chinese, Coptic, Welsh, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Finnish, French, Irish, Gothic, Ancient Greek (to 1453), Mbyá Guaraní, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Komi-Zyrian, Karelian, Latin, Latvian, Lithuanian, Literary Chinese, Marathi, Erzya, Dutch, Norwegian, Old Russian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Sanskrit, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Tamil, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Warlpiri, Wolof, Yoruba, and Galician
Description:: Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:: Licence Universal Dependencies v2.4, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.4, and PUB

40. Deep Universal Dependencies 2.5

Creator:: Zeman, Daniel and Droganova, Kira
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: semantic dependency and universal dependencies
Language:: Afrikaans, Assyrian Neo-Aramaic, Akkadian, Amharic, Arabic, Belarusian, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Mandarin Chinese, Coptic, Welsh, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Finnish, French, Irish, Gothic, Ancient Greek (to 1453), Mbyá Guaraní, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Komi-Zyrian, Karelian, Latin, Latvian, Lithuanian, Literary Chinese, Marathi, Erzya, Dutch, Norwegian, Old Russian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Sanskrit, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Tamil, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Warlpiri, Wolof, Yoruba, Galician, Bhojpuri, Komi-Permyak, Livvi, Moksha, Scottish Gaelic, and Skolt Sami
Description:: Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:: Licence Universal Dependencies v2.5, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5, and PUB

41. Deep Universal Dependencies 2.6

Creator:: Zeman, Daniel and Droganova, Kira
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: semantic dependency and universal dependencies
Language:: Afrikaans, Assyrian Neo-Aramaic, Akkadian, Amharic, Arabic, Belarusian, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Mandarin Chinese, Coptic, Welsh, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Finnish, French, Irish, Gothic, Ancient Greek (to 1453), Mbyá Guaraní, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Komi-Zyrian, Karelian, Latin, Latvian, Lithuanian, Literary Chinese, Marathi, Erzya, Dutch, Norwegian, Old Russian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Sanskrit, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Tamil, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Warlpiri, Wolof, Yoruba, Galician, Bhojpuri, Komi-Permyak, Livvi, Moksha, Scottish Gaelic, Skolt Sami, Icelandic, Albanian, and Persian
Description:: Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:: Licence Universal Dependencies v2.6, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.6, and PUB

42. Deep Universal Dependencies 2.7

Creator:: Zeman, Daniel and Droganova, Kira
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: semantic dependency and universal dependencies
Language:: Afrikaans, Assyrian Neo-Aramaic, Akkadian, Amharic, Arabic, Belarusian, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Mandarin Chinese, Coptic, Welsh, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Finnish, French, Irish, Gothic, Ancient Greek (to 1453), Mbyá Guaraní, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Komi-Zyrian, Karelian, Latin, Latvian, Lithuanian, Literary Chinese, Marathi, Erzya, Dutch, Norwegian, Old Russian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Sanskrit, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Tamil, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Warlpiri, Wolof, Yoruba, Galician, Bhojpuri, Komi-Permyak, Livvi, Moksha, Scottish Gaelic, Skolt Sami, Icelandic, Albanian, Persian, Akuntsu, Apurinã, Khunsari, Manx, Mundurukú, Nayini, Soi, South Levantine Arabic, and Tupinambá
Description:: Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:: Licence Universal Dependencies v2.7, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7, and PUB

43. Deep Universal Dependencies 2.8

Creator:: Zeman, Daniel and Droganova, Kira
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: semantic dependency and universal dependencies
Language:: Afrikaans, Assyrian Neo-Aramaic, Akkadian, Amharic, Arabic, Belarusian, Breton, Bulgarian, Russia Buriat, Catalan, Czech, Church Slavic, Mandarin Chinese, Coptic, Welsh, Danish, German, Modern Greek (1453-), English, Estonian, Basque, Faroese, Finnish, French, Irish, Gothic, Ancient Greek (to 1453), Mbyá Guaraní, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kazakh, Northern Kurdish, Korean, Komi-Zyrian, Karelian, Latin, Latvian, Lithuanian, Literary Chinese, Marathi, Erzya, Dutch, Norwegian, Old Russian, Nigerian Pidgin, Polish, Portuguese, Romanian, Russian, Sanskrit, Slovak, Slovenian, Northern Sami, Spanish, Serbian, Swedish, Tamil, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Warlpiri, Wolof, Yoruba, Galician, Bhojpuri, Komi-Permyak, Livvi, Moksha, Scottish Gaelic, Skolt Sami, Icelandic, Albanian, Persian, Akuntsu, Apurinã, Khunsari, Manx, Mundurukú, Nayini, Soi, South Levantine Arabic, Tupinambá, Beja, Western Frisian, Urubú-Kaapor, Kangri, K'iche', Low German, Makuráp, Western Armenian, and Central Siberian Yupik
Description:: Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3687). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:: Licence Universal Dependencies v2.8, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.8, and PUB

44. Deltacorpus

Creator:: Mareček, David, Yu, Zhiwei, Zeman, Daniel, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: part of speech, tagging, semi-supervised, and cross-language
Language:: Belarusian, Bosnian, Bulgarian, Czech, Serbo-Croatian, Croatian, Upper Sorbian, Macedonian, Polish, Russian, Slovak, Slovenian, Serbian, Ukrainian, Latvian, Lithuanian, Afrikaans, Danish, German, English, Faroese, Western Frisian, Swiss German, Icelandic, Limburgan, Luxembourgish, Low German, Dutch, Norwegian Nynorsk, Norwegian, Scots, Swedish, Yiddish, Aragonese, Asturian, Catalan, French, Galician, Haitian, Italian, Latin, Lombard, Neapolitan, Piemontese, Portuguese, Romanian, Spanish, Venetian, Walloon, Breton, Welsh, Scottish Gaelic, Irish, Modern Greek (1453-), Armenian, Albanian, Dimli (individual language), Persian, Gilaki, Kurdish, Tajik, Bengali, Bishnupriya, Gujarati, Fiji Hindi, Hindi, Marathi, Nepali (macrolanguage), Urdu, Amharic, Arabic, Egyptian Arabic, Hebrew, Estonian, Finnish, Hungarian, Basque, Georgian, Chuvash, Azerbaijani, Turkish, Uzbek, Kazakh, Tatar, Yakut, Korean, Mongolian, Telugu, Kannada, Malayalam, Tamil, Newari, Vietnamese, Indonesian, Javanese, Malagasy, Maori, Malay (macrolanguage), Pampanga, Sundanese, Tagalog, Waray (Philippines), Swahili (macrolanguage), Esperanto, Ido, Interlingua (International Auxiliary Language Association), and Volapük
Description:: Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

45. Deltacorpus 1.1

Creator:: Mareček, David, Yu, Zhiwei, Zeman, Daniel, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: part of speech, tagging, semi-supervised, and cross-language
Language:: Belarusian, Bosnian, Bulgarian, Czech, Serbo-Croatian, Croatian, Upper Sorbian, Macedonian, Polish, Russian, Slovak, Slovenian, Serbian, Ukrainian, Latvian, Lithuanian, Afrikaans, Danish, German, English, Faroese, Western Frisian, Swiss German, Icelandic, Limburgan, Luxembourgish, Low German, Dutch, Norwegian Nynorsk, Norwegian, Scots, Swedish, Yiddish, Aragonese, Asturian, Catalan, French, Galician, Haitian, Italian, Latin, Lombard, Neapolitan, Piemontese, Portuguese, Romanian, Spanish, Venetian, Walloon, Breton, Welsh, Scottish Gaelic, Irish, Modern Greek (1453-), Armenian, Albanian, Dimli (individual language), Persian, Gilaki, Kurdish, Tajik, Bengali, Bishnupriya, Gujarati, Fiji Hindi, Hindi, Marathi, Nepali (macrolanguage), Urdu, Amharic, Arabic, Egyptian Arabic, Hebrew, Estonian, Finnish, Hungarian, Basque, Georgian, Chuvash, Azerbaijani, Turkish, Uzbek, Kazakh, Tatar, Yakut, Korean, Mongolian, Telugu, Kannada, Malayalam, Tamil, Newari, Vietnamese, Indonesian, Javanese, Malagasy, Maori, Malay (macrolanguage), Pampanga, Sundanese, Tagalog, Waray (Philippines), Swahili (macrolanguage), Esperanto, Ido, Interlingua (International Auxiliary Language Association), and Volapük
Description:: Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). Changes in version 1.1: 1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset. 2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Rights:: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB

46. Documente privind istoria Romîniei.

Type:: text, prameny, and edice
Subject:: Dějiny Evropy, Rumunsko, přehledná zpracování světových dějin (chronologicky), and přehledná zpracování (tematicky)
Language:: Romanian
Rights:: unknown

47. Documente privind revolutia de la 1848 in tările române :

Type:: text and dokumenty
Subject:: Dějiny států a území na Balkánském poloostrově, revoluce 1848-1849, Rumunsko, and světové dějiny 1789-1918
Language:: Romanian and German
Rights:: unknown

48. Eşuarea încercârilor monarhiei şi a reactiunii interne şi externe de a râsturna regimul democrat popular /

Creator:: Bâlteanu, Boris
Type:: text and studie
Subject:: Dějiny států a území na Balkánském poloostrově, Rumunsko, politické dějiny, politici, and světové dějiny od r. 1945 do současnosti
Language:: Romanian
Rights:: unknown

49. Filozoful Jan Patočka /

Creator:: Dubský, Ivan,
Type:: text and biografie
Subject:: Filozofie, Patočka, Jan,, filozofové čeští, Československo 1945-1992, and filozofie, filozofové
Language:: Romanian
Rights:: unknown

50. Gabriel Bethlen (1613-1629) /

Creator:: Bunta Péter,
Type:: text, monografie, and biografie
Subject:: Dějiny Evropy, Bethlen, Gabriel,, knížata sedmihradská, povstání protihabsburská, šlechta, buržoazie, měšťanstvo, podnikatelé, and světové dějiny 1492-1648
Language:: Romanian
Rights:: unknown

51. HamleDT 2.0

Creator:: Zeman, Daniel, Mareček, David, Mašek, Jan, Popel, Martin, Ramasamy, Loganathan, Rosa, Rudolf, Štěpánek, Jan, and Žabokrtský, Zdeněk
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank, Stanford dependencies, Prague dependencies, harmonization, common annotation style, and Interset
Language:: Arabic, Bulgarian, Bengali, Catalan, Czech, Danish, German, Modern Greek (1453-), English, Spanish, Estonian, Basque, Persian, Finnish, Ancient Greek (to 1453), Hindi, Hungarian, Italian, Japanese, Latin, Dutch, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Tamil, Telugu, and Turkish
Description:: HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes.
Rights:: HamleDT 2.0 Licence Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-2.0, and ACA

52. HamleDT 3.0

Creator:: Zeman, Daniel, Mareček, David, Mašek, Jan, Popel, Martin, Ramasamy, Loganathan, Rosa, Rudolf, Štěpánek, Jan, and Žabokrtský, Zdeněk
Publisher:: Charles University
Type:: text and corpus
Subject:: annotated corpus, morphology, syntax, dependency, treebank, harmonized annotation, and common annotation style
Language:: Arabic, Basque, Bengali, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Modern Greek (1453-), Ancient Greek (to 1453), Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, and Turkish
Description:: HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style. Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
Rights:: HamleDT 3.0 License Terms, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-3.0, and PUB

53. Iaşii :

Creator:: Andronic, Alexandru,
Type:: text and monografie
Subject:: Dějiny států a území na Balkánském poloostrově, města středověká, urbanizace, Rumunsko, města, obce, and světové dějiny středověku (do r. 1492)
Language:: Romanian
Rights:: unknown

54. In tres partes divisa. Polonia śi criza Cehoslovacă în documente diplomatice Româneşti (septembrie-octombrie 1938) /

Creator:: Anghel, Florin
Subject:: diplomacie rumunská, dokumenty diplomatické, vztahy mezinárodní, politika zahraniční, vztahy československo-polské, vztahy rumunsko-československé, krize mnichovská, politické dějiny, politici, zahraniční politika, mezinárodní vztahy, světové dějiny 1918-1945, Polsko, Rumunsko, and Československo 1938-1945
Language:: Romanian
Description:: In tres partes divisa: Poland and the Czechoslovak crisis in Romanian diplomatic documents (September-October 1938).
Rights:: unknown

55. Începutul activitãtii revolutionare a lui C. Dobrogeanu-Gherea /

Creator:: Haupt, Georges,
Type:: text and studie
Subject:: Dějiny států a území na Balkánském poloostrově, Gherea-Dobrogeanu, Constantin,, hnutí dělnické, sociologové, publicisté, Rumunsko, Rusko, dělnictvo, chudina, and světové dějiny 1789-1918
Language:: Romanian
Rights:: unknown

56. Internationala întîi şi Romînia /

Creator:: Deac, Augustin,
Type:: text and monografie
Subject:: Politické strany a hnutí, vztahy mezinárodní, internacionála první (1864-1876), Rumunsko, zahraniční politika, mezinárodní vztahy, and světové dějiny 1789-1918
Language:: Romanian
Description:: Institutul de istorie a partidului de pe lîngă C.C. al P.M.R.
Rights:: unknown

57. Istoria ilustrată a Românilor /

Creator:: Giurescu, Dinu Constantin,
Type:: text and monografie
Subject:: Dějiny států a území na Balkánském poloostrově, dějiny států, Rumunsko, přehledná zpracování (tematicky), and přehledná zpracování světových dějin (chronologicky)
Language:: Romanian
Rights:: unknown

58. Istoria matematicii in România.

Creator:: Andonie, George Ștefan,
Type:: text and monografie
Subject:: Matematika, dějiny vědy, vědy exaktní, matematika, Rumunsko, matematika, kybernetika, and přehledná zpracování světových dějin (chronologicky)
Language:: Romanian
Rights:: unknown

59. Istoria Romäniei în date /

Type:: text and příručky
Subject:: Dějiny států a území na Balkánském poloostrově, dějiny, obecné přehledy, Rumunsko, politické dějiny, politici, and přehledná zpracování světových dějin (chronologicky)
Language:: Romanian
Rights:: unknown

60. Istoria Romäniei.

Type:: text and monografie kolektivní
Subject:: Dějiny států a území na Balkánském poloostrově, dějiny států, Rumunsko, přehledná zpracování (tematicky), and světové dějiny 1789-1918
Language:: Romanian
Rights:: unknown

61. Istoria sclavajului in Dacia romană /

Creator:: Tudor, David,
Type:: text and monografie
Subject:: Dějiny zemí starověkého světa, antika, provincie, otroci, Rumunsko, společenská struktura, Etruskové, starověký Řím, and epigrafika
Language:: Romanian
Rights:: unknown

62. Istoria stiintelor în România :

Type:: text and monografie kolektivní
Subject:: Obecná biologie, dějiny biologie, biologie, Rumunsko, přehledná zpracování světových dějin (chronologicky), and vědy o živé přírodě
Language:: Romanian
Rights:: unknown

63. Istorich sunt oameni modesti, spre pąguba breslei /

Creator:: Matej, Dorin
Subject:: vztahy česko-rumunské, zahraniční politika, mezinárodní vztahy, světové dějiny od r. 1918 do současnosti, Rumunsko, and Československo 1918-1992
Language:: Romanian
Description:: [rozhovor s velvyslancem ČR v Rumunsku Radkem Pechem]
Rights:: unknown

64. JRC-Acquis

Publisher:: Joint Research Centre of the EU
Type:: corpus
Language:: Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Modern Greek (1453-), Hungarian, Italian, Latvian, Maltese, Norwegian, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish
Description:: The largest parallel corpus, contains EU law, the Acquis Communautaire in 22 languages.
Rights:: Not specified

65. Juvenilie Josefa Macůrka /

Creator:: Macůrek, Josef,
Type:: text, sborníky jubilejní, and spisy
Subject:: Historická věda. Pomocné vědy historické. Archivnictví, Macůrek, Josef,, historici (jubilea, nekrology apod.), Československo 1918-1992, and dějepisectví, historické vědy, historici
Language:: Czech, Romanian, and French
Description:: Čes., rum. a franc. text
Rights:: unknown

66. Karel Zdeněk Líman :

Type:: text and monografie
Subject:: Architektura, Líman, Karel Zdeněk,, Carol, architekti čeští, Češi rumunští, architektura, české země 1848-1918, Československo 1918-1938, architektura, architekti, Rumunsko, světové dějiny 1789-1918, and světové dějiny 1918-1945
Language:: Romanian and English
Rights:: unknown

67. La industria checoeslovaca :

Type:: text and publikace informační
Subject:: Národní hospodářství a hospodářská politika, průmysl, Československo 1918-1938, and průmysl, manufaktury, hornictví, pivovary
Language:: Romanian
Description:: Na konci knihy [64] stran reklam a propagačních textů v češtině, španělštině, angličtině, němčině
Rights:: unknown

68. Lectii în ajutorul celor care studiaza istoriă P. M. R. /

Type:: text and sborníky
Subject:: Dějiny států a území na Balkánském poloostrově, hnutí dělnické, strany politické sociálně demokratické, strany politické komunistické, Rumunsko, dělnictvo, chudina, světové dějiny 1789-1918, světové dějiny od r. 1918 do současnosti, and politické strany a hnutí, volby
Language:: Romanian
Rights:: unknown

69. Les relations politiques roumano-francaises au début du XX siěcle (1900-1916 ) /

Creator:: Vesa, Vasile,
Type:: text and monografie
Subject:: Mezinárodní vztahy, světová politika, vztahy rumunsko-francouzské, politika zahraniční, Rumunsko, Francie, zahraniční politika, mezinárodní vztahy, and světové dějiny 1789-1918
Language:: French and Romanian
Rights:: unknown

70. Lupta pentru unitate naţională a tăriolor române 1590-1630 :

Type:: text and dokumenty
Subject:: Dějiny států a území na Balkánském poloostrově, války turecké, dějiny politické, Rumunsko, politické dějiny, politici, and světové dějiny 1492-1648
Language:: Romanian
Description:: Inst. de istorie Nicolae Iorga
Rights:: unknown

71. MEBA word aligner

Creator:: Tufiş, Dan and Ceauşu, Alexandru
Publisher:: Research Institute for Artificial Intelligence, Romanian Academy of Sciences
Type:: toolService
Subject:: word aligner
Language:: English and Romanian
Description:: MEBA is a lexical aligner, implemented in C#, based on an iterative algorithm that uses pre-processing steps: sentence alignment ([[http://www.clarin.eu/tools/sal-sentence-aligner|SAL]]), tokenization, POS-tagging and lemmatization (through [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]], sentence chunking. Similar to YAWA aligner, MEBA generates the links step by step, beginning with the most probable (anchor links). The links to be added at any later step are supported or restricted by the links created in the previous iterations. The aligner has different weights and different significance thresholds on each feature and iteration. Each of the iterations can be configured to align different categories of tokens (named entities, dates and numbers, content words, functional words, punctuation) in decreasing order of statistical evidence. MEBA has an individual F-measure of 81.71% and it is currently integrated in the platform [[http://www.clarin.eu/tools/cowal-combined-word-aligner|COWAL]]. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufiş (2007). Exploiting Aligned Parallel Corpora in Multilingual Studies and Applications. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Intercultural Collaboration. First International Workshop (IWIC 2007), volume 4568 of Lecture Notes in Computer Science, pp. 103-117. Springer-Verlag, August 2007. ISBN 978-3-540-73999-9. -- -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2006). Improved Lexical Alignment by Combining Multiple Reified Alignments. In Toru Ishida, Susan R. Fussell, and Piek T.J.M. Vossen (eds.), Proceedings of the 11th Conference EACL2006, pp. 153-160, Trento, Italy, April 2006. Association for Computational Linguistics. ISBN 1-9324-32-61-2. -- Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu (2005). Combined Aligners. In Proceedings of the ACL Workshop on Building and Using Parallel Texts: Data-Driven Machine Translation and Beyond, pp. 107-110, Ann Arbor, USA, June 2005. Association for Computational Linguistics. ISBN 978-973-703-208-9.
Rights:: Not specified

72. Misiunea militară a generalului Milan Rastislav Štefánik în România în lumina documentelor din arhivele române şi franceze /

Creator:: Kopecký, Peter,
Type:: text and studie
Subject:: Dějiny Česka a Slovenska, Štefánik, Milan Rastislav,, dokumenty archivní, politici slovenští, cesty zahraniční, vztahy československo-rumunské, zahraniční politika, mezinárodní vztahy, světové dějiny 1914-1918, Rumunsko, zahraniční archivnictví, and české země 1914-1918
Language:: Romanian
Rights:: unknown

73. Morpho-syntactically annotated corpora provided for the PARSEME Shared Task on Semi-Supervised Identification of Verbal Multiword Expressions (edition 1.2)

Creator:: Guillaume, Bruno, Ramisch, Carlos, Waszczuk, Jakub, Monti, Johanna, Di Buono, Maria Pia, Sangati, Federico, Speranza, Giulia, Carlino, Carola, Güngör, Tunga, Yirmibeşoğlu, Zeynep, Sak, Haşim, Saraçlar, Murat, Giouli, Voula, Foufi, Vassiliki, Ramisch, Renata, Rademaker, Alexandre, Vale, Oto, Wilkens, Rodrigo, Candito, Marie, Crabbé, Benoît, Segonne, Vincent, Liebeskind, Chaya, Stymne, Sara, Hajič, Jan, Ginter, Filip, Luotolahti, Juhani, Straka, Milan, Zeman, Daniel, Barbu Mititelu, Verginica, Cristescu, Mihaela, Vaidya, Ashwini, Bhatia, Archna, Lichte, Timm, Ehren, Rafael, Jiang, Menghan, Xu, Hongzhi, Walsh, Abigail, Irimia, Elena, and Dowling, Meghan
Publisher:: PARSEME
Type:: text and corpus
Subject:: morphosyntactic annotation, dependency trees, and morphological analysis
Language:: German, Modern Greek (1453-), Basque, French, Irish, Hebrew, Hindi, Italian, Polish, Portuguese, Romanian, Swedish, Turkish, and Chinese
Description:: This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs. The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe). VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2
Rights:: PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2-raw, and PUB

74. Neologismos económicos en las lenguas románicas a través de la prensa

Publisher:: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra
Type:: lexicalConceptualResource
Subject:: terminology database
Language:: Catalan, French, Galician, Italian, Portuguese, Romanian, and Spanish
Description:: Multilingual terminological resource containing 3.875 entries from the Economics, Finance and Banking domains.
Rights:: Not specified

75. Noi contribuţii privitoare la originea şi activitatea arhitectului Johann Freywald /

Creator:: Rădvan, Laurențiu,
Type:: text and studie
Subject:: Architektura, Dějiny států a území na Balkánském poloostrově, Freywald, Johann,, architekti, architektura městská, novoklasicismus, vztahy česko-rumunské, vztahy kulturní, české země 1792-1847, architektura, architekti, Rumunsko, and světové dějiny 1789-1918
Language:: Romanian
Description:: New contributions to the origin and work of the architect Johann Freywald.
Rights:: unknown

76. Noul sfat al animalelor /

Creator:: Flaška z Pardubic a Rychmburku, Smil,
Type:: text, prameny, and překlady
Subject:: Česká poezie, Flaška z Pardubic a Rychmburku, Smil,, literatura česká, literatura středověká, české země 1306-1419, and literatura, spisovatelé
Language:: Romanian
Description:: Přeloženo z češtiny
Rights:: unknown

77. Ocuparea Cehoslovaciei (august 1968) reflectată de Radiodifuziunea Romănă /

Creator:: Denize, Eugen
Subject:: okupace, rozhlas rumunský, mínění veřejné, Mnichov 1938, Pražské jaro 1968, okupace 1939, 1968, světové dějiny od r. 1945 do současnosti, Rumunsko, Československo 1948-1969, and televize, rozhlas
Language:: Romanian
Rights:: unknown

78. Omagiu lui Constantin Daicoviciu cu prilejul împlinirii a 60 de ani :

Type:: text and sborníky jubilejní
Subject:: Historická věda. Pomocné vědy historické. Archivnictví, Daicoviciu, Constantin,, archeologie, zahraniční periodika a sborníky, přehledná zpracování (tematicky), světové dějiny - pravěk a starověk, and světové dějiny středověku (do r. 1492)
Language:: Romanian
Rights:: unknown

80. ParaCrawl Corpus version 1.0

Creator:: Koehn, Philipp, Heafield, Kenneth, Forcada, Mikel L., Esplà-Gomis, Miquel, Ortiz-Rojas, Sergio, Sánchez, Gema Ramírez, Cartagena, Víctor M. Sánchez, Haddow, Barry, Bañón, Marta, Střelec, Marek, Samiotou, Anna, and Kamran, Amir
Publisher:: ParaCrawl
Type:: text and corpus
Subject:: ParaCrawl, parallel corpus, CommonCrawl, machine translation, and text corpora
Language:: English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Finnish, Latvian, Russian, and Estonian
Description:: The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB