Skip to search
Skip to main content
Skip to first result
Search
Search Results
Creator:
Droganova, Kira , Zeman, Daniel , Kanerva, Jenna , and Ginter, Filip
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
universal dependencies , ellipsis , and gapping
Language:
English , Czech , Finnish , Russian , and Slovak
Description:
Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
Rights:
Licence Universal Dependencies v2.1 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1 , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) , http://creativecommons.org/licenses/by-nc-nd/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution 4.0 International (CC BY 4.0) , http://creativecommons.org/licenses/by/4.0/ , and PUB
Creator:
Zeman, Daniel and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
tokenization , word segmentation , morphology , tagging , syntax , parsing , and universal dependencies
Language:
Afrikaans , Arabic , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Thai , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
CoNLL 2017 and 2018 shared tasks:
Multilingual Parsing from Raw Text to Universal Dependencies
This package contains the test data in the form in which they ware presented
to the participating systems: raw text files and files preprocessed by UDPipe.
The metadata.json files contain lists of files to process and to output;
README files in the respective folders describe the syntax of metadata.json.
For full training, development and gold standard test data, see
Universal Dependencies 2.0 (CoNLL 2017)
Universal Dependencies 2.2 (CoNLL 2018)
See the download links at http://universaldependencies.org/.
For more information on the shared tasks, see
http://universaldependencies.org/conll17/
http://universaldependencies.org/conll18/
Contents:
conll17-ud-test-2017-05-09 ... CoNLL 2017 test data
conll18-ud-test-2018-05-06 ... CoNLL 2018 test data
conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata
and filenames modified so that it is digestible by the 2017 systems.
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Zeman, Daniel , Potthast, Martin , Straka, Milan , Popel, Martin , Dozat, Timothy , Qi, Peng , Manning, Christopher , Shi, Tianze , Wu, Felix G. , Chen, Xilun , Cheng, Yao , Björkelund, Anders , Falenska, Agnieszka , Yu, Xiang , Kuhn, Jonas , Che, Wanxiang , Guo, Jiang , Wang, Yuxuan , Zheng, Bo , Zhao, Huaipeng , Liu, Yang , Teng, Dechuan , Liu, Ting , Lim, Kyungtae , Poibeau, Thierry , Sato, Motoki , Manabe, Hitoshi , Noji, Hiroshi , Matsumoto, Yuji , Kırnap, Ömer , Önder, Berkay Furkan , Yuret, Deniz , Straková, Jana , Vania, Clara , Zhang, Xingxing , Lopez, Adam , Heinecke, Johannes , Asadullah, Munshi , Kanerva, Jenna , Luotolahti, Juhani , Ginter, Filip , Kuan, Yu , Sofroniev, Pavel , Schill, Erik , Hinrichs, Erhard , Nguyen, Dat Quoc , Dras, Mark , Johnson, Mark , Qian, Xian , Vilares, David , Gómez-Rodríguez, Carlos , Aufrant, Lauriane , Wisniewski, Guillaume , Yvon, François , Dumitrescu, Stefan Daniel , Boroş, Tiberiu , Tufiş, Dan , Das, Ayan , Zaffar, Affan , Sarkar, Sudeshna , Wang, Hao , Zhao, Hai , Zhang, Zhisong , Hornby, Ryan , Taylor, Clark , Park, Jungyeul , de Lhoneux, Miryam , Shao, Yan , Basirat, Ali , Kiperwasser, Eliyahu , Stymne, Sara , Goldberg, Yoav , Nivre, Joakim , Akkuş, Burak Kerim , Azizoglu, Heval , Cakici, Ruket , Moor, Christophe , Merlo, Paola , Henderson, James , Wang, Haozhou , Ji, Tao , Wu, Yuanbin , Lan, Man , de la Clergerie, Eric , Sagot, Benoît , Seddah, Djamé , More, Amir , Tsarfaty, Reut , Kanayama, Hiroshi , Muraoka, Masayasu , Yoshikawa, Katsumasa , Garcia, Marcos , and Gamallo, Pablo
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
dependency parser and parsebank
Language:
Arabic , Bulgarian , Russia Buriat , Czech , Catalan , Church Slavic , Danish , German , Modern Greek (1453-) , English , Spanish , Estonian , Basque , Persian , Finnish , French , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Swedish , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Zeman, Daniel , Potthast, Martin , Duthoo, Elie , Mesnard, Olivier , Rybak, Piotr , Wróblewska, Alina , Che, Wanxiang , Liu, Yijia , Wang, Yuxuan , Zheng, Bo , Liu, Ting , Li, Zuchao , He, Shexia , Zhang, Zhuosheng , Zhao, Hai , Wu, Yingting , Tong, Jia-Jun , Nguyen, Dat Quoc , Verspoor, Karin , Wan, Hui , Naseem, Tahira , Lee, Young-Suk , Castelli, Vittorio , Ballesteros, Miguel , Hershcovich, Daniel , Abend, Omri , Rappoport, Ari , Smith, Aaron , Bohnet, Bernd , de Lhoneux, Miryam , Nivre, Joakim , Shao, Yan , Stymne, Sara , Kırnap, Ömer , Dayanık, Erenay , Yuret, Deniz , Kanerva, Jenna , Ginter, Filip , Miekka, Niko , Leino, Akseli , Salakoski, Tapio , Lim, KyungTae , Park, Cheoneum , Lee, Changki , Poibeau, Thierry , Bhat, Riyaz Ahmad , Bhat, Irshad , Bangalore, Srinivas , Qi, Peng , Dozat, Timothy , Zhang, Yuhao , Manning, Christopher , Boroș, Tiberiu , Dumitrescu, Stefan Daniel , Burtica, Ruxandra , Arakelyan, Gor , Hambardzumyan, Karen , Khachatrian, Hrant , Rosa, Rudolf , Mareček, David , Straka, Milan , Seker, Amit , More, Amir , Tsarfaty, Reut , Önder, Berkay Furkan , Gümeli, Can , Jawahar, Ganesh , Muller, Benjamin , Fethi, Amal , Martin, Louis , Villemonte de la Clergerie, Eric , Sagot, Benoît , Seddah, Djamé , Özateş, Şaziye Betül , Özgür, Arzucan , Gungor, Tunga , Öztürk, Balkız , Ji, Tao , Liu, Yufang , Wang, Yijun , Wu, Yuanbin , Lan, Man , Chen, Danlu , Lin, Mengxiao , Hu, Zhifeng , and Qiu, Xipeng
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
parsed data , conllu , and universal dependencies
Language:
Afrikaans , Arabic , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Thai , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Náplava, Jakub , Straka, Milan , Hajič, Jan , and Straňák, Pavel
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
diacritical marks generation and natural language correction
Language:
Czech , Vietnamese , Romanian , Polish , Slovak , Spanish , Croatian , Irish , Latvian , Hungarian , French , and Turkish
Description:
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Galuščáková, Petra , Garabík, Radovan , and Bojar, Ondřej
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
parallel corpus and Czech-Slovak corpus
Language:
Slovak and Czech
Description:
Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/ and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Creator:
Kubeša, David and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
entity linking , NEL , NER , dataset , and knowledge base
Language:
Afrikaans , Arabic , Armenian , Basque , Belarusian , Bulgarian , Catalan , Chinese , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , Galician , German , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Maltese , Marathi , Modern Greek (1453-) , Northern Sami , Norwegian Nynorsk , Persian , Polish , Portuguese , Romanian , Russian , Scottish Gaelic , Serbian , Slovak , Slovenian , Spanish , Swedish , Tamil , Telugu , Uighur , Ukrainian , Urdu , Vietnamese , and Wolof
Description:
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , and Galician
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.4 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.4 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , and Skolt Sami
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.5 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Icelandic , Albanian , and Persian
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.6 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.6 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Icelandic , Albanian , Persian , Akuntsu , Apurinã , Khunsari , Manx , Mundurukú , Nayini , Soi , South Levantine Arabic , and Tupinambá
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.7 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Icelandic , Albanian , Persian , Akuntsu , Apurinã , Khunsari , Manx , Mundurukú , Nayini , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Western Armenian , and Central Siberian Yupik
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3687). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.8 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.8 , and PUB
Creator:
Mareček, David , Yu, Zhiwei , Zeman, Daniel , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
part of speech , tagging , semi-supervised , and cross-language
Language:
Belarusian , Bosnian , Bulgarian , Czech , Serbo-Croatian , Croatian , Upper Sorbian , Macedonian , Polish , Russian , Slovak , Slovenian , Serbian , Ukrainian , Latvian , Lithuanian , Afrikaans , Danish , German , English , Faroese , Western Frisian , Swiss German , Icelandic , Limburgan , Luxembourgish , Low German , Dutch , Norwegian Nynorsk , Norwegian , Scots , Swedish , Yiddish , Aragonese , Asturian , Catalan , French , Galician , Haitian , Italian , Latin , Lombard , Neapolitan , Piemontese , Portuguese , Romanian , Spanish , Venetian , Walloon , Breton , Welsh , Scottish Gaelic , Irish , Modern Greek (1453-) , Armenian , Albanian , Dimli (individual language) , Persian , Gilaki , Kurdish , Tajik , Bengali , Bishnupriya , Gujarati , Fiji Hindi , Hindi , Marathi , Nepali (macrolanguage) , Urdu , Amharic , Arabic , Egyptian Arabic , Hebrew , Estonian , Finnish , Hungarian , Basque , Georgian , Chuvash , Azerbaijani , Turkish , Uzbek , Kazakh , Tatar , Yakut , Korean , Mongolian , Telugu , Kannada , Malayalam , Tamil , Newari , Vietnamese , Indonesian , Javanese , Malagasy , Maori , Malay (macrolanguage) , Pampanga , Sundanese , Tagalog , Waray (Philippines) , Swahili (macrolanguage) , Esperanto , Ido , Interlingua (International Auxiliary Language Association) , and Volapük
Description:
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Mareček, David , Yu, Zhiwei , Zeman, Daniel , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
part of speech , tagging , semi-supervised , and cross-language
Language:
Belarusian , Bosnian , Bulgarian , Czech , Serbo-Croatian , Croatian , Upper Sorbian , Macedonian , Polish , Russian , Slovak , Slovenian , Serbian , Ukrainian , Latvian , Lithuanian , Afrikaans , Danish , German , English , Faroese , Western Frisian , Swiss German , Icelandic , Limburgan , Luxembourgish , Low German , Dutch , Norwegian Nynorsk , Norwegian , Scots , Swedish , Yiddish , Aragonese , Asturian , Catalan , French , Galician , Haitian , Italian , Latin , Lombard , Neapolitan , Piemontese , Portuguese , Romanian , Spanish , Venetian , Walloon , Breton , Welsh , Scottish Gaelic , Irish , Modern Greek (1453-) , Armenian , Albanian , Dimli (individual language) , Persian , Gilaki , Kurdish , Tajik , Bengali , Bishnupriya , Gujarati , Fiji Hindi , Hindi , Marathi , Nepali (macrolanguage) , Urdu , Amharic , Arabic , Egyptian Arabic , Hebrew , Estonian , Finnish , Hungarian , Basque , Georgian , Chuvash , Azerbaijani , Turkish , Uzbek , Kazakh , Tatar , Yakut , Korean , Mongolian , Telugu , Kannada , Malayalam , Tamil , Newari , Vietnamese , Indonesian , Javanese , Malagasy , Maori , Malay (macrolanguage) , Pampanga , Sundanese , Tagalog , Waray (Philippines) , Swahili (macrolanguage) , Esperanto , Ido , Interlingua (International Auxiliary Language Association) , and Volapük
Description:
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Changes in version 1.1:
1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Galuščáková, Petra , Garabík, Radovan , and Bojar, Ondřej
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
parallel corpus and English-Slovak corpus
Language:
Slovak and English
Description:
English-Slovak parallel corpus consisting of several freely available corpora (Acquis [1], Europarl [2], Official Journal of the European Union [3] and part of OPUS corpus [4] – EMEA, EUConst, KDE4 and PHP) and downloaded website of European Commission [5]. Corpus is published in both in plaintext format and with an automatic morphological annotation.
References:
[1] http://langtech.jrc.it/JRC-Acquis.html/
[2] http://www.statmt.org/europarl/
[3] http://apertium.eu/data
[4] http://opus.lingfil.uu.se/
[5] http://ec.europa.eu/ and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003 of the Czech Republic)
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Creator:
Zeman, Daniel , Mareček, David , Mašek, Jan , Popel, Martin , Ramasamy, Loganathan , Rosa, Rudolf , Štěpánek, Jan , and Žabokrtský, Zdeněk
Publisher:
Charles University
Type:
text and corpus
Subject:
annotated corpus , morphology , syntax , dependency , treebank , harmonized annotation , and common annotation style
Language:
Arabic , Basque , Bengali , Bulgarian , Catalan , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Modern Greek (1453-) , Ancient Greek (to 1453) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Persian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Spanish , Swedish , Tamil , Telugu , and Turkish
Description:
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
Rights:
HamleDT 3.0 License Terms , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-3.0 , and PUB
Creator:
Zeman, Daniel , Bouma, Gosse , and Seddah, Djamé
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , enhanced universal dependencies , shared task , and parsing
Language:
Arabic , Bulgarian , Czech , Dutch , English , Estonian , Finnish , French , Italian , Latvian , Lithuanian , Polish , Russian , Slovak , Swedish , Tamil , and Ukrainian
Description:
This package contains data used in the IWPT 2020 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.5 (http://hdl.handle.net/11234/1-3105) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.6 (http://hdl.handle.net/11234/1-3226), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.
Rights:
Licence Universal Dependencies v2.5 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5 , and PUB
Creator:
Zeman, Daniel , Bouma, Gosse , and Seddah, Djamé
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , enhanced universal dependencies , shared task , and parsing
Language:
Arabic , Bulgarian , Czech , Dutch , English , Estonian , Finnish , French , Italian , Latvian , Lithuanian , Polish , Russian , Slovak , Swedish , Tamil , and Ukrainian
Description:
This package contains data used in the IWPT 2021 shared task. It contains training, development and test (evaluation) datasets. The data is based on a subset of Universal Dependencies release 2.7 (http://hdl.handle.net/11234/1-3424) but some treebanks contain additional enhanced annotations. Moreover, not all of these additions became part of Universal Dependencies release 2.8 (http://hdl.handle.net/11234/1-3687), which makes the shared task data unique and worth a separate release to enable later comparison with new parsing algorithms. The package also contains a number of Perl and Python scripts that have been used to process the data during preparation and during the shared task. Finally, the package includes the official primary submission of each team participating in the shared task.
Rights:
Licence Universal Dependencies v2.7 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7 , and PUB
Creator:
Veselý, Bohumil
Publisher:
Národní filmový archiv
Type:
video and clip
Subject:
katedrála exteriér , slavnost církevní , průvod církevní , církev římskokatolická , lebka sv. Vojtěcha , hrad exteriér , Galerie osobností , Places::Praha::Hradčany::Pražský hrad::katedrála sv. Víta /ext./ , Places::Praha::Hradčany::Pražský hrad::Brána gigantů , People::Beran Josef (1888-1969) , People::Opasek Jan Anastasius (1913-1999) , and People::Verzich Maurus (1911-1992)
Language:
Slovak
Description:
Cardinal Josef Beran leads a procession with the skull of Saint Adalbert of Prague in a fragmented segment from film weekly newsreel. The procession starts at Saint Vitus Cathedral at Prague Castle and walks through the Gate of Giants. Cardinal Beran blesses the remains of Saint Adalbert of Prague on Hradčany Square in Prague.
Rights:
http://creativecommons.org/licenses/by-nc-nd/4.0/ , PUB , and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Creator:
Krátký film and Veselý, Bohumil
Publisher:
Národní filmový archiv
Type:
video and clip
Subject:
kytice , narozeniny Kubín Josef Štefan 100. , Galerie osobností , Places::Dobříš::zámek::hlavní sál , People::Kubín Josef Štefan (1864-1965) , and Československý filmový týdeník 1964/42
Language:
Slovak
Description:
Writer Josef Štefan Kubín in a garden. Kubín accepts a bouquet of flowers on the occasion of his 100th birthday in Dobříš Chateau in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1964, issue no. 42.
Rights:
http://creativecommons.org/licenses/by-nc-nd/4.0/ , PUB , and Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Creator:
Galuščáková, Petra and Bojar, Ondřej
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
machine translation , errors classification , and CS-SK translation
Language:
Slovak and Czech
Description:
Manual classification of errors of Czech-Slovak translation according to the classification introduced by Vilar et al. [1]. First 50 sentences from WMT 2010 test set were translated by 5 MT systems (Česílko, Česílko2, Google Translate and two Moses setups) and MT errors were manually marked and classified. Classification was applied in MT systems comparison [3]. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://matrix.statmt.org/test_sets/list
[3] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grants Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Creator:
Galuščáková, Petra and Bojar, Ondřej
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
machine translation , errors classification , and EN-SK translation
Language:
Slovak and English
Description:
Manual classification of errors of English-Slovak translation according to the classification introduced by Vilar et al. [1]. 50 sentences randomly selected from WMT 2011 test set [2] were translated by 3 MT systems described in [3] and MT errors were manually marked and classified. Reference translation is included.
References:
[1] David Vilar, Jia Xu, Luis Fernando D’Haro and Hermann Ney. Error Analysis of Machine Translation Output. In International Conference on Language Resources and Evaluation, pages 697-702. Genoa, Italy, May 2006.
[2] http://www.statmt.org/wmt11/evaluation-task.html
[3] Petra Galuščáková and Ondřej Bojar. Improving SMT by Using Parallel Data of a Closely Related Language. In Human Language Technologies - The Baltic Perspective - Proceedings of the Fifth International Conference Baltic HLT 2012, volume 247 of Frontiers in AI and Applications, pages 58-65, Amsterdam, Netherlands, October 2012. IOS Press. and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Creator:
Bojar, Ondřej and Galuščáková, Petra
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
machine translation , evaluation , and manual ranking
Language:
Slovak and Czech
Description:
Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems (Česílko, Česílko2, Google Translate and two Moses setups) on three data sets (100 sentences randomly selected from books, 100 sentences randomly selected from Acquis corpus and 50 first sentences from WMT 2010 test set). Ranking was applied in MT systems comparison in [1].
References:
[1] Ondřej Bojar, Petra Galuščáková, and Miroslav Týnovský. Evaluating Quality of Machine Translation from Czech to Slovak. In Markéta Lopatková, editor, Information Technologies - Applications and Theory, pages 3-9, September 2011 and This work has been supported by the grant Euro-MatrixPlus (FP7-ICT-2007-3-231720 of the EU and
7E09003 of the Czech Republic)
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Creator:
Hajič, Jan and Hric, Jan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text , computationalLexicon , and lexicalConceptualResource
Subject:
Slovak and morphological dictionary
Language:
Slovak
Description:
Slovak morphological dictionary modeled after the Czech one. It consists of (word form, lemma, POS tag) triples, reusing the Czech morphological system for POS tags and lemma descriptions.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Rosa, Rudolf
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
Wikipedia , text corpora , and monolingual corpus
Language:
Abkhazian , Achinese , Adyghe , Afrikaans , Akan , Tosk Albanian , Amharic , Old English (ca. 450-1100) , Arabic , Official Aramaic (700-300 BCE) , Aragonese , Egyptian Arabic , Assamese , Asturian , Atikamekw , Avaric , Aymara , South Azerbaijani , Azerbaijani , Bashkir , Bambara , Bavarian , Central Bikol , Belarusian , Bengali , Bislama , Banjar , Tibetan , Bosnian , Bishnupriya , Breton , Buginese , Bulgarian , Russia Buriat , Catalan , Min Dong Chinese , Cebuano , Czech , Chamorro , Chechen , Cherokee , Church Slavic , Chuvash , Cheyenne , Central Kurdish , Cornish , Corsican , Cree , Crimean Tatar , Kashubian , Welsh , Danish , German , Dinka , Dimli (individual language) , Dhivehi , Lower Sorbian , Dzongkha , Modern Greek (1453-) , English , Esperanto , Estonian , Basque , Ewe , Extremaduran , Faroese , Persian , Fijian , Finnish , French , Arpitan , Northern Frisian , Western Frisian , Fulah , Friulian , Gagauz , Gan Chinese , Scottish Gaelic , Irish , Galician , Gilaki , Manx , Goan Konkani , Gothic , Guarani , Gujarati , Hakka Chinese , Haitian , Hausa , Hawaiian , Serbo-Croatian , Hebrew , Herero , Fiji Hindi , Hindi , Hiri Motu , Croatian , Upper Sorbian , Hungarian , Armenian , Igbo , Ido , Inuktitut , Interlingue , Iloko , Interlingua (International Auxiliary Language Association) , Indonesian , Inupiaq , Icelandic , Italian , Jamaican Creole English , Javanese , Lojban , Japanese , Kara-Kalpak , Kabyle , Kalaallisut , Kannada , Kashmiri , Georgian , Kanuri , Kazakh , Kabardian , Kabiyè , Khmer , Kikuyu , Kinyarwanda , Kirghiz , Komi-Permyak , Komi , Kongo , Korean , Karachay-Balkar , Kölsch , Kurdish , Ladino , Lao , Latin , Latvian , Lak , Lezghian , Ligurian , Limburgan , Lingala , Lithuanian , Lombard , Northern Luri , Latgalian , Luxembourgish , Ganda , Literary Chinese , Marshallese , Maithili , Malayalam , Marathi , Moksha , Eastern Mari , Minangkabau , Macedonian , Malagasy , Maltese , Mongolian , Maori , Western Mari , Malay (macrolanguage) , Creek , Mirandese , Burmese , Erzya , Mazanderani , Min Nan Chinese , Neapolitan , Nauru , Navajo , Ndonga , Low German , Nepali (macrolanguage) , Newari , Dutch , Norwegian Nynorsk , Norwegian , Novial , Pedi , Nyanja , Occitan (post 1500) , Livvi , Oriya (macrolanguage) , Oromo , Ossetian , Pangasinan , Pampanga , Panjabi , Papiamento , Picard , Pennsylvania German , Pfaelzisch , Pitcairn-Norfolk , Pali , Piemontese , Western Panjabi , Pontic , Polish , Portuguese , Pushto , Quechua , Vlax Romani , Romansh , Romanian , Rusyn , Rundi , Macedo-Romanian , Russian , Sango , Yakut , Sanskrit , Sicilian , Scots , Samogitian , Sinhala , Slovak , Slovenian , Northern Sami , Samoan , Shona , Sindhi , Somali , Southern Sotho , Spanish , Albanian , Sardinian , Sranan Tongo , Serbian , Swati , Saterfriesisch , Sundanese , Swahili (macrolanguage) , Swedish , Silesian , Tahitian , Tamil , Tatar , Tulu , Telugu , Tama (Colombia) , Tetum , Tajik , Tagalog , Thai , Tigrinya , Tonga (Tonga Islands) , Tok Pisin , Tswana , Tsonga , Turkmen , Tumbuka , Turkish , Twi , Tuvinian , Udmurt , Uighur , Ukrainian , Urdu , Uzbek , Venetian , Venda , Veps , Vietnamese , Vlaams , Volapük , Võro , Waray (Philippines) , Walloon , Wolof , Wu Chinese , Kalmyk , Xhosa , Mingrelian , Yiddish , Yoruba , Yue Chinese , Zeeuws , Zhuang , Chinese , Zulu , and Dotyali
Description:
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Rights:
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) , http://creativecommons.org/licenses/by-sa/3.0/ , and PUB
Creator:
(:unav) Unknown author
Publisher:
Masaryk University, NLP Centre
Type:
text and corpus
Subject:
Slovak large corpus
Language:
Slovak
Description:
Slovak large web corpus skTenTen, comprising 876,003,720 tokens. and Lexical Computing Ltd.
Rights:
Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) , http://creativecommons.org/licenses/by-nc-nd/3.0/ , and PUB
Creator:
Rosa, Rudolf , Zeman, Daniel , Mareček, David , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
other and toolService
Subject:
parsing , dependency parser , cross-lingual parsing , and universal dependencies
Language:
Slovak , Croatian , and Norwegian
Description:
Trained models for UDPipe used to produce our final submission to the Vardial 2017 CLP shared task (https://bitbucket.org/hy-crossNLP/vardial2017). The SK model was trained on CS data, the HR model on SL data, and the SV model on a concatenation of DA and NO data. The scripts and commands used to create the models are part of separate submission (http://hdl.handle.net/11234/1-1970).
The models were trained with UDPipe version 3e65d69 from 3rd Jan 2017, obtained from
https://github.com/ufal/udpipe -- their functionality with newer or older versions of UDPipe is not guaranteed.
We list here the Bash command sequences that can be used to reproduce our results submitted to VarDial 2017. The input files must be in CoNLLU format. The models only use the form, UPOS, and Universal Features fields (SK only uses the form). You must have UDPipe installed. The feats2FEAT.py script, which prunes the universal features, is bundled with this submission.
SK -- tag and parse with the model:
udpipe --tag --parse sk-translex.v2.norm.feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu
A slightly better after-deadline model (sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe), which we mention in the accompanying paper, is also included. It is applied in the same way (udpipe --tag --parse sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu).
HR -- prune the Features to keep only Case and parse with the model:
python3 feats2FEAT.py Case < hr-ud-predPoS-test.conllu | udpipe --parse hr-translex.v2.norm.Case.w2v.trainonpred.udpipe
NO -- put the UPOS annotation aside, tag Features with the model, merge with the left-aside UPOS annotation, and parse with the model (this hassle is because UDPipe cannot be told to keep UPOS and only change Features):
cut -f1-4 no-ud-predPoS-test.conllu > tmp
udpipe --tag no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe no-ud-predPoS-test.conllu | cut -f5- | paste tmp - | sed 's/^\t$//' | udpipe --parse no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Rosa, Rudolf , Zeman, Daniel , Mareček, David , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
suiteOfTools and toolService
Subject:
parsing , dependency parser , universal dependencies , and cross-lingual parsing
Language:
Czech , Slovak , Slovenian , Croatian , Danish , Swedish , and Norwegian
Description:
Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971).
For each source (SS, e.g. sl) and target (TT, e.g. hr) language,
you need to add the following into this directory:
- treebanks (Universal Dependencies v1.4):
SS-ud-train.conllu
TT-ud-predPoS-dev.conllu
- parallel data (OpenSubtitles from Opus):
OpenSubtitles2016.SS-TT.SS
OpenSubtitles2016.SS-TT.TT
!!! If they are originally called ...TT-SS... instead of ...SS-TT...,
you need to symlink them (or move, or copy) !!!
- target tagging model
TT.tagger.udpipe
All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017
You also need to have:
- Bash
- Perl 5
- Python 3
- word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014
- udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017
- Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016
The most basic setup is the sl-hr one (train_sl-hr.sh):
- normalization of deprels
- 1:1 word-alignment of parallel data with Monolingual Greedy Aligner
- simple word-by-word translation of source treebank
- pre-training of target word embeddings
- simplification of morpho feats (use only Case)
- and finally, training and evaluating the parser
Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in
specific cases (see paper for details).
Moreover, cs-sk also adds more morpho features, selecting those that
seem to be very often shared in parallel data.
The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.
Rights:
GNU General Public License 2 or later (GPL-2.0) , http://opensource.org/licenses/GPL-2.0 , and PUB
Creator:
Gajdošová, Katarína , Šimková, Mária , and et al.
Publisher:
Jazykovedný ústav Ľ. Štúra Slovenskej akadémie vied
Type:
text and corpus
Subject:
dependency , treebank , syntax , and morphology
Language:
Slovak
Description:
Slovak Dependency Treebank (Slovenský závislostný korpus) was created as part of the Slovak National Corpus at the Ľ. Štúr Institute of the Slovak Academy of Sciences. The annotation follows the guidelines of the Prague Dependency Treebank (Czech), slightly modified in the spirit of Slovak grammatical tradition. Morphological tags, lemmas and dependency relations have been assigned manually to every word.
The present dataset is a subset of the original treebank. We automatically selected the sentences where the two human annotators 100% agreed on the analysis. This increases the quality and trustworthiness of the data but it also results in selecting short sentences most of the time. An extended version may be published in the future when manually merged and checked annotation is available.
The selected sentences have been converted to the CoNLL-X file format (original token IDs are preserved in the FEATS column). This PDT-style annotation will serve as the source for the first Slovak dataset in the Universal Dependencies (to be published separately).
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text , mlmodel , and languageDescription
Subject:
MorphoDiTa , Slovak , morphological analysis , morphological generation , and PoS tagging
Language:
Slovak
Description:
Slovak models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex SK 170914 and the PoS tagger is trained on automatically translated Prague Dependency Treebank 3.0 (PDT).
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Kondratyuk, Dan and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
syntax , dependency parser , and universal dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , and Maltese
Description:
Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Riyaz Ahmad , Bick, Eckhard , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Chalub, Fabricio , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Freitas, Claudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Hajič, Jan , Hà Mỹ, Linh , Haug, Dag , Hladká, Barbora , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kenney, Jessica , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lam, Lucia , Lê Hồng, Phương , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , Mori, Keiko Sophie , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Paiva, Valeria , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Real, Livy , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Saulīte, Baiba , Schuster, Sebastian , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Spadine, Carolyn , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Wallin, Lars , Wang, Jing Xian , Washington, Jonathan North , Wirén, Mats , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Swedish Sign Language , Ukrainian , Uighur , and Vietnamese
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v1.4 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-1.4 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Candito, Marie , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Chalub, Fabricio , Choi, Jinho , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hà Mỹ, Linh , Haug, Dag , Hladká, Barbora , Hohle, Petter , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lê Hồng, Phương , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Real, Livy , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Sanguinetti, Manuela , Saulīte, Baiba , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Washington, Jonathan North , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , and Urdu
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This release is special in that the treebanks will be used as training/development data in the CoNLL 2017 shared task (http://universaldependencies.org/conll17/). Test data are not released, except for the few treebanks that do not take part in the shared task. 64 treebanks will be in the shared task, and they correspond to the following 45 languages: Ancient Greek, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Urdu, Uyghur and Vietnamese.
This release fixes a bug in http://hdl.handle.net/11234/1-1976. Changed files: ud-tools-v2.0.tgz (conllu_to_text.pl, conllu_to_conllx.pl; added text_without_spaces.pl), ud-treebanks-conll2017.tgz (fi_ftb-ud-train.txt, he-ud-train.txt, it-ud-train.txt, pt_br-ud-train.txt, es-ud-train.txt) and ud-treebanks-v2.0.tgz (fi_ftb-ud-train.txt, he-ud-train.txt, it-ud-train.txt, pt_br-ud-train.txt, es-ud-train.txt, ar_nyuad-ud-dev.txt, ar_nyuad-ud-test.txt, ar_nyuad-ud-train.txt, cop-ud-dev.txt, cop-ud-test.txt, cop-ud-train.txt, sa-ud-dev.txt, sa-ud-test.txt, sa-ud-train.txt).
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aranzabe, Maria Jesus , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Burchardt, Aljoscha , Candito, Marie , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Cinková, Silvie , Çöltekin, Çağrı , Connor, Miriam , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Eli, Marhaba , Elkahky, Ali , Erjavec, Tomaž , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hohle, Petter , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kayadelen, Tolga , Kettnerová, Václava , Kirchner, Jesse , Kotsyba, Natalia , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nainwani, Pinkey , Nedoluzhko, Anna , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Real, Livy , Reddy, Siva , Rehm, Georg , Rinaldi, Larissa , Rituma, Laura , Rosa, Rudolf , Rovati, Davide , Saleh, Shadi , Sanguinetti, Manuela , Saulīte, Baiba , Sawanakunanon, Yanin , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Shimada, Atsuko , Shohibussirri, Muh , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Stella, Antonio , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Washington, Jonathan North , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Northern Sami , Upper Sorbian , Russia Buriat , and Northern Kurdish
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present release also includes the development data already released with UD 2.0. Unlike regular UD releases, this one uses the folder-file structure that was visible to the systems participating in the shared task.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Candito, Marie , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Chalub, Fabricio , Choi, Jinho , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hà Mỹ, Linh , Haug, Dag , Hladká, Barbora , Hohle, Petter , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lê Hồng, Phương , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Real, Livy , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Sanguinetti, Manuela , Saulīte, Baiba , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Washington, Jonathan North , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , and Urdu
Description:
This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aranzabe, Maria Jesus , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bobicev, Victoria , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Burchardt, Aljoscha , Candito, Marie , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cinková, Silvie , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Elkahky, Ali , Erjavec, Tomaž , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Ion, Radu , Irimia, Elena , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kayadelen, Tolga , Kettnerová, Václava , Kirchner, Jesse , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Miekka, Niko , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Shinsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Müürisep, Kaili , Nainwani, Pinkey , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rinaldi, Larissa , Rituma, Laura , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Sagot, Benoît , Saleh, Shadi , Samardžić, Tanja , Sanguinetti, Manuela , Saulīte, Baiba , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wallin, Lars , Washington, Jonathan North , Wirén, Mats , Wong, Tak-sum , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , and Telugu
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.1 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Andersen, Erik , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Bengoetxea, Kepa , Ben Moshe, Yifat , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Daniela , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon , Davidson, Elizabeth , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , K, Sarveswaran , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Sichinava, Dmitry , Siewert, Janine , Sigurðsson, Einar Freyr , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Sourov, Shafi , Spadine, Carolyn , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , and Umbrian
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.10 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.10 , and PUB
Creator:
Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Afrikaans , Arabic , Belarusian , Bulgarian , Catalan , Czech , Church Slavic , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Scottish Gaelic , Irish , Galician , Gothic , Ancient Greek (to 1453) , Ancient Hebrew , Hebrew , Hindi , Croatian , Hungarian , Armenian , Western Armenian , Indonesian , Icelandic , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , and Chinese
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Akkurt, Salih Furkan , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Alzetta, Chiara , Andersen, Erik , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ásgeirsdóttir, Katla , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Belieni, Juan , Bengoetxea, Kepa , Ben Moshe, Yifat , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Castro, Maria Clara , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chamila, Liyanage , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Daniela , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon , Davidson, Elizabeth , de Alencar, Leonel Figueiredo , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Ebert, Christian , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huerta Mendez, Marivel , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Islamaj, Artan , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Jiang, Katharine , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Katz, Boris , Kayadelen, Tolga , Kengatharaiyer, Sarveswaran , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Koshevoy, Alexey , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuqi, Adrian , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yixuan , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makarchuk, Ilya , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Óladóttir, Hulda , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Panova, Anastasia , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Piitulainen, Jussi , Pintucci, Rodrigo , Pirinen, Tommi A , Pitler, Emily , Plamada, Magdalena , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Pugh, Robert , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Roksandic, Ivan , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Sartor, Marta , Sasaki, Mitsuya , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Shvedova, Maria , Siewert, Janine , Sigurðsson, Einar Freyr , Silva, João Ricardo , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Símonarson, Haukur Barri , Simov, Kiril , Sitchinava, Dmitri , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Sonnenhauser, Barbara , Sourov, Shafi , Spadine, Carolyn , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Stephen, Abishek , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Þórðarson, Sveinbjörn , Þorsteinsson, Vilhjálmur , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Venturi, Giulia , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Wille, Vanessa Berwanger , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , Umbrian , Abaza , Gheg Albanian , Malayalam , Nhengatu , Sinhala , Zacatlán-Ahuacatlán-Tepetzintla Nahuatl , Xavánte , and Saya
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.11 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.11 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Akkurt, Salih Furkan , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Alnajjar, Khalid , Alzetta, Chiara , Andersen, Erik , Antonsen, Lene , Aoyama, Tatsuya , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ásgeirsdóttir, Katla , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Avelãs, Mariana , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Behzad, Shabnam , Bengoetxea, Kepa , Benli, İbrahim , Ben Moshe, Yifat , Berk, Gözde , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Branco, António , Brokaitė, Kristina , Burchardt, Aljoscha , Campos, Marisa , Candito, Marie , Caron, Bernard , Caron, Gauthier , Carvalheiro, Catarina , Carvalho, Rita , Cassidy, Lauren , Castro, Maria Clara , Castro, Sérgio , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chamila, Liyanage , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Daniela , Costa, Francisco , Courtin, Marine , Cristescu, Mihaela , Dale, Ingerid Løyning , Daniel, Philemon , Davidson, Elizabeth , de Alencar, Leonel Figueiredo , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Doyle, Adrian , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Ebert, Christian , Eckhoff, Hanne , Eguchi, Masaki , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Essaidi, Farah , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Fethi, Amal , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Gerardi, Fabrício Ferraz , Gerdes, Kim , Gessler, Luke , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huerta Mendez, Marivel , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Islamaj, Artan , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Jiang, Katharine , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Kåsen, Andre , Kayadelen, Tolga , Kengatharaiyer, Sarveswaran , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Koshevoy, Alexey , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuqi, Adrian , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Kyle, Kris , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Levine, Lauren , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yixuan , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lin, Yi-Ju Jessica , Lindén, Krister , Liu, Yang Janet , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makarchuk, Ilya , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Martins, Cláudia , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Miller, Aaron , Mischenkova, Karina , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Óladóttir, Hulda , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Panova, Anastasia , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Peng, Siyao Logan , Pereira, Rita , Pereira, Sílvia , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Piitulainen, Jussi , Pinter, Yuval , Pinto, Clara , Pirinen, Tommi A , Pitler, Emily , Plamada, Magdalena , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Pugh, Robert , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Querido, Andreia , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramos, Joana , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabi, Arij , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Roksandic, Ivan , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Sartor, Marta , Sasaki, Mitsuya , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Shvedova, Maria , Siewert, Janine , Sigurðsson, Einar Freyr , Silva, João , Silveira, Aline , Silveira, Natalia , Silveira, Sara , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Símonarson, Haukur Barri , Simov, Kiril , Sitchinava, Dmitri , Sither, Ted , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Solberg, Per Erik , Sonnenhauser, Barbara , Sourov, Shafi , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Stephen, Abishek , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Þórðarson, Sveinbjörn , Þorsteinsson, Vilhjálmur , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , Vak, Socrates , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Venturi, Giulia , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhu, Yilun , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , Umbrian , Abaza , Gheg Albanian , Malayalam , Nhengatu , Sinhala , Zacatlán-Ahuacatlán-Tepetzintla Nahuatl , Xavánte , Saya , Borôro , Kirghiz , Algerian Arabic , and Old Irish (to 900)
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.12 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.12 , and PUB
Creator:
Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Afrikaans , Arabic , Belarusian , Bulgarian , Catalan , Czech , Church Slavic , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Scottish Gaelic , Irish , Galician , Gothic , Ancient Greek (to 1453) , Ancient Hebrew , Hebrew , Hindi , Croatian , Hungarian , Armenian , Western Armenian , Indonesian , Icelandic , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , Chinese , Norwegian , Erzya , and Manx
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Akkurt, Salih Furkan , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Alnajjar, Khalid , Alzetta, Chiara , Andersen, Erik , Antonsen, Lene , Aoyama, Tatsuya , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ásgeirsdóttir, Katla , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Avelãs, Mariana , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Behzad, Shabnam , Belieni, Juan , Bengoetxea, Kepa , Benli, İbrahim , Ben Moshe, Yifat , Berk, Gözde , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Branco, António , Brokaitė, Kristina , Burchardt, Aljoscha , Campos, Marisa , Candito, Marie , Caron, Bernard , Caron, Gauthier , Carvalheiro, Catarina , Carvalho, Rita , Cassidy, Lauren , Castro, Maria Clara , Castro, Sérgio , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chamila, Liyanage , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Claudia , Corbetta, Daniela , Costa, Francisco , Courtin, Marine , Crabbé, Benoît , Cristescu, Mihaela , Cvetkoski, Vladimir , Dale, Ingerid Løyning , Daniel, Philemon , Davidson, Elizabeth , de Alencar, Leonel Figueiredo , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Doyle, Adrian , Dozat, Timothy , Droganova, Kira , Duran, Magali Sanches , Dwivedi, Puneet , Ebert, Christian , Eckhoff, Hanne , Eguchi, Masaki , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Essaidi, Farah , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Fethi, Amal , Foster, Jennifer , Fransen, Theodorus , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Gerardi, Fabrício Ferraz , Gerdes, Kim , Gessler, Luke , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guiller, Kirian , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huang, Yidi , Huerta Mendez, Marivel , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Islamaj, Artan , Ito, Kaoru , Jagodzińska, Sandra , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Jiang, Katharine , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Kåsen, Andre , Kayadelen, Tolga , Kengatharaiyer, Sarveswaran , Kettnerová, Václava , Kharatyan, Lilit , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Kocharov, Petr , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Koshevoy, Alexey , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuqi, Adrian , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Kyle, Kris , Laan, Käbi , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Levine, Lauren , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yixuan , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lin, Yi-Ju Jessica , Lindén, Krister , Liu, Yang Janet , Ljubešić, Nikola , Lobzhanidze, Irina , Loginova, Olga , Lopes, Lucelene , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makarchuk, Ilya , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Martins, Cláudia , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Miller, Aaron , Mischenkova, Karina , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nunes, Maria das Graças Volpe , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Óladóttir, Hulda , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Panova, Anastasia , Pardo, Thiago Alexandre Salgueiro , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Peng, Siyao Logan , Pereira, Rita , Pereira, Sílvia , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Pierre-Louis, Claudel , Piitulainen, Jussi , Pinter, Yuval , Pinto, Clara , Pintucci, Rodrigo , Pirinen, Tommi A , Pitler, Emily , Plamada, Magdalena , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Pugh, Robert , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Querido, Andreia , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Ramos, Joana , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabi, Arij , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Roksandic, Ivan , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Sartor, Marta , Sasaki, Mitsuya , Saulīte, Baiba , Savary, Agata , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schang, Emmanuel , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Shvedova, Maria , Siewert, Janine , Sigurðsson, Einar Freyr , Silva, João , Silveira, Aline , Silveira, Natalia , Silveira, Sara , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Símonarson, Haukur Barri , Simov, Kiril , Sitchinava, Dmitri , Sither, Ted , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Solberg, Per Erik , Sonnenhauser, Barbara , Sourov, Shafi , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Stephen, Abishek , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Þórðarson, Sveinbjörn , Þorsteinsson, Vilhjálmur , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , Vak, Socrates , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Venturi, Giulia , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Wille, Vanessa Berwanger , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Wu, Qishen , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhu, Yilun , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , Umbrian , Abaza , Gheg Albanian , Malayalam , Nhengatu , Sinhala , Zacatlán-Ahuacatlán-Tepetzintla Nahuatl , Xavánte , Saya , Borôro , Kirghiz , Algerian Arabic , Old Irish (to 900) , Classical Armenian , Georgian , Haitian , Highland Puebla Nahuatl , Macedonian , Middle French (ca. 1400-1600) , and Veps
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.13 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.13 , and PUB
Creator:
Nivre, Joakim , Abrams, Mitchell , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Bauer, John , Bellato, Sandra , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Blokland, Rogier , Bobicev, Victoria , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erjavec, Tomaž , Etienne, Aline , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ion, Radu , Irimia, Elena , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kayadelen, Tolga , Kettnerová, Václava , Kirchner, Jesse , Kotsyba, Natalia , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Miekka, Niko , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Shinsuke , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Olúòkun, Adédayọ̀ , Omura, Mai , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rießler, Michael , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Roșca, Valentin , Rudina, Olga , Sadde, Shoval , Saleh, Shadi , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Saulīte, Baiba , Sawanakunanon, Yanin , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Soares-Bastos, Isabela , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tanaka, Takaaki , Tellier, Isabelle , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Wallin, Lars , Washington, Jonathan North , Williams, Seyi , Wirén, Mats , Woldemariam, Tsegay , Wong, Tak-sum , Yan, Chunxiao , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , Zhang, Manying , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , and Yoruba
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Nivre, Joakim , Abrams, Mitchell , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aplonova, Katya , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Basmov, Victoria , Bauer, John , Bellato, Sandra , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Blokland, Rogier , Bobicev, Victoria , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erjavec, Tomaž , Etienne, Aline , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Kopacewicz, Kamil , Kotsyba, Natalia , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lam, Lucia , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Miekka, Niko , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Keiko Sophie , Mori, Shinsuke , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Olúòkun, Adédayọ̀ , Omura, Mai , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rießler, Michael , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Roșca, Valentin , Rudina, Olga , Rueter, Jack , Sadde, Shoval , Sagot, Benoît , Saleh, Shadi , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Saulīte, Baiba , Sawanakunanon, Yanin , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tanaka, Takaaki , Tellier, Isabelle , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wallin, Lars , Wang, Jing Xian , Washington, Jonathan North , Williams, Seyi , Wirén, Mats , Woldemariam, Tsegay , Wong, Tak-sum , Yan, Chunxiao , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , Zhang, Manying , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , and Maltese
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.3 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.3 , and PUB