Skip to search
Skip to main content
Skip to first result
Search
Search Results
Creator:
Ramisch, Carlos , Cordeiro, Silvio Ricardo , Savary, Agata , Vincze, Veronika , Barbu Mititelu, Verginica , Bhatia, Archna , Buljan, Maja , Candito, Marie , Gantar, Polona , Giouli, Voula , Güngör, Tunga , Hawwari, Abdelati , Iñurrieta, Uxoa , Kovalevskaitė, Jolanta , Krek, Simon , Lichte, Timm , Liebeskind, Chaya , Monti, Johanna , Parra Escartín, Carla , QasemiZadeh, Behrang , Ramisch, Renata , Schneider, Nathan , Stoyanova, Ivelina , Vaidya, Ashwini , Walsh, Abigail , Aceta, Cristina , Aduriz, Itziar , Antoine, Jean-Yves , Arhar Holdt, Špela , Berk, Gözde , Bielinskienė, Agnė , Blagus, Goranka , Boizou, Loic , Bonial, Claire , Caruso, Valeria , Čibej, Jaka , Constant, Matthieu , Cook, Paul , Diab, Mona , Dimitrova, Tsvetana , Ehren, Rafael , Elbadrashiny, Mohamed , Elyovich, Hevi , Erden, Berna , Estarrona, Ainara , Fotopoulou, Aggeliki , Foufi, Vassiliki , Geeraert, Kristina , van Gompel, Maarten , Gonzalez, Itziar , Gurrutxaga, Antton , Ha-Cohen Kerner, Yaakov , Ibrahim, Rehab , Ionescu, Mihaela , Jain, Kanishka , Jazbec, Ivo-Pavao , Kavčič, Teja , Klyueva, Natalia , Kocijan, Kristina , Kovács, Viktória , Kuzman, Taja , Leseva, Svetlozara , Ljubešić, Nikola , Malka, Ruth , Markantonatou, Stella , Martínez Alonso, Héctor , Matas, Ivana , McCrae, John , de Medeiros Caseli, Helena , Onofrei, Mihaela , Palka-Binkiewicz, Emilia , Papadelli, Stella , Parmentier, Yannick , Pascucci, Antonio , Pasquer, Caroline , Pia di Buono, Maria , Puri, Vandana , Raffone, Annalisa , Ratori, Shraddha , Riccio, Anna , Sangati, Federico , Shukla, Vishakha , Simkó, Katalin , Šnajder, Jan , Somers, Clarissa , Srivastava, Shubham , Stefanova, Valentina , Taslimipoor, Shiva , Theoxari, Natasa , Todorova, Maria , Urizar, Ruben , Villavicencio, Aline , and Zilio, Leonardo
Publisher:
PARSEME
Type:
text and corpus
Subject:
Multiword expressions , verbal multiword expressions , light-verb constructions , verb-particle constructions , inherently reflexive verbs , verbal idioms , and multi-verb constructions
Language:
Bulgarian , German , Modern Greek (1453-) , Spanish , Persian , French , Hebrew , Hungarian , Italian , Lithuanian , Polish , Portuguese , Romanian , Slovenian , Turkish , Hindi , Basque , English , and Croatian
Description:
This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). VMWEs were annotated according to the universal guidelines in 19 languages. The corpora are provided in the cupt format, inspired by the CONLL-U format. The corpora were used in the 1.1 edition of the PARSEME Shared Task (2018).
For most languages, morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.1 (2018).
The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1
Rights:
PARSEME Shared Task Data (v. 1.1) Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.1 , and PUB
Creator:
Ramisch, Carlos , Guillaume, Bruno , Savary, Agata , Waszczuk, Jakub , Candito, Marie , Vaidya, Ashwini , Barbu Mititelu, Verginica , Bhatia, Archna , Iñurrieta, Uxoa , Giouli, Voula , Güngör, Tunga , Jiang, Menghan , Lichte, Timm , Liebeskind, Chaya , Monti, Johanna , Ramisch, Renata , Stymme, Sara , Walsh, Abigail , Xu, Hongzhi , Palka-Binkiewicz, Emilia , Ehren, Rafael , Stymne, Sara , Constant, Matthieu , Pasquer, Caroline , Parmentier, Yannick , Antoine, Jean-Yves , Carlino, Carola , Caruso, Valeria , Di Buono, Maria Pia , Pascucci, Antonio , Raffone, Annalisa , Riccio, Anna , Sangati, Federico , Speranza, Giulia , Cordeiro, Silvio Ricardo , de Medeiros Caseli, Helena , Miranda, Isaac , Rademaker, Alexandre , Vale, Oto , Villavicencio, Aline , Wick Pedro, Gabriela , Wilkens, Rodrigo , Zilio, Leonardo , Rizea, Monica-Mihaela , Ionescu, Mihaela , Onofrei, Mihaela , Chen, Jia , Ge, Xiaomin , Hu, Fangyuan , Hu, Sha , Li, Minli , Liu, Siyuan , Qin, Zhenzhen , Sun, Ruilong , Wang, Chenweng , Xiao, Huangyang , Yan, Peiyi , Yih, Tsy , Yu, Ke , Yu, Songping , Zeng, Si , Zhang, Yongchen , Zhao, Yun , Foufi, Vassiliki , Fotopoulou, Aggeliki , Markantonatou, Stella , Papadelli, Stella , Louizou, Sevasti , Aduriz, Itziar , Estarrona, Ainara , Gonzalez, Itziar , Gurrutxaga, Antton , Uria, Larraitz , Urizar, Ruben , Foster, Jennifer , Lynn, Teresa , Elyovitch, Hevi , Ha-Cohen Kerner, Yaakov , Malka, Ruth , Jain, Kanishka , Puri, Vandana , Ratori, Shraddha , Shukla, Vishakha , Srivastava, Shubham , Berk, Gozde , Erden, Berna , and Yirmibeşoğlu, Zeynep
Publisher:
PARSEME
Type:
text and corpus
Subject:
multiword expressions , verbal multiword expressions , light verb construction , verb-particle constructions , inherently reflexive verbs , verbal idioms , and multi-verb constructions
Language:
German , Modern Greek (1453-) , Basque , French , Irish , Hebrew , Hindi , Italian , Polish , Portuguese , Romanian , Swedish , Turkish , and Chinese
Description:
This multilingual resource contains corpora in which verbal MWEs have been manually annotated, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020).
VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do).
For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format.
Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2
Rights:
PARSEME Shared Task Data (v. 1.2) Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2 , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) , http://creativecommons.org/licenses/by-nc-nd/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution 4.0 International (CC BY 4.0) , http://creativecommons.org/licenses/by/4.0/ , and PUB
Creator:
Zeman, Daniel and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
tokenization , word segmentation , morphology , tagging , syntax , parsing , and universal dependencies
Language:
Afrikaans , Arabic , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Thai , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
CoNLL 2017 and 2018 shared tasks:
Multilingual Parsing from Raw Text to Universal Dependencies
This package contains the test data in the form in which they ware presented
to the participating systems: raw text files and files preprocessed by UDPipe.
The metadata.json files contain lists of files to process and to output;
README files in the respective folders describe the syntax of metadata.json.
For full training, development and gold standard test data, see
Universal Dependencies 2.0 (CoNLL 2017)
Universal Dependencies 2.2 (CoNLL 2018)
See the download links at http://universaldependencies.org/.
For more information on the shared tasks, see
http://universaldependencies.org/conll17/
http://universaldependencies.org/conll18/
Contents:
conll17-ud-test-2017-05-09 ... CoNLL 2017 test data
conll18-ud-test-2018-05-06 ... CoNLL 2018 test data
conll18-ud-test-2018-05-06-for-conll17 ... CoNLL 2018 test data with metadata
and filenames modified so that it is digestible by the 2017 systems.
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Zeman, Daniel , Potthast, Martin , Straka, Milan , Popel, Martin , Dozat, Timothy , Qi, Peng , Manning, Christopher , Shi, Tianze , Wu, Felix G. , Chen, Xilun , Cheng, Yao , Björkelund, Anders , Falenska, Agnieszka , Yu, Xiang , Kuhn, Jonas , Che, Wanxiang , Guo, Jiang , Wang, Yuxuan , Zheng, Bo , Zhao, Huaipeng , Liu, Yang , Teng, Dechuan , Liu, Ting , Lim, Kyungtae , Poibeau, Thierry , Sato, Motoki , Manabe, Hitoshi , Noji, Hiroshi , Matsumoto, Yuji , Kırnap, Ömer , Önder, Berkay Furkan , Yuret, Deniz , Straková, Jana , Vania, Clara , Zhang, Xingxing , Lopez, Adam , Heinecke, Johannes , Asadullah, Munshi , Kanerva, Jenna , Luotolahti, Juhani , Ginter, Filip , Kuan, Yu , Sofroniev, Pavel , Schill, Erik , Hinrichs, Erhard , Nguyen, Dat Quoc , Dras, Mark , Johnson, Mark , Qian, Xian , Vilares, David , Gómez-Rodríguez, Carlos , Aufrant, Lauriane , Wisniewski, Guillaume , Yvon, François , Dumitrescu, Stefan Daniel , Boroş, Tiberiu , Tufiş, Dan , Das, Ayan , Zaffar, Affan , Sarkar, Sudeshna , Wang, Hao , Zhao, Hai , Zhang, Zhisong , Hornby, Ryan , Taylor, Clark , Park, Jungyeul , de Lhoneux, Miryam , Shao, Yan , Basirat, Ali , Kiperwasser, Eliyahu , Stymne, Sara , Goldberg, Yoav , Nivre, Joakim , Akkuş, Burak Kerim , Azizoglu, Heval , Cakici, Ruket , Moor, Christophe , Merlo, Paola , Henderson, James , Wang, Haozhou , Ji, Tao , Wu, Yuanbin , Lan, Man , de la Clergerie, Eric , Sagot, Benoît , Seddah, Djamé , More, Amir , Tsarfaty, Reut , Kanayama, Hiroshi , Muraoka, Masayasu , Yoshikawa, Katsumasa , Garcia, Marcos , and Gamallo, Pablo
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
dependency parser and parsebank
Language:
Arabic , Bulgarian , Russia Buriat , Czech , Catalan , Church Slavic , Danish , German , Modern Greek (1453-) , English , Spanish , Estonian , Basque , Persian , Finnish , French , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Swedish , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Zeman, Daniel , Potthast, Martin , Duthoo, Elie , Mesnard, Olivier , Rybak, Piotr , Wróblewska, Alina , Che, Wanxiang , Liu, Yijia , Wang, Yuxuan , Zheng, Bo , Liu, Ting , Li, Zuchao , He, Shexia , Zhang, Zhuosheng , Zhao, Hai , Wu, Yingting , Tong, Jia-Jun , Nguyen, Dat Quoc , Verspoor, Karin , Wan, Hui , Naseem, Tahira , Lee, Young-Suk , Castelli, Vittorio , Ballesteros, Miguel , Hershcovich, Daniel , Abend, Omri , Rappoport, Ari , Smith, Aaron , Bohnet, Bernd , de Lhoneux, Miryam , Nivre, Joakim , Shao, Yan , Stymne, Sara , Kırnap, Ömer , Dayanık, Erenay , Yuret, Deniz , Kanerva, Jenna , Ginter, Filip , Miekka, Niko , Leino, Akseli , Salakoski, Tapio , Lim, KyungTae , Park, Cheoneum , Lee, Changki , Poibeau, Thierry , Bhat, Riyaz Ahmad , Bhat, Irshad , Bangalore, Srinivas , Qi, Peng , Dozat, Timothy , Zhang, Yuhao , Manning, Christopher , Boroș, Tiberiu , Dumitrescu, Stefan Daniel , Burtica, Ruxandra , Arakelyan, Gor , Hambardzumyan, Karen , Khachatrian, Hrant , Rosa, Rudolf , Mareček, David , Straka, Milan , Seker, Amit , More, Amir , Tsarfaty, Reut , Önder, Berkay Furkan , Gümeli, Can , Jawahar, Ganesh , Muller, Benjamin , Fethi, Amal , Martin, Louis , Villemonte de la Clergerie, Eric , Sagot, Benoît , Seddah, Djamé , Özateş, Şaziye Betül , Özgür, Arzucan , Gungor, Tunga , Öztürk, Balkız , Ji, Tao , Liu, Yufang , Wang, Yijun , Wu, Yuanbin , Lan, Man , Chen, Danlu , Lin, Mengxiao , Hu, Zhifeng , and Qiu, Xipeng
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
parsed data , conllu , and universal dependencies
Language:
Afrikaans , Arabic , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Thai , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Kubeša, David and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
entity linking , NEL , NER , dataset , and knowledge base
Language:
Afrikaans , Arabic , Armenian , Basque , Belarusian , Bulgarian , Catalan , Chinese , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , Galician , German , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Maltese , Marathi , Modern Greek (1453-) , Northern Sami , Norwegian Nynorsk , Persian , Polish , Portuguese , Romanian , Russian , Scottish Gaelic , Serbian , Slovak , Slovenian , Spanish , Swedish , Tamil , Telugu , Uighur , Ukrainian , Urdu , Vietnamese , and Wolof
Description:
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , and Galician
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-2988). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.4 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.4 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , and Skolt Sami
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3105). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.5 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Icelandic , Albanian , and Persian
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3226). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.6 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.6 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Icelandic , Albanian , Persian , Akuntsu , Apurinã , Khunsari , Manx , Mundurukú , Nayini , Soi , South Levantine Arabic , and Tupinambá
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3424). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.7 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7 , and PUB
Creator:
Zeman, Daniel and Droganova, Kira
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
semantic dependency and universal dependencies
Language:
Afrikaans , Assyrian Neo-Aramaic , Akkadian , Amharic , Arabic , Belarusian , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Mandarin Chinese , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Finnish , French , Irish , Gothic , Ancient Greek (to 1453) , Mbyá Guaraní , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Komi-Zyrian , Karelian , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Erzya , Dutch , Norwegian , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Tagalog , Turkish , Ukrainian , Urdu , Vietnamese , Warlpiri , Wolof , Yoruba , Galician , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Icelandic , Albanian , Persian , Akuntsu , Apurinã , Khunsari , Manx , Mundurukú , Nayini , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Western Armenian , and Central Siberian Yupik
Description:
Deep Universal Dependencies is a collection of treebanks derived semi-automatically from Universal Dependencies (http://hdl.handle.net/11234/1-3687). It contains additional deep-syntactic and semantic annotations. Version of Deep UD corresponds to the version of UD it is based on. Note however that some UD treebanks have been omitted from Deep UD.
Rights:
Licence Universal Dependencies v2.8 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.8 , and PUB
Creator:
Mareček, David , Yu, Zhiwei , Zeman, Daniel , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
part of speech , tagging , semi-supervised , and cross-language
Language:
Belarusian , Bosnian , Bulgarian , Czech , Serbo-Croatian , Croatian , Upper Sorbian , Macedonian , Polish , Russian , Slovak , Slovenian , Serbian , Ukrainian , Latvian , Lithuanian , Afrikaans , Danish , German , English , Faroese , Western Frisian , Swiss German , Icelandic , Limburgan , Luxembourgish , Low German , Dutch , Norwegian Nynorsk , Norwegian , Scots , Swedish , Yiddish , Aragonese , Asturian , Catalan , French , Galician , Haitian , Italian , Latin , Lombard , Neapolitan , Piemontese , Portuguese , Romanian , Spanish , Venetian , Walloon , Breton , Welsh , Scottish Gaelic , Irish , Modern Greek (1453-) , Armenian , Albanian , Dimli (individual language) , Persian , Gilaki , Kurdish , Tajik , Bengali , Bishnupriya , Gujarati , Fiji Hindi , Hindi , Marathi , Nepali (macrolanguage) , Urdu , Amharic , Arabic , Egyptian Arabic , Hebrew , Estonian , Finnish , Hungarian , Basque , Georgian , Chuvash , Azerbaijani , Turkish , Uzbek , Kazakh , Tatar , Yakut , Korean , Mongolian , Telugu , Kannada , Malayalam , Tamil , Newari , Vietnamese , Indonesian , Javanese , Malagasy , Maori , Malay (macrolanguage) , Pampanga , Sundanese , Tagalog , Waray (Philippines) , Swahili (macrolanguage) , Esperanto , Ido , Interlingua (International Auxiliary Language Association) , and Volapük
Description:
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Mareček, David , Yu, Zhiwei , Zeman, Daniel , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
part of speech , tagging , semi-supervised , and cross-language
Language:
Belarusian , Bosnian , Bulgarian , Czech , Serbo-Croatian , Croatian , Upper Sorbian , Macedonian , Polish , Russian , Slovak , Slovenian , Serbian , Ukrainian , Latvian , Lithuanian , Afrikaans , Danish , German , English , Faroese , Western Frisian , Swiss German , Icelandic , Limburgan , Luxembourgish , Low German , Dutch , Norwegian Nynorsk , Norwegian , Scots , Swedish , Yiddish , Aragonese , Asturian , Catalan , French , Galician , Haitian , Italian , Latin , Lombard , Neapolitan , Piemontese , Portuguese , Romanian , Spanish , Venetian , Walloon , Breton , Welsh , Scottish Gaelic , Irish , Modern Greek (1453-) , Armenian , Albanian , Dimli (individual language) , Persian , Gilaki , Kurdish , Tajik , Bengali , Bishnupriya , Gujarati , Fiji Hindi , Hindi , Marathi , Nepali (macrolanguage) , Urdu , Amharic , Arabic , Egyptian Arabic , Hebrew , Estonian , Finnish , Hungarian , Basque , Georgian , Chuvash , Azerbaijani , Turkish , Uzbek , Kazakh , Tatar , Yakut , Korean , Mongolian , Telugu , Kannada , Malayalam , Tamil , Newari , Vietnamese , Indonesian , Javanese , Malagasy , Maori , Malay (macrolanguage) , Pampanga , Sundanese , Tagalog , Waray (Philippines) , Swahili (macrolanguage) , Esperanto , Ido , Interlingua (International Auxiliary Language Association) , and Volapük
Description:
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Changes in version 1.1:
1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Bojar, Ondřej , Straňák, Pavel , Zeman, Daniel , Jain, Gaurav , and Damani, Om Prakesh
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
English-Hindi parallel corpus and parallel corpus
Language:
Hindi and English
Description:
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus. and FP7-ICT-2007-3-231720 (EuroMatrix Plus) 7E09003 (Czech part of EM+)
Rights:
Creative Commons - Attribution 3.0 Unported (CC BY 3.0) , http://creativecommons.org/licenses/by/3.0/ , and PUB
Creator:
Zeman, Daniel , Mareček, David , Mašek, Jan , Popel, Martin , Ramasamy, Loganathan , Rosa, Rudolf , Štěpánek, Jan , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
treebank , Stanford dependencies , Prague dependencies , harmonization , common annotation style , and Interset
Language:
Arabic , Bulgarian , Bengali , Catalan , Czech , Danish , German , Modern Greek (1453-) , English , Spanish , Estonian , Basque , Persian , Finnish , Ancient Greek (to 1453) , Hindi , Hungarian , Italian , Japanese , Latin , Dutch , Portuguese , Romanian , Russian , Slovak , Slovenian , Swedish , Tamil , Telugu , and Turkish
Description:
HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes.
Rights:
HamleDT 2.0 Licence Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-2.0 , and ACA
Creator:
Zeman, Daniel , Mareček, David , Mašek, Jan , Popel, Martin , Ramasamy, Loganathan , Rosa, Rudolf , Štěpánek, Jan , and Žabokrtský, Zdeněk
Publisher:
Charles University
Type:
text and corpus
Subject:
annotated corpus , morphology , syntax , dependency , treebank , harmonized annotation , and common annotation style
Language:
Arabic , Basque , Bengali , Bulgarian , Catalan , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Modern Greek (1453-) , Ancient Greek (to 1453) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Persian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Spanish , Swedish , Tamil , Telugu , and Turkish
Description:
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
Rights:
HamleDT 3.0 License Terms , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-3.0 , and PUB
Creator:
Bojar, Ondřej , Diatka, Vojtěch , Straňák, Pavel , Tamchyna, Aleš , and Zeman, Daniel
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
parallel corpus , English-Hindi parallel corpus , and sentence-parallel
Language:
Hindi and English
Description:
HindEnCorp parallel texts (sentence-aligned) come from the following sources:
Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).
Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.
EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.
Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.

For the current release, we are extending the parallel corpus using these sources:
Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.
TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.
The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.
Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.
Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary. and LM2010013,
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Creator:
Parida, Shantipriya and Bojar, Ondřej
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
image and corpus
Subject:
parallel corpus , corpus , multilingual , machine translation , shared task , English-Hindi parallel corpus , image captioning , and multi-modal
Language:
English and Hindi
Description:
Data
----
Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.
Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity.
Dataset Formats
--------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Hindi Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Segments English Words Hindi Words
------- --------- ---------------- -------------
Train 28932 143178 136722
Dev 998 4922 4695
Test 1595 7852 7535
Challenge Test 1400 8185 8665 (Released separately)
------- --------- ---------------- -------------
Total 32925 164137 157617
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@article{hindi-visual-genome:2019,
title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},
author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},
journal={Computaci{\'o}n y Sistemas},
note={In print. Presented at CICLing 2019, La Rochelle, France},
year={2019},
}
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Parida, Shantipriya and Bojar, Ondřej
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
multilingual , neural machine translation , multi-modal , English-Hindi parallel corpus , image captioning , and image annotation
Language:
English and Hindi
Description:
Data
----
Hindi Visual Genome 1.1 is an updated version of Hindi Visual Genome 1.0. The update concerns primarily the text part of Hindi Visual Genome, fixing translation issues reported during WAT 2019 multimodal task. In the image part, only one segment and thus one image were removed from the dataset.
Hindi Visual Genome 1.1 serves in "WAT 2020 Multi-Modal Machine Translation Task".
Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account.
The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome.
A third test set is called ``challenge test set'' consists of 1.4K segments and it was released for WAT2019 multi-modal task. The challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.
Dataset Formats
--------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple
tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Hindi Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
----------------
The statistics of the current release is given below.
Parallel Corpus Statistics
---------------------------
Dataset Segments English Words Hindi Words
------- --------- ---------------- -------------
Train 28930 143164 145448
Dev 998 4922 4978
Test 1595 7853 7852
Challenge Test 1400 8186 8639
------- --------- ---------------- -------------
Total 32923 164125 166917
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@article{hindi-visual-genome:2019,
title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}},
author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},
journal={Computaci{\'o}n y Sistemas},
volume={23},
number={4},
pages={1499--1505},
year={2019}
}
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Bojar, Ondřej , Straňák, Pavel , and Zeman, Daniel
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
news and web texts
Language:
Hindi
Description:
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens and FP7-ICT-2007-3-231720 (EuroMatrix Plus), 7E09003 (Czech part of EM+)
Rights:
Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0) , http://creativecommons.org/licenses/by-nc/3.0/ , and PUB
Creator:
Bafna, Niyati , Žabokrtský, Zdeněk , España-Bonet, Cristina , van Genabith, Josef , Kumar, Lalit "Samyak Lalit" , Suman, Sharda , and Shivay, Rahul
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Kavita Kosh Project
Type:
text and corpus
Subject:
dialect continuum , dialect variation , Indic , Indo-Aryan , Indian , and Hindi
Language:
Hindi , Marathi , Magahi , Awadhi , Bhojpuri , Braj , Haryanvi , Rajasthani , Korku , Garhwali , Chhattisgarhi , Bhili , Sanskrit , Angika , Bundeli , Kumaoni , Bhadrawahi , Bengali , Gujarati , Panjabi , Nimadi , Kanauji , Malvi , and Uncoded languages
Description:
HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India
Languages
This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions.
Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit.
This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection:
- They are all Indic languages except for Korku.
- The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives.
- They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh)
- All except Sanksrit are alive languages
Data
Categorising them by pre-existing available NLP resources, we have:
* Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages.
* Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources.
* Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant.
Script
This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project.
Format
The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Bafna, Niyati , Žabokrtský, Zdeněk , España-Bonet, Cristina , van Genabith, Josef , Kumar, Lalit "Samyak Lalit" , Suman, Sharda , and Shivay, Rahul
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Kavita Kosh Project
Type:
text and corpus
Subject:
dialect continuum , dialect variation , Indic , Indo-Aryan , Indian , and Hindi
Language:
Hindi , Marathi , Magahi , Awadhi , Bhojpuri , Braj , Haryanvi , Rajasthani , Korku , Garhwali , Chhattisgarhi , Bhili , Sanskrit , Angika , Bundeli , Kumaoni , Bhadrawahi , Bengali , Gujarati , Panjabi , Nimadi , Kanauji , Malvi , and Uncoded languages
Description:
HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India
Languages
This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions.
Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit.
This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection:
- They are all Indic languages except for Korku.
- The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives.
- All except Nepali are primarily spoken in (North) India
- All except Sanksrit are alive languages
Data
Categorising them by pre-existing available NLP resources, we have:
* Band 1 languages : Hindi, Marathi, Punjabi, Sindhi, Gujarati, Bengali, Nepali. These languages already have other large datasets available. Since Kavita Kosh focusses largely on Hindi-related languages, we may have very little data for these other languages in this particular dataset.
* Band 2 languages: Bhojpuri, Magahi, Awadhi, Brajbhasha. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources.
* Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant.
Script
This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project.
Format
The data is segregated by language, and contains each folksong in a different JSON file.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Bojar, Ondřej , Diatka, Vojtěch , Rychlý, Pavel , Straňák, Pavel , Suchomel, Vít , Tamchyna, Aleš , and Zeman, Daniel
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
corpus
Language:
Hindi
Description:
Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly.
HindMonoCorp contains data from:
Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following.
Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki).
SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below.
CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us.
Intercorp – 7 books with their translations scanned and manually alligned per paragraph
RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014. and LM2010013,
Rights:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) , http://creativecommons.org/licenses/by-nc-sa/3.0/ , and PUB
Publisher:
Max Planck Institute for Psycholinguistics
Type:
corpus
Language:
Hindi and Tamil
Description:
Language Acquisition corpus
Rights:
Not specified
Publisher:
Max Planck Institute for Psycholinguistics
Type:
corpus
Language:
Hindi
Description:
Language and Cognition corpus
Rights:
Not specified
Creator:
Abdi Khojasteh, Hadi , Ansari, Ebrahim , and Bohlouli, Mahdi
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) and Institute for Advanced Studies in Basic Sciences (IASBS)
Type:
text and corpus
Subject:
PoS tagging , corpus , annotated corpus , multilingual , derivation , dependency parser , machine translation , informal language , spoken language , monolingual corpus , and bilingual corpus annotation
Language:
Persian , English , German , Czech , Italian , and Hindi
Description:
"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).
Rights:
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) , http://creativecommons.org/licenses/by-nc-nd/4.0/ , and PUB
Creator:
Zeman, Daniel
Publisher:
Charles University, Faculty of Mathematics and Physics
Type:
tool and toolService
Subject:
morphology , part of speech , conversion , and tagset
Language:
Arabic , Bulgarian , Bengali , Catalan , Czech , Danish , German , Modern Greek (1453-) , English , Spanish , Estonian , Basque , Persian , Finnish , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Japanese , Multiple languages , and Portuguese
Description:
Lingua::Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped. Version 2.026 covers 37 different tagsets of 21 languages. Limited support of the older drivers for other languages (which are not included in this package but are available for download elsewhere) is also available; these will be fully ported to Interset 2 in future.
Interset is implemented as Perl libraries. It is also available via CPAN.
Rights:
Artistic License (Perl) 1.0 , http://opensource.org/licenses/Artistic-Perl-1.0 , and PUB
Creator:
Guillaume, Bruno , Ramisch, Carlos , Waszczuk, Jakub , Monti, Johanna , Di Buono, Maria Pia , Sangati, Federico , Speranza, Giulia , Carlino, Carola , Güngör, Tunga , Yirmibeşoğlu, Zeynep , Sak, Haşim , Saraçlar, Murat , Giouli, Voula , Foufi, Vassiliki , Ramisch, Renata , Rademaker, Alexandre , Vale, Oto , Wilkens, Rodrigo , Candito, Marie , Crabbé, Benoît , Segonne, Vincent , Liebeskind, Chaya , Stymne, Sara , Hajič, Jan , Ginter, Filip , Luotolahti, Juhani , Straka, Milan , Zeman, Daniel , Barbu Mititelu, Verginica , Cristescu, Mihaela , Vaidya, Ashwini , Bhatia, Archna , Lichte, Timm , Ehren, Rafael , Jiang, Menghan , Xu, Hongzhi , Walsh, Abigail , Irimia, Elena , and Dowling, Meghan
Publisher:
PARSEME
Type:
text and corpus
Subject:
morphosyntactic annotation , dependency trees , and morphological analysis
Language:
German , Modern Greek (1453-) , Basque , French , Irish , Hebrew , Hindi , Italian , Polish , Portuguese , Romanian , Swedish , Turkish , and Chinese
Description:
This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs.
The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe).
VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do).
For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format.
Morphological and syntactic information – not necessarily using UD tagsets – including parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training, development and test data, as well as the evaluation tools used in the PARSEME Shared Task 1.2 (2020). The annotation guidelines are available online: http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.2
Rights:
PARSEME Shared Task Raw Corpus Data (v. 1.2) Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.2-raw , and PUB
Creator:
Savary, Agata , Ramisch, Carlos , Guillaume, Bruno , Hawwari, Abdelati , Walsh, Abigail , Fotopoulou, Aggeliki , Bielinskienė, Agnė , Estarrona, Ainara , Gatt, Albert , Butler, Alexandra , Rademaker, Alexandre , Maldonado, Alfredo , Villavicencio, Aline , Farrugia, Alison , Muscat, Amanda , Gatt, Anabelle , Antić, Anđela , De Santis, Anna , Raffone, Annalisa , Riccio, Anna , Pascucci, Antonio , Gurrutxaga, Antton , Bhatia, Archna , Vaidya, Ashwini , Miral, Ayşenur , QasemiZadeh, Behrang , Priego Sanchez, Belem , Griciūtė, Bernadeta , Erden, Berna , Parra Escartín, Carla , Herrero, Carlos , Carlino, Carola , Pasquer, Caroline , Liebeskind, Chaya , Wang, Chenweng , Ben Khelil, Chérifa , Bonial, Claire , Somers, Clarissa , Aceta, Cristina , Krstev, Cvetana , Bejček, Eduard , Lindqvist, Ellinor , Erenmalm, Elsa , Palka-Binkiewicz, Emilia , Rimkute, Erika , Petterson, Eva , Cap, Fabienne , Hu, Fangyuan , Sangati, Federico , Wick Pedro, Gabriela , Speranza, Giulia , Jagfeld, Glorianna , Blagus, Goranka , Berk, Gözde , Attard, Greta , Eryiğit, Gülşen , Finnveden, Gustav , Martínez Alonso, Héctor , de Medeiros Caseli, Helena , Elyovich, Hevi , Xu, Hongzhi , Xiao, Huangyang , Miranda, Isaac , Jaknić, Isidora , El Maarouf, Ismail , Aduriz, Itziar , Gonzalez, Itziar , Matas, Ivana , Stoyanova, Ivelina , Jazbec, Ivo-Pavao , Busuttil, Jael , Waszczuk, Jakub , Findlay, Jamie , Bonnici, Janice , Šnajder, Jan , Antoine, Jean-Yves , Foster, Jennifer , Chen, Jia , Nivre, Joakim , Monti, Johanna , McCrae, John , Kovalevskaitė, Jolanta , Jain, Kanishka , Simkó, Katalin , Yu, Ke , Azzopardi, Kirsty , Adalı, Kübra , Uria, Larraitz , Zilio, Leonardo , Boizou, Loïc , van der Plas, Lonneke , Galea, Luke , Sarlak, Mahtab , Buljan, Maja , Cherchi, Manuela , Tanti, Marc , Di Buono, Maria Pia , Todorova, Maria , Candito, Marie , Constant, Matthieu , Shamsfard, Mehrnoush , Jiang, Menghan , Boz, Mert , Spagnol, Michael , Onofrei, Mihaela , Li, Minli , Elbadrashiny, Mohamed , Diab, Mona , Rizea, Monica-Mihaela , Hadj Mohamed, Najet , Theoxari, Natasa , Schneider, Nathan , Tabone, Nicole , Ljubešić, Nikola , Vale, Oto , Cook, Paul , Yan, Peiyi , Gantar, Polona , Ehren, Rafael , Fabri, Ray , Ibrahim, Rehab , Ramisch, Renata , Walles, Rinat , Wilkens, Rodrigo , Urizar, Ruben , Sun, Ruilong , Malka, Ruth , Galea, Sara Anne , Stymne, Sara , Louizou, Sevasti , Hu, Sha , Taslimipoor, Shiva , Ratori, Shraddha , Srivastava, Shubham , Cordeiro, Silvio Ricardo , Krek, Simon , Liu, Siyuan , Zeng, Si , Yu, Songping , Arhar Holdt, Špela , Markantonatou, Stella , Papadelli, Stella , Leseva, Svetlozara , Kuzman, Taja , Kavčič, Teja , Lynn, Teresa , Lichte, Timm , Pickard, Thomas , Dimitrova, Tsvetana , Yih, Tsy , Güngör, Tunga , Dinç, Tutkum , Iñurrieta, Uxoa , Tajalli, Vahide , Stefanova, Valentina , Caruso, Valeria , Puri, Vandana , Foufi, Vassiliki , Barbu Mititelu, Verginica , Vincze, Veronika , Kovács, Viktória , Shukla, Vishakha , Giouli, Voula , Ge, Xiaomin , Ha-Cohen Kerner, Yaakov , Öztürk, Yağmur , Yarandi, Yalda , Parmentier, Yannick , Zhang, Yongchen , Zhao, Yun , Urešová, Zdeňka , Yirmibeşoğlu, Zeynep , Qin, Zhenzhen , Stank , Cristescu, Mihaela , Zgreabăn, Bianca-Mădălina , Bărbulescu, Elena-Andreea , and Stanković, Ranka
Publisher:
PARSEME
Type:
text and corpus
Subject:
multiword expressions , verbal multiword expressions , light verb construction , verb-particle constructions , inherently reflexive verbs , verbal idioms , and multi-verb constructions
Language:
Arabic , Bulgarian , Czech , German , Modern Greek (1453-) , English , Spanish , Basque , Persian , French , Irish , Hebrew , Hindi , Croatian , Hungarian , Lithuanian , Italian , Maltese , Polish , Portuguese , Romanian , Slovenian , Serbian , Swedish , Turkish , and Chinese
Description:
This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). This is the first release of the corpora without an associated shared task. Previous version (1.2) was associated with the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). The data covers 26 languages corresponding to the combination of the corpora for all previous three editions (1.0, 1.1 and 1.2) of the corpora. VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information, including parts of speech, lemmas, morphological features and/or syntactic dependencies, are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). All corpora are split into training, development and test data, following the splitting strategy adopted for the PARSEME Shared Task 1.2. The annotation guidelines are available online: https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3 The .cupt format is detailed here: https://multiword.sourceforge.net/cupt-format/
Rights:
PARSEME Corpora v. 1.3 - Licence Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.3 , and PUB
Creator:
Rosa, Rudolf
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
Wikipedia , text corpora , and monolingual corpus
Language:
Abkhazian , Achinese , Adyghe , Afrikaans , Akan , Tosk Albanian , Amharic , Old English (ca. 450-1100) , Arabic , Official Aramaic (700-300 BCE) , Aragonese , Egyptian Arabic , Assamese , Asturian , Atikamekw , Avaric , Aymara , South Azerbaijani , Azerbaijani , Bashkir , Bambara , Bavarian , Central Bikol , Belarusian , Bengali , Bislama , Banjar , Tibetan , Bosnian , Bishnupriya , Breton , Buginese , Bulgarian , Russia Buriat , Catalan , Min Dong Chinese , Cebuano , Czech , Chamorro , Chechen , Cherokee , Church Slavic , Chuvash , Cheyenne , Central Kurdish , Cornish , Corsican , Cree , Crimean Tatar , Kashubian , Welsh , Danish , German , Dinka , Dimli (individual language) , Dhivehi , Lower Sorbian , Dzongkha , Modern Greek (1453-) , English , Esperanto , Estonian , Basque , Ewe , Extremaduran , Faroese , Persian , Fijian , Finnish , French , Arpitan , Northern Frisian , Western Frisian , Fulah , Friulian , Gagauz , Gan Chinese , Scottish Gaelic , Irish , Galician , Gilaki , Manx , Goan Konkani , Gothic , Guarani , Gujarati , Hakka Chinese , Haitian , Hausa , Hawaiian , Serbo-Croatian , Hebrew , Herero , Fiji Hindi , Hindi , Hiri Motu , Croatian , Upper Sorbian , Hungarian , Armenian , Igbo , Ido , Inuktitut , Interlingue , Iloko , Interlingua (International Auxiliary Language Association) , Indonesian , Inupiaq , Icelandic , Italian , Jamaican Creole English , Javanese , Lojban , Japanese , Kara-Kalpak , Kabyle , Kalaallisut , Kannada , Kashmiri , Georgian , Kanuri , Kazakh , Kabardian , Kabiyè , Khmer , Kikuyu , Kinyarwanda , Kirghiz , Komi-Permyak , Komi , Kongo , Korean , Karachay-Balkar , Kölsch , Kurdish , Ladino , Lao , Latin , Latvian , Lak , Lezghian , Ligurian , Limburgan , Lingala , Lithuanian , Lombard , Northern Luri , Latgalian , Luxembourgish , Ganda , Literary Chinese , Marshallese , Maithili , Malayalam , Marathi , Moksha , Eastern Mari , Minangkabau , Macedonian , Malagasy , Maltese , Mongolian , Maori , Western Mari , Malay (macrolanguage) , Creek , Mirandese , Burmese , Erzya , Mazanderani , Min Nan Chinese , Neapolitan , Nauru , Navajo , Ndonga , Low German , Nepali (macrolanguage) , Newari , Dutch , Norwegian Nynorsk , Norwegian , Novial , Pedi , Nyanja , Occitan (post 1500) , Livvi , Oriya (macrolanguage) , Oromo , Ossetian , Pangasinan , Pampanga , Panjabi , Papiamento , Picard , Pennsylvania German , Pfaelzisch , Pitcairn-Norfolk , Pali , Piemontese , Western Panjabi , Pontic , Polish , Portuguese , Pushto , Quechua , Vlax Romani , Romansh , Romanian , Rusyn , Rundi , Macedo-Romanian , Russian , Sango , Yakut , Sanskrit , Sicilian , Scots , Samogitian , Sinhala , Slovak , Slovenian , Northern Sami , Samoan , Shona , Sindhi , Somali , Southern Sotho , Spanish , Albanian , Sardinian , Sranan Tongo , Serbian , Swati , Saterfriesisch , Sundanese , Swahili (macrolanguage) , Swedish , Silesian , Tahitian , Tamil , Tatar , Tulu , Telugu , Tama (Colombia) , Tetum , Tajik , Tagalog , Thai , Tigrinya , Tonga (Tonga Islands) , Tok Pisin , Tswana , Tsonga , Turkmen , Tumbuka , Turkish , Twi , Tuvinian , Udmurt , Uighur , Ukrainian , Urdu , Uzbek , Venetian , Venda , Veps , Vietnamese , Vlaams , Volapük , Võro , Waray (Philippines) , Walloon , Wolof , Wu Chinese , Kalmyk , Xhosa , Mingrelian , Yiddish , Yoruba , Yue Chinese , Zeeuws , Zhuang , Chinese , Zulu , and Dotyali
Description:
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Rights:
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) , http://creativecommons.org/licenses/by-sa/3.0/ , and PUB
Format:
text/html
Type:
corpus
Language:
Hindi
Description:
ca. 7.000 tokens; linked with relational database; XML-encoding in progress
Rights:
http://titus.uni-frankfurt.de/texte/texte2.htm#Estart
Creator:
Kondratyuk, Dan and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
syntax , dependency parser , and universal dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , and Maltese
Description:
Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bosco, Cristina , Bowman, Sam , Celano, Giuseppe G. A. , Connor, Miriam , de Marneffe, Marie-Catherine , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Galbraith, Daniel , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Goldberg, Yoav , Gonzales, Berta , Guillaume, Bruno , Hajič, Jan , Haug, Dag , Ion, Radu , Irimia, Elena , Johannsen, Anders , Kanayama, Hiroshi , Kanerva, Jenna , Krek, Simon , Laippala, Veronika , Lenci, Alessandro , Ljubešić, Nikola , Lynn, Teresa , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , Mori, Shunsuke , Nurmi, Hanna , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Prokopidis, Prokopis , Pyysalo, Sampo , Ramasamy, Loganathan , Rosa, Rudolf , Saleh, Shadi , Schuster, Sebastian , Seeker, Wolfgang , Seraji, Mojgan , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Simov, Kiril , Smith, Aaron , Štěpánek, Jan , Suhr, Alane , Szántó, Zsolt , Tanaka, Takaaki , Tsarfaty, Reut , Uematsu, Sumire , Uria, Larraitz , Varga, Viktor , Vincze, Veronika , Žabokrtský, Zdeněk , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , and Tamil
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v1.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-1.2 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Riyaz Ahmad , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Çöltekin, Çağrı , Connor, Miriam , de Marneffe, Marie-Catherine , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Galbraith, Daniel , Garza, Sebastian , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gokirmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grūzītis, Normunds , Guillaume, Bruno , Hajič, Jan , Haug, Dag , Hladká, Barbora , Ion, Radu , Irimia, Elena , Johannsen, Anders , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kenney, Jessica , Krek, Simon , Laippala, Veronika , Lam, Lucia , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , Mori, Keiko Sophie , Mori, Shunsuke , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nikolaev, Vitaly , Nurmi, Hanna , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Ramasamy, Loganathan , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Saulīte, Baiba , Schuster, Sebastian , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Simov, Kiril , Smith, Aaron , Spadine, Carolyn , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Wang, Jing Xian , Washington, Jonathan North , Žabokrtský, Zdeněk , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , and Turkish
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v1.3 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-1.3 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Riyaz Ahmad , Bick, Eckhard , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Chalub, Fabricio , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Freitas, Claudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Hajič, Jan , Hà Mỹ, Linh , Haug, Dag , Hladká, Barbora , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kenney, Jessica , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lam, Lucia , Lê Hồng, Phương , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , Mori, Keiko Sophie , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Paiva, Valeria , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Real, Livy , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Saulīte, Baiba , Schuster, Sebastian , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Spadine, Carolyn , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Wallin, Lars , Wang, Jing Xian , Washington, Jonathan North , Wirén, Mats , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Swedish Sign Language , Ukrainian , Uighur , and Vietnamese
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v1.4 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-1.4 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Candito, Marie , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Chalub, Fabricio , Choi, Jinho , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hà Mỹ, Linh , Haug, Dag , Hladká, Barbora , Hohle, Petter , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lê Hồng, Phương , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Real, Livy , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Sanguinetti, Manuela , Saulīte, Baiba , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Washington, Jonathan North , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , and Urdu
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This release is special in that the treebanks will be used as training/development data in the CoNLL 2017 shared task (http://universaldependencies.org/conll17/). Test data are not released, except for the few treebanks that do not take part in the shared task. 64 treebanks will be in the shared task, and they correspond to the following 45 languages: Ancient Greek, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Urdu, Uyghur and Vietnamese.
This release fixes a bug in http://hdl.handle.net/11234/1-1976. Changed files: ud-tools-v2.0.tgz (conllu_to_text.pl, conllu_to_conllx.pl; added text_without_spaces.pl), ud-treebanks-conll2017.tgz (fi_ftb-ud-train.txt, he-ud-train.txt, it-ud-train.txt, pt_br-ud-train.txt, es-ud-train.txt) and ud-treebanks-v2.0.tgz (fi_ftb-ud-train.txt, he-ud-train.txt, it-ud-train.txt, pt_br-ud-train.txt, es-ud-train.txt, ar_nyuad-ud-dev.txt, ar_nyuad-ud-test.txt, ar_nyuad-ud-train.txt, cop-ud-dev.txt, cop-ud-test.txt, cop-ud-train.txt, sa-ud-dev.txt, sa-ud-test.txt, sa-ud-train.txt).
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aranzabe, Maria Jesus , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Burchardt, Aljoscha , Candito, Marie , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Cinková, Silvie , Çöltekin, Çağrı , Connor, Miriam , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Eli, Marhaba , Elkahky, Ali , Erjavec, Tomaž , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hohle, Petter , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kayadelen, Tolga , Kettnerová, Václava , Kirchner, Jesse , Kotsyba, Natalia , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nainwani, Pinkey , Nedoluzhko, Anna , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Real, Livy , Reddy, Siva , Rehm, Georg , Rinaldi, Larissa , Rituma, Laura , Rosa, Rudolf , Rovati, Davide , Saleh, Shadi , Sanguinetti, Manuela , Saulīte, Baiba , Sawanakunanon, Yanin , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Shimada, Atsuko , Shohibussirri, Muh , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Stella, Antonio , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Washington, Jonathan North , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Northern Sami , Upper Sorbian , Russia Buriat , and Northern Kurdish
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present release also includes the development data already released with UD 2.0. Unlike regular UD releases, this one uses the folder-file structure that was visible to the systems participating in the shared task.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Aranzabe, Maria Jesus , Asahara, Masayuki , Atutxa, Aitziber , Ballesteros, Miguel , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Candito, Marie , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Chalub, Fabricio , Choi, Jinho , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Erjavec, Tomaž , Farkas, Richárd , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hà Mỹ, Linh , Haug, Dag , Hladká, Barbora , Hohle, Petter , Ion, Radu , Irimia, Elena , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lê Hồng, Phương , Lenci, Alessandro , Ljubešić, Nikola , Lyashevskaya, Olga , Lynn, Teresa , Makazhanov, Aibek , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Missilä, Anna , Mititelu, Verginica , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Mori, Shunsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Mustafina, Nina , Müürisep, Kaili , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Real, Livy , Rituma, Laura , Rosa, Rudolf , Saleh, Shadi , Sanguinetti, Manuela , Saulīte, Baiba , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shakurova, Lena , Shen, Mo , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Uria, Larraitz , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Washington, Jonathan North , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , and Urdu
Description:
This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Nivre, Joakim , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aranzabe, Maria Jesus , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Bauer, John , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Bick, Eckhard , Bobicev, Victoria , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Burchardt, Aljoscha , Candito, Marie , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cinková, Silvie , Çöltekin, Çağrı , Connor, Miriam , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Elkahky, Ali , Erjavec, Tomaž , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Ion, Radu , Irimia, Elena , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kanayama, Hiroshi , Kanerva, Jenna , Kayadelen, Tolga , Kettnerová, Václava , Kirchner, Jesse , Kotsyba, Natalia , Krek, Simon , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Miekka, Niko , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Shinsuke , Moskalevskyi, Bohdan , Muischnek, Kadri , Müürisep, Kaili , Nainwani, Pinkey , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nurmi, Hanna , Ojala, Stina , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Pascual, Elena , Passarotti, Marco , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Popel, Martin , Pretkalniņa, Lauma , Prokopidis, Prokopis , Puolakainen, Tiina , Pyysalo, Sampo , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rinaldi, Larissa , Rituma, Laura , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Sagot, Benoît , Saleh, Shadi , Samardžić, Tanja , Sanguinetti, Manuela , Saulīte, Baiba , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Tanaka, Takaaki , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wallin, Lars , Washington, Jonathan North , Wirén, Mats , Wong, Tak-sum , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , and Telugu
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.1 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.1 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Andersen, Erik , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Bengoetxea, Kepa , Ben Moshe, Yifat , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Daniela , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon , Davidson, Elizabeth , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , K, Sarveswaran , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Sichinava, Dmitry , Siewert, Janine , Sigurðsson, Einar Freyr , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Sourov, Shafi , Spadine, Carolyn , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , and Umbrian
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.10 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.10 , and PUB
Creator:
Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Afrikaans , Arabic , Belarusian , Bulgarian , Catalan , Czech , Church Slavic , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Scottish Gaelic , Irish , Galician , Gothic , Ancient Greek (to 1453) , Ancient Hebrew , Hebrew , Hindi , Croatian , Hungarian , Armenian , Western Armenian , Indonesian , Icelandic , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , and Chinese
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Akkurt, Salih Furkan , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Alzetta, Chiara , Andersen, Erik , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ásgeirsdóttir, Katla , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Belieni, Juan , Bengoetxea, Kepa , Ben Moshe, Yifat , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Castro, Maria Clara , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chamila, Liyanage , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Daniela , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon , Davidson, Elizabeth , de Alencar, Leonel Figueiredo , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Ebert, Christian , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huerta Mendez, Marivel , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Islamaj, Artan , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Jiang, Katharine , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Katz, Boris , Kayadelen, Tolga , Kengatharaiyer, Sarveswaran , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Koshevoy, Alexey , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuqi, Adrian , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yixuan , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makarchuk, Ilya , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Óladóttir, Hulda , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Panova, Anastasia , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Piitulainen, Jussi , Pintucci, Rodrigo , Pirinen, Tommi A , Pitler, Emily , Plamada, Magdalena , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Pugh, Robert , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Roksandic, Ivan , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Sartor, Marta , Sasaki, Mitsuya , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Shvedova, Maria , Siewert, Janine , Sigurðsson, Einar Freyr , Silva, João Ricardo , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Símonarson, Haukur Barri , Simov, Kiril , Sitchinava, Dmitri , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Sonnenhauser, Barbara , Sourov, Shafi , Spadine, Carolyn , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Stephen, Abishek , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Þórðarson, Sveinbjörn , Þorsteinsson, Vilhjálmur , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Venturi, Giulia , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Wille, Vanessa Berwanger , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , Umbrian , Abaza , Gheg Albanian , Malayalam , Nhengatu , Sinhala , Zacatlán-Ahuacatlán-Tepetzintla Nahuatl , Xavánte , and Saya
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.11 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.11 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Akkurt, Salih Furkan , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Alnajjar, Khalid , Alzetta, Chiara , Andersen, Erik , Antonsen, Lene , Aoyama, Tatsuya , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ásgeirsdóttir, Katla , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Avelãs, Mariana , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Behzad, Shabnam , Bengoetxea, Kepa , Benli, İbrahim , Ben Moshe, Yifat , Berk, Gözde , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Branco, António , Brokaitė, Kristina , Burchardt, Aljoscha , Campos, Marisa , Candito, Marie , Caron, Bernard , Caron, Gauthier , Carvalheiro, Catarina , Carvalho, Rita , Cassidy, Lauren , Castro, Maria Clara , Castro, Sérgio , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chamila, Liyanage , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Daniela , Costa, Francisco , Courtin, Marine , Cristescu, Mihaela , Dale, Ingerid Løyning , Daniel, Philemon , Davidson, Elizabeth , de Alencar, Leonel Figueiredo , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Doyle, Adrian , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Ebert, Christian , Eckhoff, Hanne , Eguchi, Masaki , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Essaidi, Farah , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Fethi, Amal , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Gerardi, Fabrício Ferraz , Gerdes, Kim , Gessler, Luke , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huerta Mendez, Marivel , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Islamaj, Artan , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Jiang, Katharine , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Kåsen, Andre , Kayadelen, Tolga , Kengatharaiyer, Sarveswaran , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Koshevoy, Alexey , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuqi, Adrian , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Kyle, Kris , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Levine, Lauren , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yixuan , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lin, Yi-Ju Jessica , Lindén, Krister , Liu, Yang Janet , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makarchuk, Ilya , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Martins, Cláudia , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Miller, Aaron , Mischenkova, Karina , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Óladóttir, Hulda , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Panova, Anastasia , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Peng, Siyao Logan , Pereira, Rita , Pereira, Sílvia , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Piitulainen, Jussi , Pinter, Yuval , Pinto, Clara , Pirinen, Tommi A , Pitler, Emily , Plamada, Magdalena , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Pugh, Robert , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Querido, Andreia , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramos, Joana , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabi, Arij , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Roksandic, Ivan , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Sartor, Marta , Sasaki, Mitsuya , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Shvedova, Maria , Siewert, Janine , Sigurðsson, Einar Freyr , Silva, João , Silveira, Aline , Silveira, Natalia , Silveira, Sara , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Símonarson, Haukur Barri , Simov, Kiril , Sitchinava, Dmitri , Sither, Ted , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Solberg, Per Erik , Sonnenhauser, Barbara , Sourov, Shafi , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Stephen, Abishek , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Þórðarson, Sveinbjörn , Þorsteinsson, Vilhjálmur , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , Vak, Socrates , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Venturi, Giulia , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhu, Yilun , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , Umbrian , Abaza , Gheg Albanian , Malayalam , Nhengatu , Sinhala , Zacatlán-Ahuacatlán-Tepetzintla Nahuatl , Xavánte , Saya , Borôro , Kirghiz , Algerian Arabic , and Old Irish (to 900)
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.12 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.12 , and PUB
Creator:
Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Afrikaans , Arabic , Belarusian , Bulgarian , Catalan , Czech , Church Slavic , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Scottish Gaelic , Irish , Galician , Gothic , Ancient Greek (to 1453) , Ancient Hebrew , Hebrew , Hindi , Croatian , Hungarian , Armenian , Western Armenian , Indonesian , Icelandic , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , Chinese , Norwegian , Erzya , and Manx
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Akkurt, Salih Furkan , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Algom, Avner , Alnajjar, Khalid , Alzetta, Chiara , Andersen, Erik , Antonsen, Lene , Aoyama, Tatsuya , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranes, Glyd , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ásgeirsdóttir, Katla , Aslan, Deniz Baran , Asmazoğlu, Cengiz , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Avelãs, Mariana , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Behzad, Shabnam , Belieni, Juan , Bengoetxea, Kepa , Benli, İbrahim , Ben Moshe, Yifat , Berk, Gözde , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Branco, António , Brokaitė, Kristina , Burchardt, Aljoscha , Campos, Marisa , Candito, Marie , Caron, Bernard , Caron, Gauthier , Carvalheiro, Catarina , Carvalho, Rita , Cassidy, Lauren , Castro, Maria Clara , Castro, Sérgio , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chamila, Liyanage , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Corbetta, Claudia , Corbetta, Daniela , Costa, Francisco , Courtin, Marine , Crabbé, Benoît , Cristescu, Mihaela , Cvetkoski, Vladimir , Dale, Ingerid Løyning , Daniel, Philemon , Davidson, Elizabeth , de Alencar, Leonel Figueiredo , Dehouck, Mathieu , de Laurentiis, Martina , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Doyle, Adrian , Dozat, Timothy , Droganova, Kira , Duran, Magali Sanches , Dwivedi, Puneet , Ebert, Christian , Eckhoff, Hanne , Eguchi, Masaki , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Essaidi, Farah , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Favero, Federica , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Fethi, Amal , Foster, Jennifer , Fransen, Theodorus , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Gamba, Federica , Garcia, Marcos , Gärdenfors, Moa , Gerardi, Fabrício Ferraz , Gerdes, Kim , Gessler, Luke , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guiller, Kirian , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Harada, Takahiro , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huang, Yidi , Huerta Mendez, Marivel , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Islamaj, Artan , Ito, Kaoru , Jagodzińska, Sandra , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Jiang, Katharine , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Karahóǧa, Ritván , Kåsen, Andre , Kayadelen, Tolga , Kengatharaiyer, Sarveswaran , Kettnerová, Václava , Kharatyan, Lilit , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Kocharov, Petr , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Koshevoy, Alexey , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuqi, Adrian , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Kyle, Kris , Laan, Käbi , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Levine, Lauren , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yixuan , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lin, Yi-Ju Jessica , Lindén, Krister , Liu, Yang Janet , Ljubešić, Nikola , Lobzhanidze, Irina , Loginova, Olga , Lopes, Lucelene , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makarchuk, Ilya , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Markantonatou, Stella , Martínez Alonso, Héctor , Martín Rodríguez, Lorena , Martins, André , Martins, Cláudia , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Miller, Aaron , Mischenkova, Karina , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nunes, Maria das Graças Volpe , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Óladóttir, Hulda , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Ordan, Noam , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Paccosi, Teresa , Palmero Aprosio, Alessio , Panova, Anastasia , Pardo, Thiago Alexandre Salgueiro , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Pedonese, Giulia , Peljak-Łapińska, Angelika , Peng, Siyao , Peng, Siyao Logan , Pereira, Rita , Pereira, Sílvia , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Peverelli, Andrea , Phelan, Jason , Pierre-Louis, Claudel , Piitulainen, Jussi , Pinter, Yuval , Pinto, Clara , Pintucci, Rodrigo , Pirinen, Tommi A , Pitler, Emily , Plamada, Magdalena , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Pugh, Robert , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Querido, Andreia , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Ramos, Joana , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabi, Arij , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Roksandic, Ivan , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rozonoyer, Ben , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Sartor, Marta , Sasaki, Mitsuya , Saulīte, Baiba , Savary, Agata , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schang, Emmanuel , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Shvedova, Maria , Siewert, Janine , Sigurðsson, Einar Freyr , Silva, João , Silveira, Aline , Silveira, Natalia , Silveira, Sara , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Símonarson, Haukur Barri , Simov, Kiril , Sitchinava, Dmitri , Sither, Ted , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Solberg, Per Erik , Sonnenhauser, Barbara , Sourov, Shafi , Sprugnoli, Rachele , Stamou, Vivian , Steingrímsson, Steinþór , Stella, Antonio , Stephen, Abishek , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Swanson, Daniel , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tavoni, Mirko , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Tonelli, Sara , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Þórðarson, Sveinbjörn , Þorsteinsson, Vilhjálmur , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vagnoni, Elena , Vajjala, Sowmya , Vak, Socrates , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vedenina, Uliana , Venturi, Giulia , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wigderson, Shira , Wijono, Sri Hartati , Wille, Vanessa Berwanger , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Wu, Qishen , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhu, Yilun , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , Yakut , Ancient Hebrew , Cebuano , Guarani , Hittite , Madi , Emerillon , Umbrian , Abaza , Gheg Albanian , Malayalam , Nhengatu , Sinhala , Zacatlán-Ahuacatlán-Tepetzintla Nahuatl , Xavánte , Saya , Borôro , Kirghiz , Algerian Arabic , Old Irish (to 900) , Classical Armenian , Georgian , Haitian , Highland Puebla Nahuatl , Macedonian , Middle French (ca. 1400-1600) , and Veps
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.13 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.13 , and PUB
Creator:
Nivre, Joakim , Abrams, Mitchell , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Bauer, John , Bellato, Sandra , Bengoetxea, Kepa , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Blokland, Rogier , Bobicev, Victoria , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Celano, Giuseppe G. A. , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erjavec, Tomaž , Etienne, Aline , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ion, Radu , Irimia, Elena , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kayadelen, Tolga , Kettnerová, Václava , Kirchner, Jesse , Kotsyba, Natalia , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Miekka, Niko , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Shinsuke , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Olúòkun, Adédayọ̀ , Omura, Mai , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rießler, Michael , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Roșca, Valentin , Rudina, Olga , Sadde, Shoval , Saleh, Shadi , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Saulīte, Baiba , Sawanakunanon, Yanin , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Soares-Bastos, Isabela , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tanaka, Takaaki , Tellier, Isabelle , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Vincze, Veronika , Wallin, Lars , Washington, Jonathan North , Williams, Seyi , Wirén, Mats , Woldemariam, Tsegay , Wong, Tak-sum , Yan, Chunxiao , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , Zhang, Manying , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , and Yoruba
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Nivre, Joakim , Abrams, Mitchell , Agić, Željko , Ahrenberg, Lars , Antonsen, Lene , Aplonova, Katya , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Basmov, Victoria , Bauer, John , Bellato, Sandra , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Blokland, Rogier , Bobicev, Victoria , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erjavec, Tomaž , Etienne, Aline , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , Gonzáles Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Kopacewicz, Kamil , Kotsyba, Natalia , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lam, Lucia , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , Mendonça, Gustavo , Miekka, Niko , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Keiko Sophie , Mori, Shinsuke , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Olúòkun, Adédayọ̀ , Omura, Mai , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrov, Slav , Piitulainen, Jussi , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rießler, Michael , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Roșca, Valentin , Rudina, Olga , Rueter, Jack , Sadde, Shoval , Sagot, Benoît , Saleh, Shadi , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Saulīte, Baiba , Sawanakunanon, Yanin , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tanaka, Takaaki , Tellier, Isabelle , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wallin, Lars , Wang, Jing Xian , Washington, Jonathan North , Williams, Seyi , Wirén, Mats , Woldemariam, Tsegay , Wong, Tak-sum , Yan, Chunxiao , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , Zhang, Manying , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , and Maltese
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.3 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.3 , and PUB
Creator:
Nivre, Joakim , Abrams, Mitchell , Agić, Željko , Ahrenberg, Lars , Aleksandravičiūtė, Gabrielė , Antonsen, Lene , Aplonova, Katya , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Basmov, Victoria , Bauer, John , Bellato, Sandra , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erjavec, Tomaž , Etienne, Aline , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ikeda, Takumi , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Kaşıkara, Hüner , Kaasen, Andre , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Köhn, Arne , Kopacewicz, Kamil , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lam, Lucia , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Li, Yuan , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Miekka, Niko , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Keiko Sophie , Morioka, Tomohiko , Mori, Shinsuke , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Olúòkun, Adédayọ̀ , Omura, Mai , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrova, Daria , Petrov, Slav , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Roșca, Valentin , Rudina, Olga , Rueter, Jack , Sadde, Shoval , Sagot, Benoît , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tanaka, Takaaki , Tellier, Isabelle , Thomas, Guillaume , Torga, Liisi , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zeman, Daniel , Zhang, Manying , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , and Mbyá Guaraní
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.4 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.4 , and PUB
Creator:
Straka, Milan and Straková, Jana
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Czech , Afrikaans , Arabic , Belarusian , Bulgarian , Catalan , Church Slavic , Coptic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , and Chinese
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data (http://hdl.handle.net/11234/1-2988). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_24_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Aepli, Noëmi , Agić, Željko , Ahrenberg, Lars , Aleksandravičiūtė, Gabrielė , Antonsen, Lene , Aplonova, Katya , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bellato, Sandra , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cetin, Savas , Chalub, Fabricio , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ikeda, Takumi , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Jelínek, Tomáš , Johannsen, Anders , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Köhn, Arne , Kopacewicz, Kamil , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lam, Lucia , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Liovina, Maria , Li, Yuan , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsumoto, Yuji , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Miekka, Niko , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Keiko Sophie , Morioka, Tomohiko , Mori, Shinsuke , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrova, Daria , Petrov, Slav , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Reddy, Siva , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Rovati, Davide , Roșca, Valentin , Rudina, Olga , Rueter, Jack , Sadde, Shoval , Sagot, Benoît , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tanaka, Takaaki , Tellier, Isabelle , Thomas, Guillaume , Torga, Liisi , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Tyers, Francis , Uematsu, Sumire , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zhang, Manying , and Zhu, Hanzhi
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , and Swiss German
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.5 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.5 , and PUB
Creator:
Straka, Milan and Straková, Jana
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Czech , Afrikaans , Arabic , Belarusian , Bulgarian , Catalan , Church Slavic , Coptic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Polish , Portuguese , Romanian , Russian , Sanskrit , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , Chinese , and Scottish Gaelic
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models .
To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe .
In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Agić, Željko , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aranzabe, Maria Jesus , Arutie, Gashaw , Asahara, Masayuki , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bengoetxea, Kepa , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cetin, Savas , Chalub, Fabricio , Chi, Ethan , Choi, Jinho , Cho, Yongseok , Chun, Jayeol , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Farkas, Richárd , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Hwang, Jena , Ikeda, Takumi , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Jelínek, Tomáš , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Kwak, Sookyoung , Laippala, Veronika , Lambertino, Lorenzo , Lam, Lucia , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Lim, KyungTae , Li, Yuan , Ljubešić, Nikola , Loginova, Olga , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Miekka, Niko , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Keiko Sophie , Morioka, Tomohiko , Mori, Shinsuke , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özgür, Arzucan , Öztürk Başaran, Balkız , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perrier, Guy , Petrova, Daria , Petrov, Slav , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Ramasamy, Loganathan , Rama, Taraka , Ramisch, Carlos , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rudina, Olga , Rueter, Jack , Sadde, Shoval , Sagot, Benoît , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shohibussirri, Muh , Sichinava, Dmitry , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Stella, Antonio , Straka, Milan , Strnadová, Jana , Suhr, Alane , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tanaka, Takaaki , Tella, Samson , Tellier, Isabelle , Thomas, Guillaume , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wakasa, Aya , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zeldes, Amir , Zhu, Hanzhi , and Zhuravleva, Anna
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , and Icelandic
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.6 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.6 , and PUB
Creator:
Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
tool and toolService
Subject:
tokenizer , POS tagger , lemmatization , tagger , parser , and dependency parser
Language:
Afrikaans , Arabic , Armenian , Belarusian , Bulgarian , Catalan , Czech , Church Slavic , Coptic , Welsh , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Persian , Finnish , French , Old French (842-ca. 1400) , Scottish Gaelic , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Literary Chinese , Marathi , Maltese , Dutch , Norwegian Nynorsk , Norwegian Bokmål , Old Russian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Tamil , Telugu , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , Gambian Wolof , Wolof , and Chinese
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models .
To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranzabe, Maria Jesus , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Bengoetxea, Kepa , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chi, Ethan , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huber, Eva , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Jelínek, Tomáš , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , K, Sarveswaran , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yuan , Lim, KyungTae , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özgür, Arzucan , Öztürk Başaran, Balkız , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Rögnvaldsson, Eiríkur , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shohibussirri, Muh , Sichinava, Dmitry , Sigurðsson, Einar Freyr , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Steingrímsson, Steinþór , Stella, Antonio , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tella, Samson , Tellier, Isabelle , Thomas, Guillaume , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vajjala, Sowmya , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yu, Zhuoran , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhu, Hanzhi , and Zhuravleva, Anna
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , and Tupinambá
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.7 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.7 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Aslan, Deniz Baran , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Bengoetxea, Kepa , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon. , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huber, Eva , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Ito, Kaoru , Jelínek, Tomáš , Jha, Apoorva , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , K, Sarveswaran , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Rögnvaldsson, Eiríkur , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Sichinava, Dmitry , Siewert, Janine , Sigurðsson, Einar Freyr , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Sprugnoli, Rachele , Steingrímsson, Steinþór , Stella, Antonio , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , and Western Armenian
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:
Licence Universal Dependencies v2.8 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.8 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Aslan, Deniz Baran , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Bengoetxea, Kepa , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon. , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huber, Eva , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Ito, Kaoru , Jelínek, Tomáš , Jha, Apoorva , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , K, Sarveswaran , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rocha, Luisa , Rögnvaldsson, Eiríkur , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Sichinava, Dmitry , Siewert, Janine , Sigurðsson, Einar Freyr , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Spadine, Carolyn , Sprugnoli, Rachele , Steingrímsson, Steinþór , Stella, Antonio , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , and Western Armenian
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Version 2.8.1 fixes a bug in 2.8 where a portion of the Dutch Alpino treebank was accidentally omitted.
Rights:
Licence Universal Dependencies v2.8 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.8 , and PUB
Creator:
Zeman, Daniel , Nivre, Joakim , Abrams, Mitchell , Ackermann, Elia , Aepli, Noëmi , Aghaei, Hamid , Agić, Željko , Ahmadi, Amir , Ahrenberg, Lars , Ajede, Chika Kennedy , Aleksandravičiūtė, Gabrielė , Alfina, Ika , Antonsen, Lene , Aplonova, Katya , Aquino, Angelina , Aragon, Carolina , Aranzabe, Maria Jesus , Arıcan, Bilge Nas , Arnardóttir, Þórunn , Arutie, Gashaw , Arwidarasti, Jessica Naraiswari , Asahara, Masayuki , Aslan, Deniz Baran , Ateyah, Luma , Atmaca, Furkan , Attia, Mohammed , Atutxa, Aitziber , Augustinus, Liesbeth , Badmaeva, Elena , Balasubramani, Keerthana , Ballesteros, Miguel , Banerjee, Esha , Bank, Sebastian , Barbu Mititelu, Verginica , Barkarson, Starkaður , Basile, Rodolfo , Basmov, Victoria , Batchelor, Colin , Bauer, John , Bedir, Seyyit Talha , Bengoetxea, Kepa , Berk, Gözde , Berzak, Yevgeni , Bhat, Irshad Ahmad , Bhat, Riyaz Ahmad , Biagetti, Erica , Bick, Eckhard , Bielinskienė, Agnė , Bjarnadóttir, Kristín , Blokland, Rogier , Bobicev, Victoria , Boizou, Loïc , Borges Völker, Emanuel , Börstell, Carl , Bosco, Cristina , Bouma, Gosse , Bowman, Sam , Boyd, Adriane , Braggaar, Anouck , Brokaitė, Kristina , Burchardt, Aljoscha , Candito, Marie , Caron, Bernard , Caron, Gauthier , Cassidy, Lauren , Cavalcanti, Tatiana , Cebiroğlu Eryiğit, Gülşen , Cecchini, Flavio Massimiliano , Celano, Giuseppe G. A. , Čéplö, Slavomír , Cesur, Neslihan , Cetin, Savas , Çetinoğlu, Özlem , Chalub, Fabricio , Chauhan, Shweta , Chi, Ethan , Chika, Taishi , Cho, Yongseok , Choi, Jinho , Chun, Jayeol , Chung, Juyeon , Cignarella, Alessandra T. , Cinková, Silvie , Collomb, Aurélie , Çöltekin, Çağrı , Connor, Miriam , Courtin, Marine , Cristescu, Mihaela , Daniel, Philemon , Davidson, Elizabeth , de Marneffe, Marie-Catherine , de Paiva, Valeria , Derin, Mehmet Oguz , de Souza, Elvis , Diaz de Ilarraza, Arantza , Dickerson, Carly , Dinakaramani, Arawinda , Di Nuovo, Elisa , Dione, Bamba , Dirix, Peter , Dobrovoljc, Kaja , Dozat, Timothy , Droganova, Kira , Dwivedi, Puneet , Eckhoff, Hanne , Eiche, Sandra , Eli, Marhaba , Elkahky, Ali , Ephrem, Binyam , Erina, Olga , Erjavec, Tomaž , Etienne, Aline , Evelyn, Wograine , Facundes, Sidney , Farkas, Richárd , Ferdaousi, Jannatul , Fernanda, Marília , Fernandez Alcalde, Hector , Foster, Jennifer , Freitas, Cláudia , Fujita, Kazunori , Gajdošová, Katarína , Galbraith, Daniel , Garcia, Marcos , Gärdenfors, Moa , Garza, Sebastian , Gerardi, Fabrício Ferraz , Gerdes, Kim , Ginter, Filip , Godoy, Gustavo , Goenaga, Iakes , Gojenola, Koldo , Gökırmak, Memduh , Goldberg, Yoav , Gómez Guinovart, Xavier , González Saavedra, Berta , Griciūtė, Bernadeta , Grioni, Matias , Grobol, Loïc , Grūzītis, Normunds , Guillaume, Bruno , Guillot-Barbance, Céline , Güngör, Tunga , Habash, Nizar , Hafsteinsson, Hinrik , Hajič, Jan , Hajič jr., Jan , Hämäläinen, Mika , Hà Mỹ, Linh , Han, Na-Rae , Hanifmuti, Muhammad Yudistira , Hardwick, Sam , Harris, Kim , Haug, Dag , Heinecke, Johannes , Hellwig, Oliver , Hennig, Felix , Hladká, Barbora , Hlaváčová, Jaroslava , Hociung, Florinel , Hohle, Petter , Huber, Eva , Hwang, Jena , Ikeda, Takumi , Ingason, Anton Karl , Ion, Radu , Irimia, Elena , Ishola, Ọlájídé , Ito, Kaoru , Jannat, Siratun , Jelínek, Tomáš , Jha, Apoorva , Johannsen, Anders , Jónsdóttir, Hildur , Jørgensen, Fredrik , Juutinen, Markus , K, Sarveswaran , Kaşıkara, Hüner , Kaasen, Andre , Kabaeva, Nadezhda , Kahane, Sylvain , Kanayama, Hiroshi , Kanerva, Jenna , Kara, Neslihan , Katz, Boris , Kayadelen, Tolga , Kenney, Jessica , Kettnerová, Václava , Kirchner, Jesse , Klementieva, Elena , Klyachko, Elena , Köhn, Arne , Köksal, Abdullatif , Kopacewicz, Kamil , Korkiakangas, Timo , Köse, Mehmet , Kotsyba, Natalia , Kovalevskaitė, Jolanta , Krek, Simon , Krishnamurthy, Parameswari , Kübler, Sandra , Kuyrukçu, Oğuzhan , Kuzgun, Aslı , Kwak, Sookyoung , Laippala, Veronika , Lam, Lucia , Lambertino, Lorenzo , Lando, Tatiana , Larasati, Septina Dian , Lavrentiev, Alexei , Lee, John , Lê Hồng, Phương , Lenci, Alessandro , Lertpradit, Saran , Leung, Herman , Levina, Maria , Li, Cheuk Ying , Li, Josie , Li, Keying , Li, Yuan , Lim, KyungTae , Lima Padovani, Bruna , Lindén, Krister , Ljubešić, Nikola , Loginova, Olga , Lusito, Stefano , Luthfi, Andry , Luukko, Mikko , Lyashevskaya, Olga , Lynn, Teresa , Macketanz, Vivien , Mahamdi, Menel , Maillard, Jean , Makazhanov, Aibek , Mandl, Michael , Manning, Christopher , Manurung, Ruli , Marşan, Büşra , Mărănduc, Cătălina , Mareček, David , Marheinecke, Katrin , Martínez Alonso, Héctor , Martín-Rodríguez, Lorena , Martins, André , Mašek, Jan , Matsuda, Hiroshi , Matsumoto, Yuji , Mazzei, Alessandro , McDonald, Ryan , McGuinness, Sarah , Mendonça, Gustavo , Merzhevich, Tatiana , Miekka, Niko , Mischenkova, Karina , Misirpashayeva, Margarita , Missilä, Anna , Mititelu, Cătălin , Mitrofan, Maria , Miyao, Yusuke , Mojiri Foroushani, AmirHossein , Molnár, Judit , Moloodi, Amirsaeid , Montemagni, Simonetta , More, Amir , Moreno Romero, Laura , Moretti, Giovanni , Mori, Keiko Sophie , Mori, Shinsuke , Morioka, Tomohiko , Moro, Shigeki , Mortensen, Bjartur , Moskalevskyi, Bohdan , Muischnek, Kadri , Munro, Robert , Murawaki, Yugo , Müürisep, Kaili , Nainwani, Pinkey , Nakhlé, Mariam , Navarro Horñiacek, Juan Ignacio , Nedoluzhko, Anna , Nešpore-Bērzkalne, Gunta , Nevaci, Manuela , Nguyễn Thị, Lương , Nguyễn Thị Minh, Huyền , Nikaido, Yoshihiro , Nikolaev, Vitaly , Nitisaroj, Rattima , Nourian, Alireza , Nurmi, Hanna , Ojala, Stina , Ojha, Atul Kr. , Olúòkun, Adédayọ̀ , Omura, Mai , Onwuegbuzia, Emeka , Osenova, Petya , Östling, Robert , Øvrelid, Lilja , Özateş, Şaziye Betül , Özçelik, Merve , Özgür, Arzucan , Öztürk Başaran, Balkız , Park, Hyunji Hayley , Partanen, Niko , Pascual, Elena , Passarotti, Marco , Patejuk, Agnieszka , Paulino-Passos, Guilherme , Peljak-Łapińska, Angelika , Peng, Siyao , Perez, Cenel-Augusto , Perkova, Natalia , Perrier, Guy , Petrov, Slav , Petrova, Daria , Phelan, Jason , Piitulainen, Jussi , Pirinen, Tommi A , Pitler, Emily , Plank, Barbara , Poibeau, Thierry , Ponomareva, Larisa , Popel, Martin , Pretkalniņa, Lauma , Prévost, Sophie , Prokopidis, Prokopis , Przepiórkowski, Adam , Puolakainen, Tiina , Pyysalo, Sampo , Qi, Peng , Rääbis, Andriela , Rademaker, Alexandre , Rahoman, Mizanur , Rama, Taraka , Ramasamy, Loganathan , Ramisch, Carlos , Rashel, Fam , Rasooli, Mohammad Sadegh , Ravishankar, Vinit , Real, Livy , Rebeja, Petru , Reddy, Siva , Regnault, Mathilde , Rehm, Georg , Riabov, Ivan , Rießler, Michael , Rimkutė, Erika , Rinaldi, Larissa , Rituma, Laura , Rizqiyah, Putri , Rocha, Luisa , Rögnvaldsson, Eiríkur , Romanenko, Mykhailo , Rosa, Rudolf , Roșca, Valentin , Rovati, Davide , Rudina, Olga , Rueter, Jack , Rúnarsson, Kristján , Sadde, Shoval , Safari, Pegah , Sagot, Benoît , Sahala, Aleksi , Saleh, Shadi , Salomoni, Alessio , Samardžić, Tanja , Samson, Stephanie , Sanguinetti, Manuela , Sanıyar, Ezgi , Särg, Dage , Saulīte, Baiba , Sawanakunanon, Yanin , Saxena, Shefali , Scannell, Kevin , Scarlata, Salvatore , Schneider, Nathan , Schuster, Sebastian , Schwartz, Lane , Seddah, Djamé , Seeker, Wolfgang , Seraji, Mojgan , Shahzadi, Syeda , Shen, Mo , Shimada, Atsuko , Shirasu, Hiroyuki , Shishkina, Yana , Shohibussirri, Muh , Sichinava, Dmitry , Siewert, Janine , Sigurðsson, Einar Freyr , Silveira, Aline , Silveira, Natalia , Simi, Maria , Simionescu, Radu , Simkó, Katalin , Šimková, Mária , Simov, Kiril , Skachedubova, Maria , Smith, Aaron , Soares-Bastos, Isabela , Sourov, Shafi , Spadine, Carolyn , Sprugnoli, Rachele , Steingrímsson, Steinþór , Stella, Antonio , Straka, Milan , Strickland, Emmett , Strnadová, Jana , Suhr, Alane , Sulestio, Yogi Lesmana , Sulubacak, Umut , Suzuki, Shingo , Szántó, Zsolt , Taguchi, Chihiro , Taji, Dima , Takahashi, Yuta , Tamburini, Fabio , Tan, Mary Ann C. , Tanaka, Takaaki , Tanaya, Dipta , Tella, Samson , Tellier, Isabelle , Testori, Marinella , Thomas, Guillaume , Torga, Liisi , Toska, Marsida , Trosterud, Trond , Trukhina, Anna , Tsarfaty, Reut , Türk, Utku , Tyers, Francis , Uematsu, Sumire , Untilov, Roman , Urešová, Zdeňka , Uria, Larraitz , Uszkoreit, Hans , Utka, Andrius , Vajjala, Sowmya , van der Goot, Rob , Vanhove, Martine , van Niekerk, Daniel , van Noord, Gertjan , Varga, Viktor , Villemonte de la Clergerie, Eric , Vincze, Veronika , Vlasova, Natalia , Wakasa, Aya , Wallenberg, Joel C. , Wallin, Lars , Walsh, Abigail , Wang, Jing Xian , Washington, Jonathan North , Wendt, Maximilan , Widmer, Paul , Wijono, Sri Hartati , Williams, Seyi , Wirén, Mats , Wittern, Christian , Woldemariam, Tsegay , Wong, Tak-sum , Wróblewska, Alina , Yako, Mary , Yamashita, Kayo , Yamazaki, Naoki , Yan, Chunxiao , Yasuoka, Koichi , Yavrumyan, Marat M. , Yenice, Arife Betül , Yıldız, Olcay Taner , Yu, Zhuoran , Yuliawati, Arlisa , Žabokrtský, Zdeněk , Zahra, Shorouq , Zeldes, Amir , Zhou, He , Zhu, Hanzhi , Zhuravleva, Anna , and Ziane, Rayan
Publisher:
Universal Dependencies Consortium
Type:
text and corpus
Subject:
treebank , dependency , syntax , morphology , harmonized annotation , interset , universal tagset , and stanford dependencies
Language:
Ancient Greek (to 1453) , Arabic , Basque , Bulgarian , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Gothic , Modern Greek (1453-) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Norwegian , Church Slavic , Persian , Polish , Portuguese , Romanian , Slovenian , Spanish , Swedish , Tamil , Catalan , Chinese , Galician , Kazakh , Latvian , Russian , Turkish , Coptic , Sanskrit , Slovak , Ukrainian , Uighur , Vietnamese , Belarusian , Korean , Lithuanian , Urdu , Russia Buriat , Northern Kurdish , Northern Sami , Upper Sorbian , Afrikaans , Yue Chinese , Marathi , Serbian , Swedish Sign Language , Telugu , Amharic , Armenian , Breton , Faroese , Komi-Zyrian , Nigerian Pidgin , Old French (842-ca. 1400) , Tagalog , Thai , Warlpiri , Yoruba , Akkadian , Bambara , Erzya , Maltese , Welsh , Wolof , Assyrian Neo-Aramaic , Literary Chinese , Old Russian , Karelian , Mbyá Guaraní , Bhojpuri , Komi-Permyak , Livvi , Moksha , Scottish Gaelic , Skolt Sami , Swiss German , Albanian , Icelandic , Akuntsu , Apurinã , Chukot , Khunsari , Manx , Mundurukú , Nayini , Old Turkish , Soi , South Levantine Arabic , Tupinambá , Beja , Western Frisian , Guajajára , Urubú-Kaapor , Kangri , K'iche' , Low German , Makuráp , Central Siberian Yupik , Western Armenian , Bengali , Javanese , Karo (Brazil) , Ligurian , Neapolitan , Tatar , Xibe , and Yakut
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Version 2.8.1 fixes a bug in 2.8 where a portion of the Dutch Alpino treebank was accidentally omitted.
Rights:
Licence Universal Dependencies v2.9 , https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.9 , and PUB
Creator:
Žabokrtský, Zdeněk , Bafna, Nyati , Bodnár, Jan , Kyjánek, Lukáš , Svoboda, Emil , Ševčíková, Magda , Vidra, Jonáš , Angle, Sachi , Ansari, Ebrahim , Arkhangelskiy, Timofey , Batsuren, Khuyagbaatar , Bella, Gábor , Bertinetto, Pier Marco , Bonami, Olivier , Celata, Chiara , Daniel, Michael , Fedorenko, Alexei , Filko, Matea , Giunchiglia, Fausto , Haghdoost, Hamid , Hathout, Nabil , Khomchenkova, Irina , Khurshudyan, Victoria , Levonian, Dmitri , Litta, Eleonora , Medvedeva, Maria , Muralikrishna, S. N. , Namer, Fiammetta , Nikravesh, Mahshid , Padó, Sebastian , Passarotti, Marco , Plungian, Vladimir , Polyakov, Alexey , Potapov, Mihail , Pruthwik, Mishra , Rao B, Ashwath , Rubakov, Sergei , Samar, Husain , Sharma, Dipti Misra , Šnajder, Jan , Šojat, Krešimir , Štefanec, Vanja , Talamo, Luigi , Tribout, Delphine , Vodolazsky, Daniil , Vydrin, Arseniy , Zakirova, Aigul , and Zeller, Britta
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text , lexicon , and lexicalConceptualResource
Subject:
universal segmentations , morphological segmentation , word segmentation , segmentation , morphology , morphemes , morphological dictionary , unisegments , morph , and multilingual
Language:
Czech , Catalan , German , English , Persian , Finnish , French , Serbo-Croatian , Croatian , Hungarian , Italian , Komi-Zyrian , Latin , Moksha , Mari (Russia) , Mongolian , Erzya , Polish , Portuguese , Russian , Spanish , Swedish , Tajik , Udmurt , Armenian , Bengali , Hindi , Malayalam , Marathi , and Kannada
Description:
Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc. The current public version of the collection contains 38 harmonised segmentation datasets covering 30 different languages.
Rights:
Universal Segmentations 1.0 License Terms , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-unisegs-1.0 , and PUB
Creator:
Majliš, Martin
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
multilingual corpora
Language:
Afrikaans , Tosk Albanian , Amharic , Arabic , Aragonese , Egyptian Arabic , Asturian , Azerbaijani , Belarusian , Bengali , Bosnian , Bishnupriya , Breton , Buginese , Bulgarian , Catalan , Cebuano , Czech , Chuvash , Corsican , Welsh , Danish , German , Dimli (individual language) , Modern Greek (1453-) , English , Esperanto , Estonian , Basque , Faroese , Persian , Finnish , French , Western Frisian , Gan Chinese , Scottish Gaelic , Irish , Galician , Gilaki , Gujarati , Haitian , Serbo-Croatian , Hebrew , Fiji Hindi , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Ido , Interlingua (International Auxiliary Language Association) , Indonesian , Icelandic , Italian , Javanese , Japanese , Kannada , Georgian , Kazakh , Korean , Kurdish , Latin , Latvian , Limburgan , Lithuanian , Lombard , Luxembourgish , Malayalam , Marathi , Macedonian , Malagasy , Mongolian , Maori , Malay (macrolanguage) , Burmese , Neapolitan , Low German , Nepali (macrolanguage) , Newari , Dutch , Norwegian Nynorsk , Norwegian , Occitan (post 1500) , Ossetian , Pampanga , Piemontese , Polish , Portuguese , Quechua , Romanian , Russian , Yakut , Sicilian , Scots , Slovak , Slovenian , Spanish , Albanian , Serbian , Sundanese , Swahili (macrolanguage) , Swedish , Tamil , Tatar , Telugu , Tajik , Tagalog , Thai , Turkish , Ukrainian , Urdu , Uzbek , Venetian , Vietnamese , Volapük , Waray (Philippines) , Walloon , Yiddish , Yoruba , and Chinese
Description:
A set of corpora for 120 languages automatically collected from wikipedia and the web.
Collected using the W2C toolset: http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1
Rights:
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) , http://creativecommons.org/licenses/by-sa/3.0/ , and PUB