Skip to search
Skip to main content
Skip to first result
Search
Search Results
Creator:
Zeman, Daniel , Potthast, Martin , Straka, Milan , Popel, Martin , Dozat, Timothy , Qi, Peng , Manning, Christopher , Shi, Tianze , Wu, Felix G. , Chen, Xilun , Cheng, Yao , Björkelund, Anders , Falenska, Agnieszka , Yu, Xiang , Kuhn, Jonas , Che, Wanxiang , Guo, Jiang , Wang, Yuxuan , Zheng, Bo , Zhao, Huaipeng , Liu, Yang , Teng, Dechuan , Liu, Ting , Lim, Kyungtae , Poibeau, Thierry , Sato, Motoki , Manabe, Hitoshi , Noji, Hiroshi , Matsumoto, Yuji , Kırnap, Ömer , Önder, Berkay Furkan , Yuret, Deniz , Straková, Jana , Vania, Clara , Zhang, Xingxing , Lopez, Adam , Heinecke, Johannes , Asadullah, Munshi , Kanerva, Jenna , Luotolahti, Juhani , Ginter, Filip , Kuan, Yu , Sofroniev, Pavel , Schill, Erik , Hinrichs, Erhard , Nguyen, Dat Quoc , Dras, Mark , Johnson, Mark , Qian, Xian , Vilares, David , Gómez-Rodríguez, Carlos , Aufrant, Lauriane , Wisniewski, Guillaume , Yvon, François , Dumitrescu, Stefan Daniel , Boroş, Tiberiu , Tufiş, Dan , Das, Ayan , Zaffar, Affan , Sarkar, Sudeshna , Wang, Hao , Zhao, Hai , Zhang, Zhisong , Hornby, Ryan , Taylor, Clark , Park, Jungyeul , de Lhoneux, Miryam , Shao, Yan , Basirat, Ali , Kiperwasser, Eliyahu , Stymne, Sara , Goldberg, Yoav , Nivre, Joakim , Akkuş, Burak Kerim , Azizoglu, Heval , Cakici, Ruket , Moor, Christophe , Merlo, Paola , Henderson, James , Wang, Haozhou , Ji, Tao , Wu, Yuanbin , Lan, Man , de la Clergerie, Eric , Sagot, Benoît , Seddah, Djamé , More, Amir , Tsarfaty, Reut , Kanayama, Hiroshi , Muraoka, Masayasu , Yoshikawa, Katsumasa , Garcia, Marcos , and Gamallo, Pablo
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
dependency parser and parsebank
Language:
Arabic , Bulgarian , Russia Buriat , Czech , Catalan , Church Slavic , Danish , German , Modern Greek (1453-) , English , Spanish , Estonian , Basque , Persian , Finnish , French , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Swedish , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
This package contains the system outputs from the CoNLL 2017 Shared Task in Multilingual Parsing from Raw Text to Universal Dependencies.
Rights:
Licence Universal Dependencies v2.0 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.0 , and PUB
Creator:
Zeman, Daniel , Potthast, Martin , Duthoo, Elie , Mesnard, Olivier , Rybak, Piotr , Wróblewska, Alina , Che, Wanxiang , Liu, Yijia , Wang, Yuxuan , Zheng, Bo , Liu, Ting , Li, Zuchao , He, Shexia , Zhang, Zhuosheng , Zhao, Hai , Wu, Yingting , Tong, Jia-Jun , Nguyen, Dat Quoc , Verspoor, Karin , Wan, Hui , Naseem, Tahira , Lee, Young-Suk , Castelli, Vittorio , Ballesteros, Miguel , Hershcovich, Daniel , Abend, Omri , Rappoport, Ari , Smith, Aaron , Bohnet, Bernd , de Lhoneux, Miryam , Nivre, Joakim , Shao, Yan , Stymne, Sara , Kırnap, Ömer , Dayanık, Erenay , Yuret, Deniz , Kanerva, Jenna , Ginter, Filip , Miekka, Niko , Leino, Akseli , Salakoski, Tapio , Lim, KyungTae , Park, Cheoneum , Lee, Changki , Poibeau, Thierry , Bhat, Riyaz Ahmad , Bhat, Irshad , Bangalore, Srinivas , Qi, Peng , Dozat, Timothy , Zhang, Yuhao , Manning, Christopher , Boroș, Tiberiu , Dumitrescu, Stefan Daniel , Burtica, Ruxandra , Arakelyan, Gor , Hambardzumyan, Karen , Khachatrian, Hrant , Rosa, Rudolf , Mareček, David , Straka, Milan , Seker, Amit , More, Amir , Tsarfaty, Reut , Önder, Berkay Furkan , Gümeli, Can , Jawahar, Ganesh , Muller, Benjamin , Fethi, Amal , Martin, Louis , Villemonte de la Clergerie, Eric , Sagot, Benoît , Seddah, Djamé , Özateş, Şaziye Betül , Özgür, Arzucan , Gungor, Tunga , Öztürk, Balkız , Ji, Tao , Liu, Yufang , Wang, Yijun , Wu, Yuanbin , Lan, Man , Chen, Danlu , Lin, Mengxiao , Hu, Zhifeng , and Qiu, Xipeng
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
parsed data , conllu , and universal dependencies
Language:
Afrikaans , Arabic , Breton , Bulgarian , Russia Buriat , Catalan , Czech , Church Slavic , Danish , German , Modern Greek (1453-) , English , Estonian , Basque , Faroese , Persian , Finnish , French , Old French (842-ca. 1400) , Irish , Galician , Gothic , Ancient Greek (to 1453) , Hebrew , Hindi , Croatian , Upper Sorbian , Hungarian , Armenian , Indonesian , Italian , Japanese , Kazakh , Northern Kurdish , Korean , Latin , Latvian , Dutch , Norwegian , Nigerian Pidgin , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Northern Sami , Spanish , Serbian , Swedish , Thai , Turkish , Uighur , Ukrainian , Urdu , Vietnamese , and Chinese
Description:
Test data parsed by systems submitted to the CoNLL 2018 UD parsing shared task.
Rights:
Licence Universal Dependencies v2.2 , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-UD-2.2 , and PUB
Creator:
Náplava, Jakub , Straka, Milan , Hajič, Jan , and Straňák, Pavel
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
diacritical marks generation and natural language correction
Language:
Czech , Vietnamese , Romanian , Polish , Slovak , Spanish , Croatian , Irish , Latvian , Hungarian , French , and Turkish
Description:
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Publisher:
University of Zagreb, Faculty of Humanities and Social Sciences
Format:
application/octet-stream
Type:
corpus
Language:
Croatian
Description:
Manually tagged dependency treebank, analytical layer according to the PDT formalism adapted for Croatian
Rights:
Not specified
Publisher:
University of Zagreb, Faculty of Humanities and Social Sciences
Format:
application/octet-stream
Type:
lexicalConceptualResource
Language:
Croatian
Description:
38,573 lemmas, plain text; database file
Rights:
Not specified
Publisher:
University of Zagreb, Faculty of Humanities and Social Sciences
Type:
toolService
Language:
Croatian
Description:
On line service for lemmatization, full POS or MSD tagging of Croatian texts.
Rights:
Not specified
Publisher:
University of Zagreb, Faculty of Humanities and Social Sciences
Type:
lexicalConceptualResource
Language:
Croatian
Description:
110,000+ lemmas; 3,900,000+ word-forms, MulText East lexica format
Rights:
Not specified
Publisher:
University of Zagreb, Faculty of Humanities and Social Sciences
Type:
corpus
Language:
Croatian
Description:
This is the reference corpus of standard Croatian. In its 3.0 version, which is accessible via noSketch Engine, it has 216.8 million tokens. In terms of annotation, the corpus is tokenised, lemmatised and tagged for MSDs (morphosyntactic descriptions).
Rights:
Not specified
Publisher:
University of Zagreb, Faculty of Humanities and Social Sciences
Type:
corpus
Language:
Croatian and English
Description:
written; domain-specific (newspaper); synchronic; bilingual; parallel; unidirectional; XML; S-alignment
Rights:
Not specified
Creator:
Kubeša, David and Straka, Milan
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
entity linking , NEL , NER , dataset , and knowledge base
Language:
Afrikaans , Arabic , Armenian , Basque , Belarusian , Bulgarian , Catalan , Chinese , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , Galician , German , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Korean , Latin , Latvian , Lithuanian , Maltese , Marathi , Modern Greek (1453-) , Northern Sami , Norwegian Nynorsk , Persian , Polish , Portuguese , Romanian , Russian , Scottish Gaelic , Serbian , Slovak , Slovenian , Spanish , Swedish , Tamil , Telugu , Uighur , Ukrainian , Urdu , Vietnamese , and Wolof
Description:
We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA licence.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB