Skip to search
Skip to main content
Skip to first result
Search
Search Results
Creator:
Mareček, David , Yu, Zhiwei , Zeman, Daniel , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
part of speech , tagging , semi-supervised , and cross-language
Language:
Belarusian , Bosnian , Bulgarian , Czech , Serbo-Croatian , Croatian , Upper Sorbian , Macedonian , Polish , Russian , Slovak , Slovenian , Serbian , Ukrainian , Latvian , Lithuanian , Afrikaans , Danish , German , English , Faroese , Western Frisian , Swiss German , Icelandic , Limburgan , Luxembourgish , Low German , Dutch , Norwegian Nynorsk , Norwegian , Scots , Swedish , Yiddish , Aragonese , Asturian , Catalan , French , Galician , Haitian , Italian , Latin , Lombard , Neapolitan , Piemontese , Portuguese , Romanian , Spanish , Venetian , Walloon , Breton , Welsh , Scottish Gaelic , Irish , Modern Greek (1453-) , Armenian , Albanian , Dimli (individual language) , Persian , Gilaki , Kurdish , Tajik , Bengali , Bishnupriya , Gujarati , Fiji Hindi , Hindi , Marathi , Nepali (macrolanguage) , Urdu , Amharic , Arabic , Egyptian Arabic , Hebrew , Estonian , Finnish , Hungarian , Basque , Georgian , Chuvash , Azerbaijani , Turkish , Uzbek , Kazakh , Tatar , Yakut , Korean , Mongolian , Telugu , Kannada , Malayalam , Tamil , Newari , Vietnamese , Indonesian , Javanese , Malagasy , Maori , Malay (macrolanguage) , Pampanga , Sundanese , Tagalog , Waray (Philippines) , Swahili (macrolanguage) , Esperanto , Ido , Interlingua (International Auxiliary Language Association) , and Volapük
Description:
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia).
Changes in version 1.1:
1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset.
2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0.
3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Jan Patočka
Publisher:
Str. 132–159. Stať. [Věnován o F. Fajfrovi k 80. narozeninám 1972 a B. Komárkové k 70. narozeninám 1973.]
Type:
Text
Subject:
1975 , 1979/25 , 1981/6 , 1981/7 , 1988/28 , 1988/31 , 1988/32 , 1988/33 , 1988/34 , 1994/7 , 1996/4 , 1996/7 , 1998/3 , 1999/8 , 2001/9 , 2002/21 , 2006/1 , 2007/1 , 2008/3 , be , bg , cs , de , en , es , fr , fulltext , hu , I/1979 , it , lt , no , pl , ru , SS-3/PD-III , sv , and uk
Language:
Czech , English , Bulgarian , French , Italian , Lithuanian , Hungarian , German , Norwegian , Polish , Russian , Belarusian , Spanish , Swedish , and Ukrainian
Rights:
open access and Rights holder: Archiv Jana Patočky, z.s.
Creator:
Zeman, Daniel , Mareček, David , Mašek, Jan , Popel, Martin , Ramasamy, Loganathan , Rosa, Rudolf , Štěpánek, Jan , and Žabokrtský, Zdeněk
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
treebank , Stanford dependencies , Prague dependencies , harmonization , common annotation style , and Interset
Language:
Arabic , Bulgarian , Bengali , Catalan , Czech , Danish , German , Modern Greek (1453-) , English , Spanish , Estonian , Basque , Persian , Finnish , Ancient Greek (to 1453) , Hindi , Hungarian , Italian , Japanese , Latin , Dutch , Portuguese , Romanian , Russian , Slovak , Slovenian , Swedish , Tamil , Telugu , and Turkish
Description:
HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes.
Rights:
HamleDT 2.0 Licence Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-2.0 , and ACA
Creator:
Zeman, Daniel , Mareček, David , Mašek, Jan , Popel, Martin , Ramasamy, Loganathan , Rosa, Rudolf , Štěpánek, Jan , and Žabokrtský, Zdeněk
Publisher:
Charles University
Type:
text and corpus
Subject:
annotated corpus , morphology , syntax , dependency , treebank , harmonized annotation , and common annotation style
Language:
Arabic , Basque , Bengali , Bulgarian , Catalan , Croatian , Czech , Danish , Dutch , English , Estonian , Finnish , French , German , Modern Greek (1453-) , Ancient Greek (to 1453) , Hebrew , Hindi , Hungarian , Indonesian , Irish , Italian , Japanese , Latin , Persian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Spanish , Swedish , Tamil , Telugu , and Turkish
Description:
HamleDT (HArmonized Multi-LanguagE Dependency Treebank) is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Update (November 1017): for a current collection of harmonized dependency treebanks, we recommend using the Universal Dependencies (UD). All of the corpora that are distributed in HamleDT in full are also part of the UD project; only some corpora from the Patch group (where HamleDT provides only the harmonizing scripts but not the full corpus data) are available in HamleDT but not in UD.
Rights:
HamleDT 3.0 License Terms , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-hamledt-3.0 , and PUB
Creator:
Jan Patočka
Publisher:
Str. 160–203. Stať.
Type:
Text
Subject:
1975 , 1979/25 , 1980/27 , 1981/6 , 1981/7 , 1988/28 , 1988/31 , 1988/32 , 1988/34 , 1994/7 , 1996/4 , 1996/7 , 1997/7 , 1998/3 , 1999/8 , 2001/9 , 2002/21 , 2006/1 , 2007/1 , 2008/3 , be , bg , cs , de , en , es , fr , fulltext , hu , it , lt , no , pl , ru , sl , sr , SS-3/PD-III , sv , and uk
Language:
Czech , English , Bulgarian , French , Italian , Lithuanian , Hungarian , German , Norwegian , Polish , Russian , Belarusian , Slovenian , Serbian , Spanish , Swedish , and Ukrainian
Rights:
open access and Rights holder: Archiv Jana Patočky, z.s.
Creator:
Jan Patočka
Publisher:
Str. 89–131. Stať. [Součástí eseje i text To platí též..., v. 1988/25H.]
Type:
Text
Subject:
1975 , 1979/25 , 1981/6 , 1981/7 , 1988/25H , 1988/28 , 1988/31 , 1988/32 , 1988/34 , 1994/7 , 1996/4 , 1996/7 , 1998/3 , 1999/8 , 2 , 2001/9 , 2002/21 , 2002/7 , 2006/1 , 2007/1 , 2008/3 , bg , cs , de , en , es , fr , fulltext , hu , it , jp , lt , no , pl , ru , SS-3/PD-III , sv , uk , and v
Language:
Czech , English , Bulgarian , French , Italian , Lithuanian , Hungarian , German , Norwegian , Polish , Russian , Spanish , Swedish , and Ukrainian
Rights:
open access and Rights holder: Archiv Jana Patočky, z.s.
Creator:
Savary, Agata , Ramisch, Carlos , Guillaume, Bruno , Hawwari, Abdelati , Walsh, Abigail , Fotopoulou, Aggeliki , Bielinskienė, Agnė , Estarrona, Ainara , Gatt, Albert , Butler, Alexandra , Rademaker, Alexandre , Maldonado, Alfredo , Villavicencio, Aline , Farrugia, Alison , Muscat, Amanda , Gatt, Anabelle , Antić, Anđela , De Santis, Anna , Raffone, Annalisa , Riccio, Anna , Pascucci, Antonio , Gurrutxaga, Antton , Bhatia, Archna , Vaidya, Ashwini , Miral, Ayşenur , QasemiZadeh, Behrang , Priego Sanchez, Belem , Griciūtė, Bernadeta , Erden, Berna , Parra Escartín, Carla , Herrero, Carlos , Carlino, Carola , Pasquer, Caroline , Liebeskind, Chaya , Wang, Chenweng , Ben Khelil, Chérifa , Bonial, Claire , Somers, Clarissa , Aceta, Cristina , Krstev, Cvetana , Bejček, Eduard , Lindqvist, Ellinor , Erenmalm, Elsa , Palka-Binkiewicz, Emilia , Rimkute, Erika , Petterson, Eva , Cap, Fabienne , Hu, Fangyuan , Sangati, Federico , Wick Pedro, Gabriela , Speranza, Giulia , Jagfeld, Glorianna , Blagus, Goranka , Berk, Gözde , Attard, Greta , Eryiğit, Gülşen , Finnveden, Gustav , Martínez Alonso, Héctor , de Medeiros Caseli, Helena , Elyovich, Hevi , Xu, Hongzhi , Xiao, Huangyang , Miranda, Isaac , Jaknić, Isidora , El Maarouf, Ismail , Aduriz, Itziar , Gonzalez, Itziar , Matas, Ivana , Stoyanova, Ivelina , Jazbec, Ivo-Pavao , Busuttil, Jael , Waszczuk, Jakub , Findlay, Jamie , Bonnici, Janice , Šnajder, Jan , Antoine, Jean-Yves , Foster, Jennifer , Chen, Jia , Nivre, Joakim , Monti, Johanna , McCrae, John , Kovalevskaitė, Jolanta , Jain, Kanishka , Simkó, Katalin , Yu, Ke , Azzopardi, Kirsty , Adalı, Kübra , Uria, Larraitz , Zilio, Leonardo , Boizou, Loïc , van der Plas, Lonneke , Galea, Luke , Sarlak, Mahtab , Buljan, Maja , Cherchi, Manuela , Tanti, Marc , Di Buono, Maria Pia , Todorova, Maria , Candito, Marie , Constant, Matthieu , Shamsfard, Mehrnoush , Jiang, Menghan , Boz, Mert , Spagnol, Michael , Onofrei, Mihaela , Li, Minli , Elbadrashiny, Mohamed , Diab, Mona , Rizea, Monica-Mihaela , Hadj Mohamed, Najet , Theoxari, Natasa , Schneider, Nathan , Tabone, Nicole , Ljubešić, Nikola , Vale, Oto , Cook, Paul , Yan, Peiyi , Gantar, Polona , Ehren, Rafael , Fabri, Ray , Ibrahim, Rehab , Ramisch, Renata , Walles, Rinat , Wilkens, Rodrigo , Urizar, Ruben , Sun, Ruilong , Malka, Ruth , Galea, Sara Anne , Stymne, Sara , Louizou, Sevasti , Hu, Sha , Taslimipoor, Shiva , Ratori, Shraddha , Srivastava, Shubham , Cordeiro, Silvio Ricardo , Krek, Simon , Liu, Siyuan , Zeng, Si , Yu, Songping , Arhar Holdt, Špela , Markantonatou, Stella , Papadelli, Stella , Leseva, Svetlozara , Kuzman, Taja , Kavčič, Teja , Lynn, Teresa , Lichte, Timm , Pickard, Thomas , Dimitrova, Tsvetana , Yih, Tsy , Güngör, Tunga , Dinç, Tutkum , Iñurrieta, Uxoa , Tajalli, Vahide , Stefanova, Valentina , Caruso, Valeria , Puri, Vandana , Foufi, Vassiliki , Barbu Mititelu, Verginica , Vincze, Veronika , Kovács, Viktória , Shukla, Vishakha , Giouli, Voula , Ge, Xiaomin , Ha-Cohen Kerner, Yaakov , Öztürk, Yağmur , Yarandi, Yalda , Parmentier, Yannick , Zhang, Yongchen , Zhao, Yun , Urešová, Zdeňka , Yirmibeşoğlu, Zeynep , Qin, Zhenzhen , Stank , Cristescu, Mihaela , Zgreabăn, Bianca-Mădălina , Bărbulescu, Elena-Andreea , and Stanković, Ranka
Publisher:
PARSEME
Type:
text and corpus
Subject:
multiword expressions , verbal multiword expressions , light verb construction , verb-particle constructions , inherently reflexive verbs , verbal idioms , and multi-verb constructions
Language:
Arabic , Bulgarian , Czech , German , Modern Greek (1453-) , English , Spanish , Basque , Persian , French , Irish , Hebrew , Hindi , Croatian , Hungarian , Lithuanian , Italian , Maltese , Polish , Portuguese , Romanian , Slovenian , Serbian , Swedish , Turkish , and Chinese
Description:
This multilingual resource contains corpora in which verbal MWEs have been manually annotated. VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). This is the first release of the corpora without an associated shared task. Previous version (1.2) was associated with the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). The data covers 26 languages corresponding to the combination of the corpora for all previous three editions (1.0, 1.1 and 1.2) of the corpora. VMWEs were annotated according to the universal guidelines. The corpora are provided in the cupt format, inspired by the CONLL-U format. Morphological and syntactic information, including parts of speech, lemmas, morphological features and/or syntactic dependencies, are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). All corpora are split into training, development and test data, following the splitting strategy adopted for the PARSEME Shared Task 1.2. The annotation guidelines are available online: https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3 The .cupt format is detailed here: https://multiword.sourceforge.net/cupt-format/
Rights:
PARSEME Corpora v. 1.3 - Licence Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.3 , and PUB
Creator:
Rosa, Rudolf
Publisher:
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:
text and corpus
Subject:
Wikipedia , text corpora , and monolingual corpus
Language:
Abkhazian , Achinese , Adyghe , Afrikaans , Akan , Tosk Albanian , Amharic , Old English (ca. 450-1100) , Arabic , Official Aramaic (700-300 BCE) , Aragonese , Egyptian Arabic , Assamese , Asturian , Atikamekw , Avaric , Aymara , South Azerbaijani , Azerbaijani , Bashkir , Bambara , Bavarian , Central Bikol , Belarusian , Bengali , Bislama , Banjar , Tibetan , Bosnian , Bishnupriya , Breton , Buginese , Bulgarian , Russia Buriat , Catalan , Min Dong Chinese , Cebuano , Czech , Chamorro , Chechen , Cherokee , Church Slavic , Chuvash , Cheyenne , Central Kurdish , Cornish , Corsican , Cree , Crimean Tatar , Kashubian , Welsh , Danish , German , Dinka , Dimli (individual language) , Dhivehi , Lower Sorbian , Dzongkha , Modern Greek (1453-) , English , Esperanto , Estonian , Basque , Ewe , Extremaduran , Faroese , Persian , Fijian , Finnish , French , Arpitan , Northern Frisian , Western Frisian , Fulah , Friulian , Gagauz , Gan Chinese , Scottish Gaelic , Irish , Galician , Gilaki , Manx , Goan Konkani , Gothic , Guarani , Gujarati , Hakka Chinese , Haitian , Hausa , Hawaiian , Serbo-Croatian , Hebrew , Herero , Fiji Hindi , Hindi , Hiri Motu , Croatian , Upper Sorbian , Hungarian , Armenian , Igbo , Ido , Inuktitut , Interlingue , Iloko , Interlingua (International Auxiliary Language Association) , Indonesian , Inupiaq , Icelandic , Italian , Jamaican Creole English , Javanese , Lojban , Japanese , Kara-Kalpak , Kabyle , Kalaallisut , Kannada , Kashmiri , Georgian , Kanuri , Kazakh , Kabardian , Kabiyè , Khmer , Kikuyu , Kinyarwanda , Kirghiz , Komi-Permyak , Komi , Kongo , Korean , Karachay-Balkar , Kölsch , Kurdish , Ladino , Lao , Latin , Latvian , Lak , Lezghian , Ligurian , Limburgan , Lingala , Lithuanian , Lombard , Northern Luri , Latgalian , Luxembourgish , Ganda , Literary Chinese , Marshallese , Maithili , Malayalam , Marathi , Moksha , Eastern Mari , Minangkabau , Macedonian , Malagasy , Maltese , Mongolian , Maori , Western Mari , Malay (macrolanguage) , Creek , Mirandese , Burmese , Erzya , Mazanderani , Min Nan Chinese , Neapolitan , Nauru , Navajo , Ndonga , Low German , Nepali (macrolanguage) , Newari , Dutch , Norwegian Nynorsk , Norwegian , Novial , Pedi , Nyanja , Occitan (post 1500) , Livvi , Oriya (macrolanguage) , Oromo , Ossetian , Pangasinan , Pampanga , Panjabi , Papiamento , Picard , Pennsylvania German , Pfaelzisch , Pitcairn-Norfolk , Pali , Piemontese , Western Panjabi , Pontic , Polish , Portuguese , Pushto , Quechua , Vlax Romani , Romansh , Romanian , Rusyn , Rundi , Macedo-Romanian , Russian , Sango , Yakut , Sanskrit , Sicilian , Scots , Samogitian , Sinhala , Slovak , Slovenian , Northern Sami , Samoan , Shona , Sindhi , Somali , Southern Sotho , Spanish , Albanian , Sardinian , Sranan Tongo , Serbian , Swati , Saterfriesisch , Sundanese , Swahili (macrolanguage) , Swedish , Silesian , Tahitian , Tamil , Tatar , Tulu , Telugu , Tama (Colombia) , Tetum , Tajik , Tagalog , Thai , Tigrinya , Tonga (Tonga Islands) , Tok Pisin , Tswana , Tsonga , Turkmen , Tumbuka , Turkish , Twi , Tuvinian , Udmurt , Uighur , Ukrainian , Urdu , Uzbek , Venetian , Venda , Veps , Vietnamese , Vlaams , Volapük , Võro , Waray (Philippines) , Walloon , Wolof , Wu Chinese , Kalmyk , Xhosa , Mingrelian , Yiddish , Yoruba , Yue Chinese , Zeeuws , Zhuang , Chinese , Zulu , and Dotyali
Description:
Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.
The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).
For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].
The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).
Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.
The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Rights:
Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) , http://creativecommons.org/licenses/by-sa/3.0/ , and PUB
Creator:
Jan Patočka
Publisher:
Str. 46–88. Stať.
Type:
Text
Subject:
1975 , 1979/25 , 1981/6 , 1981/7 , 1988/28 , 1988/31 , 1988/32 , 1988/34 , 1994/7 , 1996/4 , 1996/7 , 1998/3 , 1999/8 , 2 , 2001/9 , 2002/21 , 2002/6 , 2006/1 , 2007/1 , 2008/3 , bg , cs , de , en , es , fr , fulltext , hu , it , lt , no , pl , ru , SS-3/PD-III , sv , uk , and v
Language:
Czech , English , Bulgarian , French , Italian , Lithuanian , Hungarian , German , Norwegian , Polish , Russian , Spanish , Swedish , and Ukrainian
Rights:
open access and Rights holder: Archiv Jana Patočky, z.s.
Creator:
Jan Patočka
Publisher:
Str. 1–45. Stať.
Type:
Text
Subject:
1975 , 1979/25 , 1981/6 , 1981/7 , 1988/28 , 1988/31 , 1988/32 , 1988/34 , 1994/7 , 1996/4 , 1996/7 , 1998/3 , 1999/8 , 2001/9 , 2002/1 , 2002/21 , 2002/5 , 2006/1 , 2007/1 , 2008/3 , bg , cs , de , en , es , fr , fulltext , hu , it , lt , no , pl , ru , SS-3/PD-III , sv , and uk
Language:
Czech , English , Bulgarian , French , Italian , Lithuanian , Hungarian , German , Norwegian , Polish , Russian , Spanish , Swedish , and Ukrainian
Rights:
open access and Rights holder: Archiv Jana Patočky, z.s.