Skip to search
Skip to main content
Skip to first result
Search
Search Results
Creator:
Savary, Agata , Ramisch, Carlos , Cordeiro, Silvio Ricardo , Sangati, Federico , Vincze, Veronika , QasemiZadeh, Behrang , Candito, Marie , Cap, Fabienne , Giouli, Voula , Stoyanova, Ivelina , Doucet, Antoine , Adalı, Kübra , Barbu Mititelu, Verginica , Bejček, Eduard , El Maarouf, Ismail , Eryiğit, Gülşen , Galea, Luke , Ha-Cohen Kerner, Yaakov , Liebeskind, Chaya , Monti, Johanna , Parra Escartín, Carla , Kovalevskaitė, Jolanta , Krek, Simon , van der Plas, Lonneke , Aceta, Cristina , Aduriz, Itziar , Antoine, Jean-Yves , Attard, Greta , Azzopardi, Kirsty , Boizou, Loic , Bonnici, Janice , Boz, Mert , Bumbulienė, Ieva , Busuttil, Jael , Caruso, Valeria , Cherchi, Manuela , Constant, Matthieu , Czerepowicka, Monika , De Santis, Anna , Dimitrova, Tsvetana , Dinç, Tutkum , Elyovich, Hevi , Fabri, Ray , Farrugia, Alison , Findlay, Jamie , Fotopoulou, Aggeliki , Foufi, Vassiliki , Galea, Sara Anne , Gantar, Polona , Gatt, Albert , Gatt, Anabelle , Herrero, Carlos , Iñurrieta, Uxoa , Jagfeld, Glorianna , Hnátková, Milena , Ionescu, Mihaela , Klyueva, Natalia , Koeva, Svetla , Kovács, Viktória , Kuzman, Taja , Leseva, Svetlozara , Louisou, Sevi , Lynn, Teresa , Malka, Ruth , Martínez Alonso, Héctor , McCrae, John , de Medeiros Caseli, Helena , Miral, Ayşenur , Muscat, Amanda , Nivre, Joakim , Oakes, Michael , Onofrei, Mihaela , Parmentier, Yannick , Pasquer, Caroline , Pia di Buono, Maria , Priego Sanchez, Belem , Raffone, Annalisa , Ramisch, Renata , Rimkutė, Erika , Rizea, Monica-Mihaela , Simkó, Katalin , Spagnol, Michael , Stefanova, Valentina , Stymne, Sara , Sulubacak, Umut , Tabone, Nicole , Tanti, Marc , Todorova, Maria , Urešová, Zdenka , Villavicencio, Aline , and Zilio, Leonardo
Publisher:
PARSEME
Type:
text and corpus
Subject:
Multiword expressions , verbal multiword expressions , idioms , light-verb constructions , verb-particle constructions , and inherently reflexive verbs
Language:
Bulgarian , Czech , German , Modern Greek (1453-) , Spanish , Persian , French , Hebrew , Hungarian , Italian , Lithuanian , Maltese , Polish , Portuguese , Romanian , Slovenian , Swedish , and Turkish
Description:
The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). VMWEs were annotated according to the universal guidelines in 18 languages. The corpora are provided in the parsemetsv format, inspired by the CONLL-U format.
For most languages, paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
This item contains training and test data, tools and the universal guidelines file.
Rights:
PARSEME Shared Task Data (v. 1.0) Agreement , https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.0 , and PUB
Publisher:
Center of Computational Linguistics, Vytautas Magnus University
Format:
application/xml
Type:
corpus
Language:
Czech , English , and Lithuanian
Description:
A collection of parallel corpora: English-Lithuanian (2m words), Lithuanian-English (0,06m words), Czech-Lithuanian (0,8m words), Lithuanian-Czech (0,02m words). All the corpora are online-searcheable via one interface at http://donelaitis.vdu.lt/main_en.php?id=4&nr=1_2. The corpus is still being updated with new texts.
Rights:
Not specified
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) , http://creativecommons.org/licenses/by-nc-nd/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) , http://creativecommons.org/licenses/by-nc-sa/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Malayalam , Macedonian , Dutch , Norwegian , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) , http://creativecommons.org/licenses/by-nc/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) , http://creativecommons.org/licenses/by-sa/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bengali , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Gujarati , Hebrew , Hindi , Croatian , Hungarian , Indonesian , Italian , Japanese , Kannada , Korean , Latvian , Lithuanian , Malayalam , Marathi , Macedonian , Nepali (macrolanguage) , Dutch , Norwegian , Panjabi , Polish , Portuguese , Romanian , Russian , Slovak , Slovenian , Somali , Spanish , Albanian , Swahili (macrolanguage) , Swedish , Tamil , Telugu , Tagalog , Thai , Turkish , Ukrainian , Undetermined , Urdu , Vietnamese , and Chinese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Creative Commons - Attribution 4.0 International (CC BY 4.0) , http://creativecommons.org/licenses/by/4.0/ , and PUB
Creator:
Gurevych, Iryna , Habernal, Ivan , and Zayed, Omnia
Publisher:
Technische Universität Darmstadt
Type:
text and corpus
Subject:
CommonCrawl , Creative Commons , Web corpus , and Amazon Web Services
Language:
Afrikaans , Arabic , Bulgarian , Czech , Danish , German , Modern Greek (1453-) , English , Estonian , Persian , Finnish , French , Croatian , Hungarian , Indonesian , Italian , Japanese , Korean , Latvian , Lithuanian , Dutch , Norwegian , Polish , Portuguese , Russian , Slovenian , Somali , Spanish , Swahili (macrolanguage) , Swedish , Tagalog , Thai , Turkish , Ukrainian , Undetermined , and Vietnamese
Description:
A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Rights:
Public Domain Mark (PD) , http://creativecommons.org/publicdomain/mark/1.0/ , and PUB
Creator:
Neporožnia, Nadija
Type:
text and studie
Subject:
Dějiny civilizace. Kulturní dějiny , Komenský, Jan Amos, , Vitold, , vztahy česko-litevské , velkoknížectví litevské , zákonodárství , husitství , dějiny národní , panovníci litevští , přehledná zpracování (tematicky) , světové dějiny středověku (do r. 1492) , and Litva
Language:
Lithuanian and Czech
Rights:
unknown