« Previous |
1 - 10 of 78
|
Next »
Number of results to display per page
Search Results
2. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.0)
- Creator:
- Savary, Agata, Ramisch, Carlos, Cordeiro, Silvio Ricardo, Sangati, Federico, Vincze, Veronika, QasemiZadeh, Behrang, Candito, Marie, Cap, Fabienne, Giouli, Voula, Stoyanova, Ivelina, Doucet, Antoine, Adalı, Kübra, Barbu Mititelu, Verginica, Bejček, Eduard, El Maarouf, Ismail, Eryiğit, Gülşen, Galea, Luke, Ha-Cohen Kerner, Yaakov, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, Kovalevskaitė, Jolanta, Krek, Simon, van der Plas, Lonneke, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Attard, Greta, Azzopardi, Kirsty, Boizou, Loic, Bonnici, Janice, Boz, Mert, Bumbulienė, Ieva, Busuttil, Jael, Caruso, Valeria, Cherchi, Manuela, Constant, Matthieu, Czerepowicka, Monika, De Santis, Anna, Dimitrova, Tsvetana, Dinç, Tutkum, Elyovich, Hevi, Fabri, Ray, Farrugia, Alison, Findlay, Jamie, Fotopoulou, Aggeliki, Foufi, Vassiliki, Galea, Sara Anne, Gantar, Polona, Gatt, Albert, Gatt, Anabelle, Herrero, Carlos, Iñurrieta, Uxoa, Jagfeld, Glorianna, Hnátková, Milena, Ionescu, Mihaela, Klyueva, Natalia, Koeva, Svetla, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Louisou, Sevi, Lynn, Teresa, Malka, Ruth, Martínez Alonso, Héctor, McCrae, John, de Medeiros Caseli, Helena, Miral, Ayşenur, Muscat, Amanda, Nivre, Joakim, Oakes, Michael, Onofrei, Mihaela, Parmentier, Yannick, Pasquer, Caroline, Pia di Buono, Maria, Priego Sanchez, Belem, Raffone, Annalisa, Ramisch, Renata, Rimkutė, Erika, Rizea, Monica-Mihaela, Simkó, Katalin, Spagnol, Michael, Stefanova, Valentina, Stymne, Sara, Sulubacak, Umut, Tabone, Nicole, Tanti, Marc, Todorova, Maria, Urešová, Zdenka, Villavicencio, Aline, and Zilio, Leonardo
- Publisher:
- PARSEME
- Type:
- text and corpus
- Subject:
- Multiword expressions, verbal multiword expressions, idioms, light-verb constructions, verb-particle constructions, and inherently reflexive verbs
- Language:
- Bulgarian, Czech, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovenian, Swedish, and Turkish
- Description:
- The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). VMWEs were annotated according to the universal guidelines in 18 languages. The corpora are provided in the parsemetsv format, inspired by the CONLL-U format. For most languages, paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training and test data, tools and the universal guidelines file.
- Rights:
- PARSEME Shared Task Data (v. 1.0) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.0, and PUB
3. Bible olomoucká 1417 :
- Type:
- text, studie, and faksimile
- Subject:
- Bible. Biblistika, Bible olomoucká, české země 1306-1419, teologie, ikonografie, zbožnost, hagiografie, and rukopisy
- Language:
- Czech and French
- Description:
- Francouzský text tištěn zvratmo. and Doplněk obsahující text listu ztraceného z bible Olomoucké za listem XXIII. od 10. verše 49. kapitoly I. knihy Mojžíšovy až po 16. verš 1. kapit. II. knihy Mojžíšovy
- Rights:
- unknown
4. Bomby na Zlín =
- Type:
- text and brožury
- Subject:
- Dějiny Česka a Slovenska, válka druhá světová (1939-1945), nálety, bombardování měst, Československo 1938-1945, vojenské operace, války, bitvy, and města, obce
- Language:
- Czech, English, French, Russian, and Spanish
- Description:
- Souběžný anglický, francouzský, ruský a španělský text
- Rights:
- unknown
5. C4Corpus (CC BY-NC part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
6. C4Corpus (CC BY-NC-ND part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB
7. C4Corpus (CC BY-NC-SA part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
8. C4Corpus (CC BY-ND part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0), http://creativecommons.org/licenses/by-nc/4.0/, and PUB
9. C4Corpus (CC BY-SA part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
10. C4Corpus (CC-BY part)
- Creator:
- Gurevych, Iryna, Habernal, Ivan, and Zayed, Omnia
- Publisher:
- Technische Universität Darmstadt
- Type:
- text and corpus
- Subject:
- CommonCrawl, Creative Commons, Web corpus, and Amazon Web Services
- Language:
- Afrikaans, Arabic, Bengali, Bulgarian, Czech, Danish, German, Modern Greek (1453-), English, Estonian, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Marathi, Macedonian, Nepali (macrolanguage), Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Spanish, Albanian, Swahili (macrolanguage), Swedish, Tamil, Telugu, Tagalog, Thai, Turkish, Ukrainian, Undetermined, Urdu, Vietnamese, and Chinese
- Description:
- A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB