Harvested from: LINDAT/CLARIAH-CZ repository / Type: corpus - LINDAT/CLARIAH-CZ Catalog Search Results

681. Treebanks for Unified Taxonomy of Deep Syntactic Relations

Creator:: Droganova, Kira and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank and semantic roles
Language:: Czech, Spanish, Catalan, and Finnish
Description:: The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. Four languages are included in this release. English PropBank is omitted due to its license terms.
Rights:: Licence Universal Dependencies v2.14, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.14, and PUB

682. UFAL Parallel Corpus of North Levantine 1.0

Creator:: Sellat, Hashem, Saleh, Shadi, Krubiński, Mateusz, Pospíšil, Adam, Zemánek, Petr, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: multilingual, machine translation, parallel corpus, north levantine, and corpus
Language:: North Levantine Arabic, English, French, Spanish, Standard Arabic, Modern Greek (1453-), and German
Description:: This is the first release of the UFAL Parallel Corpus of North Levantine, compiled by the Institute of Formal and Applied Linguistics (ÚFAL) at Charles University within the Welcome project (https://welcome-h2020.eu/). The corpus consists of 120,600 multiparallel sentences in English, French, German, Greek, Spanish, and Standard Arabic selected from the OpenSubtitles2018 corpus [1] and manually translated into the North Levantine Arabic language. The corpus was created for the purpose of training machine translation for North Levantine and the other languages.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

683. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 1

Creator:: Zemánek, Petr, Pospíšil, Adam, Sellat, Hashem, Krubiński, Mateusz, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, speech recognition, speech-to-text translation, machine translation, multilingual, Arabic, Arabic Corpus, and north levantine
Language:: North Levantine Arabic and English
Description:: The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50). The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect. The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID. The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "Tar_13052022_Czechia-01.wav" and "Tar_13052022_Czechia-02.wav". The data provided in this repository corresponds to the validation split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

684. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

Creator:: Zemánek, Petr, Pospíšil, Adam, Sellat, Hashem, Krubiński, Mateusz, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, speech recognition, speech-to-text translation, machine translation, multilingual, Arabic, Arabic Corpus, and north levantine
Language:: North Levantine Arabic and English
Description:: The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50). The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect. The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID. The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav". The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

685. UMC 0.1: Czech-Russian-English Multilingual Corpus

Creator:: Klyueva, Natalia and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: multi-language corpus
Language:: Czech
Description:: UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation. All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes. and FP6-IST-5-034291-STP (EuroMatrix)
Rights:: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0), http://creativecommons.org/licenses/by-nc-nd/3.0/, and PUB

686. Uniform Meaning Representation

Creator:: Bonn, Julia, Ching-wen, Chen, Cowell, James Andrew, Croft, William, Denk, Lukas, Hajič, Jan, Lai, Kenneth, Palmer, Martha, Palmer, Alexis, Pustejovsky, James, Sun, Haibo, Vallejos Yopán, Rosa, Van Gysel, Jens, Vigus, Meagan, Xue, Nianwen, and Zhao, Jin
Publisher:: UMR Consortium
Type:: text and corpus
Subject:: uniform meaning representation and semantics
Language:: Chinese, Arapaho, English, Cocama-Cocamilla, Navajo, and Sanapaná
Description:: The goal of the Uniform Meaning Representation (UMR) project is to design a meaning representation that can be used to annotate the semantic content of a text. UMR is primarily based on Abstract Meaning Representation (AMR), an annotation framework initially designed for English, but also draws from other meaning representations. UMR extends AMR to other languages, particularly morphologically complex, low-resource languages. UMR also adds features to AMR that are critical to semantic interpretation and enhances AMR by proposing a companion document-level representation that captures linguistic phenomena such as coreference as well as temporal and modal dependencies that potentially go beyond sentence boundaries. UMR is intended to be scalable, learnable, and cross-linguistically plausible. It is designed to support both lexical and logical inference.
Rights:: Uniform Meaning Representation v1.0 License Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/license-umr-1.0, and PUB

687. Universal Dependencies 1.0

Creator:: Nivre, Joakim, Bosco, Cristina, Choi, Jinho, de Marneffe, Marie-Catherine, Dozat, Timothy, Farkas, Richárd, Foster, Jennifer, Ginter, Filip, Goldberg, Yoav, Hajič, Jan, Kanerva, Jenna, Laippala, Veronika, Lenci, Alessandro, Lynn, Teresa, Manning, Christopher, McDonald, Ryan, Missilä, Anna, Montemagni, Simonetta, Petrov, Slav, Pyysalo, Sampo, Silveira, Natalia, Simi, Maria, Smith, Aaron, Tsarfaty, Reut, Vincze, Veronika, and Zeman, Daniel
Publisher:: Universal Dependencies Consortium
Type:: text and corpus
Subject:: treebank, dependency, syntax, morphology, harmonized annotation, interset, universal tagset, and stanford dependencies
Language:: Czech, German, English, Spanish, Finnish, French, Irish, Italian, Swedish, and Hungarian
Description:: Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Rights:: Universal Dependencies 1.0 License Set, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-1.0, and PUB

681. Treebanks for Unified Taxonomy of Deep Syntactic Relations

682. UFAL Parallel Corpus of North Levantine 1.0

683. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 1

684. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

685. UMC 0.1: Czech-Russian-English Multilingual Corpus

686. Uniform Meaning Representation

687. Universal Dependencies 1.0

688. Universal Dependencies 1.1

689. Universal Dependencies 1.2

690. Universal Dependencies 1.3

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Show values starting with

Creator

Show values starting with

Format

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Date

Original context has metadata only

Harvested from