Rights: PUB - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Rights PUB Date 2000 to 2024

71. STYX 1.0

Creator:: Hladká, Barbora, Kučera, Ondřej, and Kuchyňová, Karolína
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: annotated corpus, syntax, and sentence diagramming
Language:: Czech
Description:: STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences. Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

72. STYX 1.0 (2017-10-03)

Creator:: Hladká, Barbora, Kučera, Ondřej, and Kuchyňová, Karolína
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: annotated corpus, syntax, and sentence diagramming
Language:: Czech
Description:: STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences. Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
Rights:: Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0), http://creativecommons.org/licenses/by-nc-nd/4.0/, and PUB

73. The Use of Machine Translation by Ukrainian War Refugees in Czechia

Creator:: Agapova, Anna and Špačková, Stanislava
Publisher:: Oxford University Press
Type:: TEXT and Spreadsheet
Subject:: machine translation, migration, Ukrainian refugees, Russo-Ukrainian war, and Czech Republic
Language:: Ukrainian and Russian
Description:: Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns. Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for the Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey. Also, in this version of the dataset, the textual answers to the question 1.6 "How many months have you been to this country?" were replaced by numbers, so that we could separate the data of respondents who arrived in the Czech Republic in February 2022 or later from the other data (0 for those staying in Czechia before February 2022, 1 for those staying in Czechia since February 2022 or later, 2 for those staying in other countries).
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

74. TMODS:ENG-CZE -- query translation

Creator:: Tamchyna, Aleš and Bojar, Ondřej
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: suiteOfTools and toolService
Subject:: machine translation and query translationn
Language:: Czech and English
Description:: AMALACH project component TMODS:ENG-CZE; machine translation of queries from Czech to English. This archive contains models for the Moses decoder (binarized, pruned to allow for real-time translation) and configuration files for the MTMonkey toolkit. The aim of this package is to provide a full service for Czech->English translation which can be easily utilized as a component in a larger software solution. (The required tools are freely available and an installation guide is included in the package.) The translation models were trained on CzEng 1.0 corpus and Europarl. Monolingual data for LM estimation additionally contains WMT news crawls until 2013.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

75. Treebanks for Unified Taxonomy of Deep Syntactic Relations

Creator:: Droganova, Kira and Zeman, Daniel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: treebank and semantic roles
Language:: Czech, Spanish, Catalan, and Finnish
Description:: The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024. Four languages are included in this release. English PropBank is omitted due to its license terms.
Rights:: Licence Universal Dependencies v2.14, https://lindat.mff.cuni.cz/repository/xmlui/page/license-ud-2.14, and PUB

76. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 1

Creator:: Zemánek, Petr, Pospíšil, Adam, Sellat, Hashem, Krubiński, Mateusz, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, speech recognition, speech-to-text translation, machine translation, multilingual, Arabic, Arabic Corpus, and north levantine
Language:: North Levantine Arabic and English
Description:: The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50). The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect. The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID. The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "Tar_13052022_Czechia-01.wav" and "Tar_13052022_Czechia-02.wav". The data provided in this repository corresponds to the validation split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

77. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

Creator:: Zemánek, Petr, Pospíšil, Adam, Sellat, Hashem, Krubiński, Mateusz, and Pecina, Pavel
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: audio and corpus
Subject:: speech corpus, speech recognition, speech-to-text translation, machine translation, multilingual, Arabic, Arabic Corpus, and north levantine
Language:: North Levantine Arabic and English
Description:: The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50). The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect. The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID. The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav". The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Rights:: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB

78. Ukrainian War Refugees as Self-Translators Dataset

Creator:: Agapova, Anna and Špačková, Stanislava
Publisher:: Taylor & Francis Online
Type:: TEXT and Spreadsheet
Subject:: machine translation, self-translation, migration, Ukrainian refugees, Russo-Ukrainian war, and Czech Republic
Language:: Ukrainian and Russian
Description:: Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns. Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

79. Video699: lecture recordings and lecture materials

Creator:: Novotný, Vít
Publisher:: Faculty of Informatics, Masaryk University
Type:: video and corpus
Subject:: information retrieval, video, image, and XML
Language:: English and Czech
Description:: This is an XML dataset of 17 lecture recordings randomly sampled from the lectures recorded at the Faculty of Informatics, Brno, Czechia during 2010–2016. We drew a stratified sample of up to 25 video frames from each recording. In each video frame, we annotated lit projection screens and their condition. For each lit projection screen, we annotated lecture materials shown in the screen. The dataset contains 699 projection screen annotations, and 925 lecture materials.
Rights:: Open Data Commons Open Database License (ODbL), http://opendatacommons.org/licenses/odbl/summary/, and PUB

80. Word Importance Dataset

Creator:: Osuský, Adam and Javorský, Dávid
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: word importance, ranking, and importance ranking
Language:: English
Description:: This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source. --- rankings_task.csv - This csv contains information about the contexts which are to be annotated: - id: A unique identifier for each task. - content: The context to be ranked. --- rankings_ranking.csv - This csv includes ranking information for various assignments. It contains four columns: - id: A unique identifier for each ranking entry. - score: The score assigned to the entry. - word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator. - assignment_id: A reference ID linking to the assignments. --- rankings_assignment.csv - This csv tracks the completion status of tasks by users. It includes four columns: - id: A unique identifier for each assignment entry. - is_completed: A binary indicator (1 for completed, 0 for not completed). - task_id: A reference ID linking to the tasks. - user_id: The identifier for the user who should complete the task (rank the words). --- Known Issues: Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary. --- This dataset is a part of work from a bachelor thesis: OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.
Rights:: Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB

« Previous
Next »
1
2
…
4
5
6
7
8

71. STYX 1.0

72. STYX 1.0 (2017-10-03)

73. The Use of Machine Translation by Ukrainian War Refugees in Czechia

74. TMODS:ENG-CZE -- query translation

75. Treebanks for Unified Taxonomy of Deep Syntactic Relations

76. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 1

77. UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

78. Ukrainian War Refugees as Self-Translators Dataset

79. Video699: lecture recordings and lecture materials

80. Word Importance Dataset

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from