STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences.
Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
STYX 1.0 is a corpus of Czech sentences selected from the Prague Dependency treebank. The criterion for including sentences into STYX was their suitability for practicing Czech morphology and syntax in elementary schools. The sentences contain both the PDT annotations and the school sentence analyses. The school sentence analyses were created by transforming the PDT annotations using handcrafted rules. Altogether the STYX 1.0 corpus contains 11 655 sentences.
Originally, the STYX 1.0 corpus was an inseparable part of the Styx system (http://hdl.handle.net/11858/00-097C-0000-0001-48FB-F)
Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns.
Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for the Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey. Also, in this version of the dataset, the textual answers to the question 1.6 "How many months have you been to this country?" were replaced by numbers, so that we could separate the data of respondents who arrived in the Czech Republic in February 2022 or later from the other data (0 for those staying in Czechia before February 2022, 1 for those staying in Czechia since February 2022 or later, 2 for those staying in other countries).
AMALACH project component TMODS:ENG-CZE; machine translation of queries from Czech to English. This archive contains models for the Moses decoder (binarized, pruned to allow for real-time translation) and configuration files for the MTMonkey toolkit. The aim of this package is to provide a full service for Czech->English translation which can be easily utilized as a component in a larger software solution. (The required tools are freely available and an installation guide is included in the package.)
The translation models were trained on CzEng 1.0 corpus and Europarl. Monolingual data for LM estimation additionally contains WMT news crawls until 2013.
The datasets described in Droganova, Kira, and Daniel Zeman. "Towards a Unified Taxonomy of Deep Syntactic Relations." Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024.
Four languages are included in this release. English PropBank is omitted due to its license terms.
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "Tar_13052022_Czechia-01.wav" and "Tar_13052022_Czechia-02.wav".
The data provided in this repository corresponds to the validation split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. Altogether, there were 13 speakers (9 male and 4 female, aged 1x 15-20, 7x 20-30, 4x 30-40, and 1x 40-50).
The recordings contain both monologues and dialogues on the topics of everyday life (health, education, family life, sports, culture) as well as information on both host countries (living abroad) and country of origin (Syria traditions, education system, etc.). Both types are spontaneous, the participants were given only the general subject and talked on the topic or discussed it freely. The transcription and translation team consisted of students of Arabic at Charles University, with an additional quality check provided by the native speakers of the dialect.
The textual data is split between the (parallel) transcriptions (.apc) and translations (.eng), with one segment per line. The additional .yaml file provides mapping to the corresponding audio file (with the duration and offset in the "%S.%03d" format, i.e., seconds and milliseconds) and a unique speaker ID.
The audio data is shared in the 48kHz .wav format, with dialogues and monologues in separate folders. All of the recordings are mono, with a single channel. For dialogues, there is a separate file for each speaker, e.g., "16072022_Family-01.wav" and "16072022_Family-02.wav".
The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024.
Data from a questionnaire survey conducted from 2022-08-25 to 2022-11-15 and exploring the use of machine translation by Ukrainian refugees in the Czech Republic. The presented spreadsheet contains minimally processed data exported from the two questionnaires that were created in Google Forms in the Ukrainian and the Russian language. The links to these questionnaires were distributed by three methods: direct email to particular refugees whose contact details the authors obtained while volunteering; through a non-profit organisation helping refugees (Vesna women’s education institution) and on social networks by posting links to the survey in groups associating the Ukrainian community across Czech regions and towns.
Since we asked potential respondents to spread the questionnaire further, we could not prevent it from reaching Ukrainians who had arrived in Czechia previously, or received temporary protection in other countries. Due to this fact, the textual answers to the question 1.5 "Which country are you in right now?" were replaced in the dataset by numbers (1 for Czech Republic, 2 for other countries) in order for us to be able to separate the data of respondents not located in the Czech Republic, which were irrelevant for our survey.
This is an XML dataset of 17 lecture recordings randomly sampled from the lectures recorded at the Faculty of Informatics, Brno, Czechia during 2010–2016. We drew a stratified sample of up to 25 video frames from each recording. In each video frame, we annotated lit projection screens and their condition. For each lit projection screen, we annotated lecture materials shown in the screen. The dataset contains 699 projection screen annotations, and 925 lecture materials.
This dataset comprises a corpus of 50 text contexts, each about 60 words in length, sourced from five distinct domains. Each context has been evaluated by multiple annotators who identified and ranked the most important words—up to 10% of each text—according to their perceived significance. The annotators followed specific guidelines to ensure consistency in word selection and ranking. For further details, please refer to the cited source.
---
rankings_task.csv
- This csv contains information about the contexts which are to be annotated:
- id: A unique identifier for each task.
- content: The context to be ranked.
---
rankings_ranking.csv
- This csv includes ranking information for various assignments. It contains four columns:
- id: A unique identifier for each ranking entry.
- score: The score assigned to the entry.
- word_order: A JSON detailing the order of words positions. It is essentially the selected word positions and their ordering from an annotator.
- assignment_id: A reference ID linking to the assignments.
---
rankings_assignment.csv
- This csv tracks the completion status of tasks by users. It includes four columns:
- id: A unique identifier for each assignment entry.
- is_completed: A binary indicator (1 for completed, 0 for not completed).
- task_id: A reference ID linking to the tasks.
- user_id: The identifier for the user who should complete the task (rank the words).
---
Known Issues:
Please note that each annotator was intended to rank each context only once. However, due to a bug in the deployment of the annotation tool, some entries may be duplicated. Users of this dataset should be cautious of this issue and verify the uniqueness of the annotations where necessary.
---
This dataset is a part of work from a bachelor thesis:
OSUSKÝ, Adam. Predicting Word Importance Using Pre-Trained Language Models. Bachelor thesis, supervisor Javorský, Dávid. Prague: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, 2024.