DeriNet is a lexical network which models derivational and compositional relations in the lexicon of Czech. Nodes of the network correspond to Czech lexemes, while edges represent word-formational relations between a derived word and its base word / words.
The present version, DeriNet 2.2, contains:
- 1,040,127 lexemes (sampled from the MorfFlex CZ 2.0 dictionary), connected by
- 782,904 derivational,
- 50,511 orthographic variant,
- 6,336 compounding,
- 288 univerbation, and
- 135 conversion relations.
Compared to the previous version, version 2.1 contains an overhaul of the compounding annotation scheme, 4384 extra compounds, 83 more affixoid lexemes serving as bases for compounding, more parts of speech serving as bases for compounding (adverbs, pronouns, numerals), and several minor corrections of derivational relations.
The Dialogy.Org system allows users to search in transcribed audio-visual corpora. The Dialogy.Org works on the principle of web-based interface, so installation of additional programs on your computer is not necessary. You must have Flash Player for playing audio or video recordings. and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
Titles of courses possibly relevant to the Digital Humanities for 2017-2018, manually gathered from course catalogues of most Czech state colleges, including the names of the teachers, department and school names, and the school-unique course IDs. All this information was publicly available in the individual course catalogues accessed from the official websites of the individual colleges.
Modifications to DSpace made by Petr Pajas in order to support pidconsortium.eu PID handle system instead of the default handle.com system used by DSpace.
DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset defines a set of features that are encoded by the various tag sets. The set of features should be as universal as possible. It does not need to encode everything that is encoded by any tag set but it should encode all information that people may want to access and/or port from one tag set to another.
New tag sets are attached by writing a driver for them. Once the driver is ready, you can easily convert tags between the new set and any other set for which you also have a driver. This reusability is an obvious advantage over writing a targeted conversion procedure each time you need to convert between a particular pair of tag sets. and grant MSM 0021620838 of the Ministry of Education of the Czech Republic
This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz).
This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.
Extremely fast digital audio channelizer implementation, usable as a building block for experimental ASR front-ends or signal denoising applications. Also applicable in software defined radios, due to its high throughput. It comes in a form of a C/C++ library and an executable example program which reads input stream, splitting it into equidistant frequency channels, emitting their data to the output.
Features:
(1) Hand tuned SIMD-aware assembly for x86 (SSE) and IA64 (AVX) as well as for ARM (NEON) processors.
(2) Generic non-SIMD C++ implementation for other architectures.
(3) Capable of taking advantage of multicore CPUs.
(4) Fully configurable number of channels and the output decimation rate.
(5) User supplied FIR of the channel separation filter, which allows to specify the width of the channels, whether they should overlap or be separated.
(6) Input and output signal samples are treated as complex numbers.
(7) Speed over 750 complex MS/s achieved on Core i7 4710HQ @ 2.5GHz, when channelizing into 72 output channels with a FIR length of 1152 samples, using 3 computing threads.
(8) Runs under Linux OS.
ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
ElixirFM is a high-level implementation of Functional Arabic
Morphology documented at http://elixir-fm.wiki.sourceforge.net/. The
core of ElixirFM is written in Haskell, while interfaces in Perl
support lexicon editing and other interactions.
Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image.
The details about the experiment and the dataset can be found in the README file.