Emil František Burian during a guest performance in Zlín in a segment from Československý zvukový týdeník Aktualita (Czechoslovak Aktualita Sound Newsreel) 1939, issue no. 41B. Burian at his desk in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1954, issue no. 25.
Athlete Emil Zátopek wins the 5,000-metre race at the 1952 Summer Olympics in Helsinki in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 37. In 1952, he also wins the 5,000-metre race in Opava in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 42. Zátopek with his wife Dana Zátopková at Strahov Stadium in Prague. Zátopek accepting the Order of the Republic from the hands of Minister of Defence Alexej Čepička in 1952 in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 42.
Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image.
The details about the experiment and the dataset can be found in the README file.
EMU is a collection of software tools for the creation, manipulation and analysis of speech databases. At the core of EMU is a database search engine which allows the researcher to find various speech segments based on the sequential and hierarchical structure of the utterances in which they occur. EMU includes an interactive labeller which can display spectrograms and other speech waveforms, and which allows the creation of hierarchical, as well as sequential, labels for a speech utterance.
The corpus presented consists of job ads in Spanish related to Engineering positions in Peru.
The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.
The corpus is divided in two components:
- POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format.
- Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats:
* Whole text documents: containing all the information originally posted in the ad.
* Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Professor and Minister of Finance Karel Engliš in a garden in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1934, issue no. 10. Engliš, first on his own and later with literary critic Miloslav Hýsek on Bohumil Veselý's balcony.