Athlete Emil Zátopek wins the 5,000-metre race at the 1952 Summer Olympics in Helsinki in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 37. In 1952, he also wins the 5,000-metre race in Opava in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 42. Zátopek with his wife Dana Zátopková at Strahov Stadium in Prague. Zátopek accepting the Order of the Republic from the hands of Minister of Defence Alexej Čepička in 1952 in a segment from Československé filmové noviny (Czechoslovak Film News) 1952, issue no. 42.
Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image.
The details about the experiment and the dataset can be found in the README file.
EMU is a collection of software tools for the creation, manipulation and analysis of speech databases. At the core of EMU is a database search engine which allows the researcher to find various speech segments based on the sequential and hierarchical structure of the utterances in which they occur. EMU includes an interactive labeller which can display spectrograms and other speech waveforms, and which allows the creation of hierarchical, as well as sequential, labels for a speech utterance.
The corpus presented consists of job ads in Spanish related to Engineering positions in Peru.
The documents were preprocessed and annotated for POS tagging, NER, and topic modeling tasks.
The corpus is divided in two components:
- POS tagging/ NER training data: Consisting of 800 job ads, each one tokenized and manually annotated with POS tag information (EAGLE format) and Entity Label in BIO format.
- Topic modeling training data: containing 9000 documents stripped from stopwords. Comes in two formats:
* Whole text documents: containing all the information originally posted in the ad.
* Extracted chunks documents: containing chunks extracted by custom NER models (expected skills, tasks to perform, and preferred major), as described in Improving Topic Coherence Using Entity Extraction Denoising (to appear)
Professor and Minister of Finance Karel Engliš in a garden in a segment from Československý filmový týdeník (Czechoslovak Film Weekly Newsreel) 1934, issue no. 10. Engliš, first on his own and later with literary critic Miloslav Hýsek on Bohumil Veselý's balcony.
Data collection has been done by the means of Sketch Engine program.
Data were extrapolated from the annotated English web corpus enTenTen20.
Data collection and analysis has been done during the period of two months: April and May 2023.
Recently, the enTenTen20 corpus has been updated to a newer version - enTenTen21. Nevertheless, the older version is still available, can be worked on and can be compared with the newer one. It has been noticed that the differences between the two versions of the English web corpus did not affect the results of this study. The only apparent difference was seen in slightly different numbers in frequency values for specific collocations. This was expected since the older version of web corpus consists of 36 billion words, while the new version counts 52 billion words. On the other hand, as noted above, these frequency deviations were not significant enough to refute the hypotheses. They have rather confirmed them once again.
This study is one of the results of work on a larger scientific-research project called "Metaphorical collocations - syntagmatic relations between semantics and pragmatics". More information about the project is available on the following link: https://metakol.uniri.hr/en/opis-projekta/
The study has been financed by the Croatian science foundation.
Working with the data/replicating the study:
Data collected for the purposes of this study is available in CSV format.
Data for each gustatory adjective (collocate) is presented in a separate CSV file.
Upon opening each file, stretch the borders of every column for better visibility of data.
Tables show different collocational bases (nouns) which are found in the corpus, in combination with a specific gustatory adjective, their collocate.
These nouns are listed by their score number (The Mutual Information score expresses the extent to which words co-occur compared to the number of times they appear separately).
Tables show what type of mapping is present in a certain collocation (e.g., intra-modal or cross-modal).
Tables show what type of meaning or cognitive process is working in the background of the meaning formation (e.g., metonymic or metaphoric).
For every analyzed collocation, we provided a contextualized example of its use from the corpus, along with the hyperlink where it can be found.
English model for NameTag, a named entity recognition tool. The model is trained on CoNLL-2003 training data. Recognizes PER, ORG, LOC and MISC named entities. Achieves F-measure 84.73 on CoNLL-2003 test data.